Quicker reg exps?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • LeWalrus
    New Member
    • Sep 2008
    • 1

    Quicker reg exps?

    Hi, I've written a reg exp for capturing a group of numbers from text files in the following format:
    -1.4326 s < 0.6758 s < 1.4334 s
    Any of the numbers can be positive or negative and the units (s) can change or even be absent. What I wanted was the three numbers (signs included)! Here was the reg exp I used to capture:
    Code:
    [B]$line =~ m!([ |-]\d+\.\d+)\s+.*?<\s+([ |-]\d+\.\d+)\s+(.*?)<\s+([ |-]\d+\.\d+)!  [/B]
    The problem is that this must be used hundreds of thousands of times per file so speed is an issue! Does anyone have any ideas to make this reg exp faster? I'm not fully aware of what reg exp constructs incurr speed penalties?
    Thanks!
    Last edited by numberwhun; Sep 24 '08, 12:07 PM. Reason: Please use code tags
  • numberwhun
    Recognized Expert Moderator Specialist
    • May 2007
    • 3467

    #2
    The only thing I can really think of right off (due to it still being early and my brain is still sleeping), is to work to make your regex non-greedy if you can. You can read about it here and here.

    Having a more exact regular expression is one key to speed. Also, in the beginning of your regex you have the following:

    Code:
    [ |-]
    I assume that the spacing before the pipe symbol is supposed to be a space, but to a regex, its just white space and not part of the regex. To indicate a space in a regex, you would use a \s, not an actual space.

    Regards,

    Jeff

    Comment

    • Ganon11
      Recognized Expert Specialist
      • Oct 2006
      • 3651

      #3
      Jeff,

      A space inside a character class (such as the one he has) matches just that - a space. Whitespace is matched normally inside regexs unless a certain option is turned on (which I forget right now). In other words,

      [CODE=perl]$line =~ /This is a test./;[/CODE]

      will correctly match "This is a test." but not "Thisisates t."

      Code:
      C:\Users\Ganon11>perl
      while (1) {
         chomp(my $line = <STDIN>);
         if ($line =~ /This is a test./) {
            print "Successful match.\n";
         } else {
            print "No match.\n";
         }
      }
      ^Z
      This is a test.
      Successful match.
      Thisisatest.
      No match.
      ^C
      Similarly,

      [CODE=perl]$line =~ /(\w+)[ \t]/;[/CODE]

      will match "Dogs ", "Cats ", but not "Mouse".

      Code:
      C:\Users\Ganon11>perl
      while (1) {
              chomp(my $line = <STDIN>);
              if ($line =~ /(\w+)[ \t]/) {
                      print "Successful match.\n";
              } else {
                      print "No match.\n";
              }
      }
      ^Z
      Dogs
      No match.
      Dogs and
      Successful match.
      Cats
      Successful match.
      There was a tab in the previous line
      Successful match.
      Mousenospace
      No match.
      ^C
      The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        try:

        Code:
        $line =~ m/(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)/o;
        the "o" on the end might also give some performance boost but you would have to test the code to see if that is true for your application.

        Comment

        • numberwhun
          Recognized Expert Moderator Specialist
          • May 2007
          • 3467

          #5
          Originally posted by Ganon11
          The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
          Plus, with \s, you can add the modifiers to match none or many, where as I believe you would have to include as many spaces as you expect the way he has done it. I was just looking to efficiency, but also wasn't aware you could use a literal space as such.

          Comment

          • Ganon11
            Recognized Expert Specialist
            • Oct 2006
            • 3651

            #6
            You could use [ \t\n]+ or [ \t\n]* just like \s, it's just faster to write \s+ or \s*. I think.

            Comment

            • KevinADC
              Recognized Expert Specialist
              • Jan 2007
              • 4092

              #7
              \s is actually a character class, not just a meta character, its like \d ([0-9]) or \w ([a-zA-Z0-9_]) and not like \t or \n, which are meta characters that have only one interpolated meaning (tab and newline). Its actual meaning may also vary between older versions of perl and newer ones.

              According to the perl 5.10 documentation:

              \s matches a whitespace character, the set [\ \t\r\n\f] and others

              Comment

              Working...