re.search much slower then grep on some regular expressions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • J. Cliff Dyer

    #46
    Re: re.search much slower then grep on some regular expressions


    On Wed, 2008-07-09 at 12:29 -0700, samwyse wrote:
    On Jul 8, 11:01 am, Kris Kennaway <k...@FreeBSD.o rgwrote:
    samwyse wrote:
    >
    "Another advantage of Plex is that it compiles all of the regular
    expressions into a single DFA. Once that's done, the input can be
    processed in a time proportional to the number of characters to be
    scanned, and independent of the number or complexity of the regular
    expressions. Python's existing regular expression matchers do not have
    this property. "
    >
    Hmm, unfortunately it's still orders of magnitude slower than grep in my
    own application that involves matching lots of strings and regexps
    against large files (I killed it after 400 seconds, compared to 1.5 for
    grep), and that's leaving aside the much longer compilation time (over a
    minute). If the matching was fast then I could possibly pickle the
    lexer though (but it's not).
    >
    That's funny, the compilation is almost instantaneous for me.
    However, I just tested it to several files, the first containing
    4875*'a', the rest each twice the size of the previous. And you're
    right, for each doubling of the file size, the match take four times
    as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
    Here are my results:
    >
    compile_lexicon () took 0.0236021580595 secs
    test('file-0.txt') took 24.8322969831 secs
    test('file-1.txt') took 99.3956799681 secs
    test('file-2.txt') took 398.349623132 secs
    Sounds like a good strategy would be to find the smallest chunk of the
    file that matches can't cross, and iterate your search on units of those
    chunks. For example, if none of your regexes cross line boundaries,
    search each line of the file individually. That may help turn around
    the speed degradation you're seeing.

    Cheers,
    Cliff

    Comment

    • Sebastian \lunar\ Wiesner

      #47
      Re: re.search much slower then grep on some regular expressions

      Marc 'BlackJack' Rintsch <bj_666@gmx.net >:
      On Mon, 07 Jul 2008 16:44:22 +0200, Sebastian \"lunar\" Wiesner wrote:
      >
      >Mark Wooding <mdw@distorted. org.uk>:
      >>
      >>Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
      >>>
      >>># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
      >>>zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'
      >>>
      >>(Did you really run that as root?)
      >>
      >How come, that you think so?
      >
      The '#' is usually the prompt of root accounts while ordinary users get a
      '$'.
      Oh ... I just used "#" as a general placeholder, because I didn't want to
      copy my whole two-line prompt ;) I can assure, that I did not try this as
      root ;) and will use "$" in future ;)

      --
      Freedom is always the freedom of dissenters.
      (Rosa Luxemburg)

      Comment

      • Kris Kennaway

        #48
        Re: re.search much slower then grep on some regular expressions

        J. Cliff Dyer wrote:
        On Wed, 2008-07-09 at 12:29 -0700, samwyse wrote:
        >On Jul 8, 11:01 am, Kris Kennaway <k...@FreeBSD.o rgwrote:
        >>samwyse wrote:
        >>>You might want to look at Plex.
        >>>http://www.cosc.canterbury.ac.nz/gre...g/python/Plex/
        >>>"Another advantage of Plex is that it compiles all of the regular
        >>>expression s into a single DFA. Once that's done, the input can be
        >>>processed in a time proportional to the number of characters to be
        >>>scanned, and independent of the number or complexity of the regular
        >>>expression s. Python's existing regular expression matchers do not have
        >>>this property. "
        >>Hmm, unfortunately it's still orders of magnitude slower than grep in my
        >>own application that involves matching lots of strings and regexps
        >>against large files (I killed it after 400 seconds, compared to 1.5 for
        >>grep), and that's leaving aside the much longer compilation time (over a
        >>minute). If the matching was fast then I could possibly pickle the
        >>lexer though (but it's not).
        >That's funny, the compilation is almost instantaneous for me.
        >However, I just tested it to several files, the first containing
        >4875*'a', the rest each twice the size of the previous. And you're
        >right, for each doubling of the file size, the match take four times
        >as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
        >Here are my results:
        >>
        >compile_lexico n() took 0.0236021580595 secs
        >test('file-0.txt') took 24.8322969831 secs
        >test('file-1.txt') took 99.3956799681 secs
        >test('file-2.txt') took 398.349623132 secs
        >
        Sounds like a good strategy would be to find the smallest chunk of the
        file that matches can't cross, and iterate your search on units of those
        chunks. For example, if none of your regexes cross line boundaries,
        search each line of the file individually. That may help turn around
        the speed degradation you're seeing.
        That's what I'm doing. I've also tried various other things like
        mmapping the file and searching it at once, etc, but almost all of the
        time is spent in the regexp engine so optimizing other things only gives
        marginal improvement.

        Kris

        Comment

        Working...