re.search much slower then grep on some regular expressions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sebastian \lunar\ Wiesner

    #16
    Re: re.search much slower then grep on some regular expressions

    Carl Banks <pavlovevidence @gmail.com>:
    On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
    <basti.wies...@ gmx.netwrote:
    >Paddy <paddy3...@goog lemail.com>:
    >>
    >>
    >>
    On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
    >Henning_Thornb lad wrote:
    What can be the cause of the large difference between re.search and
    grep?
    >>
    >grep uses a smarter algorithm ;)
    >>
    This script takes about 5 min to run on my computer:
    #!/usr/bin/env python
    import re
    >>
    row=""
    for a in range(156000):
    row+="a"
    print re.search('[^ "=]*/',row)
    >>
    While doing a simple grep:
    grep '[^ "=]*/' input (input contains 156.000 a in
    one row)
    doesn't even take a second.
    >>
    Is this a bug in python?
    >>
    >You could call this a performance bug, but it's not common enough in
    >real code to get the necessary brain cycles from the core developers.
    >So you can either write a patch yourself or use a workaround.
    >>
    >re.search('[^ "=]*/', row) if "/" in row else None
    >>
    >might be good enough.
    >>
    >Peter
    >>
    It is not a smarter algorithm that is used in grep. Python RE's have
    more capabilities than grep RE's which need a slower, more complex
    algorithm.
    >>
    >FWIW, grep itself can confirm this statement. The following command
    >roughly takes as long as Python's re.search:
    >>
    ># grep -P '[^ "=]*/' input
    >>
    >-P tells grep to use real perl-compatible regular expressions.
    >
    This confirms that a particular engine might not be optimized for it,
    but it's not necessarily a reflection that the engine is more complex.
    My posting wasn't intended to reflect the differences in complexity between
    normal GNU grep expressions (which are basically extended POSIX
    expressions) and perl-compatible expressions. The latter just _are_ more
    complex, having additional features like look aheads or non-greedy
    qualifiers.

    I just wanted to illustrate, that the speed of the given search is somehow
    related to the complexity of the engine.

    Btw, other pcre implementation are as slow as Python or "grep -P". I tried
    a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
    running equally long.


    --
    Freedom is always the freedom of dissenters.
    (Rosa Luxemburg)

    Comment

    • Carl Banks

      #17
      Re: re.search much slower then grep on some regular expressions

      On Jul 5, 6:44 am, "Sebastian \"lunar\" Wiesner"
      <basti.wies...@ gmx.netwrote:
      Carl Banks <pavlovevide... @gmail.com>:
      >
      >
      >
      On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
      <basti.wies...@ gmx.netwrote:
      Paddy <paddy3...@goog lemail.com>:
      >
      On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
      Henning_Thornbl ad wrote:
      What can be the cause of the large difference between re.search and
      grep?
      >
      grep uses a smarter algorithm ;)
      >
      This script takes about 5 min to run on my computer:
      #!/usr/bin/env python
      import re
      >
      row=""
      for a in range(156000):
      row+="a"
      print re.search('[^ "=]*/',row)
      >
      While doing a simple grep:
      grep '[^ "=]*/' input (input contains 156.000 a in
      one row)
      doesn't even take a second.
      >
      Is this a bug in python?
      >
      You could call this a performance bug, but it's not common enough in
      real code to get the necessary brain cycles from the core developers.
      So you can either write a patch yourself or use a workaround.
      >
      re.search('[^ "=]*/', row) if "/" in row else None
      >
      might be good enough.
      >
      Peter
      >
      It is not a smarter algorithm that is used in grep. Python RE's have
      more capabilities than grep RE's which need a slower, more complex
      algorithm.
      >
      FWIW, grep itself can confirm this statement. The following command
      roughly takes as long as Python's re.search:
      >
      # grep -P '[^ "=]*/' input
      >
      -P tells grep to use real perl-compatible regular expressions.
      >
      This confirms that a particular engine might not be optimized for it,
      but it's not necessarily a reflection that the engine is more complex.
      >
      My posting wasn't intended to reflect the differences in complexity between
      normal GNU grep expressions (which are basically extended POSIX
      expressions) and perl-compatible expressions. The latter just _are_ more
      complex, having additional features like look aheads or non-greedy
      qualifiers.
      >
      I just wanted to illustrate, that the speed of the given search is somehow
      related to the complexity of the engine.
      I don't think you've illustrated that at all. What you've illustrated
      is that one implementation of regexp optimizes something that another
      doesn't. It might be due to differences in complexity; it might not.
      (Maybe there's something about PCREs that precludes the optimization
      that the default grep uses, but I'd be inclined to think not.)


      Carl Banks

      Comment

      • bearophileHUGS@lycos.com

        #18
        Re: re.search much slower then grep on some regular expressions

        Paddy:
        You could argue that if the costly RE features are not used then maybe
        simpler, faster algorithms should be automatically swapped in but ....
        Many Python implementations contains a TCL interpreter. TCL REs may be
        better than Python ones, so it can be interesting to benchmark the
        same RE with TCL, to see how much time it needs. If someone here knows
        TCL it may require to write just few lines of TCL code (used through
        tkinter, by direct call).

        Bye,
        bearophile

        Comment

        • Mark Dickinson

          #19
          Re: re.search much slower then grep on some regular expressions

          On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :
          I don't think you've illustrated that at all.  What you've illustrated
          is that one implementation of regexp optimizes something that another
          doesn't.  It might be due to differences in complexity; it might not.
          (Maybe there's something about PCREs that precludes the optimization
          that the default grep uses, but I'd be inclined to think not.)
          It seems like an appropriate moment to point out *this* paper:



          Apparently, grep and Tcl convert a regex to a finite state machine.
          Matching is then *very* fast: essentially linear time in the
          length of the string being matched, even in the worst case.
          Though it is possible for the size of the finite state machine
          to grow exponentially with the size of the regex.

          But not all PCREs can be converted to a finite state machine, so
          Perl, Python, etc. use a backtracking approach, which has
          exponential running time in the worst case. In particular,
          it's not possible to use a finite state machine to represent
          a regular expression that contains backreferences.

          Part of the problem is a lack of agreement on what
          'regular expression' means. Strictly speaking, PCREs aren't
          regular expressions at all, for some values of the term
          'regular expression'. See



          Mark

          Comment

          • Paddy

            #20
            Re: re.search much slower then grep on some regular expressions

            On Jul 5, 4:13 pm, Mark Dickinson <dicki...@gmail .comwrote:
            It seems like an appropriate moment to point out *this* paper:
            >

            >
            That's the one!

            Thanks Mark.

            - Paddy.

            Comment

            • Terry Reedy

              #21
              Re: re.search much slower then grep on some regular expressions



              Mark Dickinson wrote:
              On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :
              Part of the problem is a lack of agreement on what
              'regular expression' means.
              Twenty years ago, there was. Calling a extended re-derived grammar
              expression like Perl's a 'regular-expression' is a bit like calling a
              Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.
              Strictly speaking, PCREs aren't
              regular expressions at all, for some values of the term
              'regular expression'. See
              >
              http://en.wikipedia.org/wiki/Regular_expression

              Comment

              • Sebastian \lunar\ Wiesner

                #22
                Re: re.search much slower then grep on some regular expressions

                Terry Reedy <tjreedy@udel.e du>:
                Mark Dickinson wrote:
                >On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :
                >
                >Part of the problem is a lack of agreement on what
                >'regular expression' means.
                >
                Twenty years ago, there was. Calling a extended re-derived grammar
                expression like Perl's a 'regular-expression' is a bit like calling a
                Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.
                Yet, to many programmers the Hummer is not only just a car, its _the_ car.
                Everything else is just silly, bloody old posix crap ;)

                (meaning, that pcres are often seen as true regular expressions, and
                everything else is referred to as "being somehow restricted")

                --
                Freedom is always the freedom of dissenters.
                (Rosa Luxemburg)

                Comment

                • sjdevnull@yahoo.com

                  #23
                  Re: re.search much slower then grep on some regular expressions

                  On Jul 5, 11:13 am, Mark Dickinson <dicki...@gmail .comwrote:
                  Apparently, grep and Tcl convert a regex to a finite state machine.
                  ....
                  But not all PCREs can be converted to a finite state machine
                  ....
                  Part of the problem is a lack of agreement on what
                  'regular expression' means. Strictly speaking, PCREs aren't
                  regular expressions at all, for some values of the term
                  'regular expression'. See
                  >
                  http://en.wikipedia.org/wiki/Regular_expression
                  Formally, there's no lack of agreement that I'm aware of. Anyone
                  formally defining it will use something along the lines of what
                  wikipedia specifies (with "Regular expressions in this sense can
                  express the regular languages, exactly the class of languages accepted
                  by finite state automata."). Colloquially it's used to mean "any text-
                  matching scheme that looks vaguely like sed/grep regexes".

                  PCREs are certainly not "real" regular expressions, but they are
                  colloquially.

                  Comment

                  • Mark Wooding

                    #24
                    Re: re.search much slower then grep on some regular expressions

                    Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
                    I just wanted to illustrate, that the speed of the given search is somehow
                    related to the complexity of the engine.
                    >
                    Btw, other pcre implementation are as slow as Python or "grep -P". I tried
                    a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
                    running equally long.
                    So some other implementations are equally poor. I note that Perl itself
                    handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

                    Yes, Perl-compatible `regular expressions' are more complicated than
                    POSIX extended regular expressions; but that doesn't mean that you have
                    to implement them in a dumb way. Indeed, it stands to reason that
                    expressions describing truly regular languages can be matched using the
                    same old algorithms that grep uses.

                    -- [mdw]

                    Comment

                    • Sebastian \lunar\ Wiesner

                      #25
                      Re: re.search much slower then grep on some regular expressions

                      Mark Wooding <mdw@distorted. org.uk>:
                      Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
                      >
                      >I just wanted to illustrate, that the speed of the given search is
                      >somehow related to the complexity of the engine.
                      >>
                      >Btw, other pcre implementation are as slow as Python or "grep -P". I
                      >tried a sample C++-code using pcre++ (a wrapper around libpcre) and saw
                      >it running equally long.
                      >
                      So some other implementations are equally poor. I note that Perl itself
                      handles this case very quickly, as does Edi Weitz's CL-PPCRE library.
                      I don't know about the latter, but perl doesn't use any smart algorithm, it
                      just heavily relies on memoization to gain speed.

                      This makes perl perform poorly in other situations, as mentioned in the
                      paper cited by Mark Dickinson:

                      # perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                      zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

                      In Python on the other, this pattern works fine, at the cost of performance
                      losses on other patterns.

                      It'd be interesting to know, how CL-PPCRE performs here (I don't know this
                      library).
                      Yes, Perl-compatible `regular expressions' are more complicated than
                      POSIX extended regular expressions; but that doesn't mean that you have
                      to implement them in a dumb way. Indeed, it stands to reason that
                      expressions describing truly regular languages can be matched using the
                      same old algorithms that grep uses.
                      I completely agree. I'd just believe, that the combination of some finite
                      state machine for "classic" expressions with some backtracking code is
                      terribly hard to implement. But I'm not an expert in this, probably some
                      libraries out there already do this. In this case, it'd be time to give
                      pythons re engine a more sophisticated implementation ;)

                      --
                      Freedom is always the freedom of dissenters.
                      (Rosa Luxemburg)

                      Comment

                      • Terry Reedy

                        #26
                        Re: re.search much slower then grep on some regular expressions



                        Sebastian "lunar" Wiesner wrote:
                        I completely agree. I'd just believe, that the combination of some finite
                        state machine for "classic" expressions with some backtracking code is
                        terribly hard to implement. But I'm not an expert in this, probably some
                        libraries out there already do this. In this case, it'd be time to give
                        pythons re engine a more sophisticated implementation ;)
                        One thing to remember in making comparisons is that Python moved to its
                        own implementation to support unicode as well as extended ascii (bytes
                        in 3.0). Does grep do that? Has pcre, which Python once used, been
                        upgraded to do that?

                        Comment

                        • Mark Wooding

                          #27
                          Re: re.search much slower then grep on some regular expressions

                          Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
                          # perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                          zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                          (Did you really run that as root?)
                          It'd be interesting to know, how CL-PPCRE performs here (I don't know this
                          library).
                          Stack overflow condition. :-(

                          -- [mdw]

                          Comment

                          • Sebastian \lunar\ Wiesner

                            #28
                            Re: re.search much slower then grep on some regular expressions

                            Mark Wooding <mdw@distorted. org.uk>:
                            Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
                            >
                            ># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                            >zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                            >
                            (Did you really run that as root?)
                            How come, that you think so?

                            --
                            Freedom is always the freedom of dissenters.
                            (Rosa Luxemburg)

                            Comment

                            • Marc 'BlackJack' Rintsch

                              #29
                              Re: re.search much slower then grep on some regular expressions

                              On Mon, 07 Jul 2008 16:44:22 +0200, Sebastian \"lunar\" Wiesner wrote:
                              Mark Wooding <mdw@distorted. org.uk>:
                              >
                              >Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
                              >>
                              >># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                              >>zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'
                              >>
                              >(Did you really run that as root?)
                              >
                              How come, that you think so?
                              The '#' is usually the prompt of root accounts while ordinary users get a '$'.

                              Ciao,
                              Marc 'BlackJack' Rintsch

                              Comment

                              • Henning Thornblad

                                #30
                                Re: re.search much slower then grep on some regular expressions

                                When trying to find an alternative way of solving my problem i found
                                that running this script:

                                #!/usr/bin/env python

                                import re

                                row=""
                                for a in range(156000):
                                row+="a"
                                print "How many, dude?"
                                print re.search('/[^ "=]*',row) (the / has moved)

                                wouldn't take even a second (The re.search part of course)

                                This is for me very strange since this,
                                in theory, is the same problem as before.

                                /Henning Thornblad

                                Comment

                                Working...