re.search much slower then grep on some regular expressions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Henning_Thornblad

    re.search much slower then grep on some regular expressions

    What can be the cause of the large difference between re.search and
    grep?

    This script takes about 5 min to run on my computer:
    #!/usr/bin/env python
    import re

    row=""
    for a in range(156000):
    row+="a"
    print re.search('[^ "=]*/',row)


    While doing a simple grep:
    grep '[^ "=]*/' input (input contains 156.000 a in
    one row)
    doesn't even take a second.

    Is this a bug in python?

    Thanks...
    Henning Thornblad
  • Bruno Desthuilliers

    #2
    Re: re.search much slower then grep on some regular expressions

    Henning_Thornbl ad a écrit :
    What can be the cause of the large difference between re.search and
    grep?
    >
    This script takes about 5 min to run on my computer:
    #!/usr/bin/env python
    import re
    >
    row=""
    for a in range(156000):
    row+="a"
    print re.search('[^ "=]*/',row)
    >
    >
    While doing a simple grep:
    grep '[^ "=]*/' input (input contains 156.000 a in
    one row)
    doesn't even take a second.
    >
    Is this a bug in python?
    Please re-read carefully your python code. Don't you think there's a
    subtle difference between reading a file and buildin 156000 string objects ?

    Comment

    • Bruno Desthuilliers

      #3
      Re: re.search much slower then grep on some regular expressions

      Bruno Desthuilliers a écrit :
      Henning_Thornbl ad a écrit :
      >What can be the cause of the large difference between re.search and
      >grep?
      >>
      >This script takes about 5 min to run on my computer:
      >#!/usr/bin/env python
      >import re
      >>
      >row=""
      >for a in range(156000):
      > row+="a"
      >print re.search('[^ "=]*/',row)
      >>
      >>
      >While doing a simple grep:
      >grep '[^ "=]*/' input (input contains 156.000 a in
      >one row)
      >doesn't even take a second.
      >>
      >Is this a bug in python?
      >
      Please re-read carefully your python code. Don't you think there's a
      subtle difference between reading a file and buildin 156000 string
      objects ?
      >
      Mmm... This set aside, after testing it (building the string in a
      somewhat more efficient way), the call to re.search effectively takes
      ages to return. Please forget my previous post.

      Comment

      • Peter Otten

        #4
        Re: re.search much slower then grep on some regular expressions

        Henning_Thornbl ad wrote:
        What can be the cause of the large difference between re.search and
        grep?
        grep uses a smarter algorithm ;)
        This script takes about 5 min to run on my computer:
        #!/usr/bin/env python
        import re
        >
        row=""
        for a in range(156000):
        row+="a"
        print re.search('[^ "=]*/',row)
        >
        >
        While doing a simple grep:
        grep '[^ "=]*/' input (input contains 156.000 a in
        one row)
        doesn't even take a second.
        >
        Is this a bug in python?
        You could call this a performance bug, but it's not common enough in real
        code to get the necessary brain cycles from the core developers.
        So you can either write a patch yourself or use a workaround.

        re.search('[^ "=]*/', row) if "/" in row else None

        might be good enough.

        Peter

        Comment

        • Paddy

          #5
          Re: re.search much slower then grep on some regular expressions

          On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
          Henning_Thornbl ad wrote:
          What can be the cause of the large difference between re.search and
          grep?
          >
          grep uses a smarter algorithm ;)
          >
          >
          >
          This script takes about 5 min to run on my computer:
          #!/usr/bin/env python
          import re
          >
          row=""
          for a in range(156000):
              row+="a"
          print re.search('[^ "=]*/',row)
          >
          While doing a simple grep:
          grep '[^ "=]*/' input                  (input contains 156.000 a in
          one row)
          doesn't even take a second.
          >
          Is this a bug in python?
          >
          You could call this a performance bug, but it's not common enough in real
          code to get the necessary brain cycles from the core developers.
          So you can either write a patch yourself or use a workaround.
          >
          re.search('[^ "=]*/', row) if "/" in row else None
          >
          might be good enough.
          >
          Peter
          It is not a smarter algorithm that is used in grep. Python RE's have
          more capabilities than grep RE's which need a slower, more complex
          algorithm.
          You could argue that if the costly RE features are not used then maybe
          simpler, faster algorithms should be automatically swapped in but ....

          - Paddy.

          Comment

          • Filipe Fernandes

            #6
            Re: re.search much slower then grep on some regular expressions

            On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__peter__@web. dewrote:
            Henning_Thornbl ad wrote:
            >
            >What can be the cause of the large difference between re.search and
            >grep?
            >
            grep uses a smarter algorithm ;)
            >
            >This script takes about 5 min to run on my computer:
            >#!/usr/bin/env python
            >import re
            >>
            >row=""
            >for a in range(156000):
            > row+="a"
            >print re.search('[^ "=]*/',row)
            >>
            >>
            >While doing a simple grep:
            >grep '[^ "=]*/' input (input contains 156.000 a in
            >one row)
            >doesn't even take a second.
            >>
            >Is this a bug in python?
            >
            You could call this a performance bug, but it's not common enough in real
            code to get the necessary brain cycles from the core developers.
            So you can either write a patch yourself or use a workaround.
            >
            re.search('[^ "=]*/', row) if "/" in row else None
            >
            might be good enough.
            >
            Wow... I'm rather surprised at how slow this is... using re.match
            yields much quicker results, but of course it's not quite the same as
            re.search

            Incidentally, if you add the '/' to "row" at the end of the string,
            re.search returns instantly with a match object.

            @ Peter
            I'm not versed enough in regex to tell if this is a bug or not
            (although I suspect it is), but why would you say this particular
            regex isn't common enough in real code?

            filipe

            Comment

            • Carl Banks

              #7
              Re: re.search much slower then grep on some regular expressions

              On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@g mail.comwrote:
              On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web. dewrote:
              Henning_Thornbl ad wrote:
              >
              What can be the cause of the large difference between re.search and
              grep?
              >
              grep uses a smarter algorithm ;)
              >
              This script takes about 5 min to run on my computer:
              #!/usr/bin/env python
              import re
              >
              row=""
              for a in range(156000):
              row+="a"
              print re.search('[^ "=]*/',row)
              >
              While doing a simple grep:
              grep '[^ "=]*/' input (input contains 156.000 a in
              one row)
              doesn't even take a second.
              >
              Is this a bug in python?
              >
              You could call this a performance bug, but it's not common enough in real
              code to get the necessary brain cycles from the core developers.
              So you can either write a patch yourself or use a workaround.
              >
              re.search('[^ "=]*/', row) if "/" in row else None
              >
              might be good enough.
              >
              Wow... I'm rather surprised at how slow this is... using re.match
              yields much quicker results, but of course it's not quite the same as
              re.search
              >
              Incidentally, if you add the '/' to "row" at the end of the string,
              re.search returns instantly with a match object.
              This behavior is showing that you're getting n-squared performance;
              the regexp seems to be checking 156000*(156000-1)/2 substrings for a
              match.

              I don't think it's possible to avoid quadratic behavior in regexps in
              general, but clearly there are ways to optimize in some cases.

              I'm guessing that grep builds a table of locations of individual
              characters as it scans and, when the regexp exhausts the input, it
              tries to backtrack to the last slash it saw, except there wasn't one
              so it knows the regexp cannot be satisfied and it exits early.

              @ Peter
              I'm not versed enough in regex to tell if this is a bug or not
              (although I suspect it is),
              I'm pretty sure it isn't: the regexp documentation makes no claims as
              to the performance of the regexp that I'm aware of.

              but why would you say this particular
              regex isn't common enough in real code?
              When re.search regexps start with things like [^...]* or .*, typically
              the excluded characters are a typically found frequently in the
              input. For example, the pattern .*hello.* could be used to find a
              line with hello in it, with the expectation that there are lots of
              newlines. But if there aren't any newlines the regexp wouldn't be
              very useful.



              Carl Banks

              Comment

              • John Nagle

                #8
                Re: re.search much slower then grep on some regular expressions

                Henning_Thornbl ad wrote:
                What can be the cause of the large difference between re.search and
                grep?
                >
                This script takes about 5 min to run on my computer:
                #!/usr/bin/env python
                import re
                >
                row=""
                for a in range(156000):
                row+="a"
                print re.search('[^ "=]*/',row)
                >
                >
                While doing a simple grep:
                grep '[^ "=]*/' input (input contains 156.000 a in
                one row)
                doesn't even take a second.
                >
                Is this a bug in python?
                >
                Thanks...
                Henning Thornblad
                You're recompiling the regular expression on each use.
                Use "re.compile " before the loop to do it once.

                John Nagle

                Comment

                • Peter Pearson

                  #9
                  Re: re.search much slower then grep on some regular expressions

                  On Fri, 4 Jul 2008 20:34:03 -0700 (PDT), Carl Banks wrote:
                  On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@g mail.comwrote:
                  >On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web. dewrote:
                  Henning_Thornbl ad wrote:
                  >>
                  >This script takes about 5 min to run on my computer:
                  >#!/usr/bin/env python
                  >import re
                  >>
                  >row=""
                  >for a in range(156000):
                  > row+="a"
                  >print re.search('[^ "=]*/',row)
                  [snip]
                  >Is this a bug in python?
                  >
                  This behavior is showing that you're getting n-squared performance;
                  the regexp seems to be checking 156000*(156000-1)/2 substrings for a
                  match.
                  I did this:

                  $ python -m timeit -s "import re" "re.search( '[^13]*x', 900*'a' )"
                  100 loops, best of 3: 16.7 msec per loop

                  for values of 900 ranging from 300 to 1000, and the time taken
                  per loop was indeed quadratic.

                  --
                  To email me, substitute nowhere->spamcop, invalid->net.

                  Comment

                  • Peter Otten

                    #10
                    Re: re.search much slower then grep on some regular expressions

                    John Nagle wrote:
                    Henning_Thornbl ad wrote:
                    >What can be the cause of the large difference between re.search and
                    >grep?
                    >>
                    >This script takes about 5 min to run on my computer:
                    >#!/usr/bin/env python
                    >import re
                    >>
                    >row=""
                    >for a in range(156000):
                    > row+="a"
                    >print re.search('[^ "=]*/',row)
                    >>
                    >>
                    >While doing a simple grep:
                    >grep '[^ "=]*/' input (input contains 156.000 a in
                    >one row)
                    >doesn't even take a second.
                    >>
                    >Is this a bug in python?
                    >>
                    >Thanks...
                    >Henning Thornblad
                    >
                    You're recompiling the regular expression on each use.
                    Use "re.compile " before the loop to do it once.
                    Now that's premature optimization :-)

                    Apart from the fact that re.search() is executed only once in the above
                    script the re library uses a caching scheme so that even if the re.search()
                    call were in a loop the overhead would be a few microseconds for the cache
                    lookup.

                    Peter

                    Comment

                    • Peter Otten

                      #11
                      Re: re.search much slower then grep on some regular expressions

                      Paddy wrote:
                      It is not a smarter algorithm that is used in grep. Python RE's have
                      more capabilities than grep RE's which need a slower, more complex
                      algorithm.
                      So you're saying the Python algo is alternatively gifted...

                      Peter

                      Comment

                      • Peter Otten

                        #12
                        Re: re.search much slower then grep on some regular expressions

                        Filipe Fernandes wrote:
                        but why would you say this particular
                        regex isn't common enough in real code?
                        As Carl says, it's not just the regex, it's the the combination with a long
                        line that exposes the re library's weakness.

                        Peter

                        Comment

                        • Paddy

                          #13
                          Re: re.search much slower then grep on some regular expressions

                          On Jul 5, 7:01 am, Peter Otten <__pete...@web. dewrote:
                          Paddy wrote:
                          It is not a smarter algorithm that is used in grep. Python RE's have
                          more capabilities than grep RE's which need a slower, more complex
                          algorithm.
                          >
                          So you're saying the Python algo is alternatively gifted...
                          >
                          Peter
                          The following isn't the article I read on regexp types and their speed
                          differences but it does give info on regexp engine types:
                          Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation.Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions, Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch


                          - Paddy.

                          Comment

                          • Sebastian \lunar\ Wiesner

                            #14
                            Re: re.search much slower then grep on some regular expressions

                            Paddy <paddy3118@goog lemail.com>:
                            On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
                            >Henning_Thornb lad wrote:
                            What can be the cause of the large difference between re.search and
                            grep?
                            >>
                            >grep uses a smarter algorithm ;)
                            >>
                            >>
                            >>
                            This script takes about 5 min to run on my computer:
                            #!/usr/bin/env python
                            import re
                            >>
                            row=""
                            for a in range(156000):
                            row+="a"
                            print re.search('[^ "=]*/',row)
                            >>
                            While doing a simple grep:
                            grep '[^ "=]*/' input                  (input contains 156.000 a in
                            one row)
                            doesn't even take a second.
                            >>
                            Is this a bug in python?
                            >>
                            >You could call this a performance bug, but it's not common enough in real
                            >code to get the necessary brain cycles from the core developers.
                            >So you can either write a patch yourself or use a workaround.
                            >>
                            >re.search('[^ "=]*/', row) if "/" in row else None
                            >>
                            >might be good enough.
                            >>
                            >Peter
                            >
                            It is not a smarter algorithm that is used in grep. Python RE's have
                            more capabilities than grep RE's which need a slower, more complex
                            algorithm.
                            FWIW, grep itself can confirm this statement. The following command roughly
                            takes as long as Python's re.search:

                            # grep -P '[^ "=]*/' input

                            -P tells grep to use real perl-compatible regular expressions.

                            --
                            Freedom is always the freedom of dissenters.
                            (Rosa Luxemburg)

                            Comment

                            • Carl Banks

                              #15
                              Re: re.search much slower then grep on some regular expressions

                              On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
                              <basti.wies...@ gmx.netwrote:
                              Paddy <paddy3...@goog lemail.com>:
                              >
                              >
                              >
                              On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
                              Henning_Thornbl ad wrote:
                              What can be the cause of the large difference between re.search and
                              grep?
                              >
                              grep uses a smarter algorithm ;)
                              >
                              This script takes about 5 min to run on my computer:
                              #!/usr/bin/env python
                              import re
                              >
                              row=""
                              for a in range(156000):
                              row+="a"
                              print re.search('[^ "=]*/',row)
                              >
                              While doing a simple grep:
                              grep '[^ "=]*/' input (input contains 156.000 a in
                              one row)
                              doesn't even take a second.
                              >
                              Is this a bug in python?
                              >
                              You could call this a performance bug, but it's not common enough in real
                              code to get the necessary brain cycles from the core developers.
                              So you can either write a patch yourself or use a workaround.
                              >
                              re.search('[^ "=]*/', row) if "/" in row else None
                              >
                              might be good enough.
                              >
                              Peter
                              >
                              It is not a smarter algorithm that is used in grep. Python RE's have
                              more capabilities than grep RE's which need a slower, more complex
                              algorithm.
                              >
                              FWIW, grep itself can confirm this statement. The following command roughly
                              takes as long as Python's re.search:
                              >
                              # grep -P '[^ "=]*/' input
                              >
                              -P tells grep to use real perl-compatible regular expressions.
                              This confirms that a particular engine might not be optimized for it,
                              but it's not necessarily a reflection that the engine is more complex.

                              I'm not sure of the specifics and maybe that is the case, but it could
                              also be a case of a different codebase which is optimized differently.


                              Carl Banks

                              Comment

                              Working...