re.search much slower then grep on some regular expressions

**Bruno Desthuilliers** · Jul 4 '08, 12:35 PM

Re: re.search much slower then grep on some regular expressions

Henning_Thornbl ad a écrit :

What can be the cause of the large difference between re.search and
grep?
>
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
>
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
>
>
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
>
Is this a bug in python?

Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string objects ?

**Bruno Desthuilliers** · Jul 4 '08, 12:35 PM

Re: re.search much slower then grep on some regular expressions

Bruno Desthuilliers a écrit :

Henning_Thornbl ad a écrit :

>What can be the cause of the large difference between re.search and
>grep?
>>
>This script takes about 5 min to run on my computer:
>#!/usr/bin/env python
>import re
>>
>row=""
>for a in range(156000):
> row+="a"
>print re.search('[^ "=]*/',row)
>>
>>
>While doing a simple grep:
>grep '[^ "=]*/' input (input contains 156.000 a in
>one row)
>doesn't even take a second.
>>
>Is this a bug in python?

>
Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string
objects ?
>

Mmm... This set aside, after testing it (building the string in a
somewhat more efficient way), the call to re.search effectively takes
ages to return. Please forget my previous post.

**Peter Otten** · Jul 4 '08, 12:45 PM

Re: re.search much slower then grep on some regular expressions

Henning_Thornbl ad wrote:

What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
>
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
>
>
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
>
Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

**Paddy** · Jul 4 '08, 04:45 PM

Re: re.search much slower then grep on some regular expressions

On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:

Henning_Thornbl ad wrote:

What can be the cause of the large difference between re.search and
grep?

>
grep uses a smarter algorithm ;)
>
>
>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

>

Is this a bug in python?

>
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
>
re.search('[^ "=]*/', row) if "/" in row else None
>
might be good enough.
>
Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

- Paddy.

**Filipe Fernandes** · Jul 4 '08, 08:45 PM

Re: re.search much slower then grep on some regular expressions

On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__peter__@web. dewrote:

Henning_Thornbl ad wrote:
>

>What can be the cause of the large difference between re.search and
>grep?

>
grep uses a smarter algorithm ;)
>

>This script takes about 5 min to run on my computer:
>#!/usr/bin/env python
>import re
>>
>row=""
>for a in range(156000):
> row+="a"
>print re.search('[^ "=]*/',row)
>>
>>
>While doing a simple grep:
>grep '[^ "=]*/' input (input contains 156.000 a in
>one row)
>doesn't even take a second.
>>
>Is this a bug in python?

>
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
>
re.search('[^ "=]*/', row) if "/" in row else None
>
might be good enough.
>

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is), but why would you say this particular
regex isn't common enough in real code?

filipe

**Carl Banks** · Jul 5 '08, 03:35 AM

Re: re.search much slower then grep on some regular expressions

On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@g mail.comwrote:

On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web. dewrote:

Henning_Thornbl ad wrote:

>

What can be the cause of the large difference between re.search and
grep?

>

grep uses a smarter algorithm ;)

>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

>

Is this a bug in python?

>

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

>

re.search('[^ "=]*/', row) if "/" in row else None

>

might be good enough.

>
Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search
>
Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I don't think it's possible to avoid quadratic behavior in regexps in
general, but clearly there are ways to optimize in some cases.

I'm guessing that grep builds a table of locations of individual
characters as it scans and, when the regexp exhausts the input, it
tries to backtrack to the last slash it saw, except there wasn't one
so it knows the regexp cannot be satisfied and it exits early.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is),

I'm pretty sure it isn't: the regexp documentation makes no claims as
to the performance of the regexp that I'm aware of.

but why would you say this particular
regex isn't common enough in real code?

When re.search regexps start with things like [^...]* or .*, typically
the excluded characters are a typically found frequently in the
input. For example, the pattern .*hello.* could be used to find a
line with hello in it, with the expectation that there are lots of
newlines. But if there aren't any newlines the regexp wouldn't be
very useful.

Carl Banks

**John Nagle** · Jul 5 '08, 04:15 AM

Re: re.search much slower then grep on some regular expressions

Henning_Thornbl ad wrote:

What can be the cause of the large difference between re.search and
grep?
>
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
>
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
>
>
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
>
Is this a bug in python?
>
Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile " before the loop to do it once.

John Nagle

**Peter Pearson** · Jul 5 '08, 04:15 AM

Re: re.search much slower then grep on some regular expressions

On Fri, 4 Jul 2008 20:34:03 -0700 (PDT), Carl Banks wrote:

On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes...@g mail.comwrote:

>On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete...@web. dewrote:

Henning_Thornbl ad wrote:

>>

>This script takes about 5 min to run on my computer:
>#!/usr/bin/env python
>import re

>>

>row=""
>for a in range(156000):
> row+="a"
>print re.search('[^ "=]*/',row)

[snip]

>Is this a bug in python?

>
This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I did this:

$ python -m timeit -s "import re" "re.search( '[^13]*x', 900*'a' )"
100 loops, best of 3: 16.7 msec per loop

for values of 900 ranging from 300 to 1000, and the time taken
per loop was indeed quadratic.

--
To email me, substitute nowhere->spamcop, invalid->net.

**Peter Otten** · Jul 5 '08, 06:05 AM

Re: re.search much slower then grep on some regular expressions

John Nagle wrote:

Henning_Thornbl ad wrote:

>What can be the cause of the large difference between re.search and
>grep?
>>
>This script takes about 5 min to run on my computer:
>#!/usr/bin/env python
>import re
>>
>row=""
>for a in range(156000):
> row+="a"
>print re.search('[^ "=]*/',row)
>>
>>
>While doing a simple grep:
>grep '[^ "=]*/' input (input contains 156.000 a in
>one row)
>doesn't even take a second.
>>
>Is this a bug in python?
>>
>Thanks...
>Henning Thornblad

>
You're recompiling the regular expression on each use.
Use "re.compile " before the loop to do it once.

Now that's premature optimization :-)

Apart from the fact that re.search() is executed only once in the above
script the re library uses a caching scheme so that even if the re.search()
call were in a loop the overhead would be a few microseconds for the cache
lookup.

Peter

**Peter Otten** · Jul 5 '08, 06:05 AM

Re: re.search much slower then grep on some regular expressions

Paddy wrote:

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

So you're saying the Python algo is alternatively gifted...

Peter

**Peter Otten** · Jul 5 '08, 06:15 AM

Re: re.search much slower then grep on some regular expressions

Filipe Fernandes wrote:

but why would you say this particular
regex isn't common enough in real code?

As Carl says, it's not just the regex, it's the the combination with a long
line that exposes the re library's weakness.

Peter

**Paddy** · Jul 5 '08, 07:35 AM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 7:01 am, Peter Otten <__pete...@web. dewrote:

Paddy wrote:

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

>
So you're saying the Python algo is alternatively gifted...
>
Peter

The following isn't the article I read on regexp types and their speed
differences but it does give info on regexp engine types:

Mastering Regular Expressions

http://books.google.co.uk/books?id=GX3w_18-JegC&pg=PA145&lpg=PA145&dq=%2Bregex+%2Bspeed+%2Bgrep+%2Btcl+&source=web&ots=PHljNqeuJZ&sig=CyBCOZYsyZaJTMfj2Br53if46Bc&hl=en&sa=X&oi=book_result&resnum=3&ct=result#PPA145,M1

Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation.Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions, Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch

- Paddy.

**Sebastian \lunar\ Wiesner** · Jul 5 '08, 08:15 AM

Re: re.search much slower then grep on some regular expressions

Paddy <paddy3118@goog lemail.com>:

On Jul 4, 1:36Â pm, Peter Otten <__pete...@web. dewrote:

>Henning_Thornb lad wrote:

What can be the cause of the large difference between re.search and
grep?

>>
>grep uses a smarter algorithm ;)
>>
>>
>>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>>

While doing a simple grep:
grep '[^ "=]*/' input Â Â Â Â Â Â Â Â Â (input contains 156.000 a in
one row)
doesn't even take a second.

>>

Is this a bug in python?

>>
>You could call this a performance bug, but it's not common enough in real
>code to get the necessary brain cycles from the core developers.
>So you can either write a patch yourself or use a workaround.
>>
>re.search('[^ "=]*/', row) if "/" in row else None
>>
>might be good enough.
>>
>Peter

>
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

**Carl Banks** · Jul 5 '08, 08:25 AM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@ gmx.netwrote:

Paddy <paddy3...@goog lemail.com>:
>
>
>

On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:

Henning_Thornbl ad wrote:
What can be the cause of the large difference between re.search and
grep?

>

grep uses a smarter algorithm ;)

>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

>

Is this a bug in python?

>

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

>

re.search('[^ "=]*/', row) if "/" in row else None

>

might be good enough.

>

Peter

>

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

>
FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:
>
# grep -P '[^ "=]*/' input
>
-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

I'm not sure of the specifics and maybe that is the case, but it could
also be a case of a different codebase which is optimized differently.

Carl Banks

re.search much slower then grep on some regular expressions

re.search much slower then grep on some regular expressions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment