re.search much slower then grep on some regular expressions

**Sebastian \lunar\ Wiesner** · Jul 5 '08, 10:55 AM

Re: re.search much slower then grep on some regular expressions

Carl Banks <pavlovevidence @gmail.com>:

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@ gmx.netwrote:

>Paddy <paddy3...@goog lemail.com>:
>>
>>
>>

On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
>Henning_Thornb lad wrote:
What can be the cause of the large difference between re.search and
grep?

>>

>grep uses a smarter algorithm ;)

>>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>>

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

>>

Is this a bug in python?

>>

>You could call this a performance bug, but it's not common enough in
>real code to get the necessary brain cycles from the core developers.
>So you can either write a patch yourself or use a workaround.

>>

>re.search('[^ "=]*/', row) if "/" in row else None

>>

>might be good enough.

>>

>Peter

>>

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

>>
>FWIW, grep itself can confirm this statement. The following command
>roughly takes as long as Python's re.search:
>>
># grep -P '[^ "=]*/' input
>>
>-P tells grep to use real perl-compatible regular expressions.

>
This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

**Carl Banks** · Jul 5 '08, 12:55 PM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 6:44 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@ gmx.netwrote:

Carl Banks <pavlovevide... @gmail.com>:
>
>
>

On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
<basti.wies...@ gmx.netwrote:

Paddy <paddy3...@goog lemail.com>:

>

On Jul 4, 1:36 pm, Peter Otten <__pete...@web. dewrote:
Henning_Thornbl ad wrote:
What can be the cause of the large difference between re.search and
grep?

>

grep uses a smarter algorithm ;)

>

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

>

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

>

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

>

Is this a bug in python?

>

You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

>

re.search('[^ "=]*/', row) if "/" in row else None

>

might be good enough.

>

Peter

>

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

>

FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:

>

# grep -P '[^ "=]*/' input

>

-P tells grep to use real perl-compatible regular expressions.

>

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

>
My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.
>
I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

Carl Banks

**bearophileHUGS@lycos.com** · Jul 5 '08, 01:35 PM

Re: re.search much slower then grep on some regular expressions

Paddy:

You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

Many Python implementations contains a TCL interpreter. TCL REs may be
better than Python ones, so it can be interesting to benchmark the
same RE with TCL, to see how much time it needs. If someone here knows
TCL it may require to write just few lines of TCL code (used through
tkinter, by direct call).

Bye,
bearophile

**Mark Dickinson** · Jul 5 '08, 03:15 PM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

It seems like an appropriate moment to point out *this* paper:

Regular Expression Matching Can Be Simple And Fast

http://swtch.com/~rsc/regexp/regexp1.html

Apparently, grep and Tcl convert a regex to a finite state machine.
Matching is then *very* fast: essentially linear time in the
length of the string being matched, even in the worst case.
Though it is possible for the size of the finite state machine
to grow exponentially with the size of the regex.

But not all PCREs can be converted to a finite state machine, so
Perl, Python, etc. use a backtracking approach, which has
exponential running time in the worst case. In particular,
it's not possible to use a finite state machine to represent
a regular expression that contains backreferences.

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

Regular expression - Wikipedia

http://en.wikipedia.org/wiki/Regular_expression

Mark

**Paddy** · Jul 5 '08, 04:45 PM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 4:13 pm, Mark Dickinson <dicki...@gmail .comwrote:

It seems like an appropriate moment to point out *this* paper:
>

Regular Expression Matching Can Be Simple And Fast

http://swtch.com/~rsc/regexp/regexp1.html

>

That's the one!

Thanks Mark.

- Paddy.

**Terry Reedy** · Jul 5 '08, 10:25 PM

Re: re.search much slower then grep on some regular expressions

Mark Dickinson wrote:

On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :

Part of the problem is a lack of agreement on what
'regular expression' means.

Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See
>
http://en.wikipedia.org/wiki/Regular_expression

**Sebastian \lunar\ Wiesner** · Jul 5 '08, 11:35 PM

Re: re.search much slower then grep on some regular expressions

Terry Reedy <tjreedy@udel.e du>:

Mark Dickinson wrote:

>On Jul 5, 1:54 pm, Carl Banks <pavlovevide... @gmail.comwrote :

>

>Part of the problem is a lack of agreement on what
>'regular expression' means.

>
Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Yet, to many programmers the Hummer is not only just a car, its _the_ car.
Everything else is just silly, bloody old posix crap ;)

(meaning, that pcres are often seen as true regular expressions, and
everything else is referred to as "being somehow restricted")

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

**sjdevnull@yahoo.com** · Jul 6 '08, 07:15 AM

Re: re.search much slower then grep on some regular expressions

On Jul 5, 11:13 am, Mark Dickinson <dicki...@gmail .comwrote:

Apparently, grep and Tcl convert a regex to a finite state machine.

....

But not all PCREs can be converted to a finite state machine

....

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See
>
http://en.wikipedia.org/wiki/Regular_expression

Formally, there's no lack of agreement that I'm aware of. Anyone
formally defining it will use something along the lines of what
wikipedia specifies (with "Regular expressions in this sense can
express the regular languages, exactly the class of languages accepted
by finite state automata."). Colloquially it's used to mean "any text-
matching scheme that looks vaguely like sed/grep regexes".

PCREs are certainly not "real" regular expressions, but they are
colloquially.

**Mark Wooding** · Jul 6 '08, 05:05 PM

Re: re.search much slower then grep on some regular expressions

Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.
>
Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

-- [mdw]

**Sebastian \lunar\ Wiesner** · Jul 6 '08, 06:25 PM

Re: re.search much slower then grep on some regular expressions

Mark Wooding <mdw@distorted. org.uk>:

Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
>

>I just wanted to illustrate, that the speed of the given search is
>somehow related to the complexity of the engine.
>>
>Btw, other pcre implementation are as slow as Python or "grep -P". I
>tried a sample C++-code using pcre++ (a wrapper around libpcre) and saw
>it running equally long.

>
So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

I don't know about the latter, but perl doesn't use any smart algorithm, it
just heavily relies on memoization to gain speed.

This makes perl perform poorly in other situations, as mentioned in the
paper cited by Mark Dickinson:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

In Python on the other, this pattern works fine, at the cost of performance
losses on other patterns.

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

**Terry Reedy** · Jul 6 '08, 08:05 PM

Re: re.search much slower then grep on some regular expressions

Sebastian "lunar" Wiesner wrote:

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

One thing to remember in making comparisons is that Python moved to its
own implementation to support unicode as well as extended ascii (bytes
in 3.0). Does grep do that? Has pcre, which Python once used, been
upgraded to do that?

**Mark Wooding** · Jul 6 '08, 09:45 PM

Re: re.search much slower then grep on some regular expressions

Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Stack overflow condition. :-(

-- [mdw]

**Sebastian \lunar\ Wiesner** · Jul 7 '08, 02:55 PM

Re: re.search much slower then grep on some regular expressions

Mark Wooding <mdw@distorted. org.uk>:

Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
>

># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
>zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

>
(Did you really run that as root?)

How come, that you think so?

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

**Marc 'BlackJack' Rintsch** · Jul 7 '08, 03:15 PM

Re: re.search much slower then grep on some regular expressions

On Mon, 07 Jul 2008 16:44:22 +0200, Sebastian \"lunar\" Wiesner wrote:

Mark Wooding <mdw@distorted. org.uk>:
>

>Sebastian "lunar" Wiesner <basti.wiesner@ gmx.netwrote:
>>

>># perl -e '("a" x 100000) =~ /^(ab?)*$/;'
>>zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

>>
>(Did you really run that as root?)

>
How come, that you think so?

The '#' is usually the prompt of root accounts while ordinary users get a '$'.

Ciao,
Marc 'BlackJack' Rintsch

**Henning Thornblad** · Jul 7 '08, 04:55 PM

Re: re.search much slower then grep on some regular expressions

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

/Henning Thornblad

re.search much slower then grep on some regular expressions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment