88k regex = RuntimeError

**Diez B. Roggisch** · Feb 14 '06, 10:25 AM

Re: 88k regex = RuntimeError

> I assume it's hitting some limit, but don't know where the limit is to[color=blue]
> remove it. I tried stepping into it repeatedly with Komodo, but didn't
> see the problem.[/color]

That's because it is buried in the C-library that is the actual
implementation. There has been a discussion about this a few weeks ago -
and AFAIK there isn't much you can do about that.
[color=blue]
> Suggestions?[/color]

Yes. Don't do it :) After all, what you do is nothing but a simple
word-search. If I had that problem, my naive approach would be to simply
tokenize the sources and look for the words in them being part of your
function-name-set. A bit of statekeeping to keep track of the position, and
you're done. Check out pyparsing, it might help you doing the tokenization.

I admit that the apparent ease of the regular expression would have lured me
into the same trap.

Diez

**Tim N. van der Leeuw** · Feb 14 '06, 11:05 AM

Re: 88k regex = RuntimeError

Why don't you create a regex that finds for you all C function
declarations (and which returns you the function-names); apply
re.findall() to all files with that regex; and then check those
funtion-names against the set of allSupported?

You might even be able to find a regex for C funtion declarations on
the web.

Your gAllSupported can be a set(); you can then create the intersection
between gAllSupported and the function-names found by your regex.

Cheers,

--Tim

**Peter Otten** · Feb 14 '06, 11:05 AM

Re: 88k regex = RuntimeError

jodawi wrote:
[color=blue]
> I need to find a bunch of C function declarations by searching
> thousands of source or html files for thousands of known function
> names. My initial simple approach was to do this:
>
> rxAllSupported = re.compile(r"\b (" + "|".join(gAllSu pported) + r")\b")
> # giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b
>
> for root, dirs, files in os.walk( ... ):
> ...
> for fileName in files:
> ...
> filePath = os.path.join(ro ot, fileName)
> file = open(filePath, "r")
> contents = file.read()
> ...
> result = re.search(rxAll Supported, contents)
>
> but this happens:
>
> result = re.search(rxAll Supported, contents)
> File "C:\Python24\Li b\sre.py", line 134, in search
> return _compile(patter n, flags).search(s tring)
> RuntimeError: internal error in regular expression engine
>
> I assume it's hitting some limit, but don't know where the limit is to
> remove it. I tried stepping into it repeatedly with Komodo, but didn't
> see the problem.
>
> Suggestions?[/color]

One workaround may be as easy as

wanted = set(["foo", "bar", "baz"])
file_content = "foo bar-baz ignored foo()"

r = re.compile(r"\w +")
found = [name for name in r.findall(file_ content) if name in wanted]

print found

Peter

**Kent Johnson** · Feb 14 '06, 11:15 AM

Re: 88k regex = RuntimeError

jodawi wrote:[color=blue]
> I need to find a bunch of C function declarations by searching
> thousands of source or html files for thousands of known function
> names. My initial simple approach was to do this:
>
> rxAllSupported = re.compile(r"\b (" + "|".join(gAllSu pported) + r")\b")
> # giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b[/color]

Maybe you can be more clever about the regex? If the names above are
representative then something like r'\b(\w{1,2})Fo o\b' might work.

**Tim N. van der Leeuw** · Feb 14 '06, 02:45 PM

Re: 88k regex = RuntimeError

This is basically the same idea as what I tried to describe in my
previous post but without any samples.
I wonder if it's more efficient to create a new list using a
list-comprehension, and checking each entry against the 'wanted' set,
or to create a new set which is the intersection of set 'wanted' and
the iterable of all matches...

Your sample code would then look like this:
[color=blue][color=green][color=darkred]
>>> import re
>>> r = re.compile(r"\w +")
>>> file_content = "foo bar-baz ignored foo()"
>>> wanted = set(["foo", "bar", "baz"])
>>> found = wanted.intersec tion(name for name in r.findall(file_ content))
>>> print found[/color][/color][/color]
set(['baz', 'foo', 'bar'])[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Anyone who has an idea what is faster? (This dataset is so limited that
it doesn't make sense to do any performance-tests with it)

Cheers,

--Tim

**Peter Otten** · Feb 14 '06, 03:25 PM

Re: 88k regex = RuntimeError

Tim N. van der Leeuw wrote:
[color=blue]
> This is basically the same idea as what I tried to describe in my
> previous post but without any samples.
> I wonder if it's more efficient to create a new list using a
> list-comprehension, and checking each entry against the 'wanted' set,
> or to create a new set which is the intersection of set 'wanted' and
> the iterable of all matches...
>
> Your sample code would then look like this:
>[color=green][color=darkred]
>>>> import re
>>>> r = re.compile(r"\w +")
>>>> file_content = "foo bar-baz ignored foo()"
>>>> wanted = set(["foo", "bar", "baz"])
>>>> found = wanted.intersec tion(name for name in r.findall(file_ content))[/color][/color][/color]

Just

found = wanted.intersec tion(r.findall( file_content))
[color=blue][color=green][color=darkred]
>>>> print found[/color][/color]
> set(['baz', 'foo', 'bar'])[color=green][color=darkred]
>>>>[/color][/color]
>
> Anyone who has an idea what is faster? (This dataset is so limited that
> it doesn't make sense to do any performance-tests with it)[/color]

I guess that your approach would be a bit faster though most of the time
will be spent on IO anyway. The result would be slightly different, and
again yours (without duplicates) seems more useful.

However, I'm not sure whether the OP would rather stop at the first match or
need a match object and not just the text. In that case:

matches = (m for m in r.finditer(file _content) if m.group(0) in wanted)

Peter

88k regex = RuntimeError

88k regex = RuntimeError

Comment

Comment

Comment

Comment

Comment

Comment