88k regex = RuntimeError

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jodawi

    88k regex = RuntimeError

    I need to find a bunch of C function declarations by searching
    thousands of source or html files for thousands of known function
    names. My initial simple approach was to do this:

    rxAllSupported = re.compile(r"\b (" + "|".join(gAllSu pported) + r")\b")
    # giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b

    for root, dirs, files in os.walk( ... ):
    ....
    for fileName in files:
    ....
    filePath = os.path.join(ro ot, fileName)
    file = open(filePath, "r")
    contents = file.read()
    ....
    result = re.search(rxAll Supported, contents)

    but this happens:

    result = re.search(rxAll Supported, contents)
    File "C:\Python24\Li b\sre.py", line 134, in search
    return _compile(patter n, flags).search(s tring)
    RuntimeError: internal error in regular expression engine

    I assume it's hitting some limit, but don't know where the limit is to
    remove it. I tried stepping into it repeatedly with Komodo, but didn't
    see the problem.

    Suggestions?

  • Diez B. Roggisch

    #2
    Re: 88k regex = RuntimeError

    > I assume it's hitting some limit, but don't know where the limit is to[color=blue]
    > remove it. I tried stepping into it repeatedly with Komodo, but didn't
    > see the problem.[/color]

    That's because it is buried in the C-library that is the actual
    implementation. There has been a discussion about this a few weeks ago -
    and AFAIK there isn't much you can do about that.
    [color=blue]
    > Suggestions?[/color]

    Yes. Don't do it :) After all, what you do is nothing but a simple
    word-search. If I had that problem, my naive approach would be to simply
    tokenize the sources and look for the words in them being part of your
    function-name-set. A bit of statekeeping to keep track of the position, and
    you're done. Check out pyparsing, it might help you doing the tokenization.


    I admit that the apparent ease of the regular expression would have lured me
    into the same trap.

    Diez

    Comment

    • Tim N. van der Leeuw

      #3
      Re: 88k regex = RuntimeError

      Why don't you create a regex that finds for you all C function
      declarations (and which returns you the function-names); apply
      re.findall() to all files with that regex; and then check those
      funtion-names against the set of allSupported?

      You might even be able to find a regex for C funtion declarations on
      the web.

      Your gAllSupported can be a set(); you can then create the intersection
      between gAllSupported and the function-names found by your regex.

      Cheers,

      --Tim

      Comment

      • Peter Otten

        #4
        Re: 88k regex = RuntimeError

        jodawi wrote:
        [color=blue]
        > I need to find a bunch of C function declarations by searching
        > thousands of source or html files for thousands of known function
        > names. My initial simple approach was to do this:
        >
        > rxAllSupported = re.compile(r"\b (" + "|".join(gAllSu pported) + r")\b")
        > # giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b
        >
        > for root, dirs, files in os.walk( ... ):
        > ...
        > for fileName in files:
        > ...
        > filePath = os.path.join(ro ot, fileName)
        > file = open(filePath, "r")
        > contents = file.read()
        > ...
        > result = re.search(rxAll Supported, contents)
        >
        > but this happens:
        >
        > result = re.search(rxAll Supported, contents)
        > File "C:\Python24\Li b\sre.py", line 134, in search
        > return _compile(patter n, flags).search(s tring)
        > RuntimeError: internal error in regular expression engine
        >
        > I assume it's hitting some limit, but don't know where the limit is to
        > remove it. I tried stepping into it repeatedly with Komodo, but didn't
        > see the problem.
        >
        > Suggestions?[/color]

        One workaround may be as easy as

        wanted = set(["foo", "bar", "baz"])
        file_content = "foo bar-baz ignored foo()"

        r = re.compile(r"\w +")
        found = [name for name in r.findall(file_ content) if name in wanted]

        print found

        Peter

        Comment

        • Kent Johnson

          #5
          Re: 88k regex = RuntimeError

          jodawi wrote:[color=blue]
          > I need to find a bunch of C function declarations by searching
          > thousands of source or html files for thousands of known function
          > names. My initial simple approach was to do this:
          >
          > rxAllSupported = re.compile(r"\b (" + "|".join(gAllSu pported) + r")\b")
          > # giving a regex of \b(AAFoo|ABFoo| (uh... 88kb more...) |zFoo)\b[/color]

          Maybe you can be more clever about the regex? If the names above are
          representative then something like r'\b(\w{1,2})Fo o\b' might work.

          Comment

          • Tim N. van der Leeuw

            #6
            Re: 88k regex = RuntimeError

            This is basically the same idea as what I tried to describe in my
            previous post but without any samples.
            I wonder if it's more efficient to create a new list using a
            list-comprehension, and checking each entry against the 'wanted' set,
            or to create a new set which is the intersection of set 'wanted' and
            the iterable of all matches...

            Your sample code would then look like this:
            [color=blue][color=green][color=darkred]
            >>> import re
            >>> r = re.compile(r"\w +")
            >>> file_content = "foo bar-baz ignored foo()"
            >>> wanted = set(["foo", "bar", "baz"])
            >>> found = wanted.intersec tion(name for name in r.findall(file_ content))
            >>> print found[/color][/color][/color]
            set(['baz', 'foo', 'bar'])[color=blue][color=green][color=darkred]
            >>>[/color][/color][/color]

            Anyone who has an idea what is faster? (This dataset is so limited that
            it doesn't make sense to do any performance-tests with it)

            Cheers,

            --Tim

            Comment

            • Peter Otten

              #7
              Re: 88k regex = RuntimeError

              Tim N. van der Leeuw wrote:
              [color=blue]
              > This is basically the same idea as what I tried to describe in my
              > previous post but without any samples.
              > I wonder if it's more efficient to create a new list using a
              > list-comprehension, and checking each entry against the 'wanted' set,
              > or to create a new set which is the intersection of set 'wanted' and
              > the iterable of all matches...
              >
              > Your sample code would then look like this:
              >[color=green][color=darkred]
              >>>> import re
              >>>> r = re.compile(r"\w +")
              >>>> file_content = "foo bar-baz ignored foo()"
              >>>> wanted = set(["foo", "bar", "baz"])
              >>>> found = wanted.intersec tion(name for name in r.findall(file_ content))[/color][/color][/color]

              Just

              found = wanted.intersec tion(r.findall( file_content))
              [color=blue][color=green][color=darkred]
              >>>> print found[/color][/color]
              > set(['baz', 'foo', 'bar'])[color=green][color=darkred]
              >>>>[/color][/color]
              >
              > Anyone who has an idea what is faster? (This dataset is so limited that
              > it doesn't make sense to do any performance-tests with it)[/color]

              I guess that your approach would be a bit faster though most of the time
              will be spent on IO anyway. The result would be slightly different, and
              again yours (without duplicates) seems more useful.

              However, I'm not sure whether the OP would rather stop at the first match or
              need a match object and not just the text. In that case:

              matches = (m for m in r.finditer(file _content) if m.group(0) in wanted)

              Peter

              Comment

              Working...