re.sub hangs on text from large files.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Atli
    Recognized Expert Expert
    • Nov 2006
    • 5062

    re.sub hangs on text from large files.

    HI everybody.

    In an effort to teach myself the basics of Python, I set out to create a script that would read a PHP file, remove any comments and then save it to another location.
    I've got it all pretty much worked out now, except...

    I have this re.sub regex that is meant to remove /*...*/ comments.
    It works fine on small files, but causes Python to hang on larger files.
    (By large I mean over 20kb files, sometimes containing a few thousand lines of code)
    [code=python]
    inFile = open(inPath + cfile)
    outFile = open(outPath + cfile, "w")

    inText = inFile.read()
    outText = re.sub("\/\*(.|\s)*\*\/", "", inText)
    outFile.write(o utText)

    inFile.close()
    outFile.close()
    [/code]
    Running this causes Python to hang, and when closing it (crl+c) this is what I get:
    Code:
    Traceback (most recent call last):
      File "./scandir.py", line 60, in <module>
        listdirrec(inPath, outPath)
      File "./scandir.py", line 44, in listdirrec
        listdirrec(inPath + entry +"/", outPath + entry +"/")
      File "./scandir.py", line 53, in listdirrec
        outText = re.sub("\/\*(.|\s)*\*\/", "", inText)
      File "/usr/lib/python2.5/re.py", line 150, in sub
        return _compile(pattern, 0).sub(repl, string, count)
    KeyboardInterrupt
    I'm running Python 2.5.2 on Ubuntu 8.04.

    Any input would be greatly appreciated.
    Thanks
  • jlm699
    Contributor
    • Jul 2007
    • 314

    #2
    Ok so I don't have any large text files with which to work; however one point of advice that I can give:

    From my experience working with the re module it is almost always a good idea to compile your regex expressions. This should speed up your process and possibly will fix the error that you are seeing.
    [code=python]
    import re
    rc = re.compile("\/\*(.|\s)*\*\/")
    rc.sub("", inputText)[/code]

    On an entirely different note I always get worried when I see people using string concatenation to construct paths. (inPath + myFile, etc.)
    I usually use os.path.join(), as it makes things much easier; for example:
    [code=python]
    >>> import os
    >>> os.path.join('/usr', 'src', 'bin')
    '/usr/src/bin'
    >>> # On a windows system:
    >>> os.path.join('C :\\', 'Program Files', 'Python', 'Rules')
    'C:\\Program Files\\Python\\ Rules'
    >>> [/code]That's just my two cents and the method that I always go with; however there's nothing wrong with what you've done.

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #3
      Is each comment on one line? If so, try iterating on the file object.
      Example:[code=Python]f = open(file_name)
      for line in f:
      ..........[/code]From Python docs:
      "Also note that when in non-blocking mode, less data than what was requested may be returned, even if no size parameter was given."

      Comment

      • Atli
        Recognized Expert Expert
        • Nov 2006
        • 5062

        #4
        Originally posted by bvdet
        Is each comment on one line? If so, try iterating on the file object.
        No, these comments can (and usually do) span multiple lines.

        I did manage to find a solution tho!

        After some testing, I find that making the expression non-greedy will fix the problem no matter what combination of the whit-space characters I use.

        This is also true with a greedy expression, except when you couple the \n char with any other white-space char, the process is somehow cought in an indefinite loop, running at 100% CPU indefinitely.
        The funny thing is tho, that it takes up virtually no memory.

        Anyhow...
        This ended up working for me.
        [code=python]
        rc = re.compile("\/\*(.|\s)*?\*\/")
        outText = rc.sub("", inText)
        [/code]
        Thanks for the help guys!

        O, and thanks for the os.path.join tip.
        That's going to save me a lot of trouble :)

        Comment

        Working...