regular expressions ... slow

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Uwe Schmitt

    regular expressions ... slow

    Hi,

    Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
    ?

    Are there any plans to speed up Pythons regular expression module ?
    Or
    is the example in this artricle too far from reality ???

    Greetings, Uwe
  • Jerry Hill

    #2
    Re: regular expressions ... slow

    On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
    <rocksportrocke r@googlemail.co mwrote:
    Hi,
    >
    Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html ?
    Yes, it's been brought up here, on python-dev and python-ideas several
    times in the past year and a half.
    Are there any plans to speed up Pythons regular expression module ?
    Or
    is the example in this artricle too far from reality ???
    I don't think anyone has taken any concrete steps towards re-writing
    the regular expression module. My understanding from previous threads
    on the topic is that the core developers would be willing to accept a
    re-written regular expression engine, but none of them are interested
    in doing it themselves. The general consensus seemed to be that the
    pathological cases hilited in that article are not very common in the
    real world, and that simply switching to the alternative approach
    advocated there would require giving up things like backreferences
    that are actually used in the real world, which is probably
    unacceptable.

    Some references:




    Personally, I know very little about the nitty gritty of regular
    expression engines, but there's some reference material for you to
    chew on.

    --
    Jerry

    Comment

    • Terry Reedy

      #3
      Re: regular expressions ... slow

      Uwe Schmitt wrote:
      Hi,
      >
      Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
      ?
      Near the end:

      While writing the text editor sam [6] in the early 1980s, Rob Pike wrote
      a new regular expression implementation, which Dave Presotto extracted
      into a library that appeared in the Eighth Edition. Pike's
      implementation incorporated submatch tracking into an efficient NFA
      simulation but, like the rest of the Eighth Edition source, was not
      widely distributed. Pike himself did not realize that his technique was
      anything new. Henry Spencer reimplemented the Eighth Edition library
      interface from scratch, but using backtracking, and released his
      implementation into the public domain. It became very widely used,
      eventually serving as the basis for the slow regular expression
      implementations mentioned earlier: Perl, PCRE, Python, and so on. (In
      his defense, Spencer knew the routines could be slow, and he didn't know
      that a more efficient algorithm existed. He even warned in the
      documentation, “Many users have found the speed perfectly adequate,
      although replacing the insides of egrep with this code would be a
      mistake.”) Pike's regular expression implementation, extended to support
      Unicode, was made freely available with sam in late 1992, but the
      particularly efficient regular expression search algorithm went
      unnoticed. The code is now available in many forms: as part of sam, as
      Plan 9's regular expression library, or packaged separately for Unix.
      Ville Laurikari independently discovered Pike's algorithm in 1999,
      developing a theoretical foundation as well [2].

      So, depending on the license, there appears to be a potential basis for
      a Python unicode version.

      Are there any plans to speed up Pythons regular expression module ?
      Yes, but I don't know how much people with such plans have considered
      the adaptive approach recommended.
      is the example in this artricle too far from reality ???
      The example is, but I don't think the problem illustrated is.

      tjr




      Comment

      • Terry Reedy

        #4
        Re: regular expressions ... slow

        Jerry Hill wrote:
        On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
        <rocksportrocke r@googlemail.co mwrote:
        >Hi,
        >>
        >Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html ?
        >
        Yes, it's been brought up here, on python-dev and python-ideas several
        times in the past year and a half.
        >
        >Are there any plans to speed up Pythons regular expression module ?
        >Or
        >is the example in this artricle too far from reality ???
        >
        I don't think anyone has taken any concrete steps towards re-writing
        the regular expression module. My understanding from previous threads
        on the topic is that the core developers would be willing to accept a
        re-written regular expression engine, but none of them are interested
        in doing it themselves. The general consensus seemed to be that the
        pathological cases hilited in that article are not very common in the
        real world, and that simply switching to the alternative approach
        advocated there would require giving up things like backreferences
        that are actually used in the real world, which is probably
        unacceptable.
        >
        Some references:



        >
        Personally, I know very little about the nitty gritty of regular
        expression engines, but there's some reference material for you to
        chew on.
        Searching the tracker for open items with 'regular expression' in the
        text brings up about 20 items to also consider.


        Comment

        • MRAB

          #5
          Re: regular expressions ... slow

          On Nov 17, 10:24 pm, Terry Reedy <tjre...@udel.e duwrote:
          Jerry Hill wrote:
          On Mon, Nov 17, 2008 at 4:37 PM, Uwe Schmitt
          <rocksportroc.. .@googlemail.co mwrote:
          Hi,
          >
          Is anobody aware of this post:  http://swtch.com/~rsc/regexp/regexp1..html?
          >
          Yes, it's been brought up here, on python-dev and python-ideas several
          times in the past year and a half.
          >
          Are there any plans  to speed up Pythons regular expression module ?
          Or
          is the example in this artricle too far from reality ???
          >
          I don't think anyone has taken any concrete steps towards re-writing
          the regular expression module.  My understanding from previous threads
          on the topic is that the core developers would be willing to accept a
          re-written regular expression engine, but none of them are interested
          in doing it themselves.  The general consensus seemed to be that the
          pathological cases hilited in that article are not very common in the
          real world, and that simply switching to the alternative approach
          advocated there would require giving up things like backreferences
          that are actually used in the real world, which is probably
          unacceptable.
          >>
          Personally, I know very little about the nitty gritty of regular
          expression engines, but there's some reference material for you to
          chew on.
          >
          Searching the tracker for open items with 'regular expression' in the
          text brings up about 20 items to also consider.
          Work is currently being done on the re module.

          I don't think the DFA approach works permits backreferences, capture
          groups or non-greedy repetition, but it certainly could be used if
          those features aren't required by the regular expression, so the
          answer is definitely maybe! :-)

          Comment

          • Gabriel Genellina

            #6
            Re: regular expressions ... slow

            En Mon, 17 Nov 2008 19:37:18 -0200, Uwe Schmitt
            <rocksportrocke r@googlemail.co mescribió:
            Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html
            ?
            >
            Are there any plans to speed up Pythons regular expression module ?
            Or
            is the example in this artricle too far from reality ???
            It's a pathological case. There are some known cases of horrible behaviour
            that are explained in many books on regular expressions. If you avoid
            those constructs when writing the RE, you're reasonably safe. I for
            myself avoid using RE at all :)

            --
            Gabriel Genellina

            Comment

            • Stefan Behnel

              #7
              Re: regular expressions ... slow

              Kay Schluehr wrote:
              All of this is prototyped in Python and it is still work in progress.
              As long as development has not reached a stable state I refuse to
              rebuild the system in an optimized C version.
              And rightfully so:

              1) the approach is algorithmically better, so it may even beat the current
              C implementation by design.

              2) switching languages before finishing and benchmarking the prototype is a
              premature optimisation. It wouldn't be the first prototype going into
              production.

              3) even before considering a reimplementatio n, you should throw it into
              Cython to translate it into C code, and then benchmark that.

              Stefan

              Comment

              • Kay Schluehr

                #8
                Re: regular expressions ... slow

                On 18 Nov., 18:47, Stefan Behnel <stefan...@behn el.dewrote:
                Kay Schluehr wrote:
                All of this is prototyped in Python and it is still work in progress.
                As long as development has not reached a stable state I refuse to
                rebuild the system in an optimized C version.
                >
                And rightfully so:
                >
                1) the approach is algorithmically better, so it may even beat the current
                C implementation by design.
                >
                2) switching languages before finishing and benchmarking the prototype is a
                premature optimisation. It wouldn't be the first prototype going into
                production.
                >
                3) even before considering a reimplementatio n, you should throw it into
                Cython to translate it into C code, and then benchmark that.
                >
                Stefan
                I fully agree and in fact the Trail parser generator contains a single
                extension module called cyTrail which is written in Cython - it's not
                just a trivial recompilation of Python in Cython but it uses all kinds
                of cdefs.

                There is just a difference between optimizing existing code using the
                best techniques available and writing code from scratch for speed. I
                consider this as a different, subsequent project ( call it cTrail )
                and I want to have more fine control than being possible with Cython -
                actually I do want to understand the code in a simple language as C. I
                have to idea what the code does, generated by Cython. There are entire
                layers that can be stripped off when not dealing with Python types and
                dynamic memory allocation.

                Kay

                Comment

                • Lawrence D'Oliveiro

                  #9
                  Re: regular expressions ... slow

                  Uwe Schmitt wrote:
                  Are there any plans to speed up Pythons regular expression module ?
                  Or
                  is the example in this artricle too far from reality ???
                  <http://regex.info/blog/2006-09-15/247>

                  Comment

                  Working...