Is there a maximum length of a regular expression in python?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • olekristianvillabo@gmail.com

    Is there a maximum length of a regular expression in python?

    I have a regular expression that is approximately 100k bytes. (It is
    basically a list of all known norwegian postal numbers and the
    corresponding place with | in between. I know this is not the intended
    use for regular expressions, but it should nonetheless work.

    the pattern is
    ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
    SVOLVÆR)'

    The error message I get is:
    RuntimeError: internal error in regular expression engine

  • Steve Holden

    #2
    Re: Is there a maximum length of a regular expression in python?

    olekristianvill abo@gmail.com wrote:[color=blue]
    > I have a regular expression that is approximately 100k bytes. (It is
    > basically a list of all known norwegian postal numbers and the
    > corresponding place with | in between. I know this is not the intended
    > use for regular expressions, but it should nonetheless work.
    >
    > the pattern is
    > ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
    > SVOLVÆR)'
    >
    > The error message I get is:
    > RuntimeError: internal error in regular expression engine
    >[/color]
    And I'm not the least bit surprised. Your code is brittle (i.e. likely
    to break) and cannot, for example, cope with multiple spaces between the
    number and the word(s). Quite apart from breaking the interpreter :-)

    I'd say your test was the clearest possible demonstration that there
    *is* a limit.

    Wouldn't it be better to have a dict keyed on the number and containing
    the word (which you can construct from the same source you constructed
    your horrendously long regexp)?

    Then if you find something matching the pattern (untested)

    ur'(N-|NO-)?((\d\d\d\d)\s *([A-Za-z ]+))'

    or something like it that actually works (I invariably get regexps wrong
    at least three times before I get them right) you can use the dict to
    validate the number and name.

    Quite apart from anything else, if the text line you are examining
    doesn't have the right syntactic form then you are going to test
    hundreds of options, none of which can possibly match. So matching the
    syntax and then validating the data identified seems like a much more
    sensible option (to me, at least).

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/

    Comment

    • Fredrik Lundh

      #3
      Re: Is there a maximum length of a regular expression in python?

      olekristianvill abo@gmail.com wrote:
      [color=blue]
      > I have a regular expression that is approximately 100k bytes. (It is
      > basically a list of all known norwegian postal numbers and the
      > corresponding place with | in between. I know this is not the intended
      > use for regular expressions, but it should nonetheless work.
      >
      > the pattern is
      > ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
      > SVOLVÆR)'
      >
      > The error message I get is:
      > RuntimeError: internal error in regular expression engine[/color]

      you're most likely exceeding the allowed code size (usually 64k).

      however, putting all postal numbers in a single RE is a horrid abuse of the RE
      engine. why not just scan for "(N-|NO-)?(\d+)" and use a dictionary to check
      if you have a valid match?

      postcodes = {
      "5269": "HJELLESTAD ",
      ...
      "9999": "ØSTRE FJORDVIDDA",
      }

      for m in re.finditer("(N-|NO-)?(\d+) ", text):
      prefix, number = m.groups()
      try:
      place = postcodes[number]
      except KeyError:
      continue
      if not text.startswith (place, m.end()):
      continue
      # got a match!
      print prefix, number, place

      </F>



      Comment

      • Roy Smith

        #4
        Re: Is there a maximum length of a regular expression in python?

        In article <1137591911.333 364.221990@z14g 2000cwz.googleg roups.com>,
        olekristianvill abo@gmail.com wrote:
        [color=blue]
        > I have a regular expression that is approximately 100k bytes. (It is
        > basically a list of all known norwegian postal numbers and the
        > corresponding place with | in between. I know this is not the intended
        > use for regular expressions, but it should nonetheless work.
        >
        > the pattern is
        > ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
        > SVOLVÆR)'
        >
        > The error message I get is:
        > RuntimeError: internal error in regular expression engine[/color]

        I don't know of any stated maximum length, but I'm not at all surprised
        this causes the regex compiler to blow up. This is clearly a case of regex
        being the wrong tool for the job.

        I'm guessing a dictionary, with the numeric codes as keys and the city
        names as values (or perhaps the other way around) is what you want.

        Comment

        • Frithiof Andreas Jensen

          #5
          Re: Is there a maximum length of a regular expression in python?


          <olekristianvil labo@gmail.com> skrev i en meddelelse
          news:1137591911 .333364.221990@ z14g2000cwz.goo glegroups.com.. .[color=blue]
          > I have a regular expression that is approximately 100k bytes. (It is
          > basically a list of all known norwegian postal numbers and the
          > corresponding place with | in between. I know this is not the intended
          > use for regular expressions, but it should nonetheless work.[/color]

          Err. No.

          A while back it was established in this forum that re's per design can have
          a maximum of 99 match groups ... I suspect that every "|" silently consumes
          one match group.


          Comment

          • Fredrik Lundh

            #6
            Re: Is there a maximum length of a regular expression in python?

            Frithiof Andreas Jensen wrote:
            [color=blue][color=green]
            > > I have a regular expression that is approximately 100k bytes. (It is
            > > basically a list of all known norwegian postal numbers and the
            > > corresponding place with | in between. I know this is not the intended
            > > use for regular expressions, but it should nonetheless work.[/color]
            >
            > Err. No.
            >
            > A while back it was established in this forum that re's per design can have
            > a maximum of 99 match groups ... I suspect that every "|" silently consumes
            > one match group.[/color]

            nope. this is a code size limit, not a group count limit.

            </F>



            Comment

            • Bryan Olson

              #7
              Re: Is there a maximum length of a regular expression in python?

              Roy Smith wrote:[color=blue]
              > olekristianvill abo@gmail.com wrote:
              >
              >[color=green]
              >>I have a regular expression that is approximately 100k bytes. (It is
              >>basically a list of all known norwegian postal numbers and the
              >>correspondi ng place with | in between. I know this is not the intended
              >>use for regular expressions, but it should nonetheless work.
              >>
              >>the pattern is
              >>ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
              >>SVOLVÆR)'
              >>
              >>The error message I get is:
              >>RuntimeErro r: internal error in regular expression engine[/color]
              >
              >
              > I don't know of any stated maximum length, but I'm not at all surprised
              > this causes the regex compiler to blow up. This is clearly a case of regex
              > being the wrong tool for the job.[/color]

              Does no one care about an internal error in the regular expression
              engine?


              --
              --Bryan

              Comment

              • Roy Smith

                #8
                Re: Is there a maximum length of a regular expression in python?

                In article <iudAf.15405$Yu .7291@newssvr27 .news.prodigy.n et>,
                Bryan Olson <fakeaddress@no where.org> wrote:
                [color=blue]
                > Roy Smith wrote:[color=green]
                > > olekristianvill abo@gmail.com wrote:
                > >
                > >[color=darkred]
                > >>I have a regular expression that is approximately 100k bytes. (It is
                > >>basically a list of all known norwegian postal numbers and the
                > >>correspondi ng place with | in between. I know this is not the intended
                > >>use for regular expressions, but it should nonetheless work.
                > >>
                > >>the pattern is
                > >>ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
                > >>SVOLVÆR)'
                > >>
                > >>The error message I get is:
                > >>RuntimeErro r: internal error in regular expression engine[/color]
                > >
                > >
                > > I don't know of any stated maximum length, but I'm not at all surprised
                > > this causes the regex compiler to blow up. This is clearly a case of regex
                > > being the wrong tool for the job.[/color]
                >
                > Does no one care about an internal error in the regular expression
                > engine?[/color]

                I think the most that could be said here is that it should probably produce
                a better error message.

                Comment

                • Steve Holden

                  #9
                  Re: Is there a maximum length of a regular expression in python?

                  Bryan Olson wrote:[color=blue]
                  > Roy Smith wrote:
                  >[color=green]
                  >> olekristianvill abo@gmail.com wrote:
                  >>
                  >>
                  >>[color=darkred]
                  >>>I have a regular expression that is approximately 100k bytes. (It is
                  >>>basically a list of all known norwegian postal numbers and the
                  >>>correspondin g place with | in between. I know this is not the intended
                  >>>use for regular expressions, but it should nonetheless work.
                  >>>
                  >>>the pattern is
                  >>>ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER...... ..|8305
                  >>>SVOLVÆR)'
                  >>>
                  >>>The error message I get is:
                  >>>RuntimeError : internal error in regular expression engine[/color]
                  >>
                  >>
                  >>I don't know of any stated maximum length, but I'm not at all surprised
                  >>this causes the regex compiler to blow up. This is clearly a case of regex
                  >>being the wrong tool for the job.[/color]
                  >
                  >
                  > Does no one care about an internal error in the regular expression
                  > engine?
                  >
                  >[/color]
                  Not one that requires parsing a 100 kilobyte re that should be replaced
                  by something more sensible, no.

                  regards
                  Steve
                  --
                  Steve Holden +44 150 684 7255 +1 800 494 3119
                  Holden Web LLC www.holdenweb.com
                  PyCon TX 2006 www.python.org/pycon/

                  Comment

                  • Paul Rubin

                    #10
                    Re: Is there a maximum length of a regular expression in python?

                    Steve Holden <steve@holdenwe b.com> writes:[color=blue][color=green]
                    > > Does no one care about an internal error in the regular expression
                    > > engine?
                    > >[/color]
                    > Not one that requires parsing a 100 kilobyte re that should be
                    > replaced by something more sensible, no.[/color]

                    If the internal error means the re engine bumped into some internal
                    limit and gracefully raised an exception, then fine. If "internal
                    error" means the re engine unexpectedly got into some inconsistent
                    internal state, then threw up its hands and barfed after discovering
                    the error sometime later, that's bad. Does nobody care which it is?

                    Comment

                    • Roy Smith

                      #11
                      Re: Is there a maximum length of a regular expression in python?

                      In article <7xhd7yb9ya.fsf @ruckus.brouhah a.com>,
                      Paul Rubin <http://phr.cx@NOSPAM.i nvalid> wrote:
                      [color=blue]
                      > Steve Holden <steve@holdenwe b.com> writes:[color=green][color=darkred]
                      > > > Does no one care about an internal error in the regular expression
                      > > > engine?
                      > > >[/color]
                      > > Not one that requires parsing a 100 kilobyte re that should be
                      > > replaced by something more sensible, no.[/color]
                      >
                      > If the internal error means the re engine bumped into some internal
                      > limit and gracefully raised an exception, then fine. If "internal
                      > error" means the re engine unexpectedly got into some inconsistent
                      > internal state, then threw up its hands and barfed after discovering
                      > the error sometime later, that's bad. Does nobody care which it is?[/color]

                      The nice thing about an open source project is that if nobody else gets
                      excited about some particular issue which is bothering you, you can take a
                      look yourself.

                      (from Python-2.3.4/Modules/_sre.c):

                      static void
                      pattern_error(i nt status)
                      {
                      switch (status) {
                      case SRE_ERROR_RECUR SION_LIMIT:
                      PyErr_SetString (
                      PyExc_RuntimeEr ror,
                      "maximum recursion limit exceeded"
                      );
                      break;
                      case SRE_ERROR_MEMOR Y:
                      PyErr_NoMemory( );
                      break;
                      default:
                      /* other error codes indicate compiler/engine bugs */
                      PyErr_SetString (
                      PyExc_RuntimeEr ror,
                      "internal error in regular expression engine"
                      );
                      }
                      }

                      I suppose one man's graceful exit is another man's barf.

                      Comment

                      • Tim Peters

                        #12
                        Re: Is there a maximum length of a regular expression in python?

                        [Bryan Olson][color=blue][color=green]
                        >> Does no one care about an internal error in the regular expression
                        >> engine?[/color][/color]

                        [Steve Holden][color=blue]
                        > Not one that requires parsing a 100 kilobyte re that should be replaced
                        > by something more sensible, no.[/color]

                        I care: this is a case of not detecting information loss due to
                        unchecked downcasting in C, and it was pure luck that it resulted in
                        an internal re error rather than, say, a wrong result. God only knows
                        what other pathologies the re engine could tricked into exhibiting
                        this way. Python 2.5 will raise an exception instead, during regexp
                        compilation (I just checked in code for this on the trunk; with some
                        luck, someone will backport that to 2.4 too).

                        Comment

                        • Steve Holden

                          #13
                          Re: Is there a maximum length of a regular expression in python?

                          Tim Peters wrote:[color=blue]
                          > [Bryan Olson]
                          >[color=green][color=darkred]
                          >>>Does no one care about an internal error in the regular expression
                          >>>engine?[/color][/color]
                          >
                          >
                          > [Steve Holden]
                          >[color=green]
                          >>Not one that requires parsing a 100 kilobyte re that should be replaced
                          >>by something more sensible, no.[/color]
                          >
                          >
                          > I care: this is a case of not detecting information loss due to
                          > unchecked downcasting in C, and it was pure luck that it resulted in
                          > an internal re error rather than, say, a wrong result. God only knows
                          > what other pathologies the re engine could tricked into exhibiting
                          > this way. Python 2.5 will raise an exception instead, during regexp
                          > compilation (I just checked in code for this on the trunk; with some
                          > luck, someone will backport that to 2.4 too).[/color]

                          Just goes to show you, ignorance is bliss.
                          What would we do without you, Tim?

                          regards
                          Steve
                          --
                          Steve Holden +44 150 684 7255 +1 800 494 3119
                          Holden Web LLC www.holdenweb.com
                          PyCon TX 2006 www.python.org/pycon/

                          Comment

                          • Frithiof Andreas Jensen

                            #14
                            Re: Is there a maximum length of a regular expression in python?


                            "Bryan Olson" <fakeaddress@no where.org> skrev i en meddelelse
                            news:iudAf.1540 5$Yu.7291@newss vr27.news.prodi gy.net...[color=blue]
                            > Roy Smith wrote:[/color]
                            [color=blue]
                            > Does no one care about an internal error in the regular expression
                            > engine?[/color]

                            Yes, but - given the example - In about the same way that I care about an
                            internal error in my car engine after dropping a spanner into it ;-)


                            Comment

                            • fuzzylollipop

                              #15
                              Re: Is there a maximum length of a regular expression in python?

                              this should really be posted to http://www.thedailywtf.com/, I wonder
                              if they have a german version of TheDailyWTF.com ?

                              Comment

                              Working...