Freeze problem with Regular Expression

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kirk

    Freeze problem with Regular Expression

    Hi All,
    the following regular expression matching seems to enter in a infinite
    loop:

    ############### #
    import re
    text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
    una '
    re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
    *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
    ############### ##

    No problem with perl with the same expression:

    ############### ##
    $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
    ';
    $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
    Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
    print $1;
    ############### ##

    I've python 2.5.2 on Ubuntu 8.04.
    any idea?
    Thanks!

    --
    Kirk
  • cirfu

    #2
    Re: Freeze problem with Regular Expression

    On 25 Juni, 17:20, Kirk <nore...@yahoo. comwrote:
    Hi All,
    the following regular expression matching seems to enter in a infinite
    loop:
    >
    ############### #
    import re
    text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
    una '
    re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
    *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
    ############### ##
    >
    No problem with perl with the same expression:
    >
    ############### ##
    $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
    ';
    $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
    Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
    print $1;
    ############### ##
    >
    I've python 2.5.2 on Ubuntu 8.04.
    any idea?
    Thanks!
    >
    --
    Kirk

    what are you trying to do?

    Comment

    • Reedick, Andrew

      #3
      RE: Freeze problem with Regular Expression

      -----Original Message-----
      From: python-list-bounces+jr9445= att.com@python. org [mailto:python-
      list-bounces+jr9445= att.com@python. org] On Behalf Of Kirk
      Sent: Wednesday, June 25, 2008 11:20 AM
      To: python-list@python.org
      Subject: Freeze problem with Regular Expression

      Hi All,
      the following regular expression matching seems to enter in a infinite
      loop:

      ############### #
      import re
      text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
      una '
      re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-
      z]*\s*(?:[0-9]
      *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
      ############### ##

      No problem with perl with the same expression:

      ############### ##
      $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
      una
      ';
      $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-
      9]*[A-
      Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
      print $1;
      ############### ##

      I've python 2.5.2 on Ubuntu 8.04.
      any idea?
      Thanks!

      It locks up on 2.5.2 on windows also. Probably too much recursion going
      on.


      What's with the |'s in [0-9|a-z|\-]? The '|' is a character not an 'or'
      operator. I think you meant to say either '[0-9a-z\-]' or '[0-9a-z\-|]'



      *****

      The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621


      Comment

      • Maric Michaud

        #4
        Re: Freeze problem with Regular Expression

        Le Wednesday 25 June 2008 18:40:08 cirfu, vous avez écrit :
        On 25 Juni, 17:20, Kirk <nore...@yahoo. comwrote:
        Hi All,
        the following regular expression matching seems to enter in a infinite
        loop:

        ############### #
        import re
        text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
        una '
        re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9
        ] *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
        ############### ##

        No problem with perl with the same expression:

        ############### ##
        $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
        ';
        $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
        Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
        print $1;
        ############### ##

        I've python 2.5.2 on Ubuntu 8.04.
        any idea?
        Thanks!

        --
        Kirk
        >
        what are you trying to do?
        This is indeed the good question.

        Whatever the implementation/language is, something like that can work with
        happiness, but I doubt you'll find one to tell you if it *should* work or if
        it shouldn't, my brain-embedded parser is doing some infinite loop too...

        That said, "[0-9|a-z|\-]" is by itself strange, pipe (|) between square
        brackets is the character '|', so there is no reason for it to appears twice.

        Very complicated regexps are always evil, and a two or three stage filtering
        is likely to do the job with good, or at least better, readability.

        But once more, what are you trying to do ? This is not even clear that regexp
        matching is the best tool for it.

        --
        _____________

        Maric Michaud

        Comment

        • John Machin

          #5
          Re: Freeze problem with Regular Expression

          On Jun 26, 1:20 am, Kirk <nore...@yahoo. comwrote:
          Hi All,
          the following regular expression matching seems to enter in a infinite
          loop:
          >
          ############### #
          import re
          text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
          una '
          re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
          *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
          ############### ##
          >
          [expletives deleted]
          >
          I've python 2.5.2 on Ubuntu 8.04.
          any idea?
          Several problems:
          (1) lose the vertical bars (as advised by others)
          (2) ALWAYS use a raw string for regexes; your \s* will match on lower-
          case 's', not on spaces
          (3) why are you using findall on a pattern that ends in "$"?
          (4) using non-verbose regexes of that length means you haven't got a
          petrol drum's hope in hell of understanding what's going on
          (5) too many variable-length patterns, will take a finite (but very
          long) time to evaluate
          (6) as remarked by others, you haven't said what you are trying to do;
          what it actually is doing doesn't look sensible (see below).

          Following code is after fixing problems 1,2,3,4:

          C:\junk>type infinitere.py
          import re
          text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
          una '
          regex0 = r"""
          [^A-Z0-9]* # match leading space
          (
          (?:
          [0-9]* # match nothing
          [A-Z]+ # match "MSX"
          [0-9a-z\-]* # match nothing
          )+ # match "MSX"
          \s* # match " "
          [a-z]* # match nothing
          \s* # match nothing
          (?:
          [0-9]*
          [A-Z]+
          [0-9a-z\-]*
          \s*
          )* # match "INTERNATIO NAL HOLDINGS ITALIA "
          )
          ([^A-Z]*) # match "srl (di sequito "
          """
          regex1 = regex0 + "$"
          for rxno, rx in enumerate([regex0, regex1]):
          mobj = re.compile(rx, re.VERBOSE).mat ch(text)
          if mobj:
          print rxno, mobj.groups()
          else:
          print rxno, "failed"

          C:\junk>infinit ere.py
          0 ('MSX INTERNATIONAL HOLDINGS ITALIA ', 'srl (di seguito ')
          ### taking a long time, interrupted

          HTH,
          John

          Comment

          • John Machin

            #6
            Re: Freeze problem with Regular Expression

            On Jun 26, 8:29 am, John Machin <sjmac...@lexic on.netwrote:
            (2) ALWAYS use a raw string for regexes; your \s* will match on lower-
            case 's', not on spaces
            and should have written:
            (2) ALWAYS use a raw string for regexes. <<<=== Big fat full stop
            aka period.
            but he was at the time only half-way through the first cup of coffee
            for the day :-)

            Comment

            • Peter Pearson

              #7
              Re: Freeze problem with Regular Expression

              On 25 Jun 2008 15:20:04 GMT, Kirk <noreply@yahoo. comwrote:
              Hi All,
              the following regular expression matching seems to enter in a infinite
              loop:
              >
              ############### #
              import re
              text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
              una '
              re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
              *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
              ############### ##
              >
              No problem with perl with the same expression:
              >
              ############### ##
              $s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
              ';
              $s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
              Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
              print $1;
              ############### ##
              >
              I've python 2.5.2 on Ubuntu 8.04.
              any idea?
              If it will help some smarter person identify the problem, it can
              be simplified to this:

              re.findall('[^X]*((?:0*X+0*)+\s *a*\s*(?:0*X+0* \s*)*)([^X]*)$',
              "XXXXXXXXXXXXXX XXX (X" )

              This doesn't actually hang, it just takes a long time. The
              time taken increases quickly as the chain of X's gets longer.

              HTH

              --
              To email me, substitute nowhere->spamcop, invalid->net.

              Comment

              • Kirk

                #8
                Re: Freeze problem with Regular Expression

                On Wed, 25 Jun 2008 15:29:38 -0700, John Machin wrote:
                Several problems:
                Ciao John (and All partecipating in this thread),
                first of all I'm sorry for the delay but I was out for business.
                (1) lose the vertical bars (as advised by others) (2) ALWAYS use a raw
                string for regexes; your \s* will match on lower- case 's', not on
                spaces
                right! thanks!
                (3) why are you using findall on a pattern that ends in "$"?
                Yes, you are right, I started with a different need and then it changed
                over time...
                (6) as remarked by others, you haven't said what you are trying to do;
                I reply here to all of you about such point: that's not important,
                although I appreciate very much your suggestions!
                My point was 'something that works in Perl, has problems in Python'.
                In respect to this, I thank Peter for his analysis.
                Probably Perl has a different pattern matching algorithm.

                Thanks again to all of you!

                Bye!

                --
                Kirk

                Comment

                • John Machin

                  #9
                  Re: Freeze problem with Regular Expression

                  On Jul 1, 12:45 am, Kirk <nore...@yahoo. comwrote:
                  On Wed, 25 Jun 2008 15:29:38 -0700, John Machin wrote:
                  Several problems:
                  >
                  Ciao John (and All partecipating in this thread),
                  first of all I'm sorry for the delay but I was out for business.
                  >
                  (1) lose the vertical bars (as advised by others) (2) ALWAYS use a raw
                  string for regexes; your \s* will match on lower- case 's', not on
                  spaces
                  >
                  right! thanks!
                  >
                  (3) why are you using findall on a pattern that ends in "$"?
                  >
                  Yes, you are right, I started with a different need and then it changed
                  over time...
                  >
                  (6) as remarked by others, you haven't said what you are trying to do;
                  >
                  I reply here to all of you about such point: that's not important,
                  although I appreciate very much your suggestions!
                  My point was 'something that works in Perl, has problems in Python'.
                  It *is* important; our point was 'you didn't define "works", and it
                  was near-impossible (without transcribing your regex into verbose
                  mode) to guess at what you suppose it might do sometimes'.

                  Comment

                  • Kirk

                    #10
                    Re: Freeze problem with Regular Expression

                    On Mon, 30 Jun 2008 13:43:22 -0700, John Machin wrote:
                    >I reply here to all of you about such point: that's not important,
                    >although I appreciate very much your suggestions! My point was
                    >'something that works in Perl, has problems in Python'.
                    >
                    It *is* important; our point was 'you didn't define "works", and it was
                    ok...
                    near-impossible (without transcribing your regex into verbose mode) to
                    guess at what you suppose it might do sometimes'.
                    fine: it's supposed to terminate! :-)

                    Do you think that hanging is an *admissible* behavior? Couldn't we learn
                    something from Perl implementation?

                    This is my point.

                    Bye

                    --
                    Kirk

                    Comment

                    Working...