regex confusion

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John Hunter

    regex confusion


    In trying to sdebug why a certain regex wasn't working like I expected
    it to, I came across this strange (to me) behavior. The file I am
    trying to match definitely contains many instances of the letter 'a',
    so I would expect the regex

    rgxPrev = re.compile('.*? a.*?')

    to match it the string contents of the file. But it doesn't. Here is
    a complete example

    import re, urllib
    rgxPrev = re.compile('.*? a.*?')

    url = 'http://nitace.bsd.uchi cago.edu:8080/files/share/showdown_exampl e2.html'
    s = urllib.urlopen( url).read()
    m = rgxPrev.match(s )
    print m
    print s.find('a')

    m is None (no match) and the s.find('a') reports an 'a' at index 48.

    I read the regex to mean non-greedy match of anything up to an a,
    followed by non-greedy match of anything following an a, which this
    file should match.

    Or am I insane?

    John Hunter


    hunter:~/python/projects/poker/data/pokerroom> uname -a
    Linux hunter.paradise .lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686
    i686 i386 GNU/Linux
    hunter:~/python/projects/poker/data/pokerroom> python
    Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
    [GCC 3.3.1] on linux2
    Type "help", "copyright" , "credits" or "license" for more information.
    Welcome to rlcompleter2 0.95
    for nice experiences hit <tab> multiple times

  • Luther Barnum

    #2
    Re: regex confusion

    MAybe you meant:
    import re, urllib
    rgxPrev = re.compile('.*? a.*?')

    url =
    'http://nitace.bsd.uchi cago.edu:8080/files/share/showdown_exampl e2.html'
    s = urllib.urlopen( url).read()
    ***m = match(rgxPrev,s )***
    print m
    print s.find('a')

    match takes two arguments

    "John Hunter" <jdhunter@ace.b sd.uchicago.edu > wrote in message
    news:mailman.26 6.1070985064.16 879.python-list@python.org ...[color=blue]
    >
    > In trying to sdebug why a certain regex wasn't working like I expected
    > it to, I came across this strange (to me) behavior. The file I am
    > trying to match definitely contains many instances of the letter 'a',
    > so I would expect the regex
    >
    > rgxPrev = re.compile('.*? a.*?')
    >
    > to match it the string contents of the file. But it doesn't. Here is
    > a complete example
    >
    > import re, urllib
    > rgxPrev = re.compile('.*? a.*?')
    >
    > url =[/color]
    'http://nitace.bsd.uchi cago.edu:8080/files/share/showdown_exampl e2.html'[color=blue]
    > s = urllib.urlopen( url).read()
    > m = rgxPrev.match(s )
    > print m
    > print s.find('a')
    >
    > m is None (no match) and the s.find('a') reports an 'a' at index 48.
    >
    > I read the regex to mean non-greedy match of anything up to an a,
    > followed by non-greedy match of anything following an a, which this
    > file should match.
    >
    > Or am I insane?
    >
    > John Hunter
    >
    >
    > hunter:~/python/projects/poker/data/pokerroom> uname -a
    > Linux hunter.paradise .lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003[/color]
    i686[color=blue]
    > i686 i386 GNU/Linux
    > hunter:~/python/projects/poker/data/pokerroom> python
    > Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
    > [GCC 3.3.1] on linux2
    > Type "help", "copyright" , "credits" or "license" for more information.
    > Welcome to rlcompleter2 0.95
    > for nice experiences hit <tab> multiple times
    >
    >[/color]


    Comment

    • Diez B. Roggisch

      #3
      Re: regex confusion

      John Hunter wrote:
      [color=blue]
      >
      > In trying to sdebug why a certain regex wasn't working like I expected
      > it to, I came across this strange (to me) behavior. The file I am
      > trying to match definitely contains many instances of the letter 'a',
      > so I would expect the regex
      >
      > rgxPrev = re.compile('.*? a.*?')[/color]

      This is a bogus regex - a '*' means "zero or more occurences" for the
      expression to the left. '?' means "zero or one occurence" for the exp to
      the left. I'm not exactly sure why this is not working, but its definitely
      redundant. Eliminiating the redundancy gives you this:

      rgxPrev = re.compile('.*a .*')

      Works perfect.

      Regards,

      Diez

      Comment

      • A.M. Kuchling

        #4
        Re: regex confusion

        On Tue, 09 Dec 2003 09:43:24 -0600,
        John Hunter <jdhunter@ace.b sd.uchicago.edu > wrote:[color=blue]
        > rgxPrev = re.compile('.*? a.*?')[/color]

        .. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
        won't match unless 'a' is on the very first line. Add (?s) to your
        expression, and it should work (though it'll be much slower than the .find()
        method).

        --amk

        Comment

        • Peter Hansen

          #5
          Re: regex confusion

          "Diez B. Roggisch" wrote:[color=blue]
          >
          > John Hunter wrote:
          >[color=green]
          > >
          > > In trying to sdebug why a certain regex wasn't working like I expected
          > > it to, I came across this strange (to me) behavior. The file I am
          > > trying to match definitely contains many instances of the letter 'a',
          > > so I would expect the regex
          > >
          > > rgxPrev = re.compile('.*? a.*?')[/color]
          >
          > This is a bogus regex - a '*' means "zero or more occurences" for the
          > expression to the left. '?' means "zero or one occurence" for the exp to
          > the left.[/color]

          Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

          *?, +?, ??
          The "*", "+", and "?" qualifiers are all greedy; they match as much text
          as possible. .... Adding "?" after the qualifier makes it perform the match
          in non-greedy or minimal fashion; as few characters as possible will be
          matched. ....

          -Peter

          Comment

          • Peter Otten

            #6
            Re: regex confusion

            John Hunter wrote:
            [color=blue]
            >
            > In trying to sdebug why a certain regex wasn't working like I expected
            > it to, I came across this strange (to me) behavior. The file I am
            > trying to match definitely contains many instances of the letter 'a',
            > so I would expect the regex
            >
            > rgxPrev = re.compile('.*? a.*?')
            >
            > to match it the string contents of the file. But it doesn't. Here is[/color]

            [...]
            [color=blue]
            > I read the regex to mean non-greedy match of anything up to an a,
            > followed by non-greedy match of anything following an a, which this
            > file should match.[/color]

            There is a nice example where non-greedy regexes are really useful in A. M.
            Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)
            [color=blue]
            > Or am I insane?[/color]

            This may be off-topic, but the easiest if not fastest way to find multiple
            occurences of a string in a text is:
            [color=blue][color=green][color=darkred]
            >>> import re
            >>> r = re.compile("a")
            >>> for m in r.finditer("abc a\na"):[/color][/color][/color]
            .... print m.start()
            ....
            0
            3
            5[color=blue][color=green][color=darkred]
            >>>[/color][/color][/color]

            Peter

            Comment

            • Diez B. Roggisch

              #7
              Re: regex confusion

              >> This is a bogus regex - a '*' means "zero or more occurences" for the[color=blue][color=green]
              >> expression to the left. '?' means "zero or one occurence" for the exp to
              >> the left.[/color]
              >
              > Not true. See http://www.python.org/doc/current/lib/re-syntax.html :
              >
              > *?, +?, ??
              > The "*", "+", and "?" qualifiers are all greedy; they match as much text
              > as possible. .... Adding "?" after the qualifier makes it perform the
              > match in non-greedy or minimal fashion; as few characters as possible will
              > be matched. ....[/color]

              Hmm. But when thats true, what does ".??" then mean - the first ? is not
              greedy, so it is nothing matched at all. The same is true for ".*?", and
              ".+?" is then equal to "." So what makes this useful? The regex in question
              definitely didn't work with it.

              Diez

              Comment

              • Diez B. Roggisch

                #8
                Re: regex confusion

                [color=blue]
                > Hmm. But when thats true, what does ".??" then mean - the first ? is not
                > greedy, so it is nothing matched at all. The same is true for ".*?", and
                > ".+?" is then equal to "." So what makes this useful? The regex in
                > question definitely didn't work with it.[/color]

                Ok - I just found out - it makes sense when taking into account what follows
                in the regex, as that will be matched earlier. Neat - didn't know that such
                things existed.

                Diez

                Comment

                • John Hunter

                  #9
                  Re: regex confusion

                  >>>>> "Peter" == Peter Otten <__peter__@web. de> writes:

                  Peter> This may be off-topic, but the easiest if not fastest way
                  Peter> to find multiple occurences of a string in a text is:

                  Right, I actually am using regex matching and not literal char
                  matching, but in trying to debug why my regex wasn't working, I
                  simplified it to the simplest case I could, which was a string
                  literal.

                  Thanks for the DOTALL pointer above.

                  JDH

                  Comment

                  Working...