regex confusion

**Luther Barnum** · Jul 18 '05, 06:30 AM

Re: regex confusion

MAybe you meant:
import re, urllib
rgxPrev = re.compile('.*? a.*?')

url =
'http://nitace.bsd.uchi cago.edu:8080/files/share/showdown_exampl e2.html'
s = urllib.urlopen( url).read()
***m = match(rgxPrev,s )***
print m
print s.find('a')

match takes two arguments

"John Hunter" <jdhunter@ace.b sd.uchicago.edu > wrote in message
news:mailman.26 6.1070985064.16 879.python-list@python.org ...[color=blue]
>
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior. The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
>
> rgxPrev = re.compile('.*? a.*?')
>
> to match it the string contents of the file. But it doesn't. Here is
> a complete example
>
> import re, urllib
> rgxPrev = re.compile('.*? a.*?')
>
> url =[/color]
'http://nitace.bsd.uchi cago.edu:8080/files/share/showdown_exampl e2.html'[color=blue]
> s = urllib.urlopen( url).read()
> m = rgxPrev.match(s )
> print m
> print s.find('a')
>
> m is None (no match) and the s.find('a') reports an 'a' at index 48.
>
> I read the regex to mean non-greedy match of anything up to an a,
> followed by non-greedy match of anything following an a, which this
> file should match.
>
> Or am I insane?
>
> John Hunter
>
>
> hunter:~/python/projects/poker/data/pokerroom> uname -a
> Linux hunter.paradise .lost 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003[/color]
i686[color=blue]
> i686 i386 GNU/Linux
> hunter:~/python/projects/poker/data/pokerroom> python
> Python 2.3.2 (#1, Oct 13 2003, 11:33:15)
> [GCC 3.3.1] on linux2
> Type "help", "copyright" , "credits" or "license" for more information.
> Welcome to rlcompleter2 0.95
> for nice experiences hit <tab> multiple times
>
>[/color]

**Diez B. Roggisch** · Jul 18 '05, 06:30 AM

Re: regex confusion

John Hunter wrote:
[color=blue]
>
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior. The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
>
> rgxPrev = re.compile('.*? a.*?')[/color]

This is a bogus regex - a '*' means "zero or more occurences" for the
expression to the left. '?' means "zero or one occurence" for the exp to
the left. I'm not exactly sure why this is not working, but its definitely
redundant. Eliminiating the redundancy gives you this:

rgxPrev = re.compile('.*a .*')

Works perfect.

Regards,

Diez

**A.M. Kuchling** · Jul 18 '05, 06:30 AM

Re: regex confusion

On Tue, 09 Dec 2003 09:43:24 -0600,
John Hunter <jdhunter@ace.b sd.uchicago.edu > wrote:[color=blue]
> rgxPrev = re.compile('.*? a.*?')[/color]

.. doesn't match newlines unless you specify the re.DOTALL / (?s) flag, so it
won't match unless 'a' is on the very first line. Add (?s) to your
expression, and it should work (though it'll be much slower than the .find()
method).

--amk

**Peter Hansen** · Jul 18 '05, 06:30 AM

Re: regex confusion

"Diez B. Roggisch" wrote:[color=blue]
>
> John Hunter wrote:
>[color=green]
> >
> > In trying to sdebug why a certain regex wasn't working like I expected
> > it to, I came across this strange (to me) behavior. The file I am
> > trying to match definitely contains many instances of the letter 'a',
> > so I would expect the regex
> >
> > rgxPrev = re.compile('.*? a.*?')[/color]
>
> This is a bogus regex - a '*' means "zero or more occurences" for the
> expression to the left. '?' means "zero or one occurence" for the exp to
> the left.[/color]

Not true. See http://www.python.org/doc/current/lib/re-syntax.html :

*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text
as possible. .... Adding "?" after the qualifier makes it perform the match
in non-greedy or minimal fashion; as few characters as possible will be
matched. ....

-Peter

**Peter Otten** · Jul 18 '05, 06:30 AM

Re: regex confusion

John Hunter wrote:
[color=blue]
>
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior. The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
>
> rgxPrev = re.compile('.*? a.*?')
>
> to match it the string contents of the file. But it doesn't. Here is[/color]

[...]
[color=blue]
> I read the regex to mean non-greedy match of anything up to an a,
> followed by non-greedy match of anything following an a, which this
> file should match.[/color]

There is a nice example where non-greedy regexes are really useful in A. M.
Kuchling's Regex Howto (http://www.amk.ca/python/howto/regex/regex.html)
[color=blue]
> Or am I insane?[/color]

This may be off-topic, but the easiest if not fastest way to find multiple
occurences of a string in a text is:
[color=blue][color=green][color=darkred]
>>> import re
>>> r = re.compile("a")
>>> for m in r.finditer("abc a\na"):[/color][/color][/color]
.... print m.start()
....
0
3
5[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Peter

**Diez B. Roggisch** · Jul 18 '05, 06:30 AM

Re: regex confusion

>> This is a bogus regex - a '*' means "zero or more occurences" for the[color=blue][color=green]
>> expression to the left. '?' means "zero or one occurence" for the exp to
>> the left.[/color]
>
> Not true. See http://www.python.org/doc/current/lib/re-syntax.html :
>
> *?, +?, ??
> The "*", "+", and "?" qualifiers are all greedy; they match as much text
> as possible. .... Adding "?" after the qualifier makes it perform the
> match in non-greedy or minimal fashion; as few characters as possible will
> be matched. ....[/color]

Hmm. But when thats true, what does ".??" then mean - the first ? is not
greedy, so it is nothing matched at all. The same is true for ".*?", and
".+?" is then equal to "." So what makes this useful? The regex in question
definitely didn't work with it.

Diez

**Diez B. Roggisch** · Jul 18 '05, 06:30 AM

Re: regex confusion

[color=blue]
> Hmm. But when thats true, what does ".??" then mean - the first ? is not
> greedy, so it is nothing matched at all. The same is true for ".*?", and
> ".+?" is then equal to "." So what makes this useful? The regex in
> question definitely didn't work with it.[/color]

Ok - I just found out - it makes sense when taking into account what follows
in the regex, as that will be matched earlier. Neat - didn't know that such
things existed.

Diez

**John Hunter** · Jul 18 '05, 06:30 AM

Re: regex confusion

>>>>> "Peter" == Peter Otten <__peter__@web. de> writes:

Peter> This may be off-topic, but the easiest if not fastest way
Peter> to find multiple occurences of a string in a text is:

Right, I actually am using regex matching and not literal char
matching, but in trying to debug why my regex wasn't working, I
simplified it to the simplest case I could, which was a string
literal.

Thanks for the DOTALL pointer above.

JDH

regex confusion

regex confusion

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment