why does this call to re.findall() loop forever?

**james.kirin40@gmail.com** · Nov 9 '08, 11:05 PM

Re: why does this call to re.findall() loop forever?

My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""< li class=\"post\". *?<h4 class=\"desc\"> <a
href=
\"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
+))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
li>""", re.DOTALL)

**Terry Reedy** · Nov 9 '08, 11:45 PM

Re: why does this call to re.findall() loop forever?

james.kirin40@g mail.com wrote:

Hi everyone,
>
I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

[snip] html/xml string

regexp = re.compile("<li class=\"post\". *?<h4 class=\"desc\"> <a href=
\"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
+))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
li>", re.DOTALL)
>
re.findall(rege xp, s)

Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?

**Nick Craig-Wood** · Nov 10 '08, 12:35 PM

Re: why does this call to re.findall() loop forever?

james.kirin40@g mail.com <james.kirin40@ gmail.comwrote:

My apologies, given that Google Groups messes up the formatting, the
regexp should read
>
regexp = re.compile("""< li class=\"post\". *?<h4 class=\"desc\"> <a
href=
\"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
+))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
li>""", re.DOTALL)

Some regular expressions can't be searched in a reasonable length of
time. Not sure whether this is your problem but it might be! Search
for "exponentia l time regular expression" if you want some examples.

Eg http://bugs.python.org/issue1515829

I'd attack this problem using beatifulsoup probably rather than
regexps!

--
Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

why does this call to re.findall() loop forever?

why does this call to re.findall() loop forever?

Comment

Comment

Comment