Hi everyone,
I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?
The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.
The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?
This is using python 2.6.
Thanks so much for any help
-james
s="""<li class="post" key="4994199a0b 80136cb3174e9e8 75c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow"> Sluggy Freelance</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.sluggy.c om%2F&title =Sluggy
%20Freelance&am p;copyuser=crow ebert&copyt ags=imported%2B RSS
%2BComics%2Bhum or%2Bdaily%2Bwe bcomics&jum p=no&partne r=del"
class="copy" rel="nofollow"> save this</a></div<div class="meta">to
<a class="tag" href="/crowebert/imported">impor ted</a<a class="tag"
href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
ac655d3fe17873b 31abeb29a1043e4 39" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
class="date">19 45-07-18</span</div>
</li>
<li class="post" key="65d66f4197 fc7eba5c214fe85 ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow"> Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.snackbar-games.com%2Fgba covers.php& title=Snackbar-Games.com
%20%3A%3A%20GBA %20DS%20Cover
%20Project& copyuser=croweb ert&copytag s=imported%2BBo okmarkMenu
%2BGameStuff%2B art%2BGBA%2Bgam es
%2Bnintendo& ;jump=no&pa rtner=del" class="copy"
rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">impor ted</a<a class="tag" href="/
crowebert/BookmarkMenu">B ookmarkMenu</a<a class="tag" href="/
crowebert/GameStuff">Game Stuff</a<a class="tag" href="/crowebert/
art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
class="tag" href="/crowebert/games">games</a<a class="tag" href="/
crowebert/nintendo">ninte ndo</a... <a class="pop" href="/url/
a65a4a0ebe813ec 6e9c881331e3f95 83" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
class="date">19 48-12-31</span</div>
</li>
<li class="post" key="690ace1f46 5ae419dee8145ad 3871024">
<h4 class="desc"><a href="http://www.megatokyo.c om/"
rel="nofollow"> MegaTokyo</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.megatoky o.com
%2F&title=M egaTokyo&co pyuser=croweber t&copytags= imported
%2BBookmarkBar% 2BWeekendComics %2Bcomics%2Bman ga%2Bhumor
%2Bwebcomics&am p;jump=no&p artner=del" class="copy"
rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">impor ted</a<a class="tag" href="/
crowebert/BookmarkBar">Bo okmarkBar</a<a class="tag" href="/crowebert/
WeekendComics"> WeekendComics</a<a class="tag" href="/crowebert/
comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
94843244f0c6d80 f1c6806ed5c0abe c7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
class="date">19 46-01-28</span</div>
</li>"""
regexp = re.compile("<li class=\"post\". *?<h4 class=\"desc\"> <a href=
\"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
+))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
li>", re.DOTALL)
re.findall(rege xp, s)
I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?
The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.
The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?
This is using python 2.6.
Thanks so much for any help
-james
s="""<li class="post" key="4994199a0b 80136cb3174e9e8 75c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow"> Sluggy Freelance</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.sluggy.c om%2F&title =Sluggy
%20Freelance&am p;copyuser=crow ebert&copyt ags=imported%2B RSS
%2BComics%2Bhum or%2Bdaily%2Bwe bcomics&jum p=no&partne r=del"
class="copy" rel="nofollow"> save this</a></div<div class="meta">to
<a class="tag" href="/crowebert/imported">impor ted</a<a class="tag"
href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
ac655d3fe17873b 31abeb29a1043e4 39" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
class="date">19 45-07-18</span</div>
</li>
<li class="post" key="65d66f4197 fc7eba5c214fe85 ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow"> Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.snackbar-games.com%2Fgba covers.php& title=Snackbar-Games.com
%20%3A%3A%20GBA %20DS%20Cover
%20Project& copyuser=croweb ert&copytag s=imported%2BBo okmarkMenu
%2BGameStuff%2B art%2BGBA%2Bgam es
%2Bnintendo& ;jump=no&pa rtner=del" class="copy"
rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">impor ted</a<a class="tag" href="/
crowebert/BookmarkMenu">B ookmarkMenu</a<a class="tag" href="/
crowebert/GameStuff">Game Stuff</a<a class="tag" href="/crowebert/
art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
class="tag" href="/crowebert/games">games</a<a class="tag" href="/
crowebert/nintendo">ninte ndo</a... <a class="pop" href="/url/
a65a4a0ebe813ec 6e9c881331e3f95 83" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
class="date">19 48-12-31</span</div>
</li>
<li class="post" key="690ace1f46 5ae419dee8145ad 3871024">
<h4 class="desc"><a href="http://www.megatokyo.c om/"
rel="nofollow"> MegaTokyo</a>
</h4>
<div class="commands " <a save href="/post?url=http%3 A%2F
%2Fwww.megatoky o.com
%2F&title=M egaTokyo&co pyuser=croweber t&copytags= imported
%2BBookmarkBar% 2BWeekendComics %2Bcomics%2Bman ga%2Bhumor
%2Bwebcomics&am p;jump=no&p artner=del" class="copy"
rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">impor ted</a<a class="tag" href="/
crowebert/BookmarkBar">Bo okmarkBar</a<a class="tag" href="/crowebert/
WeekendComics"> WeekendComics</a<a class="tag" href="/crowebert/
comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
94843244f0c6d80 f1c6806ed5c0abe c7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
class="date">19 46-01-28</span</div>
</li>"""
regexp = re.compile("<li class=\"post\". *?<h4 class=\"desc\"> <a href=
\"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
+))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
li>", re.DOTALL)
re.findall(rege xp, s)
Comment