why does this call to re.findall() loop forever?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • james.kirin40@gmail.com

    why does this call to re.findall() loop forever?

    Hi everyone,

    I am using Python's re module to extract some data from html. The
    following code never returns, and I was wondering if someone can
    explain to me why. Is this a problem with my regexp (I tried really
    hard to find it?)?

    The string contains three records (list items in a html page). Notice
    that NONE of them matches the regexp: these records do not contain the
    "title" element which the regexp expects inside '<span
    class="date">'.

    The weird thing is that removing any of the three records makes
    findall() immediately return an empty list, while if I pass all three
    records to findall() it never returns. Why does this happen?

    This is using python 2.6.

    Thanks so much for any help

    -james

    s="""<li class="post" key="4994199a0b 80136cb3174e9e8 75c545e">
    <h4 class="desc"><a href="http://www.sluggy.com/"
    rel="nofollow"> Sluggy Freelance</a>
    </h4>
    <div class="commands "&nbsp;<a save href="/post?url=http%3 A%2F
    %2Fwww.sluggy.c om%2F&amp;title =Sluggy
    %20Freelance&am p;copyuser=crow ebert&amp;copyt ags=imported%2B RSS
    %2BComics%2Bhum or%2Bdaily%2Bwe bcomics&amp;jum p=no&amp;partne r=del"
    class="copy" rel="nofollow"> save this</a></div<div class="meta">to
    <a class="tag" href="/crowebert/imported">impor ted</a<a class="tag"
    href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
    Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
    class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
    crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
    ac655d3fe17873b 31abeb29a1043e4 39" style="padding: 0 0.2em; background-
    color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
    class="date">19 45-07-18</span</div>
    </li>

    <li class="post" key="65d66f4197 fc7eba5c214fe85 ed77725">
    <h4 class="desc"><a href="http://www.snackbar-games.com/
    gbacovers.php" rel="nofollow"> Snackbar-Games.com :: GBA DS Cover
    Project</a>
    </h4>
    <div class="commands "&nbsp;<a save href="/post?url=http%3 A%2F
    %2Fwww.snackbar-games.com%2Fgba covers.php&amp; title=Snackbar-Games.com
    %20%3A%3A%20GBA %20DS%20Cover
    %20Project&amp; copyuser=croweb ert&amp;copytag s=imported%2BBo okmarkMenu
    %2BGameStuff%2B art%2BGBA%2Bgam es
    %2Bnintendo&amp ;jump=no&amp;pa rtner=del" class="copy"
    rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
    href="/crowebert/imported">impor ted</a<a class="tag" href="/
    crowebert/BookmarkMenu">B ookmarkMenu</a<a class="tag" href="/
    crowebert/GameStuff">Game Stuff</a<a class="tag" href="/crowebert/
    art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
    class="tag" href="/crowebert/games">games</a<a class="tag" href="/
    crowebert/nintendo">ninte ndo</a... <a class="pop" href="/url/
    a65a4a0ebe813ec 6e9c881331e3f95 83" style="padding: 0 0.2em; background-
    color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
    class="date">19 48-12-31</span</div>
    </li>

    <li class="post" key="690ace1f46 5ae419dee8145ad 3871024">
    <h4 class="desc"><a href="http://www.megatokyo.c om/"
    rel="nofollow"> MegaTokyo</a>
    </h4>
    <div class="commands "&nbsp;<a save href="/post?url=http%3 A%2F
    %2Fwww.megatoky o.com
    %2F&amp;title=M egaTokyo&amp;co pyuser=croweber t&amp;copytags= imported
    %2BBookmarkBar% 2BWeekendComics %2Bcomics%2Bman ga%2Bhumor
    %2Bwebcomics&am p;jump=no&amp;p artner=del" class="copy"
    rel="nofollow"> save this</a></div<div class="meta">to <a class="tag"
    href="/crowebert/imported">impor ted</a<a class="tag" href="/
    crowebert/BookmarkBar">Bo okmarkBar</a<a class="tag" href="/crowebert/
    WeekendComics"> WeekendComics</a<a class="tag" href="/crowebert/
    comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
    class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
    crowebert/webcomics">webc omics</a... <a class="pop" href="/url/
    94843244f0c6d80 f1c6806ed5c0abe c7" style="padding: 0 0.2em; background-
    color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
    class="date">19 46-01-28</span</div>
    </li>"""

    regexp = re.compile("<li class=\"post\". *?<h4 class=\"desc\"> <a href=
    \"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
    \">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
    +))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
    li>", re.DOTALL)

    re.findall(rege xp, s)
  • james.kirin40@gmail.com

    #2
    Re: why does this call to re.findall() loop forever?

    My apologies, given that Google Groups messes up the formatting, the
    regexp should read

    regexp = re.compile("""< li class=\"post\". *?<h4 class=\"desc\"> <a
    href=
    \"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
    \">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
    +))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
    li>""", re.DOTALL)

    Comment

    • Terry Reedy

      #3
      Re: why does this call to re.findall() loop forever?

      james.kirin40@g mail.com wrote:
      Hi everyone,
      >
      I am using Python's re module to extract some data from html. The
      following code never returns, and I was wondering if someone can
      explain to me why. Is this a problem with my regexp (I tried really
      hard to find it?)?
      [snip] html/xml string
      regexp = re.compile("<li class=\"post\". *?<h4 class=\"desc\"> <a href=
      \"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
      \">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
      +))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
      li>", re.DOTALL)
      >
      re.findall(rege xp, s)
      Python have several modules for parsing and working with xml. Do you
      not know of them or is there some reason they won't work?

      Comment

      • Nick Craig-Wood

        #4
        Re: why does this call to re.findall() loop forever?

        james.kirin40@g mail.com <james.kirin40@ gmail.comwrote:
        My apologies, given that Google Groups messes up the formatting, the
        regexp should read
        >
        regexp = re.compile("""< li class=\"post\". *?<h4 class=\"desc\"> <a
        href=
        \"(.*?)\" rel=\"nofollow\ ">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
        \">(.*?)</p>)?.*?<div class=\"meta\"> (?:to ((?:<a class=\"tag\".* ?)
        +))*.*?<span class=\"date\" title=\"(.*?)\" >.*?</span>\s*</div>.*?</
        li>""", re.DOTALL)
        Some regular expressions can't be searched in a reasonable length of
        time. Not sure whether this is your problem but it might be! Search
        for "exponentia l time regular expression" if you want some examples.

        Eg http://bugs.python.org/issue1515829

        I'd attack this problem using beatifulsoup probably rather than
        regexps!

        --
        Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

        Comment

        Working...