Another BeautifulSoup crash on bad HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John Nagle

    Another BeautifulSoup crash on bad HTML

    Can't really blame BeautifulSoup for this, but our crawler hit a page
    ("http://clagnut.com/privacy/") with an out of range character escape:

    𔃷

    in this text:

    If you provide a name, email address and/or website and choose ‘Remember
    me𔃷, these details will be stored as a cookie on your computer.

    The author clearly meant "’", which is a single close quote.

    The traceback as BeautifulSoup aborts:

    SGMLParser.feed (self, markup or "")
    File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/local/lib/python2.5/sgmllib.py", line 181, in goahead
    self.handle_cha rref(name)
    File "/var/www/vhosts/sitetruth.com/cgi-bin/sitetruth/BeautifulSoup.p y", line
    1250, in handle_charref
    data = unichr(int(ref) )
    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    Another item in our ongoing saga of "What happens when you parse real-world
    HTML".

    A try-block in handle_charref would be appropriate.

    John Nagle
    SiteTruth
Working...