Web page from hell breaks BeautifulSoup, almost

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John Nagle

    Web page from hell breaks BeautifulSoup, almost

    This web page:



    parses OK with BeautifulSoup, but "prettify" will hit the
    recursion limit if you try to display it. I raised the
    recursion limit to a large number, and it was converted
    to 5MB of text successfully, in about a minute.

    The page has real problems. 1901 errors from the W3C validator,
    and that's after forcing an encoding and a doctype. "body" tags
    nested 3 deep. "head" element inside two "body" tags. Tags
    opened with an upper case tag and closed with a lower case tag.
    All "font" tags unclosed. Hundreds of "li" tags outside a
    "ol" or "ul". Yet Firefox is quite happy to display it.
    It looks even better in IE, according to comments on the page.

    The page consists of a long list of classified ads, all with
    unclosed tags. So the maximum depth is huge.

    Worst HTML I've seen in a while.

    (We use BeautifulSoup to parse hostile web sites in bulk,
    so we tend to discover more hard cases than most users.)

    John Nagle
    SiteTruth
Working...