This web page:
parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it. I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.
The page has real problems. 1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype. "body" tags
nested 3 deep. "head" element inside two "body" tags. Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed. Hundreds of "li" tags outside a
"ol" or "ul". Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.
The page consists of a long list of classified ads, all with
unclosed tags. So the maximum depth is huge.
Worst HTML I've seen in a while.
(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)
John Nagle
SiteTruth
parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it. I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.
The page has real problems. 1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype. "body" tags
nested 3 deep. "head" element inside two "body" tags. Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed. Hundreds of "li" tags outside a
"ol" or "ul". Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.
The page consists of a long list of classified ads, all with
unclosed tags. So the maximum depth is huge.
Worst HTML I've seen in a while.
(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)
John Nagle
SiteTruth