HTMLParser fragility

**Rene Pijlman** · Apr 5 '06, 10:55 AM

Re: HTMLParser fragility

Lawrence D'Oliveiro:[color=blue]
>I've been using HTMLParser to scrape Web sites. The trouble with this
>is, there's a lot of malformed HTML out there. Real browsers have to be
>written to cope gracefully with this, but HTMLParser does not.[/color]

There are two solutions to this:

1. Tidy the source before parsing it.

eGenix.com: mxTidy - HTML Tidy for Python

http://www.egenix.com/files/python/mxTidy.html

Cleanup your HTML files, convert even broken HTML into validating XHTML, prepare web scraping input for XML processing. All this using a single function and implemented in a thread-safe and scalable way.

2. Use something more foregiving, like BeautifulSoup.

Beautiful Soup: We called him Tortoise because he taught us.

http://www.crummy.com/software/BeautifulSoup/

--
René Pijlman

**Daniel Dittmar** · Apr 5 '06, 11:05 AM

Re: HTMLParser fragility

Lawrence D'Oliveiro wrote:[color=blue]
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an exception, but the parser object then gets into a
> confused state after that so you cannot continue using it.
>
> The way I'm currently working around this is to do a dummy pre-parsing
> run with a dummy (non-subclassed) HTMLParser object. Every time I hit
> HTMLParseError, I note the line number in a set of lines to skip, then
> create a new HTMLParser object and restart the scan from the beginning,
> skipping all the lines I've noted so far. Only when I get to the end
> without further errors do I do the proper parse with all my appropriate
> actions.[/color]

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

Daniel

**Richie Hindle** · Apr 5 '06, 01:25 PM

Re: HTMLParser fragility

[Daniel][color=blue]
> You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
> as a first step to get well formed HTML.[/color]

But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
[color=blue][color=green][color=darkred]
>>> from mx.Tidy import tidy
>>> results = tidy("<html><bo dy><pree>Hello world!</pre></body></html>")
>>> print results[3][/color][/color][/color]
line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?

--
Richie

**Walter Dörwald** · Apr 6 '06, 02:05 PM

Re: HTMLParser fragility

Rene Pijlman wrote:[color=blue]
> Lawrence D'Oliveiro:[color=green]
>> I've been using HTMLParser to scrape Web sites. The trouble with this
>> is, there's a lot of malformed HTML out there. Real browsers have to be
>> written to cope gracefully with this, but HTMLParser does not.[/color]
>
> There are two solutions to this:
>
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
>
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/[/color]

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
Walter Dörwald

**Paul Boddie** · Apr 6 '06, 06:25 PM

Re: HTMLParser fragility

Richie Hindle wrote:[color=blue]
>
> But Tidy fails on huge numbers of real-world HTML pages. Simple things like
> misspelled tags make it fail:
>[color=green][color=darkred]
> >>> from mx.Tidy import tidy
> >>> results = tidy("<html><bo dy><pree>Hello world!</pre></body></html>")[/color][/color][/color]

[Various error messages]
[color=blue]
> Is there a Python HTML tidier which will do as good a job as a browser?[/color]

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:
[color=blue][color=green][color=darkred]
>>> import libxml2dom
>>> d = libxml2dom.pars eString("<html> <body><pree>Hel lo world!</pre></body></html>", html=1)
>>> print d.toString()[/color][/color][/color]
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pr ee>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul

**Lawrence D'Oliveiro** · Apr 7 '06, 05:25 AM

Re: HTMLParser fragility

In article <fr7732hslt4p52 42nuevd591ldot5 rvbmn@4ax.com>,
Rene Pijlman <reply.in.the.n ewsgroup@my.add ress.is.invalid > wrote:
[color=blue]
>2. Use something more foregiving, like BeautifulSoup.
>http://www.crummy.com/software/BeautifulSoup/[/color]

That sounds like what I'm after!

**Richie Hindle** · Apr 7 '06, 08:55 AM

Re: HTMLParser fragility

[Richie][color=blue]
> But Tidy fails on huge numbers of real-world HTML pages. [...]
> Is there a Python HTML tidier which will do as good a job as a browser?[/color]

[Walter][color=blue]
> You can also use the HTML parser from libxml2[/color]

[Paul][color=blue]
> libxml2 will attempt to parse HTML if asked to [...] See how it fixes
> up the mismatching tags.[/color]

Great! Many thanks.

--
Richie Hindle
richie@entrian. com

**John J. Lee** · Apr 10 '06, 07:55 PM

Re: HTMLParser fragility

"Lawrence D'Oliveiro" <ldo@geek-central.gen.new _zealand> writes:
[color=blue]
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an exception, but the parser object then gets into a
> confused state after that so you cannot continue using it.[/color]
[...]

sgmllib.SGMLPar ser (or htmllib.HTMLPar ser) is more tolerant than
HTMLParser.HTML Parser.

BeautifulSoup derives from sgmllib.SGMLPar ser, and introduces extra
robustness, of a sort.

John

HTMLParser fragility

HTMLParser fragility

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment