HTMLParser fragility

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Lawrence D'Oliveiro

    HTMLParser fragility

    I've been using HTMLParser to scrape Web sites. The trouble with this
    is, there's a lot of malformed HTML out there. Real browsers have to be
    written to cope gracefully with this, but HTMLParser does not. Not only
    does it raise an exception, but the parser object then gets into a
    confused state after that so you cannot continue using it.

    The way I'm currently working around this is to do a dummy pre-parsing
    run with a dummy (non-subclassed) HTMLParser object. Every time I hit
    HTMLParseError, I note the line number in a set of lines to skip, then
    create a new HTMLParser object and restart the scan from the beginning,
    skipping all the lines I've noted so far. Only when I get to the end
    without further errors do I do the proper parse with all my appropriate
    actions.
  • Rene Pijlman

    #2
    Re: HTMLParser fragility

    Lawrence D'Oliveiro:[color=blue]
    >I've been using HTMLParser to scrape Web sites. The trouble with this
    >is, there's a lot of malformed HTML out there. Real browsers have to be
    >written to cope gracefully with this, but HTMLParser does not.[/color]

    There are two solutions to this:

    1. Tidy the source before parsing it.
    Cleanup your HTML files, convert even broken HTML into validating XHTML, prepare web scraping input for XML processing. All this using a single function and implemented in a thread-safe and scalable way.


    2. Use something more foregiving, like BeautifulSoup.


    --
    René Pijlman

    Comment

    • Daniel Dittmar

      #3
      Re: HTMLParser fragility

      Lawrence D'Oliveiro wrote:[color=blue]
      > I've been using HTMLParser to scrape Web sites. The trouble with this
      > is, there's a lot of malformed HTML out there. Real browsers have to be
      > written to cope gracefully with this, but HTMLParser does not. Not only
      > does it raise an exception, but the parser object then gets into a
      > confused state after that so you cannot continue using it.
      >
      > The way I'm currently working around this is to do a dummy pre-parsing
      > run with a dummy (non-subclassed) HTMLParser object. Every time I hit
      > HTMLParseError, I note the line number in a set of lines to skip, then
      > create a new HTMLParser object and restart the scan from the beginning,
      > skipping all the lines I've noted so far. Only when I get to the end
      > without further errors do I do the proper parse with all my appropriate
      > actions.[/color]

      You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
      as a first step to get well formed HTML.

      Daniel

      Comment

      • Richie Hindle

        #4
        Re: HTMLParser fragility


        [Daniel][color=blue]
        > You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
        > as a first step to get well formed HTML.[/color]

        But Tidy fails on huge numbers of real-world HTML pages. Simple things like
        misspelled tags make it fail:
        [color=blue][color=green][color=darkred]
        >>> from mx.Tidy import tidy
        >>> results = tidy("<html><bo dy><pree>Hello world!</pre></body></html>")
        >>> print results[3][/color][/color][/color]
        line 1 column 7 - Warning: inserting missing 'title' element
        line 1 column 13 - Error: <pree> is not recognized!
        line 1 column 13 - Warning: discarding unexpected <pree>
        line 1 column 31 - Warning: discarding unexpected </pre>
        This document has errors that must be fixed before
        using HTML Tidy to generate a tidied up version.

        Is there a Python HTML tidier which will do as good a job as a browser?

        --
        Richie

        Comment

        • Walter Dörwald

          #5
          Re: HTMLParser fragility

          Rene Pijlman wrote:[color=blue]
          > Lawrence D'Oliveiro:[color=green]
          >> I've been using HTMLParser to scrape Web sites. The trouble with this
          >> is, there's a lot of malformed HTML out there. Real browsers have to be
          >> written to cope gracefully with this, but HTMLParser does not.[/color]
          >
          > There are two solutions to this:
          >
          > 1. Tidy the source before parsing it.
          > http://www.egenix.com/files/python/mxTidy.html
          >
          > 2. Use something more foregiving, like BeautifulSoup.
          > http://www.crummy.com/software/BeautifulSoup/[/color]

          You can also use the HTML parser from libxml2 or any of the available
          wrappers for it.

          Bye,
          Walter Dörwald

          Comment

          • Paul Boddie

            #6
            Re: HTMLParser fragility

            Richie Hindle wrote:[color=blue]
            >
            > But Tidy fails on huge numbers of real-world HTML pages. Simple things like
            > misspelled tags make it fail:
            >[color=green][color=darkred]
            > >>> from mx.Tidy import tidy
            > >>> results = tidy("<html><bo dy><pree>Hello world!</pre></body></html>")[/color][/color][/color]

            [Various error messages]
            [color=blue]
            > Is there a Python HTML tidier which will do as good a job as a browser?[/color]

            As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
            to:
            [color=blue][color=green][color=darkred]
            >>> import libxml2dom
            >>> d = libxml2dom.pars eString("<html> <body><pree>Hel lo world!</pre></body></html>", html=1)
            >>> print d.toString()[/color][/color][/color]
            <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
            <html><body><pr ee>Hello world!</pree></body></html>

            See how it fixes up the mismatching tags. The libxml2dom package is
            available in the usual place:



            Paul

            Comment

            • Lawrence D'Oliveiro

              #7
              Re: HTMLParser fragility

              In article <fr7732hslt4p52 42nuevd591ldot5 rvbmn@4ax.com>,
              Rene Pijlman <reply.in.the.n ewsgroup@my.add ress.is.invalid > wrote:
              [color=blue]
              >2. Use something more foregiving, like BeautifulSoup.
              >http://www.crummy.com/software/BeautifulSoup/[/color]

              That sounds like what I'm after!

              Comment

              • Richie Hindle

                #8
                Re: HTMLParser fragility


                [Richie][color=blue]
                > But Tidy fails on huge numbers of real-world HTML pages. [...]
                > Is there a Python HTML tidier which will do as good a job as a browser?[/color]

                [Walter][color=blue]
                > You can also use the HTML parser from libxml2[/color]

                [Paul][color=blue]
                > libxml2 will attempt to parse HTML if asked to [...] See how it fixes
                > up the mismatching tags.[/color]

                Great! Many thanks.

                --
                Richie Hindle
                richie@entrian. com

                Comment

                • John J. Lee

                  #9
                  Re: HTMLParser fragility

                  "Lawrence D'Oliveiro" <ldo@geek-central.gen.new _zealand> writes:
                  [color=blue]
                  > I've been using HTMLParser to scrape Web sites. The trouble with this
                  > is, there's a lot of malformed HTML out there. Real browsers have to be
                  > written to cope gracefully with this, but HTMLParser does not. Not only
                  > does it raise an exception, but the parser object then gets into a
                  > confused state after that so you cannot continue using it.[/color]
                  [...]

                  sgmllib.SGMLPar ser (or htmllib.HTMLPar ser) is more tolerant than
                  HTMLParser.HTML Parser.

                  BeautifulSoup derives from sgmllib.SGMLPar ser, and introduces extra
                  robustness, of a sort.


                  John

                  Comment

                  Working...