Good HTML Parser

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chris

    Good HTML Parser

    Can anyone recommend a good HTML/XHTML parser, similar to
    HTMLParser.HTML Parser or htmllib.HTMLPar ser, but able to intelligently
    know that certain tags, like <br>, are implicitly closed? I need to
    iterate through the entire DOM, building up a DOM path, but the stdlib
    parsers aren't calling handle_endtag() for any implicitly closed tags.
    I looked at BeautifulSoup, but it only seems to work by first parsing
    the entire document, then allowing you to query the document
    afterwards. I need something like a SAX parser.
  • Diez B. Roggisch

    #2
    Re: Good HTML Parser

    Chris wrote:
    Can anyone recommend a good HTML/XHTML parser, similar to
    HTMLParser.HTML Parser or htmllib.HTMLPar ser, but able to intelligently
    know that certain tags, like <br>, are implicitly closed? I need to
    iterate through the entire DOM, building up a DOM path, but the stdlib
    parsers aren't calling handle_endtag() for any implicitly closed tags.
    I looked at BeautifulSoup, but it only seems to work by first parsing
    the entire document, then allowing you to query the document
    afterwards. I need something like a SAX parser.
    This isn't possible. Your own example of arbitrarily closeable Tags needs
    context that just a SAX-like parser can't provide.

    I suggest you use BeautifulSoup, and if you must create your own
    event-generation around that which you can attach consumers to.

    Diez

    Comment

    • Stefan Behnel

      #3
      Re: Good HTML Parser

      Chris wrote:
      Can anyone recommend a good HTML/XHTML parser, similar to
      HTMLParser.HTML Parser or htmllib.HTMLPar ser, but able to intelligently
      know that certain tags, like <br>, are implicitly closed? I need to
      iterate through the entire DOM, building up a DOM path, but the stdlib
      parsers aren't calling handle_endtag() for any implicitly closed tags.
      I looked at BeautifulSoup, but it only seems to work by first parsing
      the entire document, then allowing you to query the document
      afterwards. I need something like a SAX parser.
      Try lxml.html. It's very memory friendly and extremely fast, so you may end up
      without any reason to use SAX anymore.



      Stefan

      Comment

      Working...