Which HTMLParser?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Tuang

    Which HTMLParser?

    The library docs show that there is an HTMLParser module and an
    htmllib module, both of which apparently contain classes named
    "HTMLParser ". There is a bit of decription of differences, but it
    still doesn't seem clear to me what the intent is.

    Which one is the best choice for parsing arbitrary real-life Web
    pages? I get the feeling that maybe the HTMLParser module is the more
    recent, more practical utility, while the htmllib version is the older
    one, retained for backward compatibility, but I'm not sure. The docs
    don't exactly say that.

    Any recommendations or clarifications of what's going on would be
    helpful.

    Thanks.
  • Jarek Zgoda

    #2
    Re: Which HTMLParser?

    Tuang <tuanglen@hotma il.com> pisze:
    [color=blue]
    > Which one is the best choice for parsing arbitrary real-life Web
    > pages? I get the feeling that maybe the HTMLParser module is the more
    > recent, more practical utility, while the htmllib version is the older
    > one, retained for backward compatibility, but I'm not sure. The docs
    > don't exactly say that.
    >
    > Any recommendations or clarifications of what's going on would be
    > helpful.[/color]

    If you are not sure that your source is valid HTML, use SGML parser
    instead. Personally I recommend F. Lundh's sgmlop -- fast, robust and
    well-written piece of software, real Meisterstueck. Works perfectly on
    Unix, Windows and IBM iSeries (formerly AS/400).

    --
    Jarek Zgoda
    Unregistered Linux User # -1
    http://www.zgoda.biz/ JID:zgoda@chrom e.pl http://zgoda.jogger.pl/

    Comment

    • Rene Pijlman

      #3
      Re: Which HTMLParser?

      Tuang:[color=blue]
      >The library docs show that there is an HTMLParser module and an
      >htmllib module, both of which apparently contain classes named
      >"HTMLParser" . There is a bit of decription of differences, but it
      >still doesn't seem clear to me what the intent is.[/color]

      I think the intent is to use HTMLParser. Its newer, and its documentation
      doesn't scare you off with phrases like "HTML 2.0" and "SGML" :-)
      [color=blue]
      >Which one is the best choice for parsing arbitrary real-life Web pages?[/color]

      Neither! Real-life web pages are typically not HTML-parseable. Try tyding
      it up a bit first. See http://groups.google.nl/groups?th=58cd394d2e71137f

      --
      René Pijlman

      Comment

      Working...