Looking for a decent HTML parser for Python...

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Just Another Victim of the Ambient Morality

    Looking for a decent HTML parser for Python...

    I'm trying to parse HTML in a very generic way.
    So far, I'm using SGMLParser in the sgmllib module. The problem is that
    it forces you to parse very specific tags through object methods like
    start_a(), start_p() and the like, forcing you to know exactly which tags
    you want to handle. I want to be able to handle the start tags of any and
    all tags, like how one would do in the Xerces C++ XML parser. In other
    words, I would like a simple start() method that is called whenever any tag
    is encountered. How may I do this?
    Thank you...



  • Just Another Victim of the Ambient Morality

    #2
    Re: Looking for a decent HTML parser for Python...


    "Just Another Victim of the Ambient Morality" <ihatespam@hotm ail.comwrote
    in message news:qKqdh.3030 31$tl2.45967@fe 10.news.easynew s.com...
    I'm trying to parse HTML in a very generic way.
    So far, I'm using SGMLParser in the sgmllib module. The problem is
    that it forces you to parse very specific tags through object methods like
    start_a(), start_p() and the like, forcing you to know exactly which tags
    you want to handle. I want to be able to handle the start tags of any and
    all tags, like how one would do in the Xerces C++ XML parser. In other
    words, I would like a simple start() method that is called whenever any
    tag is encountered. How may I do this?
    Thank you...
    Okay, I think I found what I'm looking for in HTMLParser in the
    HTMLParser module.
    Thanks...



    Comment

    • Just Another Victim of the Ambient Morality

      #3
      Re: Looking for a decent HTML parser for Python...


      "Just Another Victim of the Ambient Morality" <ihatespam@hotm ail.comwrote
      in message news:Gordh.3034 66$tl2.18227@fe 10.news.easynew s.com...
      >
      Okay, I think I found what I'm looking for in HTMLParser in the
      HTMLParser module.
      Except it appears to be buggy or, at least, not very robust. There are
      websites for which it falsely terminates early in the parsing. I have a
      sneaking feeling the sgml parser will be more robust, if only it had that
      one feature I am looking for.
      Can someone help me out here?
      Thank you...



      Comment

      • Fredrik Lundh

        #4
        Re: Looking for a decent HTML parser for Python...

        Except it appears to be buggy or, at least, not very robust. There are
        websites for which it falsely terminates early in the parsing.
        which probably means that the sites are broken. the amount of broken
        HTML on the net is staggering, as is the amount of code in a typical web
        browser for dealing with all that crap. for a more tolerant parser, see:



        </F>

        Comment

        • Stephen Eilert

          #5
          Re: Looking for a decent HTML parser for Python...


          Fredrik Lundh escreveu:
          Except it appears to be buggy or, at least, not very robust. There are
          websites for which it falsely terminates early in the parsing.
          >
          which probably means that the sites are broken. the amount of broken
          HTML on the net is staggering, as is the amount of code in a typical web
          browser for dealing with all that crap. for a more tolerant parser, see:
          >

          >
          </F>
          +1 for BeautifulSoup.

          The documentation is quite brief and sometimes confusing, but I've
          found it the easiest parser I've ever worked with.


          Stephen

          Comment

          • hubritic

            #6
            Re: Looking for a decent HTML parser for Python...

            Agreed that the web sites are probably broken. Try running the HTML
            though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
            me to parse where I had problem such as yours.

            I have also had luck with BeautifulSoup, which also includes a tidy
            function in it.



            Just Another Victim of the Ambient Morality wrote:
            "Just Another Victim of the Ambient Morality" <ihatespam@hotm ail.comwrote
            in message news:Gordh.3034 66$tl2.18227@fe 10.news.easynew s.com...

            Okay, I think I found what I'm looking for in HTMLParser in the
            HTMLParser module.
            >
            Except it appears to be buggy or, at least, not very robust. There are
            websites for which it falsely terminates early in the parsing. I have a
            sneaking feeling the sgml parser will be more robust, if only it had that
            one feature I am looking for.
            Can someone help me out here?
            Thank you...

            Comment

            Working...