HTML Parsing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • disappearedng@gmail.com

    HTML Parsing

    Hi everyone
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.

    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.


  • Dan Stromberg

    #2
    Re: HTML Parsing

    On Sat, 28 Jun 2008 19:03:39 -0700, disappearedng wrote:
    Hi everyone
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.
    >
    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    Check out BeautifulSoup. I don't recall what license it uses, but the
    source is available, and it deals well with not-necessarily-beautiful-
    inside HTML.

    Comment

    • Benjamin

      #3
      Re: HTML Parsing

      On Jun 28, 9:03 pm, disappeare...@g mail.com wrote:
      Hi everyone
      I am trying to build my own web crawler for an experiement and I don't
      know how to access HTTP protocol with python.
      Look at the httplib module.
      >
      Also, Are there any Opensource Parsing engine for HTML documents
      available in Python too? That would be great.

      Comment

      • Stefan Behnel

        #4
        Re: HTML Parsing

        disappearedng@g mail.com wrote:
        I am trying to build my own web crawler for an experiement and I don't
        know how to access HTTP protocol with python.
        >
        Also, Are there any Opensource Parsing engine for HTML documents
        available in Python too? That would be great.
        Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
        BeautifulSoup and threadable, all of which should be helpful for your crawler.



        Stefan

        Comment

        • Sebastian \lunar\ Wiesner

          #5
          Re: HTML Parsing

          Stefan Behnel <stefan_ml@behn el.de>:
          disappearedng@g mail.com wrote:
          >I am trying to build my own web crawler for an experiement and I don't
          >know how to access HTTP protocol with python.
          >>
          >Also, Are there any Opensource Parsing engine for HTML documents
          >available in Python too? That would be great.
          >
          Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
          BeautifulSoup and threadable, all of which should be helpful for your
          crawler.
          You should mention its powerful features like XPATH and CSS selection
          support and its easy API here, too ;)

          --
          Freedom is always the freedom of dissenters.
          (Rosa Luxemburg)

          Comment

          • Larry Bates

            #6
            Re: HTML Parsing

            disappearedng@g mail.com wrote:
            Hi everyone
            I am trying to build my own web crawler for an experiement and I don't
            know how to access HTTP protocol with python.
            >
            Also, Are there any Opensource Parsing engine for HTML documents
            available in Python too? That would be great.
            >
            >
            Check on Mechanize. It wraps Beautiful Soup inside of methods that aid in
            website crawling.



            -Larry

            Comment

            Working...