Extract headlines from a HTML file.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Phong Ho

    Extract headlines from a HTML file.

    Hi everyone,

    I try to write a simple web crawler. It has to do the following:
    1) Open an URL and retrieve a HTML file.
    2) Extract news headlines from the HTML file
    3) Put the headlines into a RSS file.

    For example, I want to go to this site and extract the headlines:


    The problem is I do not know howto extract a headline from a HTML
    file.
    I mean HTML is not structured as XML, so I do not really know to solve
    this problem. I notice that PHP has URL Functions to deal with HTML
    file. For example, you have get_meta_tags () to extract meta tag
    content attributes from a HTML file. But then, extract meta tag is
    easy. With headlines, I don't really know where the headlines are on
    a HTML file. Would anyone give me inputs on this?

    This is not an impossible problem. If you look at Google News
    (http://news.google.com/), they crawl the web and sort the headlines
    on their site.

    Thanks,
    P. Ho
  • Andy Hassall

    #2
    Re: Extract headlines from a HTML file.

    On 3 Aug 2004 14:24:11 -0700, peter_ho98@yaho o.com (Phong Ho) wrote:
    [color=blue]
    >I try to write a simple web crawler. It has to do the following:
    >1) Open an URL and retrieve a HTML file.
    >2) Extract news headlines from the HTML file
    >3) Put the headlines into a RSS file.
    >
    >For example, I want to go to this site and extract the headlines:
    >www.unstrung.com/section.asp?section_id=86
    >
    >The problem is I do not know howto extract a headline from a HTML
    >file.
    >I mean HTML is not structured as XML, so I do not really know to solve
    >this problem. I notice that PHP has URL Functions to deal with HTML
    >file. For example, you have get_meta_tags () to extract meta tag
    >content attributes from a HTML file. But then, extract meta tag is
    >easy. With headlines, I don't really know where the headlines are on
    >a HTML file. Would anyone give me inputs on this?
    >
    >This is not an impossible problem. If you look at Google News
    >(http://news.google.com/), they crawl the web and sort the headlines
    >on their site.[/color]

    Whilst entirely possible in PHP, I'd use Perl for this, as there are many
    extremely useful modules for this sort of thing. HTML::TableExtr act,
    HTML::Parser, WWW::Mechanize to name a few.

    But make sure you're not infringing on their copyright by redistributing the
    extracted headlines.

    --
    Andy Hassall / <andy@andyh.co. uk> / <http://www.andyh.co.uk >
    <http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
    (v1.4.0 new 1st Aug 2004)

    Comment

    • steve

      #3
      Re: Re: Extract headlines from a HTML file.

      "Andy Hassall" wrote:[color=blue]
      > On 3 Aug 2004 14:24:11 -0700, peter_ho98@yaho o.com (Phong Ho) wrote:
      >[color=green]
      > >I try to write a simple web crawler. It has to do the following:
      > >1) Open an URL and retrieve a HTML file.
      > >2) Extract news headlines from the HTML file
      > >3) Put the headlines into a RSS file.
      > >
      > >For example, I want to go to this site and extract the headlines:
      > >www.unstrung.com/section.asp?section_id=86
      > >
      > >The problem is I do not know howto extract a headline from a HTML
      > >file.
      > >I mean HTML is not structured as XML, so I do not really know to[/color]
      > solve[color=green]
      > >this problem. I notice that PHP has URL Functions to deal with[/color][/color]
      HTML[color=blue][color=green]
      > >file. For example, you have get_meta_tags () to extract meta tag
      > >content attributes from a HTML file. But then, extract meta tag[/color][/color]
      is[color=blue][color=green]
      > >easy. With headlines, I don’t really know where the headlines[/color]
      > are on[color=green]
      > >a HTML file. Would anyone give me inputs on this?
      > >
      > >This is not an impossible problem. If you look at Google News
      > >(http://news.google.com/), they crawl the web and sort the headlines
      > >on their site.[/color]
      >
      > Whilst entirely possible in PHP, I’d use Perl for this, as
      > there are many
      > extremely useful modules for this sort of thing.[/color]
      HTML::TableExtr act,[color=blue]
      > HTML::Parser, WWW::Mechanize to name a few.
      >
      > But make sure you’re not infringing on their copyright by
      > redistributing the
      > extracted headlines.
      >[/color]

      Or do it the brute force way, using regular expressions. Just refer
      to the manul for doing that. I don’t think php as well developed
      modules for doing this (vs. Perl) but I could be wrong.

      If you know regular expressions, this is an easy thing to do, by the
      way.

      --
      http://www.dbForumz.com/ This article was posted by author's request
      Articles individually checked for conformance to usenet standards
      Topic URL: http://www.dbForumz.com/PHP-Extract-...ict136182.html
      Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbForumz.com/eform.php?p=454893

      Comment

      • peter

        #4
        Re: Extract headlines from a HTML file.

        I do not have experience with Perl but I would like to ask a question
        concerning Perl. I will use web hosting from a web hosting company,they
        support Perl(CGI-BIN) but I won't have the root account. Do you need a
        root account to install a Perl Module?

        Also, anyone know if there is a sample in PHP to extract the headlines
        from a HTML file on the internet?

        Thanks



        steve wrote:[color=blue]
        > "Andy Hassall" wrote:[color=green]
        > > On 3 Aug 2004 14:24:11 -0700, peter_ho98@yaho o.com (Phong Ho) wrote:
        > >[color=darkred]
        > > >I try to write a simple web crawler. It has to do the following:
        > > >1) Open an URL and retrieve a HTML file.
        > > >2) Extract news headlines from the HTML file
        > > >3) Put the headlines into a RSS file.
        > > >
        > > >For example, I want to go to this site and extract the headlines:
        > > >www.unstrung.com/section.asp?section_id=86
        > > >
        > > >The problem is I do not know howto extract a headline from a HTML
        > > >file.
        > > >I mean HTML is not structured as XML, so I do not really know to[/color]
        > > solve[color=darkred]
        > > >this problem. I notice that PHP has URL Functions to deal with[/color][/color]
        > HTML[color=green][color=darkred]
        > > >file. For example, you have get_meta_tags () to extract meta tag
        > > >content attributes from a HTML file. But then, extract meta tag[/color][/color]
        > is[color=green][color=darkred]
        > > >easy. With headlines, I don’t really know where the headlines[/color]
        > > are on[color=darkred]
        > > >a HTML file. Would anyone give me inputs on this?
        > > >
        > > >This is not an impossible problem. If you look at Google News
        > > >(http://news.google.com/), they crawl the web and sort the headlines
        > > >on their site.[/color]
        > >
        > > Whilst entirely possible in PHP, I’d use Perl for this, as
        > > there are many
        > > extremely useful modules for this sort of thing.[/color]
        > HTML::TableExtr act,[color=green]
        > > HTML::Parser, WWW::Mechanize to name a few.
        > >
        > > But make sure you’re not infringing on their copyright by
        > > redistributing the
        > > extracted headlines.
        > >[/color]
        >
        > Or do it the brute force way, using regular expressions. Just refer
        > to the manul for doing that. I don’t think php as well developed
        > modules for doing this (vs. Perl) but I could be wrong.
        >
        > If you know regular expressions, this is an easy thing to do, by the
        > way.
        >[/color]

        Comment

        Working...