using HTML::Parser

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Divya Rao

    using HTML::Parser

    Hi,
    I need to parse a HTML file, and extract all the text in it (not the
    images, tags). I cannot figure out how to do it. I have the HTML file
    saved in my local directory. I need to have the text printed/saved in
    my local directory. I would really appreciate any help in this regard.

    Thanks,
    Divya Rao
  • Jürgen Exner

    #2
    Re: using HTML::Parser

    Divya Rao wrote:[color=blue]
    > I need to parse a HTML file, and extract all the text in it (not the
    > images, tags). I cannot figure out how to do it. I have the HTML file
    > saved in my local directory. I need to have the text printed/saved in
    > my local directory. I would really appreciate any help in this regard.[/color]

    HTML::Parser comes with one example application that does exactly that.
    Unfortunately the examples are not included in the standard Perl
    installation, so you will have to download the module and unpack it manually
    to find the examples programs.

    jue


    Comment

    • Joe Smith

      #3
      Re: using HTML::Parser

      Divya Rao wrote:
      [color=blue]
      > Hi,
      > I need to parse a HTML file, and extract all the text in it (not the
      > images, tags). I cannot figure out how to do it. I have the HTML file
      > saved in my local directory. I need to have the text printed/saved in
      > my local directory. I would really appreciate any help in this regard.[/color]

      unix% cat /usr/local/bin/nohtml
      #!/usr/bin/perl -w
      # Name: nohtml Author: Joe.Smith@inwap .com 07-Nov-2001
      # Purpose: Extracts just the text portions of a document.

      use strict;
      use HTML::Parser ();

      sub text_handler { # Ordinary text
      print @_;
      }

      my $p = HTML::Parser->new(api_versio n => 3);
      $p->handler( text => \&text_handle r, "dtext");
      $p->parse_file(shi ft || "-") || die $!;

      1;

      unix% cat /usr/local/bin/nh
      #!/bin/sh
      PATH=$PATH:/usr/local/bin; export PATH
      nohtml - | less -s

      Usage: while reading e-mail, pipe the message into '|nh'.
      -Joe

      Comment

      Working...