Extracting info from html page using perl

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fox
    New Member
    • Mar 2007
    • 2

    Extracting info from html page using perl

    I need to extract patterns from a line in a web page and these patterns sometimes show up twice in the same line so using grep with the pattern only grabs one.

    Exaple is I need
    <td width="30%" class="navtext" >Sample1</td><width="7%"> 20</td>
    <td width="30%" class="navtext" >Sample2</td><td width="7%">18</td>

    extracted from the bottom code line.

    My final result wants to be
    Sample1 20
    Sample2 18
    Outputed to a seperate file.
    Is there a way of using grep or sed to do this from as perl script

    <td width="7%"></td><td width="30%" class="navtext" >Sample1</td><td width="7%">20</td><td width="7%"></td><td width="30%" class="navtext" >Sample2</td><td width="7%">18</td></font>

    Thanks
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    grep returns a list, lets see the code you have been using.

    Comment

    • fox
      New Member
      • Mar 2007
      • 2

      #3
      This the code I use which finds the line its on but then prints everything not just the
      Sample1:18
      Sample2:20

      my $grepline=`grep '<td width=\"30%\" class=\"navtext \">' tmp52.html`;
      if ($grepline=~/<td width=\"30%\" class=\"navtext \">(.+)<\/td><td width=\"7%\">(. +)<\/td>/){
      open (SCORE, ">tmp55.htm l");
      print SCORE "$1:$2\n";
      print"$1:$2\n";
      }

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        this might work, might not:

        Code:
        open (FH, 'tmp52.html') or die "$!";
        open (SCORE, ">", 'tmp55.html') or die "$!";
        while(<FH>){
           if (m|<td width="30%" class="navtext">(.+?)</td><td width="7%">(.+?)</td>|) {
              print SCORE "$1:$2\n";
           }
        }
        close SCORE;
        close FH;

        Comment

        Working...