Newbie...parsing from multiple lines.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • drjay
    New Member
    • Mar 2008
    • 3

    Newbie...parsing from multiple lines.

    I'm trying to get my script to parse a bunch of files and grab data between the <title></> and <blah></> tags. Yes yes, I'm parsing html with regex, it works though. :)

    The issue I have is sometimes there is one line, sometimes 30 lines, between <title> and <blah> so I can't just .+ it all the way. Plus there are multiple <blah> tags in each file. I'm looking for a way for to scan the file for <title>, assign to $1, then search for every instance of <blah> and assign to $2 and upwards as necessary. Then print to the tab file $1 \t $2 \t $3 etc. Boy I hope that jibberish made sense lol. I'm new so offering an explanation with hardcore jargon might not be good for me. Here's what I have so far:

    [code=perl]#!/usr/bin/env perl
    #fix.py


    $dir = 'e:\\tmp';
    $outdir = "newfiles";
    $tabfile = "tabdata.tx t";




    ### EDIT CAREFULLY BELOW HERE :) ###
    open(TAB, ">$dir\\$outdir \\$tabfile");
    print TAB ("Item Name\tItem Number\tCost\tA dd\tIn All\n");
    open(PARTNUMBER , "$dir\\$outdir\ \partnumber.txt ");
    while (<PARTNUMBER>) {
    chomp;
    $i = $_;
    }
    close(PARTNUMBE R);
    print "Opening $dir\n";
    opendir(DH,$dir );
    while (defined ( my $filename = readdir(DH))) {
    if ($filename =~ m/\.htm/ ) {
    $outfilename="> $dir\\$outdir\\ $filename";
    print "Opening $filename\n";
    open(FHI,$filen ame);
    while (<FHI>) {
    $html .= $_;
    }
    close(FHI);
    while ($html =~ s/<title>(.+?)< \/title>/$1$2$3$4/)
    {
    print TAB ("$1\t$2\t$3\t$ i\n");
    open (PARTNUMBER, ">$dir\\$outdir \\partnumber.tx t");
    print PARTNUMBER ($i);
    close(PARTNUMBE R);
    print "$i matches foung in $filename\n";
    print "Saving to $outfilename\n" ;
    open(FHO, $outfilename);
    print FHO ($html);
    close(FHO);
    }
    }
    $html = '';
    }
    print "Done\n";
    [/code]

    Thanks in advance!
    Last edited by eWish; Mar 26 '08, 12:59 AM. Reason: Added language to code tags for readability
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    some sample input and sample output would probably help.

    Comment

    • eWish
      Recognized Expert Contributor
      • Jul 2007
      • 973

      #3
      Is this the line you are using to capture the data between the title tags?
      [CODE=perl]while ($html =~ s/<title>(.+?)< \/title>/$1$2$3$4/)[/CODE]
      The reason I as is because s/// is the substitution operator.

      --Kevin

      Comment

      Working...