I'm trying to get my script to parse a bunch of files and grab data between the <title></> and <blah></> tags. Yes yes, I'm parsing html with regex, it works though. :)
The issue I have is sometimes there is one line, sometimes 30 lines, between <title> and <blah> so I can't just .+ it all the way. Plus there are multiple <blah> tags in each file. I'm looking for a way for to scan the file for <title>, assign to $1, then search for every instance of <blah> and assign to $2 and upwards as necessary. Then print to the tab file $1 \t $2 \t $3 etc. Boy I hope that jibberish made sense lol. I'm new so offering an explanation with hardcore jargon might not be good for me. Here's what I have so far:
[code=perl]#!/usr/bin/env perl
#fix.py
$dir = 'e:\\tmp';
$outdir = "newfiles";
$tabfile = "tabdata.tx t";
### EDIT CAREFULLY BELOW HERE :) ###
open(TAB, ">$dir\\$outdir \\$tabfile");
print TAB ("Item Name\tItem Number\tCost\tA dd\tIn All\n");
open(PARTNUMBER , "$dir\\$outdir\ \partnumber.txt ");
while (<PARTNUMBER>) {
chomp;
$i = $_;
}
close(PARTNUMBE R);
print "Opening $dir\n";
opendir(DH,$dir );
while (defined ( my $filename = readdir(DH))) {
if ($filename =~ m/\.htm/ ) {
$outfilename="> $dir\\$outdir\\ $filename";
print "Opening $filename\n";
open(FHI,$filen ame);
while (<FHI>) {
$html .= $_;
}
close(FHI);
while ($html =~ s/<title>(.+?)< \/title>/$1$2$3$4/)
{
print TAB ("$1\t$2\t$3\t$ i\n");
open (PARTNUMBER, ">$dir\\$outdir \\partnumber.tx t");
print PARTNUMBER ($i);
close(PARTNUMBE R);
print "$i matches foung in $filename\n";
print "Saving to $outfilename\n" ;
open(FHO, $outfilename);
print FHO ($html);
close(FHO);
}
}
$html = '';
}
print "Done\n";
[/code]
Thanks in advance!
The issue I have is sometimes there is one line, sometimes 30 lines, between <title> and <blah> so I can't just .+ it all the way. Plus there are multiple <blah> tags in each file. I'm looking for a way for to scan the file for <title>, assign to $1, then search for every instance of <blah> and assign to $2 and upwards as necessary. Then print to the tab file $1 \t $2 \t $3 etc. Boy I hope that jibberish made sense lol. I'm new so offering an explanation with hardcore jargon might not be good for me. Here's what I have so far:
[code=perl]#!/usr/bin/env perl
#fix.py
$dir = 'e:\\tmp';
$outdir = "newfiles";
$tabfile = "tabdata.tx t";
### EDIT CAREFULLY BELOW HERE :) ###
open(TAB, ">$dir\\$outdir \\$tabfile");
print TAB ("Item Name\tItem Number\tCost\tA dd\tIn All\n");
open(PARTNUMBER , "$dir\\$outdir\ \partnumber.txt ");
while (<PARTNUMBER>) {
chomp;
$i = $_;
}
close(PARTNUMBE R);
print "Opening $dir\n";
opendir(DH,$dir );
while (defined ( my $filename = readdir(DH))) {
if ($filename =~ m/\.htm/ ) {
$outfilename="> $dir\\$outdir\\ $filename";
print "Opening $filename\n";
open(FHI,$filen ame);
while (<FHI>) {
$html .= $_;
}
close(FHI);
while ($html =~ s/<title>(.+?)< \/title>/$1$2$3$4/)
{
print TAB ("$1\t$2\t$3\t$ i\n");
open (PARTNUMBER, ">$dir\\$outdir \\partnumber.tx t");
print PARTNUMBER ($i);
close(PARTNUMBE R);
print "$i matches foung in $filename\n";
print "Saving to $outfilename\n" ;
open(FHO, $outfilename);
print FHO ($html);
close(FHO);
}
}
$html = '';
}
print "Done\n";
[/code]
Thanks in advance!
Comment