Hi,
What I'm trying to do seems right up Perl's alley, but I can't get it to work. I'm using the WWW::Mechanize module to retrieve a sprawling HTML document from which I want to extract certain strings and save them. I can get this much to work:
but now I am stuck on how to process that data.
The HTML doc is quite long, and contains numbers that I want to extract, numbers that are always preceeded by a text string that is the same each time, such as:
<a href bla bla bla>bla bla bla<random tag>mydigits=493409 834%bla bla bla<meaningless tag>bla bla</a>
where the string "mydigits=" always preceeds the desired number and is sometimes all lowercase but can occasionally look like "MyDigits=" ; where the number itself may be anywhere from one to 10 digits in length; and where "%" might literally be "%" or any other non-digit character including a space. Moreover, the desired string might appear more than once per line -- assuming Perl doesn't see the HTML doc as just one single long line of text anyway.
What I have tried is many extremely ugly variations on
but if I don't get an error, all I get is a spew-out of the entire HTML doc instead of what I am hoping for, which would be a printout or file that looks like:
219824
2230239084
04598
98739874
etc., etc.
or better yet, assign the output to an array that looks like:
@desired_array = ( 219824, 2230239084, 04598, 98739874);
I know I must be missing something very fundamental, so if anyone can help steer me away from the major mistakes I'm making, I'd appreciate it. Thanks.
What I'm trying to do seems right up Perl's alley, but I can't get it to work. I'm using the WWW::Mechanize module to retrieve a sprawling HTML document from which I want to extract certain strings and save them. I can get this much to work:
Code:
use WWW::Mechanize; $url = "http://someurl"; my $mechanize = WWW::Mechanize->new(autocheck => 1); $mechanize->get($url); my @array_of_data = $mechanize->content;
The HTML doc is quite long, and contains numbers that I want to extract, numbers that are always preceeded by a text string that is the same each time, such as:
<a href bla bla bla>bla bla bla<random tag>mydigits=493409 834%bla bla bla<meaningless tag>bla bla</a>
where the string "mydigits=" always preceeds the desired number and is sometimes all lowercase but can occasionally look like "MyDigits=" ; where the number itself may be anywhere from one to 10 digits in length; and where "%" might literally be "%" or any other non-digit character including a space. Moreover, the desired string might appear more than once per line -- assuming Perl doesn't see the HTML doc as just one single long line of text anyway.
What I have tried is many extremely ugly variations on
Code:
my $pattern = "[Mm]y[Dd]igits=[0-9]*[^0-9]";
foreach (@array_of_data){
if ( /$pattern/ ){
print "$_\n";
219824
2230239084
04598
98739874
etc., etc.
or better yet, assign the output to an array that looks like:
@desired_array = ( 219824, 2230239084, 04598, 98739874);
I know I must be missing something very fundamental, so if anyone can help steer me away from the major mistakes I'm making, I'd appreciate it. Thanks.
Comment