Need help with some very Practical Extraction

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • theapeman
    New Member
    • Jan 2007
    • 5

    Need help with some very Practical Extraction

    Hi,

    What I'm trying to do seems right up Perl's alley, but I can't get it to work. I'm using the WWW::Mechanize module to retrieve a sprawling HTML document from which I want to extract certain strings and save them. I can get this much to work:
    Code:
    use WWW::Mechanize;
    $url = "http://someurl";
    my $mechanize = WWW::Mechanize->new(autocheck => 1);
    $mechanize->get($url);
    my @array_of_data = $mechanize->content;
    but now I am stuck on how to process that data.

    The HTML doc is quite long, and contains numbers that I want to extract, numbers that are always preceeded by a text string that is the same each time, such as:

    <a href bla bla bla>bla bla bla<random tag>mydigits=493409 834%bla bla bla<meaningless tag>bla bla</a>

    where the string "mydigits=" always preceeds the desired number and is sometimes all lowercase but can occasionally look like "MyDigits=" ; where the number itself may be anywhere from one to 10 digits in length; and where "%" might literally be "%" or any other non-digit character including a space. Moreover, the desired string might appear more than once per line -- assuming Perl doesn't see the HTML doc as just one single long line of text anyway.

    What I have tried is many extremely ugly variations on
    Code:
    my $pattern = "[Mm]y[Dd]igits=[0-9]*[^0-9]";
    foreach (@array_of_data){
        if ( /$pattern/ ){
        print "$_\n";
    but if I don't get an error, all I get is a spew-out of the entire HTML doc instead of what I am hoping for, which would be a printout or file that looks like:

    219824
    2230239084
    04598
    98739874
    etc., etc.

    or better yet, assign the output to an array that looks like:

    @desired_array = ( 219824, 2230239084, 04598, 98739874);

    I know I must be missing something very fundamental, so if anyone can help steer me away from the major mistakes I'm making, I'd appreciate it. Thanks.
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    as long as the pattern is on the same line this should work:


    Code:
    my @digits = ();
    foreach (@array_of_data){
        if ( /mydigits=(\d+)/i ){
           print "found $1 in this line: $_\n";
           push @digits,$1;
        }
    }
    print "$_\n" for @digits;

    but can be changed if the pattern is broken over multiple lines.

    Comment

    • theapeman
      New Member
      • Jan 2007
      • 5

      #3
      Wow, thank you, that is very helpful. Love the i modifier for case insensitivity!

      As I suspected, the HTML doc looks like one long line to Perl, so what happens is it finds the first instance, say, 123456789, prints

      "found 123456789 in this line:"

      followed by what to you and me looks like more than 3000 lines of HTML, then prints:

      "123456789"

      and then quits. But that is more than I could get it to do before, and this definitely has me pointed in the right direction, so thanks again. :)

      Comment

      • theapeman
        New Member
        • Jan 2007
        • 5

        #4
        Wow, I just realized what you did with the parentheses and the $1 to extract only the digits. Awesome!

        Comment

        • KevinADC
          Recognized Expert Specialist
          • Jan 2007
          • 4092

          #5
          See how this works:

          Code:
          use WWW::Mechanize;
          $url = "http://someurl";
          my $mechanize = WWW::Mechanize->new(autocheck => 1);
          $mechanize->get($url);
          my $string_of_data = $mechanize->content;
          my @digits = $string_of_data =~ m/mydigits=(\d+)/igm;
          print "$_\n" for @digits;
          if that doesn't work, change the 'm' after 'ig' to an 's'

          Comment

          Working...