How to count names and dates by parsing an HTML file using perl?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • moddster
    New Member
    • Apr 2007
    • 6

    How to count names and dates by parsing an HTML file using perl?

    Hi Guys. I am a newbie to perl and need some help with a problem.

    PROBLEM: I have to parse an HTML file and get rid of all the HTML tags and count the number of sumbissions a person has through out the dates found. The condition is that multiple submissions by the same person on the same date is counted as 1.

    I have already gotten rid of the HTML tags using:

    Code:
    #!/usr/bin/perl -w
    
    use strict;
    
    package HTMLStrip;
    use base "HTML::Parser";
    
    sub text {
    	my ($self, $text) = @_;
    	print $text;
    }
    
    my $p = new HTMLStrip;
    # parse line-by-line, rather than the whole file at once
    while (<>) {
    #chomp;
    s/&nbsp;/ /g;
    s/&gt;/ /g;
    s/Remove/ /g;
    $p->parse($_);
    }
    # flush and parse remaining unparsed HTML
    $p->eof;
    And now after parsing the HTML file my output looks like: (This is just a part of the output)

    Aneeka Bhalla Bhalla, Aneeka (abhalla7840)Re ceived 01-24-2007 10:51



    Andrew Johnson 1-24-07 Johnson, Andrew (aljohnson8711) Received 01-24-2007 10:51



    Stephen Pennington - Jan 24, 06 Pennington, Stephen (sjpennington84 23)Received 01-24-2007 10:51



    Sarah Gatliff Gatliff, Sarah (sngatliff7093) Received 01-24-2007 10:51



    Kyle McCracken McCracken, Kyle (krmccracken903 2)Received 01-24-2007 10:51



    Exercise 1 1/24/07 Monk, Megan (mjmonk7907)Rec eived 01-24-2007 10:50



    homework Ilieva, Mariya (mkilieva7030)R eceived 01-18-2007 15:15



    Sarah Gatliff Gatliff, Sarah (sngatliff7093) Received 01-17-2007 10:48



    William Shaun Greening Greening, William (wsgreening7657 )Received 01-17-2007 10:48



    Shearita Henderson Received 01-17-2007 10:48



    pfe quotes 1-17-07 Monk, Megan (mjmonk7907)Rec eived 01-17-2007 10:47



    Sondra Denise Grissom Received 01-17-2007 10:47



    Anthony Harris Harris, Anthony (adharris9208)R eceived 01-17-2007 10:47



    Curtis Box Intro Worksheet Box, Curtis (cbox9827)Recei ved 01-17-2007 10:47



    Jason Hughes Hughes, Jason (jbhughes8891)R eceived 01-17-2007 10:47



    charles christopherson Christopherson, Charles (cachristophers on9444)Received 01-17-2007 10:47



    Darwin Moore Moore, Darwin (ddmoore7092)Re ceived 01-17-2007 10:47



    April Stephens Stephens, April (atstephens4498 )Received 01-17-2007 10:47



    Lyntisha Miller Miller, Lyntisha (lsmiller8647)R eceived 01-17-2007 10:47



    Kyle McCracken McCracken, Kyle (krmccracken903 2)Received 01-17-2007 10:47



    Aneeka Bhalla Bhalla, Aneeka (abhalla7840)Re ceived 01-17-2007 10:47


    Format for your understanding:

    <file name> <lastname>,<fir stname> <userid> Received <Date and Time>

    My output should be:

    <firstname> <lastname> (< number of time user submitted>)

    eg.

    Aneeka Bhalla (2)
    Kyle McCracken (1)
    ....

    I need help with the counting and comparing dates part.

    Any help appreciated !
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    What have you tried as far as the counting goes? Is this school/class work?

    Comment

    • moddster
      New Member
      • Apr 2007
      • 6

      #3
      Originally posted by KevinADC
      What have you tried as far as the counting goes? Is this school/class work?
      I want to implement loops and arrays but I dont know how to do it in perl

      It is not part of my class, but I am a student assitant for a professor. Part of my job includes keeping a record of all the students' submissions. I am learning perl on my own. If I can write a script that can parse the HTML and do the stuff for me... that will be awesome!! I wouldnt have to write it on paper and count everytime !

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        what about lines like this that do not follow the format of the file:

        Code:
        <file name> <lastname>,<firstname> <userid> Received <Date and Time>
        Sondra Denise Grissom Received 01-17-2007 10:47
        What do you propose to do with those lines? Is there more than one format?

        Comment

        • moddster
          New Member
          • Apr 2007
          • 6

          #5
          Originally posted by KevinADC
          what about lines like this that do not follow the format of the file:

          Code:
          <file name> <lastname>,<firstname> <userid> Received <Date and Time>
          Sondra Denise Grissom Received 01-17-2007 10:47
          What do you propose to do with those lines? Is there more than one format?
          I want to omit everything else with spaces. I have PM you with link to the output file

          Comment

          • KevinADC
            Recognized Expert Specialist
            • Jan 2007
            • 4092

            #6
            Well, your requirements are still sort of vague, making some assumptions I came up with this:

            Code:
            use strict;
            use warnings;
            my %results = ();
            open(FH,"out.txt") or die "$!";
            while(<FH>){
               if(/(\S+),\s+(\S+)\s+\([^)]+\)Received (\S+)/) {
                  $results{"$2 $1"}{$3}++;
                  $results{"$2 $1"}{Total}++ if $results{"$2 $1"}{$3} == 1;
            	}
            }
            foreach my $person (keys %results) {
               print "$person ($results{$person}->{Total})\n";
            }
            which returns this from the out.txt file

            Code:
            Curtis Box (13)
            Sarah Gatliff (13)
            Megan Monk (12)
            Mariya Ilieva (6)
            Charles Christopherson (14)
            Lyntisha Miller (8)
            Kyle McCracken (11)
            Megan Gonzales (7)
            Jennifer Smith (10)
            Jason Hughes (12)
            Johnathan Van (10)
            Elizabeth Myrick (10)
            Stephen Pennington (13)
            Jacob Leigh (10)
            William Greening (3)
            Anthony Harris (10)
            Holli Peek (10)
            Andrew Johnson (12)
            April Stephens (11)
            Darwin Moore (9)
            Aneeka Bhalla (13)
            you need check the results for accuracy

            Comment

            Working...