compare two file contents

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lilly07
    New Member
    • Jul 2008
    • 89

    compare two file contents

    Hi
    I have two text files and each file contains 2 tab separated strings as below:

    File R.txt

    class1 12345
    class2 26789
    class1 4567
  • lilly07
    New Member
    • Jul 2008
    • 89

    #2
    Hi Sorry, I submitted my post before finishing the draft.


    I have two text files and each file contains 2 tab separated strings as below:

    File testR.txt
    class1 12345
    class2 26789
    class1 4567
    class5 567
    class3 987

    and another file as below:
    File testP.txt
    class5 525
    class7 728
    class1 670
    class8 34
    class3 567

    I need to compare both the files as below. For every record in R.txt, I have to check whether the column 1 is equal in the every record in P.txt and process further. I tried as below. But some how the search is not complete. For every record in R.txt, I have to search every record in P.txt and do the comparison.

    I have posted my script below. But there seems to be some flaw in the logic and also I want to know whether this kind of search is optimal or not b'cos my record size of each file is around 5000 for R.txt and 3000 for P.txt. Thanks and let me know the problem in my script.

    Code:
    #!/usr/bin/perl
    $file1 = 'testR.txt';
    $file2 = 'testP.txt';
    open (R, $file1) || die ("Could not open $file!");
    open (P, $file2) || die ("Could not open $file!");
    $counter = 0;
    while ($Rline = <R>)
    { 
            chomp $_;
            my @R = split(/\s+/,$Rline);
            
            while ($Pline = <P>)
            { 
                    chomp $_;
                    my @P = split(/\s+/,$Pline);
                            if($R[0] eq $P[0]) {
                            print "$R[0]\t$R[1]\t$P[0]\t$P[1]\n";
                            }
            }
            close (P);
            print "$counter\n";
            $counter++;
      
      
    }
    close (R);
    Thanks.

    Comment

    • KevinADC
      Recognized Expert Specialist
      • Jan 2007
      • 4092

      #3
      Your code looks like it should not work since you are closing file P after only searching the first line of file R. I'm not sure what to suggest though becuase what you are trying to do is not clear to me. Most likely you want to use a hash and search the hashes instead of searching the file over and over.

      Comment

      • lilly07
        New Member
        • Jul 2008
        • 89

        #4
        Hi Kevin,

        I can not add the file contents into a hash. In the above example, I need to check the first column of file1 and first column of file2 and if they are same, I have to process further. Basically I have to check for every element in the file1 and file2.

        Thanks.

        Comment

        • KevinADC
          Recognized Expert Specialist
          • Jan 2007
          • 4092

          #5
          Why can't you add the file contents into a hash? This way is very inefifficient but see if it works:

          Code:
          #!/usr/bin/perl
          $file1 = 'testR.txt';
          $file2 = 'testP.txt';
          open (R, $file1) || die ("Could not open $file!");
          open (P, $file2) || die ("Could not open $file!");
          $counter = 0;
          while ($Rline = <R>)
          { 
                  chomp $_;
                  my @R = split(/\s+/,$Rline);
                  seek P,0,0; # return to beginning of the P file
                  while ($Pline = <P>)
                  { 
                          chomp $_;
                          my @P = split(/\s+/,$Pline);
                                  if($R[0] eq $P[0]) {
                                       print "$R[0]\t$R[1]\t$P[0]\t$P[1]\n";
                                  }
                  }
                  print "$counter\n";
                  $counter++;
           
           
          }
          close (P)
          close (R);

          Comment

          • lilly07
            New Member
            • Jul 2008
            • 89

            #6
            Hi Kevin,
            As a beginner Hash is always confusing. My basic objective is to check whether the 1st column in R file is equal to 1st column in P file and then take a difference between their 2nd columns to check whether they are just 100 in difference. That is for example,

            Code:
            class1 12345
            from R and
            Code:
            class1 670
            of P

            1st column are same and diff is mod value of (12345 -670). I have to check for all the records in R against every record in P. Since it is confusing to think using hash, I did a search in a very primitive way.

            Anyway as you had suggested, I will try to put the contents of values in a hash and try to compare the array values. I can understand that my search time comes down with this, but again bit confusing to compare the values between hashes.
            Thanks again.

            Comment

            • KevinADC
              Recognized Expert Specialist
              • Jan 2007
              • 4092

              #7
              I understand. It will be even harder because you have many duplicates "keys" in the files. Hash keys are unique so you would actually have to use something like a hash of arrays. I will see what I can come up with.

              Comment

              • KevinADC
                Recognized Expert Specialist
                • Jan 2007
                • 4092

                #8
                Heres a rather quick write up of some code. It does what you want, I think. The output is probably much more verbose than you want but that can be changed to only display results you want, like if the diff is 100. I would run this on a small set of data since it will print out a lot of results. If it appears to work correctly the output can be modified.

                Code:
                use strict;
                use warnings;
                #use Data::Dumper;
                my $file1 = 'c:/perl_test/testR.txt';
                my $file2 = 'c:/perl_test/testP.txt';
                my %HoA;
                open (R, $file1) or  die ("Could not open $file1!");
                while(<R>){
                   chomp;
                   my ($k, $v) = split(/\s+/);
                   push @{$HoA{'R'}{$k}},$v;
                }
                close(R);
                open (P, $file2) or die ("Could not open $file2!");
                while(<P>){
                   chomp;
                   my ($k, $v) = split(/\s+/);
                   push @{$HoA{'P'}{$k}},$v;
                }
                close(P);
                #print Dumper \%HoA;
                foreach my $R (keys %{ $HoA{'R'} }) {
                   if (exists $HoA{'P'}{$R}) {
                      print "$R\ntestR     testP     diff\n------------------------------\n";
                      foreach my $classR ( @{$HoA{'R'}{$R}} ) {
                         foreach my $classP ( @{$HoA{'P'}{$R}} ) {
                            printf "%-10s%-10s%s\n",$classR,$classP,$classR-$classP;
                         }
                      }
                      print "\n";
                   }
                   else {
                      print "\n$R has no match in testP\n\n";
                   }
                }

                Comment

                • KevinADC
                  Recognized Expert Specialist
                  • Jan 2007
                  • 4092

                  #9
                  Output with your small sample data is:

                  Code:
                  class5
                  testR     testP     diff
                  ------------------------------
                  567       525       42
                  
                  class1
                  testR     testP     diff
                  ------------------------------
                  12345     670       11675
                  4567      670       3897
                  
                  
                  class2 has no match in testP
                  
                  class3
                  testR     testP     diff
                  ------------------------------
                  987       567       420

                  Comment

                  • lilly07
                    New Member
                    • Jul 2008
                    • 89

                    #10
                    Hi Kevin,

                    Thank you so much. It works.

                    Comment

                    • KevinADC
                      Recognized Expert Specialist
                      • Jan 2007
                      • 4092

                      #11
                      You're welcome. Hopefully it helps you learn how to use hashes and more complex data for future needs.

                      Comment

                      • lilly07
                        New Member
                        • Jul 2008
                        • 89

                        #12
                        Hi Kevin,
                        It was too helpful especially with hashes and saved lots of time rather than primitive way of searching. Thanks a lot again for your time.

                        Code:
                         
                        push @{$HoA{'P'}{$k}},$v;
                        is bit tricky. Could you please explain?
                        Regards
                        Lilly

                        Comment

                        • KevinADC
                          Recognized Expert Specialist
                          • Jan 2007
                          • 4092

                          #13
                          You are already familiar with the push function I assume:

                          push @array,$var;

                          This is really the same thing all be it with more brackets:

                          push @{$HoA{'P'}{$k} },$v;


                          its a hash of hash of array

                          $HoA{'P'} <-- top level of the hash
                          $HoA{'P'}{$R} <-- second level of the hash
                          @{ $HoA{'P'}{$R} } <-- this converts the second level of the hash into an array
                          push @{$HoA{'P'}{$k} },$v; <-- this adds $v to the end of the array @{$HoA{'P'}{$k} }

                          all the bracketing makes it look more complicated than it is. But notice the type casting is the same: @ for array.

                          Comment

                          Working...