Compare two files in perl

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • raj14
    New Member
    • Feb 2014
    • 4

    Compare two files in perl

    I have a problem. Currently I am trying to compare two text files which has high amount of data. I have developed a perl script to cross check both files. But it takes very long time. The codes are working fine for small number of data. The sample files are attached here.

    I want the 1st line of chr.txt file to check all the lines in exon.txt. it should repeat the process until all the lines from chr.txt is checked with lines from exon.txt.

    This the code which i developed.

    Code:
    use strict;
    use warnings;
    
    my $file1 = "exon.txt";
    my $file2 = "chr.txt";
    
    open(FILE1, $file1) || die "couldn't open the file!";
    open(FILE2, $file2) || die "couldn't open the file!";
    
    open(OUT,">result.txt");
    
    my @arr1 =<FILE1>;
    my @arr2 =<FILE2>;
    
    foreach my $arr1 (@arr1){
    
    	chomp $arr1;
    	my ($eChr,$eStart,$eEnd,$eCat)=split(/\t/,$arr1);
    
    	foreach my $arr2 (@arr2) {
    	
    		my($cChr, $cStart, $cEnd)=split(/\t/, $arr2);
    		if (($mChr eq $eChr)&&($mStart >= $eStart) && ($mEnd <= $eEnd)) {
    				print OUT "$mChr\t$mStart\t$mEnd\t$eCat\t$eStart\t$eEnd\n";
    
    				}
    			}
    		}
    close(FILE1);
    close(FILE2);
    close OUT;
    Attached Files
  • RonB
    Recognized Expert Contributor
    • Jun 2009
    • 589

    #2
    You're looping over the data too many times.

    Load the first file (exon.txt) into a HoAoH (Hash of Array of Hashes) where the key is the "chr" and the hash ref would hold the rest of the data. Then loop over the chr.txt file line-by-line checking for the existence of the "chr" key.

    The sample data you posted won't produce any matching results, but presumably your real data set will.

    Code:
    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my $file1 = "exon.txt";
    open my $exon_fh, '<', $file1 or die "couldn't open $file1 $!";
    
    my %exon;
    while (my $line = <$exon_fh>) {
        next if $line =~ /^\s*$/;
        chomp $line;
        my ($chr,$start,$end,$cat) = split(/\t/, $line);
        push @{$exon{$chr}}, {
            start => $start,
            end   => $end,
            cat   => $cat,
        };
    }
    close $exon_fh;
    
    my $file2 = "chr.txt";
    open my $chr_fh, '<', $file2 or die "couldn't open $file2 $!";
    
    while (my $line = <$chr_fh>) {
        next if $line =~ /^\s*$/;
        chomp $line;
        my ($chr,$start,$end) = split(/\t/, $line);
        next unless exists $exon{$chr};
    
        foreach my $exon ( $exon{$chr} ) {
            if ($start >= $exon{start} && $end <= $exon{end} ) {
                print join("\t", $chr, $start, $exon{start}, $end <= $exon{end}) . "\n";
            }
        }
    }
    close $chr_fh;

    Comment

    • RonB
      Recognized Expert Contributor
      • Jun 2009
      • 589

      #3
      I just noticed that I had an error in the print statement. It should be:
      Code:
      print join("\t", $chr, $start, $exon{cat}, $exon{start}, $exon{end}) . "\n";

      Comment

      • raj14
        New Member
        • Feb 2014
        • 4

        #4
        Thanks for the help RonB. But when I run this script, it prompts error. Use of uninitialized Value.

        Can you explain this part.

        Code:
        push @{$exon{$chr}}, ;{
                start&nbsp;=> $start,
                end&nbsp;  => $end,
                cat&nbsp;  => $cat,
            };

        Comment

        • RonB
          Recognized Expert Contributor
          • Jun 2009
          • 589

          #5
          Which part do you want explained? The syntax errors that you added to the code I gave you or what the code should do without your syntax errors?

          Comment

          • raj14
            New Member
            • Feb 2014
            • 4

            #6
            Originally posted by RonB
            Which part do you want explained? The syntax errors that you added to the code I gave you or what the code should do without your syntax errors?

            The errors is "Use of uninitialized value in numeric ge (>=)". So i guess the syntax has some problem. This part of your syntax has error. I attach it here.
            Code:
            next if $line =~ /^\s*$/;
                    chomp $line;
                    my ($chr,$start,$end,$cat) = split(/\t/, $line);
                    push @{$exon{$chr}}, ;{
                        start => $start,
                        end  => $end,
                        cat  => $cat,
                    };

            Comment

            • RonB
              Recognized Expert Contributor
              • Jun 2009
              • 589

              #7
              Remove the semi colon in this line:
              Code:
              push @{$exon{$chr}}, ;{

              Comment

              • raj14
                New Member
                • Feb 2014
                • 4

                #8
                Its still produce the same error.

                Comment

                • RonB
                  Recognized Expert Contributor
                  • Jun 2009
                  • 589

                  #9
                  You didn't say which line the warning message was referring to.

                  The only line in the code I gave that does that numerical test is this one (line 32):
                  Code:
                  if ($start >= $exon{start} && $end <= $exon{end} ) {
                  You need to dump those 4 vars (via the Data::Dumper module) to see which one is undefined.

                  Comment

                  Working...