Filtering out Duplicate IDs

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • idorjee
    New Member
    • Mar 2007
    • 76

    Filtering out Duplicate IDs

    hi all,
    i would really appreciate if you could let me know how i could get just the first row of each of the unique Accession in the tab-delimited file like one below.

    Accession TC# TCAcc Evalue %ID Score
    2005490039 3.A.1.2.9 Q7BSH4 9.00E-18 24.78991597 289
    2005490039 3.A.1.111.2 P33116 1.00E-15 28.94736842 289
    2005490048 3.A.1.107.1 P30962 1.00E-17 35.34482759 31
    2005490048 9.B.14.2.1 P29961 2.00E-16 27.97202797 31

    thanks a lot.
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    what have you tried so far?

    Comment

    • idorjee
      New Member
      • Mar 2007
      • 76

      #3
      this is what i did and it doesn't do anything, just gets the same input file.

      Code:
      while (<INFILE>) {
      	if ($_ =~ /(\S+)\t(.+)/) {
      		my $qa = $1;
      		my $rest = $2;
      		my $lowest = $qa;
      		$lowest = $qa if $qa ne $lowest;
      		print OUTFILE "$lowest\t$rest\n";
      	}
      }
      thanks
      Last edited by miller; Apr 12 '07, 12:09 AM. Reason: Code Tag and Reformatting

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        You need to use a hash to keep track of what you have "seen" so you don't repeat it:

        Code:
        my %seen = ();
        while(<INFILE>){
           if (/^(\S+)\t/) {
             next if ++$seen{$1} > 1;
           }
           print OUTFILE;
        }

        Comment

        • idorjee
          New Member
          • Mar 2007
          • 76

          #5
          thanks alot Kevin,
          that worked fine.
          ^ ^*

          Comment

          Working...