Average and Standard deviation

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kumarboston
    New Member
    • Sep 2007
    • 55

    Average and Standard deviation

    Hi All,
    I am trying to get an average value for my data, here is my data file
    Code:
    DATA FILE
    EP1934.PDB 250 250 11.27
    EP1934.PDB 251 251 12.7332
    EP1934.PDB 252 252 6.38341
    EP1934.PDB 253 253 8.04318
    EP1934.PDB 254 254 13.7123
    EP1934.PDB 255 255 10.5251
    EP1934.PDB 256 256 6.0811
    EP1934.PDB 257 257 13.317
    EP1934.PDB 258 258 14.1105
    EP1934.PDB 259 259 6.98834
    EP1934.PDB 260 260 9.93146
    EP1934.PDB 261 261 15.0784
    EP1934.PDB 262 262 11.2232
    EP1934.PDB 263 263 5.8835
    EP1934.PDB 264 264 12.9708
    EP1934.PDB 265 265 14.6467
    EP1934.PDB 266 266 7.85166
    EP1934.PDB 267 267 8.95534
    EP1934.PDB 268 268 14.5541
    EP1934.PDB 269 269 11.5805
    EP1934.PDB 270 270 5.62243
    EP1934.PDB 271 271 12.6822
    EP1934.PDB 272 272 14.9681
    EP1934.PDB 273 273 8.78424
    EP1934.PDB 274 274 9.98951
    EP1935.PDB 250 250 11.793
    EP1935.PDB 251 251 13.2081
    EP1935.PDB 252 252 6.3147
    EP1935.PDB 253 253 8.55546
    EP1935.PDB 254 254 13.8497
    EP1935.PDB 255 255 10.091
    EP1935.PDB 256 256 5.70243
    EP1935.PDB 257 257 12.8827
    EP1935.PDB 258 258 13.4507
    EP1935.PDB 259 259 6.39756
    EP1935.PDB 260 260 9.43181
    EP1935.PDB 261 261 14.7167
    EP1935.PDB 262 262 10.9966
    EP1935.PDB 263 263 5.71955
    EP1935.PDB 264 264 13.135
    EP1935.PDB 265 265 14.4682
    EP1935.PDB 266 266 7.93579
    EP1935.PDB 267 267 9.48097
    EP1935.PDB 268 268 15.5227
    EP1935.PDB 269 269 12.5595
    EP1935.PDB 270 270 6.47589
    EP1935.PDB 271 271 13.1677
    EP1935.PDB 272 272 15.9816
    EP1935.PDB 273 273 10.2107
    EP1935.PDB 274 274 10.7019
    EP1936.PDB 250 250 12.0315
    EP1936.PDB 251 251 13.6144
    EP1936.PDB 252 252 6.44758
    EP1936.PDB 253 253 8.70471
    EP1936.PDB 254 254 13.9884
    EP1936.PDB 255 255 10.4086
    EP1936.PDB 256 256 5.42416
    EP1936.PDB 257 257 12.5661
    EP1936.PDB 258 258 13.497
    EP1936.PDB 259 259 6.49391
    EP1936.PDB 260 260 9.43865
    EP1936.PDB 261 261 14.9835
    EP1936.PDB 262 262 11.4903
    EP1936.PDB 263 263 6.2322
    EP1936.PDB 264 264 13.3191
    EP1936.PDB 265 265 15.0674
    EP1936.PDB 266 266 8.56444
    EP1936.PDB 267 267 9.8656
    EP1936.PDB 268 268 16.3347
    EP1936.PDB 269 269 13.6462
    EP1936.PDB 270 270 7.47648
    EP1936.PDB 271 271 13.8738
    EP1936.PDB 272 272 16.8272
    EP1936.PDB 273 273 11.1519
    EP1936.PDB 274 274 9.61694
    EP1937.PDB 250 250 11.2767
    EP1937.PDB 251 251 12.8564
    EP1937.PDB 252 252 6.13925
    EP1937.PDB 253 253 8.30244
    EP1937.PDB 254 254 14.1491
    EP1937.PDB 255 255 10.6535
    EP1937.PDB 256 256 5.36572
    EP1937.PDB 257 257 12.1148
    EP1937.PDB 258 258 13.3093
    EP1937.PDB 259 259 6.15769
    EP1937.PDB 260 260 9.39202
    EP1937.PDB 261 261 14.6329
    EP1937.PDB 262 262 11.1803
    EP1937.PDB 263 263 6.40411
    EP1937.PDB 264 264 13.6729
    EP1937.PDB 265 265 14.5391
    EP1937.PDB 266 266 8.22699
    EP1937.PDB 267 267 8.98709
    EP1937.PDB 268 268 15.2712
    EP1937.PDB 269 269 13.2764
    EP1937.PDB 270 270 6.57068
    EP1937.PDB 271 271 11.7033
    EP1937.PDB 272 272 16.2944
    EP1937.PDB 273 273 11.7734
    EP1937.PDB 274 274 8.73714
    EP1940.PDB 250 250 11.7256
    EP1940.PDB 251 251 13.3999
    EP1940.PDB 252 252 6.52818
    EP1940.PDB 253 253 8.41266
    EP1940.PDB 254 254 14.1372
    EP1940.PDB 255 255 10.5523
    EP1940.PDB 256 256 5.54926
    EP1940.PDB 257 257 12.544
    EP1940.PDB 258 258 13.0304
    EP1940.PDB 259 259 6.3614
    EP1940.PDB 260 260 9.26743
    EP1940.PDB 261 261 14.8251
    EP1940.PDB 262 262 11.0243
    EP1940.PDB 263 263 6.09589
    EP1940.PDB 264 264 13.2229
    EP1940.PDB 265 265 14.4447
    EP1940.PDB 266 266 7.83723
    EP1940.PDB 267 267 10.0536
    EP1940.PDB 268 268 16.3468
    EP1940.PDB 269 269 13.4618
    EP1940.PDB 270 270 7.98931
    EP1940.PDB 271 271 14.8577
    EP1940.PDB 272 272 17.9952
    EP1940.PDB 273 273 12.2682
    EP1940.PDB 274 274 10.2391
    where the first column is pdb id , second, third is residue position and fourth is distance.
    What i am trying to do is to calculate the average value for each residue position and calulate standard deviation(SD).
    For example: for residue position 250, program should select and calculate all the average values for distande at residue number 250 and then calculate SD.
    and finaly print the residue number, average value and SD.

    I have written a code but its not able to select the specified residue and do the calculations.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my (%hash,$respos1,$respos2,$dist,$val,$line,@temp);
    my ($count,$dis) = 0;
    
    
    open (FH,"caca.dat") or die "Check the file";
    while (<FH>)
    {
        $line = $_;
        chomp $_;
        @temp = split (/\s/,$line);
        $respos1 = $temp[1];
        $respos2 = $temp[2];
        $dist    = $temp[3];
        $hash{$respos1} = $dist;
    }
    
        for ($respos1=250;$respos1<=274;$respos1++)
        {
            if ($respos1 == $respos2)
            {
                $dis = $dis + $dist;
                $count++;
            }
        }
    Since the average value is not calculating correctly, I have not tried the SD part.

    Any directions will be helpful.

    Thanks
    Kumar
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    Not sure if I got this correct but it should help or can be fixed easy enough (I think).

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my %hash;
    
    #  ID        r1  r2 sd 
    #EP1935.PDB 267 267 9.48097
    
    open (FH,"caca.dat") or die "Check the file";
    while (my $line = <FH>){
        chomp $line;
        my ($r1,$sd) = (split (/\s/,$line))[1,3];
        $hash{$r1}{'sd'} += $sd;
        $hash{$r1}{'divisor'}++;
    }
    close FH;
    foreach my $key (sort {$a <=> $b} keys %hash) {
       my $avg = sprintf "%.3f" , $hash{$key}{'sd'} / $hash{$key}{'divisor'};
       print "The average SD for $key is $avg\n";
    }

    Comment

    • kumarboston
      New Member
      • Sep 2007
      • 55

      #3
      Hi All,
      Thanks for the reply, I tried to calculate the average and the SD but something is wrong I am not sure.
      here is my code
      Code:
      #!/usr/bin/perl
      
      use strict;
      use warnings;
      
      my (@pos,@cadist,$mean,%hash,$respos1,$respos2,$cadist,$val,$line,@temp,@respos2);
      my ($cnt,$dis,$sum) = 0;
      
      open (FH,"caca.dat") or die "Check the file";
      while (<FH>)
      {
          $line = $_;chomp $_;
          @temp = split (/\s/,$line);
          $respos1   = $temp[1];
          $respos2   = $temp[2];
          $cadist    = $temp[3];
          for(my $i=250;$i<=274;$i++)
          {
              if ($i == $respos2)
              {
                  push (@cadist,$cadist);
                  push (@pos,$respos2);
                  $sum +=$cadist;
                  $cnt++;
              }
          }
      }
      
      $mean = $sum/$cnt;
      @cadist = ();
      
      my $summ = 0;
      my $deviation;
      
      foreach my $val(@cadist)
      {
          my $abar = (($val-$mean)**2);
          $summ   += $abar;
      }
      $deviation = sprintf "%.5f",sqrt($summ/($cnt-1));
      print "$pos[0] $mean $deviation\n";
      If i remove the for loop and in the if statement simply put some value for comparision everything works fine, but when I put the condition for every residue position then calculation goes wrong.
      Thanks
      Kumar

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        Well, if everything else is correct in your code, this line needs to be removed:

        @cadist = (); (line 30)

        That deletes the array of any values it had previously.

        Comment

        • kumarboston
          New Member
          • Sep 2007
          • 55

          #5
          Thanks for the reply,
          I finally succeded in calculating the values but one thing still remains, due to for loop in the code, all the values are printed repeatedly till the end, which makes it redundant,
          I am posting the code which runs on the data file, which i posted earlier and one can see the results after running the code.
          Code:
          #!/usr/bin/perl
          
          use strict;
          use warnings;
          
          my ($line,@temp,$respos1,$respos2,@respos1,@respos2,$cadis,@lstdist,$i,$j,@cadist,@result,$length,$ele,$val,$abar,$deviation);
          my ($cnt,$dis,$sum,$summ,$mean) = 0;
          
          open (FH,"caca.dat") or die "Check the file";
          while (<FH>)
          {
              $line = $_;chomp $_;
              @temp = split (/\s/,$line);
              $respos1   = $temp[1];
              $respos2   = $temp[2];
              $cadis     = $temp[3];
              push(@respos1,$respos1);push(@respos2,$respos2);push(@cadist,$cadis);
          }
          
          for($i=0;$i<@respos1;$i++)
          {
              @lstdist=();
             for($j=0;$j<@respos2;$j++)
             {
                 if ($respos1[$i] == $respos2[$j])
                 {
          	   push (@lstdist,$cadist[$j]);
                 }
             }
              @result=&mean(@lstdist);
              print "$respos1[$i]\t$result[0]\t$result[1]\n";
          }
          sub mean
          {
              (@lstdist)=@_;
              $length=scalar(@lstdist);
              $sum=0;$mean=0;$summ=0;
              foreach $ele(@lstdist)
              {
          	$sum +=$ele;
              }
              $mean=$sum/$length;
              foreach $val(@lstdist)
              {
          	$abar=0;
          	$abar = (($val-$mean)**2);
          	$summ   += $abar;
              }
              $deviation = sqrt($summ/($length-1));
              return($mean,$deviation);
          }
          How I can print the values only once for each residue number.


          Thanks
          Kumar

          Comment

          • nithinpes
            Recognized Expert Contributor
            • Dec 2007
            • 410

            #6
            Originally posted by kumarboston
            Thanks for the reply,
            I finally succeded in calculating the values but one thing still remains, due to for loop in the code, all the values are printed repeatedly till the end, which makes it redundant,
            I am posting the code which runs on the data file, which i posted earlier and one can see the results after running the code.
            Code:
            #!/usr/bin/perl
            
            use strict;
            use warnings;
            
            my ($line,@temp,$respos1,$respos2,@respos1,@respos2,$cadis,@lstdist,$i,$j,@cadist,@result,$length,$ele,$val,$abar,$deviation);
            my ($cnt,$dis,$sum,$summ,$mean) = 0;
            
            open (FH,"caca.dat") or die "Check the file";
            while (<FH>)
            {
                $line = $_;chomp $_;
                @temp = split (/\s/,$line);
                $respos1   = $temp[1];
                $respos2   = $temp[2];
                $cadis     = $temp[3];
                push(@respos1,$respos1);push(@respos2,$respos2);push(@cadist,$cadis);
            }
            
            for($i=0;$i<@respos1;$i++)
            {
                @lstdist=();
               for($j=0;$j<@respos2;$j++)
               {
                   if ($respos1[$i] == $respos2[$j])
                   {
            	   push (@lstdist,$cadist[$j]);
                   }
               }
                @result=&mean(@lstdist);
                print "$respos1[$i]\t$result[0]\t$result[1]\n";
            }
            sub mean
            {
                (@lstdist)=@_;
                $length=scalar(@lstdist);
                $sum=0;$mean=0;$summ=0;
                foreach $ele(@lstdist)
                {
            	$sum +=$ele;
                }
                $mean=$sum/$length;
                foreach $val(@lstdist)
                {
            	$abar=0;
            	$abar = (($val-$mean)**2);
            	$summ   += $abar;
                }
                $deviation = sqrt($summ/($length-1));
                return($mean,$deviation);
            }
            How I can print the values only once for each residue number.


            Thanks
            Kumar
            You may store the result in hash of array, and print it outside the loop to avoid duplicate results:
            Code:
            my %result=();  # result hash
            for($i=0;$i<@respos1;$i++) 
            { 
              @lstdist=(); 
               for($j=0;$j<@respos2;$j++) 
               { 
                   if ($respos1[$i] == $respos2[$j]) 
                   { 
                   push(@lstdist,$cadist[$j]); 
                   } 
               } 
               @result=&mean(@lstdist); 
                $result{$respos1[$i]} = [$result[0],$result[1]]; # create hash of arrays
              } 
            
            foreach(sort keys %result) {
             print "$_\t$result{$_}[0]\t$result{$_}[1]\n"; #display result
            }

            Comment

            • kumarboston
              New Member
              • Sep 2007
              • 55

              #7
              Thanks Nithinpes and All, for suggestions and now the program works perfectly fine.

              Thanks
              Kumar

              Comment

              Working...