Checking for bad dna sequences

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • adriaan
    New Member
    • Dec 2007
    • 7

    Checking for bad dna sequences

    hi all,
    I have an assignment were I have to check multiple human dna sequences that look like this

    ORIGIN
    Code:
    1 ttgctgcaga cgctcacccc agacactcac tgcaccggag tgagcgcgac catcatgtcc
    61 atgctcgtgg tctttctctt gctgtggggt gtcacctggg gcccagtgac agaagcagcc
    121 atattttatg agacgcagcc cagcctgtgg gcagagtccg aatcactgct gaaacccttg
    181 gccaatgtga cgctgacgtg ccaggcccac ctggagactc cagacttcca gctgttcaag
    241 aatggggtgg cccaggagcc tgtgcacctt gactcacctg ccatcaagca ccagttcctg
    301 ctgacgggtg acacccaggg ccgctaccgc tgccgctcgg gcttgtccac aggatggacc //
    the dan starts at origin and ends at // ,
    What I have to find are sequences that represent diseases like for example this one :
    gcttgtccac atattttatg agacgcagcc

    Now this isn't that hard but the problem is that the to be found sequences can be expanded over several lines and i have to be able to return in witch lines they were encountered and at witch position.

    thanx in advance

    adriaan
    Last edited by eWish; Dec 15 '07, 10:01 PM. Reason: Added Code Tags
  • eWish
    Recognized Expert Contributor
    • Jul 2007
    • 973

    #2
    Welcome to TSDN!

    Is this homework? If so,please read our Posting Guidlines. Please post your code that you have tried. Also, have a look at CPAN for a module to assist in what you are doing.

    --Kevin
    Last edited by eWish; Dec 15 '07, 10:18 PM. Reason: Corrected Link

    Comment

    • adriaan
      New Member
      • Dec 2007
      • 7

      #3
      thanks for the reply
      youre link to

      doesn't seem to work so I don't know what you mean with the module thing.
      It's not really homework, it's more of an exercise I was advised to try out
      on.
      this the code i've written to get the evil sequences out of the database,
      I don't really have any usefull code yet on the analyzing section as Im still trying to figure out how to do it

      [CODE=perl]sub haalDataBaseOp
      {

      open (DataBase,"data base.txt");

      @data = <DataBase>;

      foreach $ziekte (@data)
      {


      # steek alle ziekte codes in een array
      if($ziekte =~ m/(\b[ctga]+\b)(.*)/)
      {

      $code = $1.$2;

      # print $code."\n";

      }

      # haal het nummer en de naam uit de string
      if($ziekte =~ m/(\d+)(.*?)(\b[gtac]+\b)/)
      {

      # nu maken we een hash met het nummer als keyword naar de ziekte naam
      $ziektenaam{$1} = $2;

      print $1."\n";

      print $code."\n";

      print $2."\n";

      # we maken ook een hash waarbij het nummer verwijst naar de gevonden ziekte codes
      $ziektecode{$1} = $code;

      }

      }

      }[/CODE]
      oh yes I'm a Belgian, so I nativly speak dutch and use that in my comments

      Comment

      • eWish
        Recognized Expert Contributor
        • Jul 2007
        • 973

        #4
        Sorry, about the link to CPAN. I have corrected it. There are serveral bioinformatics modules available that would be designed to handle your request. Also, check out BioPerl.org, in the long run it will be a better solution.

        --Kevin

        Comment

        • nithinpes
          Recognized Expert Contributor
          • Dec 2007
          • 410

          #5
          As a reply to your initial posting where you wanted to search the pattern:
          gcttgtccac atattttatg agacgcagcc (e.g) which can extend across multpile lines and to return the line number and position, the following code works:
          [code=perl]
          $/ ="//"; ## input record separator: each sequence ends with //
          open(DB,"databa se.txt") or die "sorry:$!";
          $pos=0;
          $line=1;
          while(<DB>)
          {
          ## \1 is to back refer pattern inside parantheses, which searches for
          # newline followed by digits
          while(/\bgcttgtccac\b( \s*\n\d+)?\s+\b atattttatg\b\1? \s+\bagacgcagcc \b/g)
          {
          $prev=$`; # get the pattern preceeding your match
          $line++ while($prev=~/(\n)/g); # increment whenever newline occurs
          @pos= split//,$prev;
          foreach (@pos)
          {$pos++ if(/[atgc]/);} # get the number of residues preceeding match
          print "\n line:$line";
          print "\n position: $pos";
          $line=1; $pos=0; # reinitialize variables
          }

          }
          [/code]
          Regards,
          Nithin
          Last edited by numberwhun; Dec 24 '07, 02:11 PM. Reason: add code tags

          Comment

          • numberwhun
            Recognized Expert Moderator Specialist
            • May 2007
            • 3467

            #6
            Originally posted by nithinpes
            As a reply to your initial posting where you wanted to search the pattern:
            gcttgtccac atattttatg agacgcagcc (e.g) which can extend across multpile lines and to return the line number and position, the following code works:
            [code=perl]
            $/ ="//"; ## input record separator: each sequence ends with //
            open(DB,"databa se.txt") or die "sorry:$!";
            $pos=0;
            $line=1;
            while(<DB>)
            {
            ## \1 is to back refer pattern inside parantheses, which searches for
            # newline followed by digits
            while(/\bgcttgtccac\b( \s*\n\d+)?\s+\b atattttatg\b\1? \s+\bagacgcagcc \b/g)
            {
            $prev=$`; # get the pattern preceeding your match
            $line++ while($prev=~/(\n)/g); # increment whenever newline occurs
            @pos= split//,$prev;
            foreach (@pos)
            {$pos++ if(/[atgc]/);} # get the number of residues preceeding match
            print "\n line:$line";
            print "\n position: $pos";
            $line=1; $pos=0; # reinitialize variables
            }

            }
            [/code]
            Regards,
            Nithin
            First, when posting code into the forums, please be sure and use the proper code tags. That way, we moderators don't have to clean up behind you and add them to what you just posted. (As I have done here).

            Next, just out of curiosity, have you checked out the bioperl website? I have seen this site referenced to others working with genomics and such and they ahve always said it was very helpful.

            Regards,

            Jeff

            Comment

            • nithinpes
              Recognized Expert Contributor
              • Dec 2007
              • 410

              #7
              Hi Jeff,

              I'm sorry for that. I have checked bioperl website, that's indeed very helpful in the long run.

              Regards,
              Nithin

              Comment

              Working...