looping through a big file containing a set of files.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • aboxylica
    New Member
    • Jul 2007
    • 111

    #31
    I dont seem to get a hang of it. it is not working for all the sequences and all the matrices.what do i do?
    waiting for your reply,
    cheers!

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #32
      Originally posted by aboxylica

      This is the code that works and calculates for a single sequence and a single matrix(containi ng 16 positions) I want to do it for many sequences and many matrices.I guess am clearer now.I have given how my sequences and matrices look like.I just need to generalize it.am i clearer now
      waiting for ur reply
      cheers!
      You have a mistake in the code where the data is normalized:
      Code:
      datadict1=datadict.copy()
      for [B][U]keymain[/U][/B] in datadict:
          for keysub in datadict[keymain]:
              datadict1[keymain][keysub] /= (sum(values[int(keysub)-1])+4)
      You had keysub where it should have been keymain. The division on the last line can be done with the division in place operator '/='.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #33
        Originally posted by aboxylica
        Code:
        res=1
        part=""
        q=len(p)
        seqq=""
         
        value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
        for i in range(q-16):
            part=p[i:i+16]
            seqq=part
            res=1
            score=1
            for j in range(16):
                key=seqq[j]
                res=res*datadict1[key]["%02d"%(j+1)]
                #print res
            for key in seqq:
                score=score * value[key]
            #print score,"*******************",res
            log_ratio=log10(res/score)
            print i,log_ratio
        This is the code that works ......
        There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

        The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

        I am trying to understand the calculation you want.
        Example:
        sequence = 'ATTC.........'
        dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
        Each key has 16 sub keys, 01-16 in the first set of data

        (dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ............... ...

        What is a sequence exactly? This is the first set of sequence data in your file:
        Code:
        >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
        CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
        GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
        CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
        CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
        ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
        CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
        TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
        CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
        GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
        TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG
        Is each line a separate sequence, or all the lines combined one sequence?

        Comment

        • aboxylica
          New Member
          • Jul 2007
          • 111

          #34
          This is my full code now.there seems to be problems.But I cant really figure it out!!

          Code:
          def parseArray(fn, dataset=1, key='PO', term='/'):
              '''
              Read a formatted data file in matrix format and
              compile data into a dictionary
              '''
              f = open(fn)
           
              # skip to required data set
              for _ in range(dataset):
                  try:
                      line = f.next()
                      while not line.startswith(key):
                          line = f.next()
                  except StopIteration, e:
                      print 'We have reached the end of the file!'
                      f.close()
                      return False
           
              headerList = line.strip().split()[1:]
              lineList = []
           
              line = f.next().strip()
              while not line.startswith(term):
                  if line != '':
                      lineList.append(line.strip().split())
                  line = f.next().strip()
           
              f.close()
           
              # Key list
              keys = [i[0] for i in lineList]
              print keys
              # Values list
              values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
           
              # Create a dictionary from keys and values
              lineDict = dict(zip(keys, values))
           
              dataDict = {}
           
              for i, item in enumerate(headerList):
                  dataDict[item] = {}
                  for key in lineDict:
                      dataDict[item][key] = lineDict[key][i]
           
              # Add 1.0 to every element in dataDict subdictionaries
              for keyMain in dataDict:
                  for keySub in dataDict[keyMain]:
                      dataDict[keyMain][keySub] += 1.0
           
              # Normalize original data (with 1 added) and update data
              valueSums = [sum(item)+4 for item in values]
              dataDict1=dataDict.copy()
           
              for keyMain in dataDict:
                  for keySub in dataDict[keyMain]:
                      dataDict1[keyMain][keySub] /= (sum(valueSums[int(keySub)-1]+4
           
                  return dataDict1
           
           
          def parseData(fn, dataset=1, key='>'):
              '''
              Read a formatted data file of alpha sequences
              Return a list of sequences
              '''    
              # initialize output list
              dataList = []
              
              # open file for reading
              f = open(fn)
              
              # skip to required data set
              for _ in range(dataset):
                  try:
                      s = f.next()
                      while not s.startswith(key):
                          s = f.next()
                  except StopIteration, e:
                      print 'We have reached the end of the file!'
                      f.close()
                      return False
           
              for line in f:
                  if not line.startswith(key):
                      dataList.append(line.strip())
                  else:
                      break
           
              f.close()
              return dataList
           
          if __name__ == '__main__':
              value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
              fnArray = 'all_redfly.transfac.txt'
              fnSeq = 'redfly_sequence.fasta'
              dataset = 3
              dataArray = parseArray(fnArray, dataset)
              dataSeq = parseData(fnSeq, dataset)
              seq = ''.join(dataSeq[1:])
              subKeys = dataArray['A'].keys()
              subKeys.sort()
           
              i,j = divmod(len(seq), len(subKeys))
              keys = subKeys*i + subKeys[:j]
           
              print dataSeq[0],
              outList = ['%s[%s]*%s = %0.4f' % (s, keys[i], s, dataArray[s][keys[i]]*value[s]) for i, s in enumerate(seq)]
              print '\n'.join(outList)
              print sum([float(s.split('=')[1]) for s in outList])
          waiting for ur reply,
          cheers!

          Comment

          • aboxylica
            New Member
            • Jul 2007
            • 111

            #35
            Originally posted by bvdet
            There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

            The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

            I am trying to understand the calculation you want.
            Example:
            sequence = 'ATTC.........'
            dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
            Each key has 16 sub keys, 01-16 in the first set of data

            (dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ............... ...

            What is a sequence exactly? This is the first set of sequence data in your file:
            Code:
            >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
            CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
            GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
            CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
            CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
            ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
            CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
            TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
            CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
            GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
            TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG
            Is each line a separate sequence, or all the lines combined one sequence?
            Each sequence refers starts with
            >sign(list of A<T<G<C characters... and then second sequence starts with > sign followed by characters.. so ..on.. am i clearer??

            Comment

            • aboxylica
              New Member
              • Jul 2007
              • 111

              #36
              THIS IS THE FIRST SEQUENCE
              :>CG9571_O-E|Drosophila
              melanogaster|CG 9571|FBgn003108 6|X:19926374..1 99271 33
              CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTC GAGGATTACCCGTGT ATCCTGGGACGCG
              GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAA GCCTTCGGATTCATT CATTGATCCACAT
              CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGG TTTCTAAATTGGATT ATTTCTACCTGAC
              CCTGGAGCCATCGTC CTCGTCCTCCGTCCC TTAGCGCCTCCTGCA TGGAT GTCGTTTTTGGGTTT CATACCTTTTCAC
              ACTGGAAAAATACGG AATTTGTTGTAAGCC CTTTCAAGACGAATG GGATT TAGCTTCGGATGTCA ACGTCACCATAAT
              CATATTAGGAATATT TCTACTCAATTGCAA TATTGGTACTTTTCT GACTG TAAACGCGATGATAA TTACAAATATGCC
              TAATTTGCTGTCTTT ATAATCAAATGGAGT TCTTTATATTTCCAA AATAT TGAAATTCCGATTCC CTAGAAAATAATA
              CGTTTTTCTGTTATT AATAAAAAACCAATA GGAAAGTTCTCAAAA ATTAC TCTGTTGTATTTGAT CATTTCTTTTCCG
              GTATAATCTTTTATT TTAAGCATTCCCATG TGAATAAATTTCAGA CTAAT GTATTAATAAGATGT CGTGTTTTTCCAC
              TTACAAATTTCTCAT ACAGCTGGATATATA CTACGAGTACTATAC ACATG CTCTGGG
              # THIS IS THE SECOND SEQUENCE
              >Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
              AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTA GAAGCTAAATCCATG TCCACGTCAAACC
              AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCA CACCATACACCACAC ACCACACATACAC
              ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAA GGAAATTATTAAAAA AAATAATAAAATT
              AATACAAAAAAAATA TATATATATACAAAA ATTTGTTGTGTTTGA ATTGA ATTAAGAGCTTATCA AGAAAAAAATTTC
              AGTGACTCATAATAC ACTACTCTACAAGTT TAAATTGAATCAACA ATTTA ACTTTCATTGCTCAG GTTTTTAGTAACA
              ATGTTTATATAAGTT TAGGTATAACAAATG ATTTAAATATAAGAT ACTGT ATTTCACATTGAGAC GAAACAATCCACC
              GAAAATCATAAAATA TAAGAATGTTGCATT TTATTTTTAAAAATA AAGAT GCCTTTTAAGAGGAA TAACTTAAATGTC
              TTTAATACCTTTGAA TTTAATTATATGGCT AATAAACACAAACTT AAAGC TTAAAACTGCATCGA ATTGAATGCGGTT
              ATAAATGTACTTATA TATCTAATATAATCT GCTAATATGGTTTAC ATGGT ATATCTTTCTCGGAA ATTTTTACAAAAA
              TTATCTATTCATATA TCTCGAGCGTAAGAT ATTTATCAGTTTATA GATAA CATCTTTAAATTTGG GTGATTAAAAAAA
              AACATTG
              #THIS IS THE THIRD SEQUENCE
              >Cp36_PRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8324430..8324 513
              TCTAGAGATCTGGGC ACGATGGCGAGACAA AGATGCGGCGCAAAA TCGGA AATGGAGATGGATCA CGTAGCCGGCCAT
              GGCGG

              Comment

              • aboxylica
                New Member
                • Jul 2007
                • 111

                #37
                and one more thing.The calculation is not corresponding.
                not like first sequence and first dataset.
                to make it simpler
                just say i have first sequence I am going to find the log value using all the matrices..
                like I get a set of log values for matrix 1,then matrix 2,matrix3...... .so..on
                Now, just realize that there are many sequences too..
                so what il do is. for seqence one find all the values using all the matrices..
                then move to sequence 2..find all the values using all the matrices..
                then move to third sequence..find all the values for all the matrices..so on........
                I think its better now??

                Comment

                • bvdet
                  Recognized Expert Specialist
                  • Oct 2006
                  • 2851

                  #38
                  Originally posted by aboxylica
                  This is my full code now.there seems to be problems.But I cant really figure it out!!

                  Code:
                  def parseArray(fn, dataset=1, key='PO', term='/'):
                      '''
                      Read a formatted data file in matrix format and
                      compile data into a dictionary
                      '''
                      .....................................
                      .....................................
                      print '\n'.join(outList)
                      print sum([float(s.split('=')[1]) for s in outList])
                  waiting for ur reply,
                  cheers!
                  'parseArray' returns the array dictionary after adding 1 and normalizing each array element. 'parseData' returns a list of strings - the first string is the header and the rest comprise the sequence. Based on what you have communicated, this is part of the calculation printout:
                  Code:
                  >CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
                  C[01]/C = 1.922438822066
                  C[02]/C = 0.207382828702
                  A[03]/A = 0.939745715865
                  G[04]/G = 0.746578183326
                  T[05]/T = 1.200055302088
                  C[06]/C = 0.207296849088
                  C[07]/C = 0.327664869349
                  A[08]/A = 2.918567675930
                  C[09]/C = 0.217751970137
                  C[10]/C = 0.207382828702
                  G[11]/G = 0.207382828702
                  G[12]/G = 0.207382828702
                  C[13]/C = 0.279850746269
                  C[14]/C = 0.709249274160
                  G[15]/G = 3.334715885525
                  C[16]/C = 0.300705101618
                  C[01]/C = 1.922438822066
                  G[02]/G = 2.285358772294
                  A[03]/A = 0.939745715865
                  T[04]/T = 0.138255219135
                  C[05]/C = 0.207382828702
                  T[06]/T = 0.138197899392
                  A[07]/A = 0.145167980091
                  T[08]/T = 0.138255219135
                  T[09]/T = 0.138255219135
                  T[10]/T = 0.138255219135
                  A[11]/A = 0.138255219135
                  T[12]/T = 0.138255219135
                  A[13]/A = 2.209784411277
                  C[14]/C = 0.709249274160
                  G[15]/G = 3.334715885525
                  A[16]/A = 1.137840453477
                  G[01]/G = 0.207382828702
                  A[02]/A = 0.138255219135
                  G[03]/G = 0.207296849088
                  G[04]/G = 0.746578183326
                  In the first calculation, dataDict['C']['01'] is divided by value['C']. Second - dataDict['C']['02'] is divided by value['C']. When we get to the 17th calculation, we start over with dataDict['X']['01']. Is this right?

                  If we multiply each element together,
                  so it should be done like A[01]*T[02]*T[03]*.............. .../0.3*0.3*0.3.... ........
                  we get an incredibly small number like:
                  Code:
                  >>> result
                  2.4243160673268744e-210
                  >>>

                  Comment

                  • aboxylica
                    New Member
                    • Jul 2007
                    • 111

                    #39
                    yes, we have to multiply it together only
                    to be clearer
                    CCTATA..for simplicity let us store them seperately and then divide
                    numerator=C[01]*C{02]*c[01]..... this is for postion one
                    corresponding denominator will be C*C*T*A*T*T*A(t he standard values should me multiplied)
                    formula is log(numerator/denominator)
                    so it will be
                    FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(t he standard values should me multiplied))
                    FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
                    (the standard values here))
                    is it clearer?
                    waiting for ur reply,
                    cheers!!

                    Comment

                    • aboxylica
                      New Member
                      • Jul 2007
                      • 111

                      #40
                      we should not divide individualy!!we should calculatte for the entire seuence in the first position)
                      if this is the ccccctaaaaaaa
                      get the corresponding values for all the positions say that will be numerator
                      then get the standard values for the denominator multiply them
                      then do log(numerator/denominator)
                      ok??we are not dividing individually!!
                      am sorry for all the trouble but am finding it difficult to do..thats y am bugging u!!

                      Comment

                      • elbin
                        New Member
                        • Jul 2007
                        • 27

                        #41
                        Bvdet, to make it easier for you too, this is the code generating the calculation (log value)

                        Code:
                        q=len(p)
                        value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
                        for i in range(q-len(datadict['A'])):
                            part=p[i:i+len(datadict['A'])]
                            seqq=part
                            res=1
                            score=1
                            for j in range(len(datadict['A'])):
                                key=seqq[j]
                                res=res*datadict[key]["%02d"%(j+1)]
                            for key in seqq:
                                score=score * value[key]
                            log_ratio=log10(res/score)
                            print i,log_ratio
                        where datadict is the normalized matrix dictionary, and p is the WHOLE sequence between the header line starting with '>' and the next header, as a string. So, for a sequence of 50 chars length and matrix of 16 lines you get 50 - 16 = 34 log values - from 0 to 33 in the initial example. That is because you evaluate with the matrix a moving window of X characters, where X is the matrix length. And each step the window moves with 1 character forward. Then it should be done as aboxylica said - for each sequence and each matrix.

                        Hope this helps, as I don't have time to do the whole thing... although there's not much to be done.

                        Comment

                        • bvdet
                          Recognized Expert Specialist
                          • Oct 2006
                          • 2851

                          #42
                          Originally posted by aboxylica
                          yes, we have to multiply it together only
                          to be clearer
                          CCTATA..for simplicity let us store them seperately and then divide
                          numerator=C[01]*C{02]*c[01]..... this is for postion one
                          corresponding denominator will be C*C*T*A*T*T*A(t he standard values should me multiplied)
                          formula is log(numerator/denominator)
                          so it will be
                          FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(t he standard values should me multiplied))
                          FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
                          (the standard values here))
                          is it clearer?
                          waiting for ur reply,
                          cheers!!
                          Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!

                          For a sequence of CCTATA.........
                          Numerator = C[01]*C{02]*c[01]........ ?
                          Denominator = C*C*T*A*T*T*A.. ....... ?

                          Comment

                          • aboxylica
                            New Member
                            • Jul 2007
                            • 111

                            #43
                            oops... am sorry!!
                            Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
                            then for the same sequence we calulate the denominator which will be
                            0.2*0.2*0.3*0.3 *0.3*0.3=denomi nator(these are simple standard values)
                            log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
                            ok?


                            Originally posted by bvdet
                            Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!

                            For a sequence of CCTATA.........
                            Numerator = C[01]*C{02]*c[01]........ ?
                            Denominator = C*C*T*A*T*T*A.. ....... ?

                            Comment

                            • elbin
                              New Member
                              • Jul 2007
                              • 27

                              #44
                              Originally posted by bvdet
                              The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!
                              They are small, but divided they give a reasonable number between -5 and 5 (roughly). You have to use numbers like
                              Code:
                              >>> 0.3**16
                              4.3046720999999976e-009
                              but there is a result.

                              Comment

                              • bvdet
                                Recognized Expert Specialist
                                • Oct 2006
                                • 2851

                                #45
                                Originally posted by aboxylica
                                oops... am sorry!!
                                Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
                                then for the same sequence we calulate the denominator which will be
                                0.2*0.2*0.3*0.3 *0.3*0.3=denomi nator(these are simple standard values)
                                log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
                                ok?
                                The first sequence in the sequence data file is 759 characters long. The first array in the array data file has 16 elements. How many numbers are multiplied together? 759 or 16?

                                Comment

                                Working...