looping through a big file containing a set of files.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • aboxylica
    New Member
    • Jul 2007
    • 111

    #91
    I have changed the indendation of this part so that only whatever is >1 is only printed is it correct?.. dont know if it is checking for the entire file!
    Code:
    for i, num in enumerate(numList):
            if (log10(num/denList[i]))>=2:
        	    outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
       
    	    return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)
    will this check for the entire sequence and all the weight matrix??

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #92
      Obviously accumulating such a large list is taxing your system. A small adjustment in the code should help, and allow you to see the results for each sequence file.[code=Python]if __name__ == '__main__':

      import os

      fnArray = 'array.txt'
      outputfile = 'seq_array_outp ut.txt'

      # user has multiple sequence files
      # and must iterate over all the files
      dir_name = r'H:\TEMP\temsy s\seq_array'
      fileList = [os.path.join(di r_name,f) for f in os.listdir(dir_ name)\
      if os.path.isfile( os.path.join(di r_name,f))]

      for fnSeq in fileList:
      outList = []
      calcdata = 1
      arraySet = 1
      while calcdata:
      seqSet = 1
      while True:
      calcdata = compileData(fnA rray, fnSeq, arraySet, seqSet)
      if calcdata:
      outList.append( calcdata)
      seqSet += 1
      else:
      break
      arraySet += 1
      # write to output file for each sequence file
      f = open(outputfile , 'a')
      f.write('\n'.jo in(outList))
      f.close()
      print 'Data written for sequence file %s' % fnSeq[/code]You should check the validity of the calculations on a sample array file and two or three sample sequence files. To limit the output to calculation results that are greater than 1:[code=Python]....resultList = []
      seqList1 = []

      for i, num in enumerate(numLi st):
      result = log10(num/denList[i])
      # limit output to results greater than 1
      if result > 1:
      # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
      resultList.appe nd(result)
      seqList1.append (seqList[i])

      outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
      (seqList1[i], res) for i, res in enumerate(resul tList)])
      return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
      (arraySet, seqSet, dataSeq[0], outStr)[/code]

      Comment

      • aboxylica
        New Member
        • Jul 2007
        • 111

        #93
        but when i change
        while not calcdata is None:
        to while calcdata:


        the array set is not iterating at all.. the set is one through out.why is this happening??
        and what exactly is the difference between the two lines??
        and also
        if any log value is greater than two then am taking that value as one.. only for these sequences.. i want to print the seqset,arrayset and calculated value.

        i want all these written to a file..

        will it take a longtime to run??

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #94
          You are correct. The code should be[code=Python]while not calcdata is None:[/code]I have one array file and two small sequence files for testing. It takes about 2 or 3 seconds.

          Comment

          • aboxylica
            New Member
            • Jul 2007
            • 111

            #95
            i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #96
              Originally posted by aboxylica
              i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??
              Try and see. The modified code from my earlier post writes the data to file for each sequence file. That should solve your memory error, but the only way to find for sure out is to try it.

              Comment

              • aboxylica
                New Member
                • Jul 2007
                • 111

                #97
                yup.. I got the o/p file for the last sequence file.. its about 87 KB. if i want all the values for all the 5000 files in my folder.. it will take 440MB..is it possible to write such a huge file?? because i need the entire output.

                Comment

                • aboxylica
                  New Member
                  • Jul 2007
                  • 111

                  #98
                  okay.. i dont want to do that because this is not the final result of my task so.. i am not going to write it to a file.
                  what i should be doing is that..
                  i get the results which have log values of ones with four headers...like
                  >skud
                  >scer
                  >smik
                  >spar

                  if my result list has something like this
                  Array set # = 1

                  Sequence set # = 3

                  Sequence Header: >Skud YPR204W c2068:3442..421 8



                  Sequence = AACTGTACACT Calculation = 1

                  Sequence = GATTATAACAA Calculation = 1

                  Sequence Header: >Smik YPR204W c2854:249..1235



                  Sequence = GAAGAGATACTAACA A Calculation = 1

                  Sequence = AATGACCGCGGCTCT T Calculation = 1

                  Array set # = 3

                  Sequence set # = 3

                  Sequence Header: >Skud YPR204W c2068:3442..421 8



                  Sequence = AACTGTACACTGATT A Calculation = 1

                  Sequence = TAACAAGAACGGTTC A Calculation = 1

                  Sequence = TCGGAGCCTCGACTA A Calculation = 1

                  Sequence = AGACACTTGACGGAC T Calculation = 1

                  Sequence = CACTTCAGATTACTT G Calculation = 1

                  Array set # = 3

                  Sequence set # = 4


                  this a part of the file
                  what i should do is... count the number of ones at every occurrence and write it to a list and find the average..
                  for example..skud=[2,5][at first occurrence of skud there were 2 ones then there were 5 ones] smik[2][there is only one occurence of this with two ones calculated]
                  now i should compute the average for skud which is 5+2/2=3(approx) similarly for smik 2/1=2

                  waiting for your reply,
                  cheers!

                  Comment

                  • aboxylica
                    New Member
                    • Jul 2007
                    • 111

                    #99
                    I have done this average calculation to a part of my output_file.. but i should be doing this on resultlist directly. i dont now how to exactly do that.here is the code for calculating average for a small part of my o/p file
                    Code:
                    # Open the file. File is in the same directory as source code
                    f=file("result.txt")
                    
                    # Put the Contents os the testfile in list a
                    
                    a=f.readlines()
                    
                    # create a 'mylist' which will contain only the data we require i.e. header and calculation line
                    
                    mylist=[]
                    
                    
                    
                    
                    for i in a:
                        if i.find('Header')==-1:
                            pass
                        else:
                            name=""
                            for j in i:
                                if j ==">":
                                    for k in range(1,5):
                                        indx = i.index(j)+k
                                        name+=i[indx]
                            mylist.append(name)
                            
                        
                    
                        if i.find('Calculation')==-1:
                            pass
                        else:
                            mylist.append(i)
                    
                    
                    #name=" "
                    
                    
                    # Make the List of all the yeast in the text file.The key values are set initially. they will be incrimented for corresponding ones
                    
                    yeasts={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
                    
                    # Dictioalry to count only those yeasts who have one (valid yeasts). It will skip blank yeasts
                    
                    
                    valid={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}
                    
                    #Dictionary to store the values of average
                    
                    average={}
                    
                    
                    # Function to read the name of yeast .
                    # e.g. >ABC as the first argument and index of >ABC as second argument. Returnes name ABC
                    
                    
                    def readname(r):
                        name=""
                        for i in r:
                            if i ==">":
                                for j in range(1,5):
                                    indx = r.index(i)+j
                                    name+=r[indx]
                        return name
                        
                                
                    def name(r):
                        name=""
                        for i in r:
                            if i ==">":
                                for j in range(1,5):
                                    indx = r.index(i)+j
                                    name+=r[indx]
                        return name
                    
                    
                    
                    
                    
                                                  
                                                    
                                                  
                    
                           
                           
                           
                    
                    
                    
                    # Function to claculate ones for each species. Takes name of species as argument
                    
                    
                    for v in range(len(mylist)-2):
                        s=mylist[v]
                        if '1' in s:
                            yeasts[name]+=1
                        else:
                            name=s
                            if v<len(mylist):
                                t=mylist[v+1]
                                if '1' in t:
                                    valid[name]+=1
                                    
                    
                        
                    
                    
                         
                        
                    
                    
                    # Calculations for output (Average) Number
                    
                    
                    for keys in yeasts:
                        if yeasts[keys]==0:
                            print "Yeast %s has no occurence" %keys
                            pass
                        else:
                            if valid[keys]==0:
                                print "Yeast %s has no occurence" %keys
                                pass
                            else:
                                average[keys]=yeasts[keys]/valid[keys]
                    
                    print yeasts
                    print valid
                    for keys in average:
                        print "%s Avareage is : %d" %(keys,average[keys])
                    
                    f.close()
                    i dont exactly know how to string this to my main program.. should i put this under main or seperate function as this is calculating from what the code is computing.
                    waiting for ur reply
                    cheers!

                    Comment

                    • aboxylica
                      New Member
                      • Jul 2007
                      • 111

                      I want to know if i should write a seperate function to compute average or can i do it in the main part??

                      Comment

                      • aboxylica
                        New Member
                        • Jul 2007
                        • 111

                        i want to print the header only if it satifies the following condition:
                        how do i do it??
                        Code:
                        return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2

                        Comment

                        • bvdet
                          Recognized Expert Specialist
                          • Oct 2006
                          • 2851

                          Originally posted by aboxylica
                          i want to print the header only if it satifies the following condition:
                          how do i do it??
                          Code:
                          return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2
                          [code=Python]def compileData(fnA rray, fnSeq, arraySet=1, seqSet=1):
                          # sequence factor dictionary
                          value={"A":0.3, "T":0.3,"C":0.2 ,"G":0.2}

                          dataArray = parseArray(fnAr ray, arraySet)
                          if dataArray:
                          dataSeq = parseData(fnSeq , seqSet)
                          if not dataSeq:
                          return False
                          else:
                          return None

                          # This is the complete sequence
                          seq = ''.join(dataSeq[1:])
                          # These are the subkeys of dataArray - '01', '02', '03',.......... ...
                          subKeys = dataArray['A'].keys()
                          subKeys.sort()

                          # Calculate num/den for each slice of sequence
                          # Each sequence slice length = length of subKeys
                          # Example:
                          # seq = 'ATCGATA'
                          # subKeys length = 3
                          # 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
                          numList = []
                          denList = []
                          seqList = []
                          for i in range(len(seq) - len(subKeys) + 1):
                          subseq = seq[0:len(subKeys)]
                          seqList.append( subseq)
                          num, den = 1, 1
                          for j, s in enumerate(subse q):
                          num *= dataArray[s][subKeys[j]]
                          den *= value[s]
                          numList.append( num)
                          denList.append( den)
                          seq = seq[1:]

                          resultList = []
                          seqList1 = []

                          for i, num in enumerate(numLi st):
                          result = log10(num/denList[i])
                          # limit output to results greater than 1
                          if result > 2:
                          # print 'Seq = %s Result = %0.12f' % (seqList[i], result)
                          resultList.appe nd(result)
                          seqList1.append (seqList[i])

                          outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
                          (seqList1[i], res) for i, res in enumerate(resul tList)])
                          return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
                          (arraySet, seqSet, dataSeq[0], outStr)[/code]

                          Comment

                          • aboxylica
                            New Member
                            • Jul 2007
                            • 111

                            i am getting an error and still many sequence headers whose values are less than 2 are printed!
                            but i want the headers only for those value is more than 2
                            Traceback (most recent call last):
                            File "C:\Python25\th is_final_1.py", line 165, in <module>
                            calcdata=compil eData(fnArray,f nSeq,arraySet,s eqSet)
                            File "C:\Python25\th is_final_1.py", line 137, in compileD
                            seqList1.append (seqList[i])
                            IndexError: list index out of range

                            Comment

                            • bvdet
                              Recognized Expert Specialist
                              • Oct 2006
                              • 2851

                              I am baffled. The code works fine for me, and limits the output to values over 2. Add print statements to check the status of seqList and the calculation results.

                              Comment

                              Working...