looping through a big file containing a set of files.

**aboxylica** · Dec 19 '07, 11:04 AM

I have changed the indendation of this part so that only whatever is >1 is only printed is it correct?.. dont know if it is checking for the entire file!

Code:

for i, num in enumerate(numList):
        if (log10(num/denList[i]))>=2:
    	    outStr = '\n'.join(['Sequence = %s Calculation = %d' % (seqList[i], res) for i, res in enumerate(resultList)])
   
	    return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % (arraySet, seqSet, dataSeq[0], outStr)

will this check for the entire sequence and all the weight matrix??

**bvdet** · Dec 19 '07, 12:56 PM

Obviously accumulating such a large list is taxing your system. A small adjustment in the code should help, and allow you to see the results for each sequence file.[code=Python]if __name__ == '__main__':

import os

fnArray = 'array.txt'
outputfile = 'seq_array_outp ut.txt'

# user has multiple sequence files
# and must iterate over all the files
dir_name = r'H:\TEMP\temsy s\seq_array'
fileList = [os.path.join(di r_name,f) for f in os.listdir(dir_ name)\
if os.path.isfile( os.path.join(di r_name,f))]

for fnSeq in fileList:
outList = []
calcdata = 1
arraySet = 1
while calcdata:
seqSet = 1
while True:
calcdata = compileData(fnA rray, fnSeq, arraySet, seqSet)
if calcdata:
outList.append( calcdata)
seqSet += 1
else:
break
arraySet += 1
# write to output file for each sequence file
f = open(outputfile , 'a')
f.write('\n'.jo in(outList))
f.close()
print 'Data written for sequence file %s' % fnSeq[/code]You should check the validity of the calculations on a sample array file and two or three sample sequence files. To limit the output to calculation results that are greater than 1:[code=Python]....resultList = []
seqList1 = []

for i, num in enumerate(numLi st):
result = log10(num/denList[i])
# limit output to results greater than 1
if result > 1:
# print 'Seq = %s Result = %0.12f' % (seqList[i], result)
resultList.appe nd(result)
seqList1.append (seqList[i])

outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
(seqList1[i], res) for i, res in enumerate(resul tList)])
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
(arraySet, seqSet, dataSeq[0], outStr)[/code]

**aboxylica** · Dec 19 '07, 03:59 PM

but when i change
while not calcdata is None:
to while calcdata:

the array set is not iterating at all.. the set is one through out.why is this happening??
and what exactly is the difference between the two lines??
and also
if any log value is greater than two then am taking that value as one.. only for these sequences.. i want to print the seqset,arrayset and calculated value.

i want all these written to a file..

will it take a longtime to run??

**bvdet** · Dec 19 '07, 04:37 PM

You are correct. The code should be[code=Python]while not calcdata is None:[/code]I have one array file and two small sequence files for testing. It takes about 2 or 3 seconds.

**aboxylica** · Dec 19 '07, 04:49 PM

i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??

**bvdet** · Dec 19 '07, 05:56 PM

Originally posted by aboxylica

i want the entire result..i.e; all the log values greater than two(I am taking these values as one..the values lesser than this.. i dont care about). i want the entire thing written in a file.. will that be possible or will their be memory error??

Try and see. The modified code from my earlier post writes the data to file for each sequence file. That should solve your memory error, but the only way to find for sure out is to try it.

**aboxylica** · Dec 20 '07, 04:17 AM

yup.. I got the o/p file for the last sequence file.. its about 87 KB. if i want all the values for all the 5000 files in my folder.. it will take 440MB..is it possible to write such a huge file?? because i need the entire output.

**aboxylica** · Dec 20 '07, 09:55 AM

okay.. i dont want to do that because this is not the final result of my task so.. i am not going to write it to a file.
what i should be doing is that..
i get the results which have log values of ones with four headers...like
>skud
>scer
>smik
>spar

if my result list has something like this
Array set # = 1

Sequence set # = 3

Sequence Header: >Skud YPR204W c2068:3442..421 8

Sequence = AACTGTACACT Calculation = 1

Sequence = GATTATAACAA Calculation = 1

Sequence Header: >Smik YPR204W c2854:249..1235

Sequence = GAAGAGATACTAACA A Calculation = 1

Sequence = AATGACCGCGGCTCT T Calculation = 1

Array set # = 3

Sequence set # = 3

Sequence Header: >Skud YPR204W c2068:3442..421 8

Sequence = AACTGTACACTGATT A Calculation = 1

Sequence = TAACAAGAACGGTTC A Calculation = 1

Sequence = TCGGAGCCTCGACTA A Calculation = 1

Sequence = AGACACTTGACGGAC T Calculation = 1

Sequence = CACTTCAGATTACTT G Calculation = 1

Array set # = 3

Sequence set # = 4

this a part of the file
what i should do is... count the number of ones at every occurrence and write it to a list and find the average..
for example..skud=[2,5][at first occurrence of skud there were 2 ones then there were 5 ones] smik[2][there is only one occurence of this with two ones calculated]
now i should compute the average for skud which is 5+2/2=3(approx) similarly for smik 2/1=2

waiting for your reply,
cheers!

**aboxylica** · Dec 20 '07, 12:17 PM

I have done this average calculation to a part of my output_file.. but i should be doing this on resultlist directly. i dont now how to exactly do that.here is the code for calculating average for a small part of my o/p file

Code:

# Open the file. File is in the same directory as source code
f=file("result.txt")

# Put the Contents os the testfile in list a

a=f.readlines()

# create a 'mylist' which will contain only the data we require i.e. header and calculation line

mylist=[]




for i in a:
    if i.find('Header')==-1:
        pass
    else:
        name=""
        for j in i:
            if j ==">":
                for k in range(1,5):
                    indx = i.index(j)+k
                    name+=i[indx]
        mylist.append(name)
        
    

    if i.find('Calculation')==-1:
        pass
    else:
        mylist.append(i)


#name=" "


# Make the List of all the yeast in the text file.The key values are set initially. they will be incrimented for corresponding ones

yeasts={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}

# Dictioalry to count only those yeasts who have one (valid yeasts). It will skip blank yeasts


valid={'Skud':0,'Smik':0,'Scer':0,'Spar':0,'Sbay':0}

#Dictionary to store the values of average

average={}


# Function to read the name of yeast .
# e.g. >ABC as the first argument and index of >ABC as second argument. Returnes name ABC


def readname(r):
    name=""
    for i in r:
        if i ==">":
            for j in range(1,5):
                indx = r.index(i)+j
                name+=r[indx]
    return name
    
            
def name(r):
    name=""
    for i in r:
        if i ==">":
            for j in range(1,5):
                indx = r.index(i)+j
                name+=r[indx]
    return name





                              
                                
                              

       
       
       



# Function to claculate ones for each species. Takes name of species as argument


for v in range(len(mylist)-2):
    s=mylist[v]
    if '1' in s:
        yeasts[name]+=1
    else:
        name=s
        if v<len(mylist):
            t=mylist[v+1]
            if '1' in t:
                valid[name]+=1
                

    


     
    


# Calculations for output (Average) Number


for keys in yeasts:
    if yeasts[keys]==0:
        print "Yeast %s has no occurence" %keys
        pass
    else:
        if valid[keys]==0:
            print "Yeast %s has no occurence" %keys
            pass
        else:
            average[keys]=yeasts[keys]/valid[keys]

print yeasts
print valid
for keys in average:
    print "%s Avareage is : %d" %(keys,average[keys])

f.close()

i dont exactly know how to string this to my main program.. should i put this under main or seperate function as this is calculating from what the code is computing.
waiting for ur reply
cheers!

**aboxylica** · Dec 20 '07, 03:16 PM

I want to know if i should write a seperate function to compute average or can i do it in the main part??

**aboxylica** · Dec 20 '07, 03:44 PM

i want to print the header only if it satifies the following condition:
how do i do it??

Code:

return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2

**bvdet** · Dec 20 '07, 04:00 PM

Originally posted by aboxylica

i want to print the header only if it satifies the following condition:
how do i do it??

Code:

return 'Sequence Header: %s\n%s' % (dataSeq[0], outStr)if (log10(num/denList[i]))>=2

[code=Python]def compileData(fnA rray, fnSeq, arraySet=1, seqSet=1):
# sequence factor dictionary
value={"A":0.3, "T":0.3,"C":0.2 ,"G":0.2}

dataArray = parseArray(fnAr ray, arraySet)
if dataArray:
dataSeq = parseData(fnSeq , seqSet)
if not dataSeq:
return False
else:
return None

# This is the complete sequence
seq = ''.join(dataSeq[1:])
# These are the subkeys of dataArray - '01', '02', '03',.......... ...
subKeys = dataArray['A'].keys()
subKeys.sort()

# Calculate num/den for each slice of sequence
# Each sequence slice length = length of subKeys
# Example:
# seq = 'ATCGATA'
# subKeys length = 3
# 'ATC', 'TCG', 'CGA', 'GAT', 'ATA'
numList = []
denList = []
seqList = []
for i in range(len(seq) - len(subKeys) + 1):
subseq = seq[0:len(subKeys)]
seqList.append( subseq)
num, den = 1, 1
for j, s in enumerate(subse q):
num *= dataArray[s][subKeys[j]]
den *= value[s]
numList.append( num)
denList.append( den)
seq = seq[1:]

resultList = []
seqList1 = []

for i, num in enumerate(numLi st):
result = log10(num/denList[i])
# limit output to results greater than 1
if result > 2:
# print 'Seq = %s Result = %0.12f' % (seqList[i], result)
resultList.appe nd(result)
seqList1.append (seqList[i])

outStr = '\n'.join(['Sequence = %s Calculation = %0.12f' % \
(seqList1[i], res) for i, res in enumerate(resul tList)])
return 'Array set # = %d\nSequence set # = %d\nSequence Header: %s\n%s' % \
(arraySet, seqSet, dataSeq[0], outStr)[/code]

**aboxylica** · Dec 20 '07, 04:56 PM

i am getting an error and still many sequence headers whose values are less than 2 are printed!
but i want the headers only for those value is more than 2
Traceback (most recent call last):
File "C:\Python25\th is_final_1.py", line 165, in <module>
calcdata=compil eData(fnArray,f nSeq,arraySet,s eqSet)
File "C:\Python25\th is_final_1.py", line 137, in compileD
seqList1.append (seqList[i])
IndexError: list index out of range

**bvdet** · Dec 20 '07, 09:06 PM

I am baffled. The code works fine for me, and limits the output to values over 2. Add print statements to check the status of seqList and the calculation results.

looping through a big file containing a set of files.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment