hey!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.
//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//
the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
CCTGGAGCCATCGTC CTCGTCCTCC
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
AATACAAAAAAAATA TATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
[CODE=python]
from math import *
import random
f=open("deeps1. txt","r")
line=f.next()
while not line.startswith ('PO'):
line=f.next()
headerlist=line .strip().split( )[1:]
linelist=[]
line=f.next().s trip()
while not line.startswith ('/'):
if line != '':
linelist.append (line.strip().s plit())
line=f.next().s trip()
keys=[i[0] for i in linelist]
values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
array={}
linedict=dict(z ip(keys,values) )
keys = linedict.keys()
keys.sort()
for key in keys:
array=[key,linedict[key]]
datadict={}
datadict1={}
for i,item in enumerate(heade rlist):
datadict[item]={}
for key_ in linedict:
datadict[item][key_]=linedict[key_][i]
for keymain in datadict:
for keysub in datadict[keymain]:
datadict[keymain][keysub]+=1.0
datadict1=datad ict.copy()
for keysub in datadict:
for keysub in datadict[keymain]:
datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
def readfasta():
file1= open("chr011.py ",'r')
file_content=fi le1.readlines()
first=1
list1=""
for line in file_content:
if line[0]==">":
if first==0:
print "********** *"
list1+=sequence
print "********** *"
else:
first=0
sequence=""
seq=""
for i in range(0,len(lin e)-1):
seq+=line[i]
else:
for i in range(0,len(lin e)-1):
sequence+=line[i]
list1+=sequence
return list1
p=readfasta()
res=1
part=""
q=len(p)
seqq=""
value={"A":0.3, "T":0.3,"C":0.2 ,"G":0.2}
for i in range(q-16):
part=p[i:i+16]
seqq=part
res=1
score=1
for j in range(16):
key=seqq[j]
res=res*datadic t1[key]["%02d"%(j+1 )]
#print res
for key in seqq:
score=score * value[key]
#print score,"******** ***********",re s
log_ratio=log10 (res/score)
print i,log_ratio
[/CODE]
what changes should i make and how?/
waiting for your reply,
cheers!
I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
my matrix file is huge containing so many matrices. a part of it is here.
//
NA Abd-B
PO A C G T
01 10.19 0.00 10.65 6.24
02 5.79 0.67 10.50 10.11
03 4.50 0.00 0.00 22.57
04 0.00 0.00 0.00 27.08
05 0.00 0.00 0.00 27.08
06 0.00 0.00 0.00 27.08
07 27.08 0.00 0.00 0.00
08 0.00 2.83 0.00 24.25
09 0.00 0.00 24.45 2.62
10 19.33 0.00 4.34 3.41
11 0.31 12.28 3.39 11.09
//
//
NA Adf1
PO A C G T
01 0.71 0.08 26.02 1.55
02 3.03 23.00 1.24 1.09
03 0.26 10.50 3.29 14.31
04 0.00 0.06 28.23 0.07
05 0.12 27.27 0.06 0.91
06 1.44 20.36 0.37 6.19
07 5.35 0.28 21.49 1.24
08 7.81 16.10 3.81 0.63
09 0.51 17.77 0.45 9.63
10 0.00 0.14 28.21 0.00
11 0.00 25.69 0.20 2.46
12 0.48 9.98 0.07 17.82
13 1.27 0.00 27.01 0.07
14 15.59 7.98 2.92 1.87
15 4.28 22.37 0.00 1.70
16 0.18 0.77 22.70 4.70
//
//
NA Aef1
PO A C G T
01 0.00 0.06 12.49 0.00
02 3.80 0.17 0.00 8.57
03 0.87 0.06 0.00 11.62
04 0.06 9.76 2.32 0.41
05 9.82 0.00 2.73 0.00
06 9.76 0.00 0.00 2.78
07 3.80 0.31 0.00 8.43
08 0.00 0.00 0.00 12.54
09 0.00 6.53 5.85 0.17
10 0.00 12.38 0.17 0.00
11 2.73 1.02 8.80 0.00
12 5.85 0.00 6.70 0.00
13 1.02 5.96 0.00 5.57
14 0.00 5.16 4.66 2.73
15 1.03 7.55 3.97 0.00
16 4.82 5.00 2.73 0.00
//
//
NA Antp
PO A C G T
01 5.52 14.49 27.56 0.49
02 8.17 14.02 11.42 14.47
03 18.18 27.29 1.31 1.29
04 40.26 5.66 1.83 0.32
05 19.05 12.67 0.43 15.91
06 9.94 0.07 0.20 37.86
07 26.63 15.17 0.00 6.27
08 47.45 0.06 0.00 0.56
09 0.81 0.48 0.00 46.79
10 26.46 19.05 1.81 0.75
11 48.07 0.00 0.00 0.00
12 30.51 0.00 0.00 17.56
13 43.45 0.00 0.00 4.62
14 30.06 5.98 0.00 12.03
15 0.38 0.64 0.00 47.05
16 22.14 0.29 7.15 18.49
//
//
the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
CCTGGAGCCATCGTC CTCGTCCTCC
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
AATACAAAAAAAATA TATATATATA
this is my code which works(prints the log value for one sequence and one matrix)
[CODE=python]
from math import *
import random
f=open("deeps1. txt","r")
line=f.next()
while not line.startswith ('PO'):
line=f.next()
headerlist=line .strip().split( )[1:]
linelist=[]
line=f.next().s trip()
while not line.startswith ('/'):
if line != '':
linelist.append (line.strip().s plit())
line=f.next().s trip()
keys=[i[0] for i in linelist]
values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
array={}
linedict=dict(z ip(keys,values) )
keys = linedict.keys()
keys.sort()
for key in keys:
array=[key,linedict[key]]
datadict={}
datadict1={}
for i,item in enumerate(heade rlist):
datadict[item]={}
for key_ in linedict:
datadict[item][key_]=linedict[key_][i]
for keymain in datadict:
for keysub in datadict[keymain]:
datadict[keymain][keysub]+=1.0
datadict1=datad ict.copy()
for keysub in datadict:
for keysub in datadict[keymain]:
datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
def readfasta():
file1= open("chr011.py ",'r')
file_content=fi le1.readlines()
first=1
list1=""
for line in file_content:
if line[0]==">":
if first==0:
print "********** *"
list1+=sequence
print "********** *"
else:
first=0
sequence=""
seq=""
for i in range(0,len(lin e)-1):
seq+=line[i]
else:
for i in range(0,len(lin e)-1):
sequence+=line[i]
list1+=sequence
return list1
p=readfasta()
res=1
part=""
q=len(p)
seqq=""
value={"A":0.3, "T":0.3,"C":0.2 ,"G":0.2}
for i in range(q-16):
part=p[i:i+16]
seqq=part
res=1
score=1
for j in range(16):
key=seqq[j]
res=res*datadic t1[key]["%02d"%(j+1 )]
#print res
for key in seqq:
score=score * value[key]
#print score,"******** ***********",re s
log_ratio=log10 (res/score)
print i,log_ratio
[/CODE]
what changes should i make and how?/
waiting for your reply,
cheers!
Comment