looping through a big file containing a set of files.

**aboxylica** · Jul 16 '07, 04:48 AM

I dont seem to get a hang of it. it is not working for all the sequences and all the matrices.what do i do?
waiting for your reply,
cheers!

**bvdet** · Jul 16 '07, 12:18 PM

Originally posted by aboxylica

This is the code that works and calculates for a single sequence and a single matrix(containi ng 16 positions) I want to do it for many sequences and many matrices.I guess am clearer now.I have given how my sequences and matrices look like.I just need to generalize it.am i clearer now
waiting for ur reply
cheers!

You have a mistake in the code where the data is normalized:

Code:

datadict1=datadict.copy()
for [B][U]keymain[/U][/B] in datadict:
    for keysub in datadict[keymain]:
        datadict1[keymain][keysub] /= (sum(values[int(keysub)-1])+4)

You had keysub where it should have been keymain. The division on the last line can be done with the division in place operator '/='.

**bvdet** · Jul 16 '07, 12:42 PM

Originally posted by aboxylica

Code:

res=1
part=""
q=len(p)
seqq=""
 
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
for i in range(q-16):
    part=p[i:i+16]
    seqq=part
    res=1
    score=1
    for j in range(16):
        key=seqq[j]
        res=res*datadict1[key]["%02d"%(j+1)]
        #print res
    for key in seqq:
        score=score * value[key]
    #print score,"*******************",res
    log_ratio=log10(res/score)
    print i,log_ratio

This is the code that works ......

There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

I am trying to understand the calculation you want.
Example:
sequence = 'ATTC.........'
dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
Each key has 16 sub keys, 01-16 in the first set of data

(dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ............... ...

What is a sequence exactly? This is the first set of sequence data in your file:

Code:

>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG

Is each line a separate sequence, or all the lines combined one sequence?

**aboxylica** · Jul 16 '07, 12:50 PM

This is my full code now.there seems to be problems.But I cant really figure it out!!

Code:

def parseArray(fn, dataset=1, key='PO', term='/'):
    '''
    Read a formatted data file in matrix format and
    compile data into a dictionary
    '''
    f = open(fn)
 
    # skip to required data set
    for _ in range(dataset):
        try:
            line = f.next()
            while not line.startswith(key):
                line = f.next()
        except StopIteration, e:
            print 'We have reached the end of the file!'
            f.close()
            return False
 
    headerList = line.strip().split()[1:]
    lineList = []
 
    line = f.next().strip()
    while not line.startswith(term):
        if line != '':
            lineList.append(line.strip().split())
        line = f.next().strip()
 
    f.close()
 
    # Key list
    keys = [i[0] for i in lineList]
    print keys
    # Values list
    values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]
 
    # Create a dictionary from keys and values
    lineDict = dict(zip(keys, values))
 
    dataDict = {}
 
    for i, item in enumerate(headerList):
        dataDict[item] = {}
        for key in lineDict:
            dataDict[item][key] = lineDict[key][i]
 
    # Add 1.0 to every element in dataDict subdictionaries
    for keyMain in dataDict:
        for keySub in dataDict[keyMain]:
            dataDict[keyMain][keySub] += 1.0
 
    # Normalize original data (with 1 added) and update data
    valueSums = [sum(item)+4 for item in values]
    dataDict1=dataDict.copy()
 
    for keyMain in dataDict:
        for keySub in dataDict[keyMain]:
            dataDict1[keyMain][keySub] /= (sum(valueSums[int(keySub)-1]+4
 
        return dataDict1
 
 
def parseData(fn, dataset=1, key='>'):
    '''
    Read a formatted data file of alpha sequences
    Return a list of sequences
    '''    
    # initialize output list
    dataList = []
    
    # open file for reading
    f = open(fn)
    
    # skip to required data set
    for _ in range(dataset):
        try:
            s = f.next()
            while not s.startswith(key):
                s = f.next()
        except StopIteration, e:
            print 'We have reached the end of the file!'
            f.close()
            return False
 
    for line in f:
        if not line.startswith(key):
            dataList.append(line.strip())
        else:
            break
 
    f.close()
    return dataList
 
if __name__ == '__main__':
    value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
    fnArray = 'all_redfly.transfac.txt'
    fnSeq = 'redfly_sequence.fasta'
    dataset = 3
    dataArray = parseArray(fnArray, dataset)
    dataSeq = parseData(fnSeq, dataset)
    seq = ''.join(dataSeq[1:])
    subKeys = dataArray['A'].keys()
    subKeys.sort()
 
    i,j = divmod(len(seq), len(subKeys))
    keys = subKeys*i + subKeys[:j]
 
    print dataSeq[0],
    outList = ['%s[%s]*%s = %0.4f' % (s, keys[i], s, dataArray[s][keys[i]]*value[s]) for i, s in enumerate(seq)]
    print '\n'.join(outList)
    print sum([float(s.split('=')[1]) for s in outList])

waiting for ur reply,
cheers!

**aboxylica** · Jul 16 '07, 12:54 PM

Originally posted by bvdet

There are multiple sets of arrays and sequences in the data files you presented. By data set, I am referring to one set of data in each file, and assume that the first set of array data and the first set of sequence data should work together, and the second set of array data .........

The code above is where I get lost. The sequence is 50 characters long. There are 16 sets of numbers for 'A', 'T', 'C', and 'G'. Why does the code generate 33 calculations?

I am trying to understand the calculation you want.
Example:
sequence = 'ATTC.........'
dataDict is defined with main keys of 'A', 'C', 'G', and 'T'
Each key has 16 sub keys, 01-16 in the first set of data

(dataDict['A']['01']/0.3) * (dataDict['T']['02']/0.3) * (dataDict['T']['03']/0.3) * (dataDict['C']['04']/0.2) * ............... ...

What is a sequence exactly? This is the first set of sequence data in your file:

Code:

>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
CCAGTCCACCGGCCGCCGATCTATTTATACGAGAGGAAGAGGCTGAACTCGAGGATTACCCGTGTATCCTGGGACGCG
GATTAGCGATCCATTCCCCTTTTAATCGCCGCGCAAACAGATTCATGAAAGCCTTCGGATTCATTCATTGATCCACAT
CTACGGGAACGGGAGTCGCAAACGTTTTCGGATTAGCGCTGGACTAGCGGTTTCTAAATTGGATTATTTCTACCTGAC
CCTGGAGCCATCGTCCTCGTCCTCCGTCCCTTAGCGCCTCCTGCATGGATGTCGTTTTTGGGTTTCATACCTTTTCAC
ACTGGAAAAATACGGAATTTGTTGTAAGCCCTTTCAAGACGAATGGGATTTAGCTTCGGATGTCAACGTCACCATAAT
CATATTAGGAATATTTCTACTCAATTGCAATATTGGTACTTTTCTGACTGTAAACGCGATGATAATTACAAATATGCC
TAATTTGCTGTCTTTATAATCAAATGGAGTTCTTTATATTTCCAAAATATTGAAATTCCGATTCCCTAGAAAATAATA
CGTTTTTCTGTTATTAATAAAAAACCAATAGGAAAGTTCTCAAAAATTACTCTGTTGTATTTGATCATTTCTTTTCCG
GTATAATCTTTTATTTTAAGCATTCCCATGTGAATAAATTTCAGACTAATGTATTAATAAGATGTCGTGTTTTTCCAC
TTACAAATTTCTCATACAGCTGGATATATACTACGAGTACTATACACATGCTCTGGG

Is each line a separate sequence, or all the lines combined one sequence?

Each sequence refers starts with
>sign(list of A<T<G<C characters... and then second sequence starts with > sign followed by characters.. so ..on.. am i clearer??

**aboxylica** · Jul 16 '07, 12:57 PM

THIS IS THE FIRST SEQUENCE
:>CG9571_O-E|Drosophila
melanogaster|CG 9571|FBgn003108 6|X:19926374..1 99271 33
CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTC GAGGATTACCCGTGT ATCCTGGGACGCG
GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAA GCCTTCGGATTCATT CATTGATCCACAT
CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGG TTTCTAAATTGGATT ATTTCTACCTGAC
CCTGGAGCCATCGTC CTCGTCCTCCGTCCC TTAGCGCCTCCTGCA TGGAT GTCGTTTTTGGGTTT CATACCTTTTCAC
ACTGGAAAAATACGG AATTTGTTGTAAGCC CTTTCAAGACGAATG GGATT TAGCTTCGGATGTCA ACGTCACCATAAT
CATATTAGGAATATT TCTACTCAATTGCAA TATTGGTACTTTTCT GACTG TAAACGCGATGATAA TTACAAATATGCC
TAATTTGCTGTCTTT ATAATCAAATGGAGT TCTTTATATTTCCAA AATAT TGAAATTCCGATTCC CTAGAAAATAATA
CGTTTTTCTGTTATT AATAAAAAACCAATA GGAAAGTTCTCAAAA ATTAC TCTGTTGTATTTGAT CATTTCTTTTCCG
GTATAATCTTTTATT TTAAGCATTCCCATG TGAATAAATTTCAGA CTAAT GTATTAATAAGATGT CGTGTTTTTCCAC
TTACAAATTTCTCAT ACAGCTGGATATATA CTACGAGTACTATAC ACATG CTCTGGG
# THIS IS THE SECOND SEQUENCE
>Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTA GAAGCTAAATCCATG TCCACGTCAAACC
AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCA CACCATACACCACAC ACCACACATACAC
ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAA GGAAATTATTAAAAA AAATAATAAAATT
AATACAAAAAAAATA TATATATATACAAAA ATTTGTTGTGTTTGA ATTGA ATTAAGAGCTTATCA AGAAAAAAATTTC
AGTGACTCATAATAC ACTACTCTACAAGTT TAAATTGAATCAACA ATTTA ACTTTCATTGCTCAG GTTTTTAGTAACA
ATGTTTATATAAGTT TAGGTATAACAAATG ATTTAAATATAAGAT ACTGT ATTTCACATTGAGAC GAAACAATCCACC
GAAAATCATAAAATA TAAGAATGTTGCATT TTATTTTTAAAAATA AAGAT GCCTTTTAAGAGGAA TAACTTAAATGTC
TTTAATACCTTTGAA TTTAATTATATGGCT AATAAACACAAACTT AAAGC TTAAAACTGCATCGA ATTGAATGCGGTT
ATAAATGTACTTATA TATCTAATATAATCT GCTAATATGGTTTAC ATGGT ATATCTTTCTCGGAA ATTTTTACAAAAA
TTATCTATTCATATA TCTCGAGCGTAAGAT ATTTATCAGTTTATA GATAA CATCTTTAAATTTGG GTGATTAAAAAAA
AACATTG
#THIS IS THE THIRD SEQUENCE
>Cp36_PRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8324430..8324 513
TCTAGAGATCTGGGC ACGATGGCGAGACAA AGATGCGGCGCAAAA TCGGA AATGGAGATGGATCA CGTAGCCGGCCAT
GGCGG

**aboxylica** · Jul 16 '07, 01:29 PM

and one more thing.The calculation is not corresponding.
not like first sequence and first dataset.
to make it simpler
just say i have first sequence I am going to find the log value using all the matrices..
like I get a set of log values for matrix 1,then matrix 2,matrix3...... .so..on
Now, just realize that there are many sequences too..
so what il do is. for seqence one find all the values using all the matrices..
then move to sequence 2..find all the values using all the matrices..
then move to third sequence..find all the values for all the matrices..so on........
I think its better now??

**bvdet** · Jul 16 '07, 02:03 PM

Originally posted by aboxylica

This is my full code now.there seems to be problems.But I cant really figure it out!!

Code:

def parseArray(fn, dataset=1, key='PO', term='/'):
    '''
    Read a formatted data file in matrix format and
    compile data into a dictionary
    '''
    .....................................
    .....................................
    print '\n'.join(outList)
    print sum([float(s.split('=')[1]) for s in outList])

waiting for ur reply,
cheers!

'parseArray' returns the array dictionary after adding 1 and normalizing each array element. 'parseData' returns a list of strings - the first string is the header and the rest comprise the sequence. Based on what you have communicated, this is part of the calculation printout:

Code:

>CG9571_O-E|Drosophila melanogaster|CG9571|FBgn0031086|X:19926374..19927133
C[01]/C = 1.922438822066
C[02]/C = 0.207382828702
A[03]/A = 0.939745715865
G[04]/G = 0.746578183326
T[05]/T = 1.200055302088
C[06]/C = 0.207296849088
C[07]/C = 0.327664869349
A[08]/A = 2.918567675930
C[09]/C = 0.217751970137
C[10]/C = 0.207382828702
G[11]/G = 0.207382828702
G[12]/G = 0.207382828702
C[13]/C = 0.279850746269
C[14]/C = 0.709249274160
G[15]/G = 3.334715885525
C[16]/C = 0.300705101618
C[01]/C = 1.922438822066
G[02]/G = 2.285358772294
A[03]/A = 0.939745715865
T[04]/T = 0.138255219135
C[05]/C = 0.207382828702
T[06]/T = 0.138197899392
A[07]/A = 0.145167980091
T[08]/T = 0.138255219135
T[09]/T = 0.138255219135
T[10]/T = 0.138255219135
A[11]/A = 0.138255219135
T[12]/T = 0.138255219135
A[13]/A = 2.209784411277
C[14]/C = 0.709249274160
G[15]/G = 3.334715885525
A[16]/A = 1.137840453477
G[01]/G = 0.207382828702
A[02]/A = 0.138255219135
G[03]/G = 0.207296849088
G[04]/G = 0.746578183326

In the first calculation, dataDict['C']['01'] is divided by value['C']. Second - dataDict['C']['02'] is divided by value['C']. When we get to the 17th calculation, we start over with dataDict['X']['01']. Is this right?

If we multiply each element together,

so it should be done like A[01]*T[02]*T[03]*.............. .../0.3*0.3*0.3.... ........

we get an incredibly small number like:

Code:

>>> result
2.4243160673268744e-210
>>>

**aboxylica** · Jul 16 '07, 02:19 PM

yes, we have to multiply it together only
to be clearer
CCTATA..for simplicity let us store them seperately and then divide
numerator=C[01]*C{02]*c[01]..... this is for postion one
corresponding denominator will be C*C*T*A*T*T*A(t he standard values should me multiplied)
formula is log(numerator/denominator)
so it will be
FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(t he standard values should me multiplied))
FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
(the standard values here))
is it clearer?
waiting for ur reply,
cheers!!

**aboxylica** · Jul 16 '07, 02:23 PM

we should not divide individualy!!we should calculatte for the entire seuence in the first position)
if this is the ccccctaaaaaaa
get the corresponding values for all the positions say that will be numerator
then get the standard values for the denominator multiply them
then do log(numerator/denominator)
ok??we are not dividing individually!!
am sorry for all the trouble but am finding it difficult to do..thats y am bugging u!!

**elbin** · Jul 16 '07, 03:01 PM

Bvdet, to make it easier for you too, this is the code generating the calculation (log value)

Code:

q=len(p)
value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
for i in range(q-len(datadict['A'])):
    part=p[i:i+len(datadict['A'])]
    seqq=part
    res=1
    score=1
    for j in range(len(datadict['A'])):
        key=seqq[j]
        res=res*datadict[key]["%02d"%(j+1)]
    for key in seqq:
        score=score * value[key]
    log_ratio=log10(res/score)
    print i,log_ratio

where datadict is the normalized matrix dictionary, and p is the WHOLE sequence between the header line starting with '>' and the next header, as a string. So, for a sequence of 50 chars length and matrix of 16 lines you get 50 - 16 = 34 log values - from 0 to 33 in the initial example. That is because you evaluate with the matrix a moving window of X characters, where X is the matrix length. And each step the window moves with 1 character forward. Then it should be done as aboxylica said - for each sequence and each matrix.

Hope this helps, as I don't have time to do the whole thing... although there's not much to be done.

**bvdet** · Jul 16 '07, 03:19 PM

Originally posted by aboxylica

yes, we have to multiply it together only
to be clearer
CCTATA..for simplicity let us store them seperately and then divide
numerator=C[01]*C{02]*c[01]..... this is for postion one
corresponding denominator will be C*C*T*A*T*T*A(t he standard values should me multiplied)
formula is log(numerator/denominator)
so it will be
FOR POSITION ONE -->log ((C[01]*C{02]*c[01]..... )/ C*C*T*A*T*T*A(t he standard values should me multiplied))
FOR POSITION 2--->log((C[01]*T[01]*A[01].../C*T*A
(the standard values here))
is it clearer?
waiting for ur reply,
cheers!!

Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!

For a sequence of CCTATA.........
Numerator = C[01]*C{02]*c[01]........ ?
Denominator = C*C*T*A*T*T*A.. ....... ?

**aboxylica** · Jul 16 '07, 03:40 PM

oops... am sorry!!
Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
then for the same sequence we calulate the denominator which will be
0.2*0.2*0.3*0.3 *0.3*0.3=denomi nator(these are simple standard values)
log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
ok?

Originally posted by bvdet

Now I am confused even more. Your examples are not consistent. The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!

For a sequence of CCTATA.........
Numerator = C[01]*C{02]*c[01]........ ?
Denominator = C*C*T*A*T*T*A.. ....... ?

**elbin** · Jul 16 '07, 03:46 PM

Originally posted by bvdet

The result of calculating a long sequence of small numbers like 0.2*0.3*0.2.... ...... result in incredibly small numbers. The calculation result is 0.0/0.0!

They are small, but divided they give a reasonable number between -5 and 5 (roughly). You have to use numbers like

Code:

>>> 0.3**16
4.3046720999999976e-009

but there is a result.

**bvdet** · Jul 16 '07, 04:11 PM

Originally posted by aboxylica

oops... am sorry!!
Numerator = C[01]*C{02]*T[03]*A[04]*T[05]*A[06]........until the end of the sequence..we are adding one and then normalizing the matrix so the value will not be 0.0 at any cost..(the values which we are multiplying will be normalized values and not the original values in the matrix).
then for the same sequence we calulate the denominator which will be
0.2*0.2*0.3*0.3 *0.3*0.3=denomi nator(these are simple standard values)
log(numerator/denominator) this is for position one.. then we will do for position two.. then...so on..until there are enough alphabets to calculate(like say the matrix one has 16 postions)then we should stop in a position where the sequence length is less than 16.
ok?

The first sequence in the sequence data file is 759 characters long. The first array in the array data file has 16 elements. How many numbers are multiplied together? 759 or 16?

looping through a big file containing a set of files.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment