looping through a big file containing a set of files.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • aboxylica
    New Member
    • Jul 2007
    • 111

    looping through a big file containing a set of files.

    hey!
    I have a program that takes two input files(one in the matrix form) and one in the sequence form.Now my problem is that i have to give the matrix file(containing many matrices) and sequence file containing many sequences and calculate the same log score as I did for one matrix file and one sequence file.
    how it should exactly work is that. for every sequence it should calculate log values for all the weight matrices,then go to the second sequence and calculate all the log values using the matrices.
    my matrix file is huge containing so many matrices. a part of it is here.

    //
    NA Abd-B
    PO A C G T
    01 10.19 0.00 10.65 6.24
    02 5.79 0.67 10.50 10.11
    03 4.50 0.00 0.00 22.57
    04 0.00 0.00 0.00 27.08
    05 0.00 0.00 0.00 27.08
    06 0.00 0.00 0.00 27.08
    07 27.08 0.00 0.00 0.00
    08 0.00 2.83 0.00 24.25
    09 0.00 0.00 24.45 2.62
    10 19.33 0.00 4.34 3.41
    11 0.31 12.28 3.39 11.09
    //
    //
    NA Adf1
    PO A C G T
    01 0.71 0.08 26.02 1.55
    02 3.03 23.00 1.24 1.09
    03 0.26 10.50 3.29 14.31
    04 0.00 0.06 28.23 0.07
    05 0.12 27.27 0.06 0.91
    06 1.44 20.36 0.37 6.19
    07 5.35 0.28 21.49 1.24
    08 7.81 16.10 3.81 0.63
    09 0.51 17.77 0.45 9.63
    10 0.00 0.14 28.21 0.00
    11 0.00 25.69 0.20 2.46
    12 0.48 9.98 0.07 17.82
    13 1.27 0.00 27.01 0.07
    14 15.59 7.98 2.92 1.87
    15 4.28 22.37 0.00 1.70
    16 0.18 0.77 22.70 4.70
    //
    //
    NA Aef1
    PO A C G T
    01 0.00 0.06 12.49 0.00
    02 3.80 0.17 0.00 8.57
    03 0.87 0.06 0.00 11.62
    04 0.06 9.76 2.32 0.41
    05 9.82 0.00 2.73 0.00
    06 9.76 0.00 0.00 2.78
    07 3.80 0.31 0.00 8.43
    08 0.00 0.00 0.00 12.54
    09 0.00 6.53 5.85 0.17
    10 0.00 12.38 0.17 0.00
    11 2.73 1.02 8.80 0.00
    12 5.85 0.00 6.70 0.00
    13 1.02 5.96 0.00 5.57
    14 0.00 5.16 4.66 2.73
    15 1.03 7.55 3.97 0.00
    16 4.82 5.00 2.73 0.00
    //
    //
    NA Antp
    PO A C G T
    01 5.52 14.49 27.56 0.49
    02 8.17 14.02 11.42 14.47
    03 18.18 27.29 1.31 1.29
    04 40.26 5.66 1.83 0.32
    05 19.05 12.67 0.43 15.91
    06 9.94 0.07 0.20 37.86
    07 26.63 15.17 0.00 6.27
    08 47.45 0.06 0.00 0.56
    09 0.81 0.48 0.00 46.79
    10 26.46 19.05 1.81 0.75
    11 48.07 0.00 0.00 0.00
    12 30.51 0.00 0.00 17.56
    13 43.45 0.00 0.00 4.62
    14 30.06 5.98 0.00 12.03
    15 0.38 0.64 0.00 47.05
    16 22.14 0.29 7.15 18.49
    //
    //

    the sequence file is here( I mean this is also a part of my file)the actual file starts from "CC" the line before is just heading which we omit and this file is containg two sequences.
    >CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
    CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
    GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
    CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
    CCTGGAGCCATCGTC CTCGTCCTCC
    >Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
    AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
    AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
    ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
    AATACAAAAAAAATA TATATATATA
    this is my code which works(prints the log value for one sequence and one matrix)
    [CODE=python]
    from math import *
    import random
    f=open("deeps1. txt","r")
    line=f.next()
    while not line.startswith ('PO'):
    line=f.next()

    headerlist=line .strip().split( )[1:]
    linelist=[]


    line=f.next().s trip()
    while not line.startswith ('/'):
    if line != '':
    linelist.append (line.strip().s plit())
    line=f.next().s trip()

    keys=[i[0] for i in linelist]
    values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]

    array={}
    linedict=dict(z ip(keys,values) )
    keys = linedict.keys()
    keys.sort()
    for key in keys:
    array=[key,linedict[key]]

    datadict={}
    datadict1={}
    for i,item in enumerate(heade rlist):
    datadict[item]={}
    for key_ in linedict:
    datadict[item][key_]=linedict[key_][i]


    for keymain in datadict:
    for keysub in datadict[keymain]:
    datadict[keymain][keysub]+=1.0

    datadict1=datad ict.copy()
    for keysub in datadict:
    for keysub in datadict[keymain]:
    datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)



    def readfasta():
    file1= open("chr011.py ",'r')
    file_content=fi le1.readlines()
    first=1
    list1=""
    for line in file_content:
    if line[0]==">":
    if first==0:
    print "********** *"
    list1+=sequence
    print "********** *"
    else:
    first=0
    sequence=""
    seq=""
    for i in range(0,len(lin e)-1):
    seq+=line[i]
    else:
    for i in range(0,len(lin e)-1):
    sequence+=line[i]
    list1+=sequence
    return list1



    p=readfasta()





    res=1
    part=""
    q=len(p)
    seqq=""

    value={"A":0.3, "T":0.3,"C":0.2 ,"G":0.2}
    for i in range(q-16):
    part=p[i:i+16]
    seqq=part
    res=1
    score=1
    for j in range(16):
    key=seqq[j]
    res=res*datadic t1[key]["%02d"%(j+1 )]
    #print res
    for key in seqq:
    score=score * value[key]
    #print score,"******** ***********",re s
    log_ratio=log10 (res/score)
    print i,log_ratio
    [/CODE]
    what changes should i make and how?/
    waiting for your reply,
    cheers!
    Last edited by bartonc; Jul 13 '07, 08:32 AM. Reason: Added =python to code tags
  • bartonc
    Recognized Expert Expert
    • Sep 2006
    • 6478

    #2
    Start with a list of input files (hopefully, generated automatically when there are many). Like:
    Code:
    matrixFileList = ['matfile1', 'matfile2']
    sequenceFileList = ['seqFile1', 'seqFile2']
    
    # use zip() to pair them
    filesList = zip(matrixFileList, sequenceFileList)
    
    # define your function to take file names
    def mainFunc(matFileName, seqFileName):
        print matFileName, seqFileName
    
    
    # call the function for ever file pair in the list
    for pair in filesList:
        mainFunc(*pair)   # the * breaks the tuple into its parts.

    Comment

    • aboxylica
      New Member
      • Jul 2007
      • 111

      #3
      my sequence file:>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
      CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
      GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
      CTACGGGAACGGGAG TCGCAAACGTTTTCG GATTAGCGCTGGACT AGCGGTTTCTAAATT GGATTATTTCTACCT GAC
      CCTGGAGCCATCGTC CTCGTCCTCCGTCCC TTAGCGCCTCCTGCA TGGATGTCGTTTTTG GGTTTCATACCTTTT CAC
      ACTGGAAAAATACGG AATTTGTTGTAAGCC CTTTCAAGACGAATG GGATTTAGCTTCGGA TGTCAACGTCACCAT AAT
      CATATTAGGAATATT TCTACTCAATTGCAA TATTGGTACTTTTCT GACTGTAAACGCGAT GATAATTACAAATAT GCC
      TAATTTGCTGTCTTT ATAATCAAATGGAGT TCTTTATATTTCCAA AATATTGAAATTCCG ATTCCCTAGAAAATA ATA
      CGTTTTTCTGTTATT AATAAAAAACCAATA GGAAAGTTCTCAAAA ATTACTCTGTTGTAT TTGATCATTTCTTTT CCG
      GTATAATCTTTTATT TTAAGCATTCCCATG TGAATAAATTTCAGA CTAATGTATTAATAA GATGTCGTGTTTTTC CAC
      TTACAAATTTCTCAT ACAGCTGGATATATA CTACGAGTACTATAC ACATGCTCTGGG
      >Cp36_DRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8323349..8324 136
      AGTCGACCAGCACGA GATCTCACCTACCTT CTTTATAAGCGGGGT CTCTAGAAGCTAAAT CCATGTCCACGTCAA ACC
      AAAGACTTGCGGTCT CCAGACCATTGAGTT CTATAAATGGGACTG AGCCACACCATACAC CACACACCACACATA CAC
      ACACGCCAACACATT ACACACAACACGAAC TACACAAACACTGAG ATTAAGGAAATTATT AAAAAAAATAATAAA ATT
      AATACAAAAAAAATA TATATATATACAAAA ATTTGTTGTGTTTGA ATTGAATTAAGAGCT TATCAAGAAAAAAAT TTC
      AGTGACTCATAATAC ACTACTCTACAAGTT TAAATTGAATCAACA ATTTAACTTTCATTG CTCAGGTTTTTAGTA ACA
      ATGTTTATATAAGTT TAGGTATAACAAATG ATTTAAATATAAGAT ACTGTATTTCACATT GAGACGAAACAATCC ACC
      GAAAATCATAAAATA TAAGAATGTTGCATT TTATTTTTAAAAATA AAGATGCCTTTTAAG AGGAATAACTTAAAT GTC
      TTTAATACCTTTGAA TTTAATTATATGGCT AATAAACACAAACTT AAAGCTTAAAACTGC ATCGAATTGAATGCG GTT
      ATAAATGTACTTATA TATCTAATATAATCT GCTAATATGGTTTAC ATGGTATATCTTTCT CGGAAATTTTTACAA AAA
      TTATCTATTCATATA TCTCGAGCGTAAGAT ATTTATCAGTTTATA GATAACATCTTTAAA TTTGGGTGATTAAAA AAA
      AACATTG
      >Cp36_PRR|Droso phila melanogaster|Cp 36|FBgn0000359| X:8324430..8324 513
      TCTAGAGATCTGGGC ACGATGGCGAGACAA AGATGCGGCGCAAAA TCGGAAATGGAGATG GATCACGTAGCCGGC CAT
      GGCGG
      >Him_distal|Dro sophila melanogaster|Hi m|FBgn0030900|X :18039896..1804 3470
      GGTTTTCTGCGATGG CTTCCGCGCCAGCTG AAGTATCTGATTTGC TGCCTTGTTTTTGTT GATATTTCTGCGAAG GGA
      CTTGTGCTTTTCAAA TGGCCTTTTTTTGGG ATTACGGCAAGGGCG CGTTTCCCACGCTCG ATCCCCACTTACCAT TGG
      TGCACGCGATTGCGG CAAGCTGCTGAGGCA AGCTATTAAACGCCA CACTGGGCCGGGGGG CGGTACCGGTGGGCG TGG
      CAGGGGAGTCGACAC ATGTTGTGTGCCAGA GAACTTTGCTCCGAT CCCCAGATCATCAAA TAGTTGTCGCTGTCT GCT
      CGTGCGCAAATTGCA ATACTTTGCATACCC TTACTGCAGGGTATC TGAGCTTGGACTTTA AATAAGGGGGTATAA CAT
      AGCTTATACTCTCTA TCTCTGTTATAAAGT CAATTTTCCTTAGAT CTTTAGTACAGTGGG TAGTTAAGGAGACAT AAC
      TTCCAAAAAAAAAAA CTATAAAATTGCAAT AATTTATGCAAAATA TGTATTTTATTGAAT GGGATGAATAATTTA CCT
      TATACGACTGTAAAA CATTTCTAACGATTA AATGCACTTCTAAAA GTTTTCCCACAAGTA GGTGAGCTATTATGC TAA
      GCGTTCCATGACTTG GAATCTAAGATCTTG TTTTGATCTTCGCTG ATCTTTGAGAACTCG GGGATTACTTACACA TTT
      CTGGGCAGGCACAAG TGGGCCGAGGCAGTG TAGATTCATCACGTT TTCACTCAACACACG CAGCTCATTAACAGC CCC
      GCTGACAACTTGTCA GGACTTCCCCCTCGT GAATCCCCCTGCTAC GCAACCCCCATTCCC CGCCCATTCCAACAC TTC
      CCGCCGGGAGCGTGG GAAATTATGCGTGTT GGTGGGACGTCGGGC GGTGAAAATTGGCGC GCTCTTCGGGGGGCC ACA
      CCGCGTGGCATTGAC AACTCTTCCACATTT CGCGCCCAACGATGC GTTGGCATCAGTGGG TCACAGGGATTACGG CTG
      GCTGGGATTCCAGAG CCAGATCTTTTTCAG CCAAAACTTTCAGCT TTCGAAGACCTCAAG CGATAGGAGAGTGTC GGA
      AGTCCAGAAATAGAC GCGTAGCACATAAAT TATGGATCGTATCGA GTATCGATTAGCCCG GGACAAGCGAAGCGA TAG
      GGAGACATATTTTTA TTACCCTCTCGGGGA CCTGCACTTGTTGGC TTCGCTTCTATGAAA GATCCCTCTACCATA TCA
      CGTATGTGGGCTCCC CCAATCGAACCGAGT TGTGGGAAATGTTTT CCCAGGCCAACAGCT AATTGTCACTCCAAG GGT
      TGTCCCCGCAGCCCA GACGACAGATAAGCG GGCAAGTGAAGCCCA GCGATCTGAGTCAAG TGAAGGGCTTCAATT TCT
      TTCCCGAGTGGAACT GGGATATCGAAATTA CATTTGTAACAGACG TTTTAGTCCGCAATC CTCAGCTAATGGGAC TTA
      CGAACATATATTCAT CTGAAATTCAAGAAC ATGCGCACTTAAAGA GCAGGGAAGTCGCAC ACGCGCAAGTCAGGC GCT
      CAAAAAGGGATCTTC GGAGGTACAGTGGGC AAAAGACTGTAAATA AATAATATAAATAAA ATAATATTTAGCTCT ATG
      TGTTTATATAATCTA CAAAGTAGTTAACAA AAAATATAAAATGGA TATAAAAATACATCT TATATATCCCTATAA TAA
      GAAATAAATAATAAT TTTAGTAAATTAATT TTGTTACACAAAGTA CCTGTATTATTACCT CTTTTTTGTTGGTTG GTT
      CTTTTTTGATGTGGC CCCACTGTGCTCTCT TATCAGTGCGACAAT CAGGCATTGCCTTTC CCCATCGGGGGATTC TAA
      TTCCGTGGACGATGG GCCGAAACGCCTATA AAGTCGCTCATTAAA AATGTTTAATTATGG CCCATCTTGCATCTT GCA
      CCGATGTGGATGGGG TTTGTCGGCAATGAT TTACATTATAAAAAT GCCCGTTATCTGAGC ATTTTGTACGCTCCA CTC
      CCTCTTCCCCCCTCC AAAAAAAAAAAAAAC AGATATGTATATTCC CCGAGATATTCCCAA GCGGCCAAAAATAGA CGC
      AAATTGTAACGCACT TGAAGTGCACTCTGA AACATCTTGAAGTCC AAATAAAATAGCAGA GAGACCCACAATAAT ATA
      CGTTGATATACACAT GTATATATGTATGTA TGTACATAAAGGGCC AGGAGCAGGAACGTT AGGCATGCGGTGGTA CGA
      GCACCGTGGTGCGAG CGAGAGCGCTGTGCT GCCTGAGGGAGAGGT AGCGAGTGGGTTGCA TTGCGCACACAGAAC ATG
      TGAATGCAGAGTTCA AGTGCATGCCGTGAC ACAGACACGCACACA CACACACGCACACAC AGATGAGTAGCCGCT GCA
      AAGTGTTTTTTCCCA GGCGCTATTTATAAT ATGCATCCCGTCGCC GATCCGATCCGATCC AATCCAATCCGATTG GAT
      CCCATCTTGCGGCAC TACGATTATGACGCT CGACACGATGATGCA TTCGCAGAGTTTCCC GATCGCAGAGTACCC TGT
      ACTCGAGTAGTTTTT AGATGCAGTATTATT AAGTAGAAAATTGTA ACCGTATAATATTCC ATTATATTAAATATT TTT
      ATAGCACTAAAGAAA TAAAAGCCCATTTTA TAATTTATATTACAA AAATACTTAACCATA GAAACTTATGATATG ATA
      CCAATATTTAAGTTC CAAAAAATGTAGAAC ATTTTTAAGTATATA CTCGAAAATATTAAT TTTCAAAATTGATAT TCA
      AGAGATATTATAAAA AGATCCCCATTCTAA ATATCTAACATCATG CCATGCTTTCTAATG AGTATAGTATACCCC TGC
      TACCCTGTCAATCCG CAAAACAGGCGCCGA AACATGCGGTTTCTC GCAGCAGACTGCCAC GGGAAAAATTCGGTT CGA
      GATTTGGGAATGGAT GTATGACGGAGCAGA AGGAGCAGGACCCGG ATTTCGGATTTCGGA ATGGATATGGAAATG AAG
      ATGGAAATGGGACTT TGACTGCGCGACGGC CACATGCGCCGCTGG CGATGCCGCTGGATG TTGCATGTGGCAGCG GTC
      GGTGCAGCAGCGAAA GTGTTGCAGCTGTAT GAGAGGGTCTATTTT TGGGGCGATTGTGCG GCGCTGGTGCTGCCA CAT
      GTGTTCTGTGTTGGG CTGCTAAAAGGCATT GTAATGAGAGCAGAA AATAGAATTGACTCC ACTTGAGCAATGTCC CAT
      AAAGCGGGAGTTTCG AGTTTGGCGCGCAAT GTGCCGCACCAGCAA ACGAACAAAAGAAAA AAAAAAAAAAAAAAC ACA
      GCCAGTAACACATGG GCCCACGAGTTATGT TTTATTTTTAATCCC ACAAAGAGTCGATCT CCAAAACAAACCCGC AGA
      GAGCACATATAAAGA GACTCGGTGGACGAG TGGTTCGAAACAGTC TTCCGCCGCAGCTCG ACGCGCTCGCATATC GGG
      AATATATAGATCGGA GATATCGCAGGACCC ACAGCAGAGCAGAGC CGCAGAGCCACCAAC CTCG
      >Him_proximal|D rosophila melanogaster|Hi m|FBgn0030900|X :18041232..1804 3470
      GCCCAGACGACAGAT AAGCGGGCAAGTGAA GCCCAGCGATCTGAG TCAAGTGAAGGGCTT CAATTTCTTTCCCGA GTG
      GAACTGGGATATCGA AATTACATTTGTAAC AGACGTTTTAGTCCG CAATCCTCAGCTAAT GGGACTTACGAACAT ATA
      TTCATCTGAAATTCA AGAACATGCGCACTT AAAGAGCAGGGAAGT CGCACACGCGCAAGT CAGGCGCTCAAAAAG GGA
      TCTTCGGAGGTACAG TGGGCAAAAGACTGT AAATAAATAATATAA ATAAAATAATATTTA GCTCTATGTGTTTAT ATA
      ATCTACAAAGTAGTT AACAAAAAATATAAA ATGGATATAAAAATA CATCTTATATATCCC TATAATAAGAAATAA ATA
      ATAATTTTAGTAAAT TAATTTTGTTACACA AAGTACCTGTATTAT TACCTCTTTTTTGTT GGTTGGTTCTTTTTT GAT
      GTGGCCCCACTGTGC TCTCTTATCAGTGCG ACAATCAGGCATTGC CTTTCCCCATCGGGG GATTCTAATTCCGTG GAC
      GATGGGCCGAAACGC CTATAAAGTCGCTCA TTAAAAATGTTTAAT TATGGCCCATCTTGC ATCTTGCACCGATGT GGA
      TGGGGTTTGTCGGCA ATGATTTACATTATA AAAATGCCCGTTATC TGAGCATTTTGTACG CTCCACTCCCTCTTC CCC
      CCTCCAAAAAAAAAA AAAACAGATATGTAT ATTCCCCGAGATATT CCCAAGCGGCCAAAA ATAGACGCAAATTGT AAC
      GCACTTGAAGTGCAC TCTGAAACATCTTGA AGTCCAAATAAAATA GCAGAGAGACCCACA ATAATATACGTTGAT ATA
      CACATGTATATATGT ATGTATGTACATAAA GGGCCAGGAGCAGGA ACGTTAGGCATGCGG TGGTACGAGCACCGT GGT
      GCGAGCGAGAGCGCT GTGCTGCCTGAGGGA GAGGTAGCGAGTGGG TTGCATTGCGCACAC AGAACATGTGAATGC AGA
      GTTCAAGTGCATGCC GTGACACAGACACGC ACACACACACACGCA CACACAGATGAGTAG CCGCTGCAAAGTGTT TTT
      TCCCAGGCGCTATTT ATAATATGCATCCCG TCGCCGATCCGATCC GATCCAATCCAATCC GATTGGATCCCATCT TGC
      GGCACTACGATTATG ACGCTCGACACGATG ATGCATTCGCAGAGT TTCCCGATCGCAGAG TACCCTGTACTCGAG TAG
      TTTTTAGATGCAGTA TTATTAAGTAGAAAA TTGTAACCGTATAAT ATTCCATTATATTAA ATATTTTTATAGCAC TAA
      AGAAATAAAAGCCCA TTTTATAATTTATAT TACAAAAATACTTAA CCATAGAAACTTATG ATATGATACCAATAT TTA
      AGTTCCAAAAAATGT AGAACATTTTTAAGT ATATACTCGAAAATA TTAATTTTCAAAATT GATATTCAAGAGATA TTA
      TAAAAAGATCCCCAT TCTAAATATCTAACA TCATGCCATGCTTTC TAATGAGTATAGTAT ACCCCTGCTACCCTG TCA
      ATCCGCAAAACAGGC GCCGAAACATGCGGT TTCTCGCAGCAGACT GCCACGGGAAAAATT CGGTTCGAGATTTGG GAA
      TGGATGTATGACGGA GCAGAAGGAGCAGGA CCCGGATTTCGGATT TCGGAATGGATATGG AAATGAAGATGGAAA TGG
      GACTTTGACTGCGCG ACGGCCACATGCGCC GCTGGCGATGCCGCT GGATGTTGCATGTGG CAGCGGTCGGTGCAG CAG
      CGAAAGTGTTGCAGC TGTATGAGAGGGTCT ATTTTTGGGGCGATT GTGCGGCGCTGGTGC TGCCACATGTGTTCT GTG
      TTGGGCTGCTAAAAG GCATTGTAATGAGAG CAGAAAATAGAATTG ACTCCACTTGAGCAA TGTCCCATAAAGCGG GAG
      TTTCGAGTTTGGCGC GCAATGTGCCGCACC AGCAAACGAACAAAA GAAAAAAAAAAAAAA AAAACACAGCCAGTA ACA
      CATGGGCCCACGAGT TATGTTTTATTTTTA ATCCCACAAAGAGTC GATCTCCAAAACAAA CCCGCAGAGAGCACA TAT
      AAAGAGACTCGGTGG ACGAGTGGTTCGAAA CAGTCTTCCGCCGCA GCTCGACGCGCTCGC ATATCGGGAATATAT AGA
      TCGGAGATATCGCAG GACCCACAGCAGAGC AGAGCCGCAGAGCCA CCAACCTCG
      >Obp18a_prom|Dr osophila melanogaster|Ob p18a|FBgn003098 5|X:18969778..1 8972746
      ATGGCGAAAATCTGT TTCCCAACTAACAAT GAGCGCATCATCACA GCTCTATATATATAA CCCATCGATTTGCTA ATT
      CAGCTCAAAAGTAGA CAGGAGATTTTAATT AAATAATTGGATGCT ACTTTACATTCGCCA CACACCAACAAATAA AGT
      CTATAATTGAAATTT TAAGCGCAGTTCCCG ATTATGAGCTACACG TATGTCGTATGCGCA ATATCTGCATTACAA TTG
      CCAATAGTAAATTAC CAACTTGGTTTTCTT CATATTTATTAAGAT AGAAAACATACAATT TTTGGCTTTTACACT CCA
      AGCATCTCTGAAGTT TAAACAAAAAACATA TGTGTAGCCTATCTA CTGTATTGGACTTTA TTCGTATATTTTATA TGG
      TTCATTAATATAGGT ATAAATACAAATTAT ATTCACGCTTTGCGA TTTGCAGCGAATATC ACATCTTATACACGA TGT
      AAAAAAAAAAAAAAT ATTTCGTCATGTTTT TAGGTTGGCCGCAGG CAGTGCTCACTGTAC CGCCACAATGTTTAT CGT
      TTTGCATTTTTTTTT TCTTTGTTTTCTTGC GGTTTCCCCTAATTA TCTTTAGTATAAACT TAGTCTACTGTCTTT TTT
      GGTAAGTATTTTCGT GATGGGCTCGTCTAT GCGAATTCCCATTTC CAATGAATAAATAAA GTAATTAGAACATTA AAA
      TTAGCAATAAAACAC GTACATTTAAAGCTG ACAACAAAAAAAAAA AGTATTCTTATGTTA AACTGTAGTATGTGC CTA
      TGCAATATTAAGAAC AATTAAATAAAATAG CATATTAACTTATGG CAGCACTTTGTTGCT ATGTTTATGTTTATG TTT
      ATGCACGCAGTTAGG CCAGGGCGGATGTAA CATGATCACCCACTC GAAGGCAAAAAGTAT AAGTGCATGGTCAGC ATT
      CACACGCCGACCAAA TACATATTACATACG TACATACATATCTCG CTCTCCCGATAAGCC TAGATATATAAGATA TAC
      ATAAGAACGCCGCTC CGCTGCTGGCGTACC CGGCAGCGCAGCTAC GCGGATTAGCCTAAG TCCAAATATATTAAA AAC
      TGTAAAATCAGAGAG ACTCTGTAGACGTTG AGCTGACAGAACCAT TTCTGCCTACTCTAA AATCAAAAGAAGAAA TTG
      AATAAATATATGTCA GCCCGACGGCTGCCT TCAACTTAAAACGGA CTTGTGTTCTGAATT GGAGTTCATCATTAC ATG
      GCGACCGTGACAGTC GTCCAACGCTGGACG AATTGACCAAAGCTG GTGAAAACAAAGGAA CAAAGGAACACTGGA CTG
      GAAGAAGACTGGACT AATTAAATGGAACTG CAAAAACCAAGGAAA AATCTGAGTGAGTAG AGTTCTATTGAGTAT GGG
      CAAACACCGTGGCGG TTTGAAAACTAAGCT GAATAAACGTATAGC CCACGTAAGGTGGCT AATATACGGTCAGCA AAC
      GCCACCGGTTTGGTC GAAAGCTCTAAAGCT ACATGCAGAGCTAGA CCACTTGTTGCAATA TCAGCAAGAATTAAA GAC
      CCATAAGCTCGAGAA AACTCACTCAGATAA TATTAAAAATATACC CACAATTAATGAAGT TCCAAAATACCAGGC ATG
      TCCAGCACCAGCACC AGCATTAACAAAACC AAAGAAGTCCTGCCC CCCTGGCTGCGAAGG AATCTGGAGTCCCCA CTG
      CCTGGGGACTTGTGA GCGACCATCGACGTC TTCAGCGGCGAAGAA ATAGACAGCAGCGAG GGAGTGTCAGCGTGC CAC
      CCCCGGCGACGCCCA GCTGACACCTGATGA GCATCATCAACAGCA GAATATAATAATAAA TATATATAAATATAA AGT
      AAATATAAAATATAT ATAGATAAGAAAAAT TGTAAGAAATATTGT AAAACGGAGCATATA CTATTATGCCCTGTT AAC
      CCAATATGGCCCGTG AAGCCATAGCTAGAA TCAGGCAGGCAACAA TGTAAAATACAATTT TTTTTTACTCTTGCG AAC
      ATTGAAAGATTTTAT AAATAGATAATTCCA AACATAAATGTCTAT AGAGACAAATGAAAT AAGTAAAACTGAAAA TAA
      AAGTATATACAAAGG AAATTTTCTATTCTA TTCTCCAAAATATAA AATTAGTATACCCAA AATGGGTCTAATAGA CAC
      TAAAACTGTGGACTC TACAGCCAATGTAAT AAATAAAGTAGAAGT CCAAAATGCAGACTT GTTCTGGATAACCAT AAT
      ACTAATTGTAATTGC ATTAATTATGGTATC CAATGCATTAATAAA AATATACAAACTGCA TAACAAGTGTCTTAA GAA
      ACGATACCGTAGCAC TGCTAACGGTATAGA TAATATTTAAGGAAG ATCTTTAATAAAGTC AATTATGAATGAAAA TAT
      GAGAAAAATTATATG AAAAAAAAAAAATAA TAAATAAAAAAAAAA ATATAAAACGTAATA TTGAATTTATCTACG TTA
      AAAAAAAAAATATAT ACAAATGAATAAATT TGAAGTTATGAGTAT ACCACAGCATGGACT GGGAAAAGCTTGTTG ATC
      AGATAAAAGATCAAA ATGAAAATTTCAGAA AATCCTATAAGTGCT TAACGCAAAACAGAT CAACACAAGCTGTAA CAA
      TCAATAGGAATGCCC AAGTCTTGGTAAATA GTTATAATGAAATCA GAGAGTTGATCCAAC AAAATAGAAAGAATT TGG
      AACGCAAACAGTGTG CTAAGGCTTTGAACC TACTGGTGACATTAA GAGAAAAATTAATAT TTATAAAAAATAAAT TCA
      GTCTCCAGATAGAAA TTCCAACCATAGTAA ACACCCCACTAAGAA TAAATTTGAATGAAG ACAGCACTAACTCTG ACG
      AGGAAGATAGGACTA TAGTCAAGGAAGACA TTAAAGAGGAAGATC TTCACGATCTAACTA TACCAGCAAAATTAA TGC
      TGAA
      >Obp19a_prom|Dr osophila melanogaster|Ob p19a|FBgn003110 9|X:20223943..2 0226446
      CCACCTGCGAAATGG GTCATAGTATATGTA TTTGTAAAAAATGTA TGTAAAAAAATGTTA AATTAATAATTTTGA ATT
      TCAATTTGGAGCTGA AAATAATATTTTGTG TCCATCAACAGCTCC AAAGCGATGGTTCAT TTTATCTTGTGTGCG TTC
      AATAGAATCACTCTT ACGTTAGCGCGTCCA TTGATGGTTGTCCCA TTGAAGTACTTCTTA AAGCCGTCGGCCATT GCT
      ACTGGACTGGATCTG GAGATCTGGAGATCT GGATTTGGGGTCGGG TCCGGGTGAGAGCTG AGTGTGTTCTGCCTA TAG
      CTCCGAGCGAGAACC TAATGACAAGCAGCG AAGTGCAAAGCTCGG CCAACTAGATTACAA AGTCGATTCATTGGC AGG
      ATTCGATTTTTATTG ACTCAACGAGGTGGT ACATGAGTTTGGTCC CCAAGCCTTTAACTG TGGCATCGAGGACCG GAA
      AGGGGGTGCTGATTA TAAATAGTTATGGAT TGCTGACGGGTCGAA TGGGTCGGAGCGGTG GGGAGCCATGACTTC AAT
      GATTTGGCAGCATCG GCGCCCTAGCCATGG AGCATGGCCTGCTGG CAGCCCTTGCAGTAG AGCTTGGTCTCGCGC CGC
      TTCGTGTTGCGGCGG TGCATCTTGACCAGG ACGTAGACGAGTCCC AACGAGGCCCAGGTG GCCTTGGCTACCTGT GGG
      TTTCGGTGGCGTATT TGGGCGCATCTTGTG TACTGCCGTGTACTG AATCACTTACATTGG CGCGACCACGCATGG TCT
      GGCTGTTGAAGGCTT CGTTGAAGTTGAAAT GATCGGACATCTTTG GATCGTTGTTGACCG GATTGGCGTGGCTTT TAA
      CAAAAGATTAAAATT TGGATTCGATATTCG ACCTGTATTTTAGAC CGGGATTCGGATTGT GACTTTTAAACGTTC GAA
      ATGAAAGGAATGTTA CTGACAGTCGTCAAA GCCGACTCGGGTTTC CCAACTAGAGAGAAT GCTGAAGTCTAGTAC CGA
      CTAATGGGATACCCA TTAATTACTGCTTAA ATACTGTGATGAAAA TTGAGATATGCAAGA GGCAAATCGAAAGTT TTG
      GACATTTTCATATTG TACCTTTAACCAACT TCAGAATTCATTGAG CTAAATACCATTTAC AATTTTATGAAATTT TTA
      AGCATGTTACAGCTA TAACTATTTTTAAAC CAGTTACTAGATTCG TTGAAAATTGTATGT CACACAGAACTTCTT GCC
      ATCCTGGTCGGAATT AGGATCACTAGCCAA GCCGATATGGCTATG TCTGTCCGTATGAAA GTCTTGGAATCTGAT ATT
      AACATCGCATATCGA TCGACCATTATATAT CTAATATATCCTCTA CAAATGTATTTTATC ACCTAGCTAGCATGT AAA
      CATTCTGGCCTATTT AGCTGTACGCTTCAG TTATGCTAATGCAAA CATAAGCCTTTTGTG ATATTATAATTTACA TTT
      ATTATTTATTGCAGT TAGCTTTATCAGCGA TTTGGGCTCATGCCA CACGCAATACTACTT ATTTCAACGTCATCA GTT
      GTACTAAATGCACAA ATGAAATACATTTCG CCAAATAAATGCCAA CTTGCAACTAATTTG AATGCTAATCAAACC GAA
      CTACTCATTTGCATA CAAGGTAATAGGTGG TTAAAGTGAGTGTAA TGGACTTACTTAAGG GGTTACAAGGCTTAT ATT
      TAAAATGCCTGCCTT GTAATTAAATTTTTA AATATATTGGAAAAA AATGGCCACTTGTTA TGTGAGTCTCCAGAA AAA
      AAACAAAAAAACAGC AACCATCTGGTATGC AAAATATCTGGTGGT AGCAAAATATCTGGT GGTATCTGGTGGACT ATC
      AAAATATAAAAACTT TTTTTTCCAGATAGT ATATCTTAAAATCAG CATCTTGAAGGAGTA TATGTAAATAGCAAA CTA
      TTTGTAAAAATAGAT TTTATTTTATAATTT TTTAAGATATATACC AAACATTATTACCGA TTGTGATTATCTTTA CAT
      TGTTTGACCTCAAAA CGGAAAACTGGATGC GCGGTATCCATGCGA CCCTAACTCTGGAAC CGATTTTGGAACCGC CCC
      GTTAGATCTCAGATT GAAACCTTATTTGCA TTCGCATGATCGCTG ATGAACACTGGGGAA ATGCGGCCCAGCAAT GGG
      ATTGTCAACGCATCT CGGCCAGAATCGCGC CTCGCATGCCACCTC GCACGGTGACCACAT ACCTGTGTACACTGT CAA
      TTAACGTGGCAAGAT TATAGCCCGGCCAGA AAGTAATCCGCCCCA GGAACACCACCCACC GCCCGCCCATTTGGA TAT
      GGAAATGGGCAGTGG GGGCGGCGATTGGCG CTAACCCATAATTCC CACACCCACTTAGCG GTTCGATCGAACCAA TAT
      GAAGTCATTTGCATG TCGGGGGCCGTGTAT AAAAGGAGTCGCCGA TGGGTCTGGAGTCTG GAATCCGCCAAATCG TCT
      CGGAAAT
      >Obp19b_prom|Dr osophila melanogaster|Ob p19b|FBgn003111 0|X:20224439..2 0227440
      ATTGCTGACGGGTCG AATGGGTCGGAGCGG TGGGGAGCCATGACT TCAATGATTTGGCAG CATCGGCGCCCTAGC CAT
      GGAGCATGGCCTGCT GGCAGCCCTTGCAGT AGAGCTTGGTCTCGC GCCGCTTCGTGTTGC GGCGGTGCATCTTGA CCA
      GGACGTAGACGAGTC CCAACGAGGCCCAGG TGGCCTTGGCTACCT GTGGGTTTCGGTGGC GTATTTGGGCGCATC TTG
      TGTACTGCCGTGTAC TGAATCACTTACATT GGCGCGACCACGCAT GGTCTGGCTGTTGAA GGCTTCGTTGAAGTT GAA
      ATGATCGGACATCTT TGGATCGTTGTTGAC CGGATTGGCGTGGCT TTTAACAAAAGATTA AAATTTGGATTCGAT ATT
      CGACCTGTATTTTAG ACCGGGATTCGGATT GTGACTTTTAAACGT TCGAAATGAAAGGAA TGTTACTGACAGTCG TCA
      AAGCCGACTCGGGTT TCCCAACTAGAGAGA ATGCTGAAGTCTAGT ACCGACTAATGGGAT ACCCATTAATTACTG CTT
      AAATACTGTGATGAA AATTGAGATATGCAA GAGGCAAATCGAAAG TTTTGGACATTTTCA TATTGTACCTTTAAC CAA
      CTTCAGAATTCATTG AGCTAAATACCATTT ACAATTTTATGAAAT TTTTAAGCATGTTAC AGCTATAACTATTTT TAA
      ACCAGTTACTAGATT CGTTGAAAATTGTAT GTCACACAGAACTTC TTGCCATCCTGGTCG GAATTAGGATCACTA GCC
      AAGCCGATATGGCTA TGTCTGTCCGTATGA AAGTCTTGGAATCTG ATATTAACATCGCAT ATCGATCGACCATTA TAT
      ATCTAATATATCCTC TACAAATGTATTTTA TCACCTAGCTAGCAT GTAAACATTCTGGCC TATTTAGCTGTACGC TTC
      AGTTATGCTAATGCA AACATAAGCCTTTTG TGATATTATAATTTA CATTTATTATTTATT GCAGTTAGCTTTATC AGC
      GATTTGGGCTCATGC CACACGCAATACTAC TTATTTCAACGTCAT CAGTTGTACTAAATG CACAAATGAAATACA TTT
      CGCCAAATAAATGCC AACTTGCAACTAATT TGAATGCTAATCAAA CCGAACTACTCATTT GCATACAAGGTAATA GGT
      GGTTAAAGTGAGTGT AATGGACTTACTTAA GGGGTTACAAGGCTT ATATTTAAAATGCCT GCCTTGTAATTAAAT TTT
      TAAATATATTGGAAA AAAATGGCCACTTGT TATGTGAGTCTCCAG AAAAAAAACAAAAAA ACAGCAACCATCTGG TAT
      GCAAAATATCTGGTG GTAGCAAAATATCTG GTGGTATCTGGTGGA CTATCAAAATATAAA AACTTTTTTTTCCAG ATA
      GTATATCTTAAAATC AGCATCTTGAAGGAG TATATGTAAATAGCA AACTATTTGTAAAAA TAGATTTTATTTTAT AAT
      TTTTTAAGATATATA CCAAACATTATTACC GATTGTGATTATCTT TACATTGTTTGACCT CAAAACGGAAAACTG GAT
      GCGCGGTATCCATGC GACCCTAACTCTGGA ACCGATTTTGGAACC GCCCCGTTAGATCTC AGATTGAAACCTTAT TTG
      CATTCGCATGATCGC TGATGAACACTGGGG AAATGCGGCCCAGCA ATGGGATTGTCAACG CATCTCGGCCAGAAT CGC
      GCCTCGCATGCCACC TCGCACGGTGACCAC ATACCTGTGTACACT GTCAATTAACGTGGC AAGATTATAGCCCGG CCA
      "how can i make a list of the individual sequences.here the new sequences start with">" symbol.so i want to put them in individual lists.and i cant specify the number of sequences in the entire file.. how can i do?"

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Originally posted by aboxylica
        my sequence file:>CG9571_O-E|Drosophila melanogaster|CG 9571|FBgn003108 6|X:19926374..1 9927133
        CCAGTCCACCGGCCG CCGATCTATTTATAC GAGAGGAAGAGGCTG AACTCGAGGATTACC CGTGTATCCTGGGAC GCG
        GATTAGCGATCCATT CCCCTTTTAATCGCC GCGCAAACAGATTCA TGAAAGCCTTCGGAT TCATTCATTGATCCA CAT
        ............... ..........
        GCGCGGTATCCATGC GACCCTAACTCTGGA ACCGATTTTGGAACC GCCCCGTTAGATCTC AGATTGAAACCTTAT TTG
        CATTCGCATGATCGC TGATGAACACTGGGG AAATGCGGCCCAGCA ATGGGATTGTCAACG CATCTCGGCCAGAAT CGC
        GCCTCGCATGCCACC TCGCACGGTGACCAC ATACCTGTGTACACT GTCAATTAACGTGGC AAGATTATAGCCCGG CCA
        "how can i make a list of the individual sequences.here the new sequences start with">" symbol.so i want to put them in individual lists.and i cant specify the number of sequences in the entire file.. how can i do?"
        Python is great for this kind of stuff:[code=Python]
        def parseData(fn, dataset=1, key='>'):
        # initialize output list
        dataList = []

        # open file for reading
        f = open(fn)

        # skip to required data set
        for _ in range(dataset):
        try:
        s = f.next()
        while not s.startswith(ke y):
        s = f.next()
        except StopIteration, e:
        print 'We have reached the end of the file!'
        f.close()
        return False

        for line in f:
        if not line.startswith (key):
        dataList.append (line.strip())
        else:
        break

        f.close()
        return dataList

        fn = 'your_file'
        data = []
        i = 1
        while True:
        d = parseData(fn, i)
        if d:
        data.append(d)
        else:
        break
        i += 1

        for item in data:
        for i in item:
        print i[/code]

        Comment

        • aboxylica
          New Member
          • Jul 2007
          • 111

          #5
          This is my code. it always seems to go to exception loop..i donno why??
          Code:
          from math import *
          import random
          f=open("deeps1.txt","r")
          line=f.next()
          while not line.startswith('PO'):
              line=f.next()
           
          headerlist=line.strip().split()[1:]
          linelist=[]
           
           
          line=f.next().strip()
          while not line.startswith('/'):
              if line != '':
                  linelist.append(line.strip().split())
              line=f.next().strip()
              
          keys=[i[0] for i in linelist]
          values=[[float(s) for s in item] for item in [j[1:] for j in linelist]]
           
          array={}
          linedict=dict(zip(keys,values))
          keys = linedict.keys()
          keys.sort()
          for key in keys:
              array=[key,linedict[key]]
           
          datadict={}
          datadict1={}
          for i,item in enumerate(headerlist):
              datadict[item]={}
              for key_ in linedict:
                  datadict[item][key_]=linedict[key_][i]
                  
           
          for keymain in datadict:
              for keysub in datadict[keymain]:
                  datadict[keymain][keysub]+=1.0
           
          datadict1=datadict.copy()
          for keysub in datadict:
              for keysub in datadict[keymain]:
                  datadict1[keymain][keysub]=datadict[keymain][keysub]/(sum(values[int(keysub)-1])+4)
             
           
           
          def readfasta(fn,dataset=1,key=">"):
              datalist=[]
              fn= open(fn)
              for _ in range(dataset):
                  try:
                      s=f.next()
                      while not s.startswith(key):
                          s=f.next()
                  except StopIteration,e:
                      print "we have reached the end of file!"
                      f.close()
                      return False
              for line in f:
                  if not line.startswith(key):
                      datalist.append(line.strip())
                  else:
                      break
              f.close()
              print datalist
              return datalist
          
          fn="redfly_sequence.fasta"
          data=[]
          i=1
          while True:
              d=readfasta(fn,i)
              print d
              if d:
                  data.append(d)
              else:
                  break
              i=i+1
          
          for item in data:
              for i in item:
                  print i
              
             
           
           
          #p=readfasta()
           
                  
           
           
           
          res=1
          part=""
          
          #q=len(d)
          #print q
          seqq=""
           
          #value={"A":0.3,"T":0.3,"C":0.2,"G":0.2}
          #for i in range(q-16):
           #   part=d[i:i+16]
            #  seqq=part
             # res=1
              #score=1
             # for j in range(16):
              #    key=seqq[j]
               #   res=res*datadict1[key]["%02d"%(j+1)]
                  #print res
             # for key in seqq:
              #    score=score * value[key]
              #print score,"*******************",res
              #log_ratio=log10(res/score)
             # print i,log_ratio
          waiting for your reply
          cheers!!

          Comment

          • bartonc
            Recognized Expert Expert
            • Sep 2006
            • 6478

            #6
            If you mean this,[CODE=python]# useing [CODE=python] tags#
            s=f.next()
            except StopIteration,e :
            print "we have reached the end of file!"[/CODE]it is supposed to do that. Simply remove the print statement.

            Exceptions are often used to control program flow just like if, while, etc. That is the case here.

            Comment

            • aboxylica
              New Member
              • Jul 2007
              • 111

              #7
              even if i do that..it is going to false.. i want to see if it is reading the entire file or not..

              Comment

              • aboxylica
                New Member
                • Jul 2007
                • 111

                #8
                what i want to do specifically is store the sequences seperately in lists and calculate the scores for them individually how do i do it??

                Comment

                • aboxylica
                  New Member
                  • Jul 2007
                  • 111

                  #9
                  i executed the code you gave me it works..it takes the entire file as a single list.. i want to read the entire file but take the "individual files" as lists..calculat e the scores for thrm..how do i do this?
                  waiting for your reply,
                  cheers!

                  Comment

                  • aboxylica
                    New Member
                    • Jul 2007
                    • 111

                    #10
                    hey,
                    am stuck..plz help

                    Comment

                    • bvdet
                      Recognized Expert Specialist
                      • Oct 2006
                      • 2851

                      #11
                      Originally posted by aboxylica
                      i executed the code you gave me it works..it takes the entire file as a single list.. i want to read the entire file but take the "individual files" as lists..calculat e the scores for thrm..how do i do this?
                      waiting for your reply,
                      cheers!
                      The output is actually a list of lists. The StopIteration happens because there is no more data to read, and will happen EVERY TIME. This is necessary because we do not know how many sets of data are in there. The function was coded to read one data set only. You could rewrite the function to iterate only once and return the same list of lists. I will leave that up to you. Here is part of the output, printed in a different way:
                      [code=Python]>>> for i, item in enumerate(data) :
                      ... print 'Sequence number %d: %s' % (i, item)
                      ...
                      Sequence number 0: ['CCAGTCCACCGGCC GCCGATCTATTTATA CGAGAGGAAGAGGCT GAACTCGAGGATTAC CCGTGTATCCTGGGA CGCG', 'GATTAGCGATCCAT TCCCCTTTTAATCGC CGCGCAAACAGATTC ATGAAAGCCTTCGGA TTCATTCATTGATCC ACAT', 'CTACGGGAACGGGA GTCGCAAACGTTTTC GGATTAGCGCTGGAC TAGCGGTTTCTAAAT TGGATTATTTCTACC TGAC', 'CCTGGAGCCATCGT CCTCGTCCTCCGTCC CTTAGCGCCTCCTGC ATGGATGTCGTTTTT GGGTTTCATACCTTT TCAC', 'ACTGGAAAAATACG GAATTTGTTGTAAGC CCTTTCAAGACGAAT GGGATTTAGCTTCGG ATGTCAACGTCACCA TAAT', 'CATATTAGGAATAT TTCTACTCAATTGCA ATATTGGTACTTTTC TGACTGTAAACGCGA TGATAATTACAAATA TGCC', 'TAATTTGCTGTCTT TATAATCAAATGGAG TTCTTTATATTTCCA AAATATTGAAATTCC GATTCCCTAGAAAAT AATA', 'CGTTTTTCTGTTAT TAATAAAAAACCAAT AGGAAAGTTCTCAAA AATTACTCTGTTGTA TTTGATCATTTCTTT TCCG', 'GTATAATCTTTTAT TTTAAGCATTCCCAT GTGAATAAATTTCAG ACTAATGTATTAATA AGATGTCGTGTTTTT CCAC', 'TTACAAATTTCTCA TACAGCTGGATATAT ACTACGAGTACTATA CACATGCTCTGGG']
                      Sequence number 1: ['AGTCGACCAGCACG AGATCTCACCTACCT TCTTTATAAGCGGGG TCTCTAGAAGCTAAA TCCATGTCCACGTCA AACC',
                      ............... ..........[/code]

                      Comment

                      • bvdet
                        Recognized Expert Specialist
                        • Oct 2006
                        • 2851

                        #12
                        aboxylica,

                        I have reworked the parse functions a bit for compatibility, and tested for a specific data set (set number 3). The output is a dictionary of the data set matrix after processing and a list of the corresponding(? ) data set sequences. Can you show us what needs to be done from here?[code=Python]# Parse matrix data and sequence data files

                        def parseArray(fn, dataset=1, key='PO', term='/'):
                        '''
                        Read a formatted data file in matrix format and
                        compile data into a dictionary
                        '''
                        f = open(fn)

                        # skip to required data set
                        for _ in range(dataset):
                        try:
                        line = f.next()
                        while not line.startswith (key):
                        line = f.next()
                        except StopIteration, e:
                        print 'We have reached the end of the file!'
                        f.close()
                        return False

                        headerList = line.strip().sp lit()[1:]
                        lineList = []

                        line = f.next().strip( )
                        while not line.startswith (term):
                        if line != '':
                        lineList.append (line.strip().s plit())
                        line = f.next().strip( )

                        f.close()

                        # Key list
                        keys = [i[0] for i in lineList]
                        # Values list
                        values = [[float(s) for s in item] for item in [j[1:] for j in lineList]]

                        # Create a dictionary from keys and values
                        lineDict = dict(zip(keys, values))

                        dataDict = {}

                        for i, item in enumerate(heade rList):
                        dataDict[item] = {}
                        for key in lineDict:
                        dataDict[item][key] = lineDict[key][i]

                        # Add 1.0 to every element in dataDict subdictionaries
                        for keyMain in dataDict:
                        for keySub in dataDict[keyMain]:
                        dataDict[keyMain][keySub] += 1.0

                        # Normalize original data (with 1 added) and update data
                        valueSums = [sum(item)+4 for item in values]

                        for keyMain in dataDict:
                        for keySub in dataDict[keyMain]:
                        dataDict[keyMain][keySub] /= valueSums[int(keySub)-1]

                        return dataDict


                        def parseData(fn, dataset=1, key='>'):
                        '''
                        Read a formatted data file of alpha sequences
                        Return a list of sequences
                        '''
                        # initialize output list
                        dataList = []

                        # open file for reading
                        f = open(fn)

                        # skip to required data set
                        for _ in range(dataset):
                        try:
                        s = f.next()
                        while not s.startswith(ke y):
                        s = f.next()
                        except StopIteration, e:
                        print 'We have reached the end of the file!'
                        f.close()
                        return False

                        for line in f:
                        if not line.startswith (key):
                        dataList.append (line.strip())
                        else:
                        break

                        f.close()
                        return dataList

                        if __name__ == '__main__':
                        fnArray = 'matrixdata.txt '
                        fnSeq = 'seqdata.txt'
                        dataset = 3
                        dataArray = parseArray(fnAr ray, dataset)
                        dataSeq = parseData(fnSeq , dataset)

                        '''
                        >>> for key in dataArray:
                        ... print '%s = %s' % (key, dataArray[key])
                        ...
                        A = {'02': 0.0054464669628 780287, '03': 0.6107501194457 7162, '01': 0.2080745341614 9066, '06': 0.8383737040752 951, '07': 0.3847403372987 4352, '04': 0.0057811753463 927378, '05': 0.0047776025990 158141}
                        C = {'02': 0.0073097319764 941961, '03': 0.3775441949354 9923, '01': 0.0465838509316 77016, '06': 0.0093163250680 808381, '07': 0.0272323348143 90139, '04': 0.0055900621118 012426, '05': 0.0639243227748 31593}
                        T = {'02': 0.0047776025990 158141, '03': 0.0064022933588 150982, '01': 0.0436693741041 56715, '06': 0.1412259328269 0746, '07': 0.1033873202427 0221, '04': 0.9838509316770 1865, '05': 0.9244661029095 5999}
                        G = {'02': 0.9824661984616 1198, '03': 0.0053033922599 139988, '01': 0.7016722408026 7565, '06': 0.0110840380297 1669, '07': 0.4846400076441 6415, '04': 0.0047778308647 873869, '05': 0.0068319717165 926134}
                        >>> for item in dataSeq:
                        ... print item
                        ...
                        TCTAGAGATCTGGGC ACGATGGCGAGACAA AGATGCGGCGCAAAA TCGGAAATGGAGATG GATCACGTAGCCGGC CAT
                        GGCGG
                        >>>
                        '''[/code]

                        Comment

                        • elbin
                          New Member
                          • Jul 2007
                          • 27

                          #13
                          Just a suggestion, the sequence is needed in a single string maybe, not a list of strings, and the same evaluation as in the matrix-thread should be done, each 16-character fragment at a time, using the normalized matrix. And then the same thing for all sequence/matrix pairs (I don't think they are corresponding). ..

                          Is that so, aboxylica?

                          Comment

                          • bvdet
                            Recognized Expert Specialist
                            • Oct 2006
                            • 2851

                            #14
                            Originally posted by elbin
                            Just a suggestion, the sequence is needed in a single string maybe, not a list of strings, and the same evaluation as in the matrix-thread should be done, each 16-character fragment at a time, using the normalized matrix. And then the same thing for all sequence/matrix pairs (I don't think they are corresponding). ..

                            Is that so, aboxylica?
                            Not to be contrary, but the matrix data set #3 has only 7 elements for each of A, C, G, T. The sequence data set #3 has 78 characters on one line and 5 on another. The sequence can easily be joined if needed for evaluation:[code=Python]>>> ''.join(dataSeq )
                            'TCTAGAGATCTGGG CACGATGGCGAGACA AAGATGCGGCGCAAA ATCGGAAATGGAGAT GGATCACGTAGCCGG CCATGGCGG'
                            [/code]The sequence/matrix data may not be corresponding as you suggested.

                            Comment

                            • elbin
                              New Member
                              • Jul 2007
                              • 27

                              #15
                              Originally posted by bvdet
                              Not to be contrary, but the matrix data set #3 has only 7 elements for each of A, C, G, T. The sequence data set #3 has 78 characters on one line and 5 on another. The sequence can easily be joined if needed for evaluation:[code=Python]>>> ''.join(dataSeq )
                              'TCTAGAGATCTGGG CACGATGGCGAGACA AAGATGCGGCGCAAA ATCGGAAATGGAGAT GGATCACGTAGCCGG CCATGGCGG'
                              [/code]The sequence/matrix data may not be corresponding as you suggested.
                              Then I suppose the window-size changes for each matrix... It was just some thoughts on the task, because I am into genetics as well... Thanks for the clarification :), I overlooked that.

                              Comment

                              Working...