mapping fasta files into dictionary (to create non-redundant fasta file)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Elniunia
    New Member
    • Feb 2010
    • 1

    mapping fasta files into dictionary (to create non-redundant fasta file)

    Hi,

    I am new to python. I have to mapp fasta file into dictionary. There are around 1000 sequences in my fasta file. The problem is that there are some the same sequences under different sequence id. I can sorted them out by accession number which is unique. The first line of my fasta file looks as follows:
    >seqId|GeneName |AccessionNumbe r|taxaNumber|Or ganizmName|Addi tionalInfo

    the next lines consist of amino acids.


    I need to make non-redundant fasta file for these sequences on the base of unique AccessionNumber . I was sugessted to create dictionary but I am not sure how to do it for that problem. Can someone help me please.

    Many tanks,
    E.
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Elniunia,

    Formatted data can be very simple to convert to a dictionary. Is your data delimited by the "|" character? It could be as simple as:
    Code:
    f = open("fasta.txt")
    headerList = f.readline().strip().split("|")
    dd = {}
    for line in f:
        lineList = line.strip().split("|")
        dd[lineList.pop(2)] = lineList
    f.close()
    Using the code above, this data:
    Code:
    seqId|GeneName|AccessionNumber|taxaNumber|Organiz mName|AdditionalInfo
    AAA|XYZ|0001|23658876|Bill|line 1
    CCC|D&HFREE|0002|99999931|John|line 2
    is converted to this dictionary:
    Code:
    >>> for key in dd:
    ... 	print key, dd[key]
    ... 	
    0001 ['AAA', 'XYZ', '23658876', 'Bill', 'line 1']
    0002 ['CCC', 'D&HFREE', '99999931', 'John', 'line 2']
    >>>

    Comment

    • Glenton
      Recognized Expert Contributor
      • Nov 2008
      • 391

      #3
      Hi Elniunia

      It's possible that I don't understand your situation precisely, but perhaps it's similar to mine. I often have data files which have a header row, and then many lines of data.

      Eg
      Temperature, Voltage, Current, etc
      5.002, 1.32, 0.00032, etc
      6.003, 1.42, 0.00042, etc
      etc

      I then find it very convenient to make a dictionary of numpy arrays.
      I have this function which I use to create this dictionary of arrays:
      Code:
      from numpy import *
      
      def MyOpen(myFile,textRow=0,dataStarts=1,hasHeadings=True,separater=NoneappendWhenNotDigit=True,returnArray=True):
          """Opens txt file (myFile), which has a standard format of
          text headings (with no space) separated by white space, followed
          by numbers separated in the same way.
          Output is a dictionary based the first row, with lists.
          textRow is the row containing the headings.
          dataStarts is the first row containing the data, and must be bigger
          that textRow.
          If there are no text headings then set hasHeadings to
          False, and they'll be labelled in the dictionary by 'Col0' etc
          If appendWhenNotDigit=True (default), then all rows will be appended.
          Setting it to False, will mean that rows containing non-numeric values
          will not be appended"""
          f=open(myFile,'r')
          g=f.readlines()
          f.close()
          ###change to lists###
          h=[]
          for n,i in enumerate(g):
              if n<dataStarts and n<>textRow: continue
              
              if separater==None:
                  temp1=i.split()
              else:
                  temp1=i.split(separater)
              temp2=[]
              myAppend=True
              for j in temp1:
                  #if j.isdigit():
                  #    temp2.append(int(j))
                  if isNumber(j.strip()):
                      temp2.append(float(j.strip()))
                  else:
                      temp2.append(j.strip())
                      if n<>textRow and not appendWhenNotDigit:
                          myAppend=False
                          break
              if myAppend: h.append(temp2)
          ###create dictionary
          d=dict([])
          if hasHeadings:
              for hi in h[0]:
                  d[hi]=[]
          else:
              for i in range(len(h[0])):
                  d["Col"+str(i)]=[]
          for i in range(hasHeadings,len(h)):
              for j in range(len(h[0])):
                  if hasHeadings:
                      d[h[0][j]].append(h[i][j])
                  else:
                      d["Col"+str(j)].append(h[i][j])
          if returnArray==True:
              e=dict([])
              for k in d.keys():
                  e[k]=array(d[k])
              return e
          return d
      There are several advantages to doing it this way.
      Firstly if you need to calculate another set of results based on the data you've stored, it can be done like this:
      Code:
      def calc(a,d,A):
          """a is the array based dictionary from the raw data & it will return
          a dictionary where additional variables have been calculated"""
          T=a["T/K"]
          q=a["Theta"]
          Z=a["Z"]
          a["10/T"]=10/T
          a["T-0.5"]=T**(-0.5)
          return a
      But the other thing you can do is first sort your data by AccessionNumber with this function:
      Code:
      def sort(a,sortName="T/K"):
      
          """a is an array dictionary.  Sorts all arrays by one of them"""
      
          #use  list.insert(bisect_left(list,element),elemnt) to create
      
          #a mask and apply it to all the elements
      
          mask=[]
      
          vals=[]
      
          for n,t in enumerate(a[sortName]):
      
              ins=bisect_left(vals,t)
      
              mask.insert(ins,n)
      
              vals.insert(ins,t)
      
          a2=dict()
      
          for k in a.keys():
      
              a2[k]=a[k][mask]
      
          return a2
      You just need to pass the dictionary you created to it and the name of the field you want to sort by.

      Then I guess you want to remove duplicates. I haven't got a function for it, but something like this will do the job:
      Code:
      def removeDuplicates(a,sortName):
          """a is an array dictionary.  Sorts all arrays by one of them"""
          #use  list.insert(bisect_left(list,element),elemnt) to create
          #a mask and apply it to all the elements
          a=sort(a,sortName)    
          mask=a[sortName][:-1]==a[sortName][1:]
          mask=concatenate(array(True),mask)
          for k in a.keys():
              a2[k]=a[k][mask]
          return a2
      I'm afraid I haven't had a chance to test this code.

      Comment

      Working...