Parsing a file, and matching numbers of a sequence (python)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • pcamps
    New Member
    • Apr 2012
    • 3

    Parsing a file, and matching numbers of a sequence (python)

    I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:

    NM_198212 chr7 + **115926679** **115935830** 115927071 11593344 2 *115926679*,'11 5933260', *115927221*,'11 5935830',

    The sequence for the chromosome starts at base **115926679** and continues up to(but not including) base **115935830**

    If we want the spliced sequence, we use the exons.The first extends from
    *115926679* to *155927221*, and the second goes from '115933260' to '115935830'

    However, I have run across a problem when on a complementary sequence such as:

    NM_001005286 chr1 - **245941755** ***245942680*** 245941755 245942680 1 *245941755*, '245942680'

    Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.

    That said, I have only been programming for about half a year and and not sure how to starts going about doing this.

    I have written a regular expression:

    '(NM_\d+)\s+(ch r\d+)([(\+)|(-)])\s+(\d+)\s+(\d +)\s+(\d+)\s+(\ d+)\s+(\d+)\s+( \d+),(\d+),s+(\ d+),(\d+),'

    but am now unsure as to how to start this function...
    If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.

    OK: suppose this is chromsome 25:

    AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG

    (there are 10 of each character).

    Now: if I am looking for an unspliced gene on:
    <blah> chr25 + 10 20 <blah>

    Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:

    CCCCCCCCCC

    This is easy. It matches python string slicing really well.


    Its more confusing if I give you:

    <blah> chr25 - 10 20 <blah>

    What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a souble-strand:

    AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG
    TTTTTTTTTTGGGGG GGGGGAAAAAAAAAA CCCCCCCCCC

    We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.

    The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)

    So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.

    So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).


    I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing

    I have tried:


    fields = line.split(' \t')
    geneID, chr, strand = fields[:2]
    start = int(fields[3])
    end = int(fields[4])
    if strand == '-':
    start,end = end,start



    which is on the right track, yet but its not enough. This would take the 10 and 20, and turn it into a 20 and 10.

    And I know I can reverse complement the string by doing this:

    r = s[::-1]
    bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    l = list(r)
    o = [bc[base] for base in l]
    return ''.join(o)


    But, from here I am lost.
  • dwblas
    Recognized Expert Contributor
    • May 2008
    • 626

    #2
    We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right.
    You can start from either the left or the right, assuming you already know how to extract the "+" or "-" and the begin and end postitions-->if not post back. Note that the smallest number is always first in the slice-->[:]
    Code:
    forward="AAAAAAAAAAXXXXXXXXXXTTTTTTTTTTGGGGGGGGGG"
    print forward[10:20]  ## prints "X"'s
    
    backward="AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGGTTTTTTTTTTGGGGGGGGGGYYYYYYYYYYCCCCCCCCCC"
    print backward[-20:-10]  ## prints "Y"'s
    You can also split the string if you know that each sequence is always 10, otherwise you have to test for some break indicator.
    Code:
    backward="AAAAAAAAAACCCCCCCCCCTTTTTTTTTTGGGGGGGGGGTTTTTTTTTTGGGGGGGGGGAAAAAAAAAACCCCCCCCCC"
    
    list_of_seq=[backward[start:start+10] for start in range(0, len(backward), 10)]
    print list_of_seq
    
    for start in [10, -20, 30, -30, 70]:
        print list_of_seq[start/10]

    Comment

    • dwblas
      Recognized Expert Contributor
      • May 2008
      • 626

      #3
      fields = line.split(' \t')

      This may give you and extra field because of the newline=\n, or retain the newline in the last field which could also create problems. Instead strip the whitespace first:
      fields = line.strip().sp lit()

      Note that split() receives the return of strip() and the default will split on whitespace (any spaces, tabs, or newlines combination).

      Comment

      Working...