I currently have a list of genes in a file. Each line has a chromosome with it's information. Such an entry appears as:
NM_198212 chr7 + **115926679** **115935830** 115927071 11593344 2 *115926679*,'11 5933260', *115927221*,'11 5935830',
The sequence for the chromosome starts at base **115926679** and continues up to(but not including) base **115935830**
If we want the spliced sequence, we use the exons.The first extends from
*115926679* to *155927221*, and the second goes from '115933260' to '115935830'
However, I have run across a problem when on a complementary sequence such as:
NM_001005286 chr1 - **245941755** ***245942680*** 245941755 245942680 1 *245941755*, '245942680'
Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.
That said, I have only been programming for about half a year and and not sure how to starts going about doing this.
I have written a regular expression:
'(NM_\d+)\s+(ch r\d+)([(\+)|(-)])\s+(\d+)\s+(\d +)\s+(\d+)\s+(\ d+)\s+(\d+)\s+( \d+),(\d+),s+(\ d+),(\d+),'
but am now unsure as to how to start this function...
If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.
OK: suppose this is chromsome 25:
AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG
(there are 10 of each character).
Now: if I am looking for an unspliced gene on:
<blah> chr25 + 10 20 <blah>
Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:
CCCCCCCCCC
This is easy. It matches python string slicing really well.
Its more confusing if I give you:
<blah> chr25 - 10 20 <blah>
What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a souble-strand:
AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG
TTTTTTTTTTGGGGG GGGGGAAAAAAAAAA CCCCCCCCCC
We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.
The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)
So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.
So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).
I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing
I have tried:
fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
start,end = end,start
which is on the right track, yet but its not enough. This would take the 10 and 20, and turn it into a 20 and 10.
And I know I can reverse complement the string by doing this:
r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
return ''.join(o)
But, from here I am lost.
NM_198212 chr7 + **115926679** **115935830** 115927071 11593344 2 *115926679*,'11 5933260', *115927221*,'11 5935830',
The sequence for the chromosome starts at base **115926679** and continues up to(but not including) base **115935830**
If we want the spliced sequence, we use the exons.The first extends from
*115926679* to *155927221*, and the second goes from '115933260' to '115935830'
However, I have run across a problem when on a complementary sequence such as:
NM_001005286 chr1 - **245941755** ***245942680*** 245941755 245942680 1 *245941755*, '245942680'
Since column 3 is a '-', these coordinates are in reference to the anti-sense strand (the complement to the strand). The first base (in bold) matches the last base on the sense strand (in italics). Since the file only has the sense stand, I need to try to translate coordinates on the anti-sense strand to the sense strand, pick out the right sequence and then reverse-complement it.
That said, I have only been programming for about half a year and and not sure how to starts going about doing this.
I have written a regular expression:
'(NM_\d+)\s+(ch r\d+)([(\+)|(-)])\s+(\d+)\s+(\d +)\s+(\d+)\s+(\ d+)\s+(\d+)\s+( \d+),(\d+),s+(\ d+),(\d+),'
but am now unsure as to how to start this function...
If anyone can help me get started at all on this, perhaps making me see how to do this, I would very much appreciate it.
OK: suppose this is chromsome 25:
AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG
(there are 10 of each character).
Now: if I am looking for an unspliced gene on:
<blah> chr25 + 10 20 <blah>
Then the gene starts on position 10 (starting from 0), and goes up to but not including position 20. So its:
CCCCCCCCCC
This is easy. It matches python string slicing really well.
Its more confusing if I give you:
<blah> chr25 - 10 20 <blah>
What you have is the positive strand. But this gene is on the negative (complementary) strand. Remember what the chromosome looks like as a souble-strand:
AAAAAAAAAACCCCC CCCCCTTTTTTTTTT GGGGGGGGGG
TTTTTTTTTTGGGGG GGGGGAAAAAAAAAA CCCCCCCCCC
We are looking for the gene on the bottom strand. Meaning we count from 0 starting on the right. Number the top strand from the left, and the bottom strand from the right. So what I want here is AAAAAAAAAA.
The catch is that I'm only giving you the top strand. I'm not giving you the bottom strand. (You could generate yourself from the top strand — but given how large it is, I advise against that.)
So you need to convert coordinates. On the bottom strand, base 0 (the right-most C) is opposed to base 39 on the top strand. Base 1 is against base 38. Base 2 is against case 37. (Important point: notice what happens when you add these two numbers up — every time.) So base 10 is against base 29, and base 19 is against base 20.
So: if I want to find base 10-20 on the bottom strand, I can look at base 20-29 on the top (and then reverse-complement it).
I need to figure out how to translate to coordinates on the bottom strand to the equivalent coordinates on the bottom strand. Yes: it is very confusing
I have tried:
fields = line.split(' \t')
geneID, chr, strand = fields[:2]
start = int(fields[3])
end = int(fields[4])
if strand == '-':
start,end = end,start
which is on the right track, yet but its not enough. This would take the 10 and 20, and turn it into a 20 and 10.
And I know I can reverse complement the string by doing this:
r = s[::-1]
bc = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
l = list(r)
o = [bc[base] for base in l]
return ''.join(o)
But, from here I am lost.
Comment