Hi everyone, I'm new to python and would like to split a FASTA (text) file into each different gene (separated by a ">"), randomly sample a certain number of the sequences, and print the result. I have the program almost working correctly, but for some reason text.split('>') strips all of the ">"s from the file. If there's some way I can either remove this strip or add back in the ">" character that would be amazing. Here's my program so far, I've added the ">" character to the print line, but that only adds it at the beginning of the result, I want it at the beginning of all of the splits.
My Program:
Result:
>FLP3FBN01A85 QC length=268 xy=0397_0946 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
CCGTTCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTAGGC CGTTACCCTACCAAC
AAGCTAATCAGACGC GGAGCCATCTTACAC CACCTCAGTTTTTCA CACCGGACCATGCGG
TCCTGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCTGTGTAAGGCA
GGTCCTCCACGCGTT ACTCACCCGTCCG
FLP3FBN01DH3NR length=257 xy=1319_0885 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTCTGGGCCGTGTC TCAGTCCCAATGTGG
CCGGTCACCCTCTCA GGTCGGCTACTGATC GTCGGCTTGGTGAGC CGTTACCTCACCAAC
TACCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
GACCGTGCGCTTATG CGGTATTGCACCTAT TTCTAAGTGTTATCC CCCAGTGCAAGGCAG
GTTACCCACGCGTTA CT
FLP3FBN01D0219 length=268 xy=1535_1839 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
CCGTCCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTGGGC CTTTACCCCGCCAAC
CAGCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
GACCGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCAGTGCAAGGCA
GGTTACCCACGCGTT ACTCACCCGTCCG
So, basically all I need is the ">" at the beginning of each of the three paragraphs.
Thanks so much in advance!
My Program:
Code:
import random fileobj = open("MyFile") ignore = fileobj.read(1) text = fileobj.read() records = text.split('>') NewLines = random.sample(records, 3) print ">" + '\n'.join(NewLines)
>FLP3FBN01A85 QC length=268 xy=0397_0946 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
CCGTTCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTAGGC CGTTACCCTACCAAC
AAGCTAATCAGACGC GGAGCCATCTTACAC CACCTCAGTTTTTCA CACCGGACCATGCGG
TCCTGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCTGTGTAAGGCA
GGTCCTCCACGCGTT ACTCACCCGTCCG
FLP3FBN01DH3NR length=257 xy=1319_0885 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTCTGGGCCGTGTC TCAGTCCCAATGTGG
CCGGTCACCCTCTCA GGTCGGCTACTGATC GTCGGCTTGGTGAGC CGTTACCTCACCAAC
TACCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
GACCGTGCGCTTATG CGGTATTGCACCTAT TTCTAAGTGTTATCC CCCAGTGCAAGGCAG
GTTACCCACGCGTTA CT
FLP3FBN01D0219 length=268 xy=1535_1839 region=1 run=R_2008_12_0 9_13_51_01_
ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
CCGTCCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTGGGC CTTTACCCCGCCAAC
CAGCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
GACCGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCAGTGCAAGGCA
GGTTACCCACGCGTT ACTCACCCGTCCG
So, basically all I need is the ">" at the beginning of each of the three paragraphs.
Thanks so much in advance!
Comment