How do I split a text file without stripping the character I'm splitting at?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • GeneticsJustin
    New Member
    • Nov 2010
    • 3

    How do I split a text file without stripping the character I'm splitting at?

    Hi everyone, I'm new to python and would like to split a FASTA (text) file into each different gene (separated by a ">"), randomly sample a certain number of the sequences, and print the result. I have the program almost working correctly, but for some reason text.split('>') strips all of the ">"s from the file. If there's some way I can either remove this strip or add back in the ">" character that would be amazing. Here's my program so far, I've added the ">" character to the print line, but that only adds it at the beginning of the result, I want it at the beginning of all of the splits.

    My Program:
    Code:
    import random
    fileobj = open("MyFile")
    ignore  = fileobj.read(1)
    text    = fileobj.read()
    records = text.split('>')
    NewLines = random.sample(records, 3)
    print ">" + '\n'.join(NewLines)
    Result:

    >FLP3FBN01A85 QC length=268 xy=0397_0946 region=1 run=R_2008_12_0 9_13_51_01_
    ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
    CCGTTCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTAGGC CGTTACCCTACCAAC
    AAGCTAATCAGACGC GGAGCCATCTTACAC CACCTCAGTTTTTCA CACCGGACCATGCGG
    TCCTGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCTGTGTAAGGCA
    GGTCCTCCACGCGTT ACTCACCCGTCCG

    FLP3FBN01DH3NR length=257 xy=1319_0885 region=1 run=R_2008_12_0 9_13_51_01_
    ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTCTGGGCCGTGTC TCAGTCCCAATGTGG
    CCGGTCACCCTCTCA GGTCGGCTACTGATC GTCGGCTTGGTGAGC CGTTACCTCACCAAC
    TACCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
    GACCGTGCGCTTATG CGGTATTGCACCTAT TTCTAAGTGTTATCC CCCAGTGCAAGGCAG
    GTTACCCACGCGTTA CT

    FLP3FBN01D0219 length=268 xy=1535_1839 region=1 run=R_2008_12_0 9_13_51_01_
    ACAGACCACTCACAT GCTGCCTCCCGTAGG AGTTTGGGCCGTGTC TCAGTCCCAATGTGG
    CCGTCCACCCTCTCA GGCCGGCTACTGATC GTCGCCTTGGTGGGC CTTTACCCCGCCAAC
    CAGCTAATCAGACGC GGGTCCATCTTGCAC CACCGGAGTTTTTCA CACTGTCCCATGCAG
    GACCGTGCGCTTATG CGGTATTAGCACCTA TTTCTAAGTGTTATC CCCCAGTGCAAGGCA
    GGTTACCCACGCGTT ACTCACCCGTCCG

    So, basically all I need is the ">" at the beginning of each of the three paragraphs.

    Thanks so much in advance!
    Last edited by bvdet; Nov 5 '10, 06:01 PM. Reason: Add code tags
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Use a list comprehension to add the ">" character to each record. Example:
    Code:
    >>> records = ["123", "456", "789"]
    >>> new_records = [">%s" % (s) for s in records]
    >>> new_records
    ['>123', '>456', '>789']
    >>>

    Comment

    • GeneticsJustin
      New Member
      • Nov 2010
      • 3

      #3
      Would I have to convert my list to a string for this? Also, for "records = ["123", "456", "789"]" how would I get python to automatically fill in for the "123", "456", and "789" in your example? The text file I'm using is hundreds of thousands of characters long, so I can't possibly do this manually.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        In your case it would be:
        Code:
        NewLines = [">%s" % (s) for s in random.sample(records, 3)]
        OR:
        Code:
        records = [">%s" % (s) for s in text.split('>')]

        Comment

        • GeneticsJustin
          New Member
          • Nov 2010
          • 3

          #5
          Thanks! That works perfectly!

          Comment

          Working...