Parsing tab separated .txt files with common and distinct attributes

**bvdet** · Oct 11 '10, 01:45 PM

I am unclear about your end goal. It seems you want to read multiple files, read the first line of each file, split the line on the tab character, then write the first two elements to the output file. Would you please clarify what output you want?

**haobijam** · Oct 13 '10, 12:20 PM

Parsing headers with \n\n separated

Dear,

Please find the attached zip file. I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but not fix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I would be glad for your support and cooperation.

With regards,
Haobijam

Attached Files

headers.zip (73.7 KB, 216 views)

**haobijam** · Oct 13 '10, 12:41 PM

Parsing headers with \n\n separated

I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but ends with unfix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I have attached the output for this script written. I would be glad for your support and cooperation.
The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FFY-10.adf.txt

The source code i have written is as below-

Code:

#!/usr/bin/python
import glob

outfile = open('output_attri.txt' , 'w')
files = glob.glob('*.adf.txt')

for file in files:
    infile = open(file)
    
    for line in infile:
        line = line.replace('^' , '\n\n').replace('!' , '').replace('#' , '').replace('\n','')
        lineArray = line.split('%s\t')
        if line == '\n\n':
            outfile.write('')
            break;
        elif len(lineArray) > 2:            
            output = "%s\t%s\n"%(lineArray[0],lineArray[1])
            outfile.write(output)
        else:
            output = "%s\t\n"%(lineArray[0])
            outfile.write(output)
    infile.close()
outfile.close()

With regards,
Haobijam

Attached Files

output_attribute.zip (2.46 MB, 161 views)

**bvdet** · Oct 13 '10, 12:50 PM

Is there always a blank line separating the header info you want from the data you do not want? You only want the first two elements of each header line? Untested:

Code:

outFile = open(outFileName, 'w')
for fn in fileNameList:
    f = open(fn)
    output = []
    for line in f:
        line = line.strip().split("\t")
        if line:
            output.append("\t".join(line[:2]))
        else:
            outFile.write("\n".join(output))
            break
outFile.close()

**haobijam** · Oct 13 '10, 01:56 PM

Dear,
Yes there is always a blank line separating the header information i want from the text data i do not want to extract in all the files.

Regards,
Haobijam

**haobijam** · Oct 13 '10, 02:31 PM

Dear,

What is wrong with this script? I could not print any output.

Code:

#!/usr/bin/python
import glob

outFile = open('output.txt', 'w')
fileNameList = glob.glob('*.adf.txt')
for file in fileNameList:
    f = open(file)
    output = []
    for line in f:
        line = line.strip().split("\t")
        #lineArray = line.split('\t')
        if line:
            #output = "%s\t%s\n"%(lineArray[0],lineArray[1])
            output.append("\t".join(line[:2]))
        else:
            outFile.write("\n".join(output))
            break
    f.close()
outFile.close()

The code is here -

**bvdet** · Oct 13 '10, 03:17 PM

PLEASE use code tags when posting code. That way I will not have to edit your post.

There are no print statements. Is there any content in the output file? Add print statements, as in print line, to see what is being read.

BV - Moderator

**bvdet** · Oct 13 '10, 03:40 PM

This writes the header information to disk:

Code:

outFile = open(outFileName, 'w')
for fn in fileNameList:
    f = open(fn)
    output = []
    for line in f:
        line = line.strip()
        if line:
            output.append(line)
        else:
            outFile.write("\n".join(output))
            f.close()
            break
outFile.close()

**dwblas** · Oct 13 '10, 04:29 PM

Note that his will never be found as it is read as two separate records. Test for len(line.strip( )) instead to find an empty record.

Code:

        if '\n\n' in line:

**haobijam** · Oct 14 '10, 05:11 AM

Parsing tab separated .txt files with distinct or unique attributes

Dear Sir,
I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script. I have attached a zip file for all sdrf.txt files.The file may be located from this url -
ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

Code:

Regards,
Haobijam

Attached Files

**haobijam** · Oct 14 '10, 05:13 AM

Code:

#!/usr/bin/python
import glob
#import linecache
outfile = open('output_att.txt' , 'w')
files = glob.glob('*.sdrf.txt')
for file in files:
    infile = open(file)
    #count = 0
    for line in infile:
        
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        #count = count + 1
        lineArray = line.split('%s\t')
        print lineArray[0]
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
    infile.close()
outfile.close()

**haobijam** · Oct 19 '10, 05:28 AM

Parsing attributes from sdrf.txt files and extracting unique terms for all sdrf.txt

Dear Sir,

I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.

Code:

#!/usr/bin/python
import glob
import string

outfile = open('output.txt' , 'w')
files = glob.glob('*.sdrf.txt')
previous = set()
for file in files:
    print('\n'+file)
    infile = open(file)
    #previous = set() # uncomment this if do not need to be unique between the files
    for line in infile:
        lineArray = line.rstrip()
        if not line.startswith('Source Name') : continue
        lineArray = line.split('%s\t')
        output = "%s\t\n"%(lineArray[0])
        outfile.write(output)
        uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                        if word.strip() and word.strip() not in previous) 
        print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
        previous |=  uniqwords 
    infile.close()
outfile.close()
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Attached Files

**haobijam** · Oct 28 '10, 04:57 AM

Parsing attributes and extracting unique terms from adf.txt

Dear Sir,

I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/mi...y/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

With warm regards,
Haobijam

Code:

#!/usr/bin/python
import glob
import string
with open('output_Reporter Name.txt' , 'w') as outfile:
    files = glob.glob('*.adf.txt')
    uniqwords = set()
    previous = set()
    for file in files:
        with open(file) as infile:
            #previous = set() # uncomment this if do not need to be unique between the files
            for line in infile:
                if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
                output = line
                uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                if word.strip() and word.strip() not in previous)
                previous |=  uniqwords
                print (output)
                outfile.write(output)
print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
print('='*80)
print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))

Attached Files

adf.zip (1.01 MB, 147 views)