Parsing tab separated .txt files with common and distinct attributes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • haobijam

    Parsing tab separated .txt files with common and distinct attributes

    I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please rectify this script. The file may be located from this url -
    ftp://ftp.ebi.ac.uk/pub/databases/mi...MX-10.sdrf.txt

    The source code i have written is as below -
    Code:
    #!/usr/bin/python
    import glob
    outfile = open('output_attribute.txt' , 'w')
    files = glob.glob('*.sdrf.txt')
    for file in files:
        infile = open(file)
        #ret = False
        for line in infile:
            lineArray = line.split('\t')
            
            if '\n\n' in line:
                ret = false
                outfile.write('')
                break;
            elif len(lineArray) > 2:            
               output = "%s\t%s\n\n"%(lineArray[0],lineArray[1])
               outfile.write(output)
            else:
                output = "%s\t\n"%(lineArray[0])
                outfile.write(output)
        infile.close()
    outfile.close()
    Last edited by bvdet; Oct 11 '10, 01:29 PM. Reason: Please use code tags when posting code.
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    I am unclear about your end goal. It seems you want to read multiple files, read the first line of each file, split the line on the tab character, then write the first two elements to the output file. Would you please clarify what output you want?

    Comment

    • haobijam
      New Member
      • Oct 2010
      • 16

      #3
      Parsing headers with \n\n separated

      Dear,

      Please find the attached zip file. I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but not fix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I would be glad for your support and cooperation.

      With regards,
      Haobijam
      Attached Files

      Comment

      • haobijam
        New Member
        • Oct 2010
        • 16

        #4
        Parsing headers with \n\n separated

        I would like to extract only the headers from the file parsed. Every files in header starts with Array Design Name but ends with unfix attribute. So i would like to extract headers with space (\n\n)separated gap which is attached in zip file format. I would like to extract only the RED encircled headers. I have attached the output for this script written. I would be glad for your support and cooperation.
        The file may be located from this url -
        ftp://ftp.ebi.ac.uk/pub/databases/mi...FFY-10.adf.txt

        The source code i have written is as below-

        Code:
        #!/usr/bin/python
        import glob
        
        outfile = open('output_attri.txt' , 'w')
        files = glob.glob('*.adf.txt')
        
        for file in files:
            infile = open(file)
            
            for line in infile:
                line = line.replace('^' , '\n\n').replace('!' , '').replace('#' , '').replace('\n','')
                lineArray = line.split('%s\t')
                if line == '\n\n':
                    outfile.write('')
                    break;
                elif len(lineArray) > 2:            
                    output = "%s\t%s\n"%(lineArray[0],lineArray[1])
                    outfile.write(output)
                else:
                    output = "%s\t\n"%(lineArray[0])
                    outfile.write(output)
            infile.close()
        outfile.close()

        With regards,
        Haobijam
        Attached Files
        Last edited by bvdet; Oct 13 '10, 03:19 PM. Reason: Add code tags

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          Is there always a blank line separating the header info you want from the data you do not want? You only want the first two elements of each header line? Untested:
          Code:
          outFile = open(outFileName, 'w')
          for fn in fileNameList:
              f = open(fn)
              output = []
              for line in f:
                  line = line.strip().split("\t")
                  if line:
                      output.append("\t".join(line[:2]))
                  else:
                      outFile.write("\n".join(output))
                      break
          outFile.close()

          Comment

          • haobijam
            New Member
            • Oct 2010
            • 16

            #6
            Dear,
            Yes there is always a blank line separating the header information i want from the text data i do not want to extract in all the files.

            Regards,
            Haobijam

            Comment

            • haobijam
              New Member
              • Oct 2010
              • 16

              #7
              Dear,

              What is wrong with this script? I could not print any output.
              Code:
              #!/usr/bin/python
              import glob
              
              outFile = open('output.txt', 'w')
              fileNameList = glob.glob('*.adf.txt')
              for file in fileNameList:
                  f = open(file)
                  output = []
                  for line in f:
                      line = line.strip().split("\t")
                      #lineArray = line.split('\t')
                      if line:
                          #output = "%s\t%s\n"%(lineArray[0],lineArray[1])
                          output.append("\t".join(line[:2]))
                      else:
                          outFile.write("\n".join(output))
                          break
                  f.close()
              outFile.close()
              The code is here -
              Last edited by bvdet; Oct 13 '10, 03:14 PM. Reason: Please use code tags when posting code. [code]....code goes here....[/code]

              Comment

              • bvdet
                Recognized Expert Specialist
                • Oct 2006
                • 2851

                #8
                PLEASE use code tags when posting code. That way I will not have to edit your post.

                There are no print statements. Is there any content in the output file? Add print statements, as in print line, to see what is being read.

                BV - Moderator

                Comment

                • bvdet
                  Recognized Expert Specialist
                  • Oct 2006
                  • 2851

                  #9
                  This writes the header information to disk:
                  Code:
                  outFile = open(outFileName, 'w')
                  for fn in fileNameList:
                      f = open(fn)
                      output = []
                      for line in f:
                          line = line.strip()
                          if line:
                              output.append(line)
                          else:
                              outFile.write("\n".join(output))
                              f.close()
                              break
                  outFile.close()

                  Comment

                  • dwblas
                    Recognized Expert Contributor
                    • May 2008
                    • 626

                    #10
                    Note that his will never be found as it is read as two separate records. Test for len(line.strip( )) instead to find an empty record.
                    Code:
                            if '\n\n' in line:

                    Comment

                    • haobijam
                      New Member
                      • Oct 2010
                      • 16

                      #11
                      Parsing tab separated .txt files with distinct or unique attributes

                      Dear Sir,
                      I have written a script to extract the first line starting with Source Name AND ends with Comment [ArrayExpress Data Retrieval URI] and i have done it but i could not parse distinct or unique attributes which is not repeated in every files. I would like to parse only the first line attributes not the table values. Could you please rectify this script. I have attached a zip file for all sdrf.txt files.The file may be located from this url -
                      ftp://ftp.ebi.ac.uk/pub/databases/mi...FMX-1.sdrf.txt

                      Code:
                      
                      
                      Regards,
                      Haobijam
                      Attached Files

                      Comment

                      • haobijam
                        New Member
                        • Oct 2010
                        • 16

                        #12
                        Code:
                        #!/usr/bin/python
                        import glob
                        #import linecache
                        outfile = open('output_att.txt' , 'w')
                        files = glob.glob('*.sdrf.txt')
                        for file in files:
                            infile = open(file)
                            #count = 0
                            for line in infile:
                                
                                lineArray = line.rstrip()
                                if not line.startswith('Source Name') : continue
                                #count = count + 1
                                lineArray = line.split('%s\t')
                                print lineArray[0]
                                output = "%s\t\n"%(lineArray[0])
                                outfile.write(output)
                            infile.close()
                        outfile.close()

                        Comment

                        • haobijam
                          New Member
                          • Oct 2010
                          • 16

                          #13
                          Parsing attributes from sdrf.txt files and extracting unique terms for all sdrf.txt

                          Dear Sir,

                          I would like to extract only unique terms from all sdrf.txt files but this python code outputs unique terms for every file individually. Like Array Data File , Array Design REF ... are repeated in most of sdrf.txt files so i don't wanna print it as unique terms. Could you please tell me to hide case sensitive in python because Characteristics[OrganismPart] is printed as unique term to Characteristics[organism part] similarly for Characteristics[Sex] with Characteristics[sex]. I am eagerly waiting for your support and positive reply.
                          Code:
                          #!/usr/bin/python
                          import glob
                          import string
                          
                          outfile = open('output.txt' , 'w')
                          files = glob.glob('*.sdrf.txt')
                          previous = set()
                          for file in files:
                              print('\n'+file)
                              infile = open(file)
                              #previous = set() # uncomment this if do not need to be unique between the files
                              for line in infile:
                                  lineArray = line.rstrip()
                                  if not line.startswith('Source Name') : continue
                                  lineArray = line.split('%s\t')
                                  output = "%s\t\n"%(lineArray[0])
                                  outfile.write(output)
                                  uniqwords = set(word.strip() for word in lineArray[0].split('\t')
                                                  if word.strip() and word.strip() not in previous) 
                                  print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))
                                  previous |=  uniqwords 
                              infile.close()
                          outfile.close()
                          print('='*80)
                          print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
                          Attached Files

                          Comment

                          • haobijam
                            New Member
                            • Oct 2010
                            • 16

                            #14
                            Parsing attributes and extracting unique terms from adf.txt

                            Dear Sir,

                            I do have a query regarding parsing attributes and extracting unique terms from adf.txt files from ArrayExpress [ftp://ftp.ebi.ac.uk/pub/databases/mi...y/data/array/] .The python code written here is feasible for running individual file with similar starting term but it is infeasible for running around 2270 adf.txt files at one time. Could you please rectify or suggest me some tips for this python code in line number 12 . Actually i would like to parse the first line for every adf.txt files (2270 in numbers) and later extract unique terms and common terms from it. For your convenience i have attached a zip file for adf.txt format but for more you may get into ftp site mentioned above. I would so glad for your support and cooperation.

                            With warm regards,
                            Haobijam

                            Code:
                            #!/usr/bin/python
                            import glob
                            import string
                            with open('output_Reporter Name.txt' , 'w') as outfile:
                                files = glob.glob('*.adf.txt')
                                uniqwords = set()
                                previous = set()
                                for file in files:
                                    with open(file) as infile:
                                        #previous = set() # uncomment this if do not need to be unique between the files
                                        for line in infile:
                                            if not line.startswith('Reporter Name') : continue ## change this line to deal with other form
                                            output = line
                                            uniqwords = set(word.strip() for word in line.rstrip().split('\t')
                                                            if word.strip() and word.strip() not in previous)
                                            previous |=  uniqwords
                                            print (output)
                                            outfile.write(output)
                            print('The %i unique terms are:\n\t%s' % (len(uniqwords),'\n\t'.join(sorted(uniqwords))))                  
                            print('='*80)
                            print('The %i terms are:\n\t%s' % (len(previous),'\n\t'.join(sorted(previous))))
                            Attached Files

                            Comment

                            Working...