Formatting text files

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gurgietrueshot
    New Member
    • Jan 2007
    • 1

    Formatting text files

    Im starting new project where I have to take data from one text file and toss it into another is a specified format ( i have a 27 page document outlining that format). I am new to python and was wondering if anyone had some tips on usefull ways to parse text files using python?
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Originally posted by gurgietrueshot
    Im starting new project where I have to take data from one text file and toss it into another is a specified format ( i have a 27 page document outlining that format). I am new to python and was wondering if anyone had some tips on usefull ways to parse text files using python?
    I'm no expert on parsing data, but regular expressions (module re) is a powerful tool.

    Comment

    • bartonc
      Recognized Expert Expert
      • Sep 2006
      • 6478

      #3
      Originally posted by gurgietrueshot
      Im starting new project where I have to take data from one text file and toss it into another is a specified format ( i have a 27 page document outlining that format). I am new to python and was wondering if anyone had some tips on usefull ways to parse text files using python?
      Python has many powerful and easy to use tools for dealing with text. So many, in fact, that picking the tools is sometimes the hard part. After getting a grasp on the language syntax and structures, parsing text is fairly easy to implement.

      Comment

      • badech
        New Member
        • Jan 2007
        • 16

        #4
        hello
        i hope this is useful


        you can see also :



        chapter 9 ( page 112 )
        but it's in french

        Comment

        • dshimer
          Recognized Expert New Member
          • Dec 2006
          • 136

          #5
          In my mind it also depends a great deal on the structure of the text. I do this task constantly and because of the setting I am in I deal primarily with two types of data files.

          One is text or numeric data that is fairly structured, for example each line may contain a variety of numbers or strings delimited by a character or white space. For example to describe a point in space, a file may have hundreds of lines with
          Name,X,Y,Z,R1,R 2,R3,Desc
          Where Name is the name of a point, x,y,z, are geographic coordinates and the R's are some kind of real numbers, and Desc is a text description. When working with these kinds of files I find it easiest to take the data from the input file and create a list of lists then just loop over the lists outputing the data in the new format. I usually grab the whole file using readlines() then run it through a function I call lines2lists which works on data separated by whitespace but could easily be modified to add a delimiter variable.

          Code:
          def lines2lists(AListOfDataLines):
          	'''
          	Function readlines returns an entire file with each line as a string in a
          	list of data.  This function will convert each string into a list of words,
          	then return a list of lists.
          	Example:            lines2lists(['first line','the second line','or sentences'])
          	Would return:       [['first','line'],['the','second','line'],['or','sentences]]
          	'''
          	DataList=[]
          	for Line in AListOfDataLines:
          		DataList.append(Line.split())
          	return DataList
          I usually combine it with readlines(), for example
          Code:
          Data=utilitymodule.lines2lists(AnOpenFile.readlines())
          Where Data is the resulting list in which the example above would look something like the following.
          [[Name,X,Y,Z,R1,R 2,R3,Desc],[Name,X,Y,Z,R1,R 2,R3,Desc],[Name,X,Y,Z,R1,R 2,R3,Desc]]

          The for each element in Data I format and output the elements of each coordinate (in this case).

          The other type of file I work with a lot is one in which there may be multiple elements of a particular dataset but they are on different lines of the file. Using the example above it might look like.
          Name1 aString
          Name2 aString
          X1 aNumber
          X2 aNumber
          Y1 aNumber
          Y2 aNumber
          and so on....

          In this case I still read in the lines, build lists from them, then use something like the count() function to test for a value in a particular list, if the test value exists, I grab the other member of the list which is the actual data then append it to the actual list of data. In pseudocode it would be something like

          If InputString.cou nt(test value like name1) is true then datastring.appe nd( the value associated with name1) then when all the input strings have been parsed. for each element of datastring, format and output the values.

          There are probably easier ways to do this, but most of the time I may only need to write a program every couple of weeks, may only have 5 minutes notice and need to have it done very quickly. Since there is so much power in lists I tend to stick with the functions I know and love and can hack together quickly.

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            dshimer,
            Whatever works for you in a efficient manner IS the easiest way. Here are two things that I learned on this forum:
            Code:
            f = open(dlg1.import_file, "r")
            # Files can be used as iterators. Internally, for calls file next() method.
            for item in f:
                ....do stuff....
            Code:
            # If a sub-string is in a string, evaluate True
            if "subject_text" in item.lower():
                pt = re.split('[:;,]', item)
            Maybe you can use these sometime.

            Comment

            Working...