how to use python to extract certain text in the file?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • maximus tee
    New Member
    • Dec 2010
    • 30

    how to use python to extract certain text in the file?

    i want to extract certain section of the text file. my input file:

    -- num cell port function safe [ccell disval rslt]
    "17 (BC_1, CLK, input, X)," &
    "16 (BC_1, OC_NEG, input, X), " &-- Merged input/
    " 8 (BC_1, D(8), input, X)," & -- cell 16 @ 1 -> Hi-Z
    " 7 (BC_1, Q(1), output3, X, 16, 1, Z)," &
    " 0 (BC_1, Q(8), output3, X, 16, 1, Z)";
    and i need the output to be as such:

    num cell port function safe ccell
    17 BC_1 CLK input X
    16 BC_1 OC_NEG input X
    16 BC_1 * control 1
    8 BC_1 D8 input X
    7 BC_1 Q1 output3 X 16 1
    0 BC_1 Q8 output3 X 16 1
    so far i tried below code but it gave index error. pls advise.

    Code:
    import re
    lines=open("input.txt",'r').readlines()
    
    for line in lines:
        a=re.findall(r'\w+',line)
        print re.findall(r'\w+',line)
        print a[0],a[1],a[2],a[3],a[4],a[5],a[6]
    i'm using python 2.6.6 and win 7 and error as below: ['num', 'cell', 'port', 'function', 'safe', 'ccell', 'disval', 'rslt'] num cell port function safe ccell disval ['17', 'BC_1', 'CLK', 'input', 'X'] 17 BC_1 CLK input X Traceback (most recent call last): File "C:\Users\ctee1 \Desktop\pypars ing\outputparse r.py", line 39, in print a[0],a[1],a[2],a[3],a[4],a[5],a[6] IndexError: list index out of range

    thanks maximus
    Attached Files
  • Mariostg
    Contributor
    • Sep 2010
    • 332

    #2
    I believe it is because there are only 5 elements in "17 BC_1 CLK input X" and you are trying to print 7 (a[0] to a[6]).

    Comment

    • maximus tee
      New Member
      • Dec 2010
      • 30

      #3
      thanks will relook into it.

      Comment

      • Glenton
        Recognized Expert Contributor
        • Nov 2008
        • 391

        #4
        The easy way to do this more safely would be something like
        Code:
        import re
        lines=open("input.txt",'r').readlines()
         
        for line in lines:
            a=re.findall(r'\w+',line)
            print re.findall(r'\w+',line)
            #print a[0],a[1],a[2],a[3],a[4],a[5],a[6]
            for b in a:
                print b,
            print

        Comment

        • maximus tee
          New Member
          • Dec 2010
          • 30

          #5
          hi, thanks for your f/back.
          would like to check whether the last line print is a typo? and also in the for loop, there is a print, ?
          i was thinking of doing:
          Code:
          for line in lines:
              a=line.split('-')[0]
              print a
              for b in a: 
                  print b,
              print

          Comment

          • Glenton
            Recognized Expert Contributor
            • Nov 2008
            • 391

            #6
            The "print b," is to print b without going to a new line. It's to do the equivalent of printing a[0], a[1], a[2],... for as many as are needed.

            The "print" at the end is to make a new line.

            I was assuming your regular expression was working, but perhaps it isn't. I can't imaging that your split expression would work either.

            Perhaps you can explain the logic of what you're trying to achieve. There are many ways that you could get that output given that input. But what are the more general rules? Eg is the first line always in that format? Do you find that all lines are one of two formats? If you can provide more about what you're trying to achieve, then it will be easier to help.

            Comment

            • maximus tee
              New Member
              • Dec 2010
              • 30

              #7
              apologies for confusion.
              general rules:
              1) the first line inside input.txt (as attached in the first post):
              -- num cell port function safe [ccell disval rslt]
              i just need num cell port function safe ccell
              however my script below couldnt get this so i skip the line. it will be great if you can show me.

              2) i'm trying to convert the line inside the input.txt
              "17 (BC_1, CLK, input, X)," &
              into 17 BC_1 CLK input X
              basically i'm only extracting column for num, cell, port, function, safe and ccell. the rest are not needed.

              and " 7 (BC_1, Q(1), output3, X, 16, 1, Z)," &
              into 7 BC_1 Q1 output3 X 16 1

              so far my script is which result in the output.txt (as attached in the first post):

              Code:
              import re
              
              fileIn = open("input.txt", "rb")
              fileOut = open("output.txt", "w")
              
              for strData in fileIn:
                  strData = strData.split('-')[0] #this is to remove the first line
              
                  if("input" in strData):
                      a=re.split("\W+", strData)
                      #print a
                      #fileOut.write (' '.join(a[1:7]) )
                      fileOut.write(a[1]+' '+a[2]+' '+a[3]+' '+a[4]+' '+a[5]+' '+a[6]+'\n')
                
                  if("output" in strData):
                      a=re.split("\W+", strData)
                      #print a
                      fileOut.write(a[1]+' '+a[2]+' '+a[3]+' '+a[4]+' '+a[5]+' '+a[6]+' '+a[7]+'\n')

              Comment

              • Glenton
                Recognized Expert Contributor
                • Nov 2008
                • 391

                #8
                Regular expressions are what you need. They take a bit of getting used to, but work brilliantly once you have the hang of it.

                You have a different number of variables in your input.txt file, so the below works with the first ones, which have the format you posted originally, but not with the latter ones:

                Code:
                import re
                
                lines=open("input.txt","r")
                
                p=re.compile('   " *(.*) \((.*), (.*), (.*), (.*)\)," &.*')
                
                
                for line in lines:
                    m=p.match(line)
                    if not m: continue
                    for i in range(1,6):
                        print m.group(i),
                    print
                Gives
                Code:
                17 BC_1 CLK input X
                15 BC_1 D(1) input X
                14 BC_1 D(2) input X
                13 BC_1 D(3) input X
                12 BC_1 D(4) input X
                11 BC_1 D(5) input X
                10 BC_1 D(6) input X
                9 BC_1 D(7) input X
                8 BC_1 D(8) input X
                7 BC_1, Q(1), output3, X 16 1 Z
                6 BC_1, Q(2), output3, X 16 1 Z
                5 BC_1, Q(3), output3, X 16 1 Z
                4 BC_1, Q(4), output3, X 16 1 Z
                3 BC_1, Q(5), output3, X 16 1 Z
                2 BC_1, Q(6), output3, X 16 1 Z
                1 BC_1, Q(7), output3, X 16 1 Z

                Comment

                • maximus tee
                  New Member
                  • Dec 2010
                  • 30

                  #9
                  wow only a few lines of codes. i dont get it regular expression, it is hard.
                  dont understand:
                  1)p=re.compile( ' " *(.*) \((.*), (.*), (.*), (.*)\)," &.*')
                  2)m=p.match(lin e), what does match line mean?
                  3)m.group(i), what does it group?
                  i tried to print but it only print address.

                  thanks

                  Comment

                  • Glenton
                    Recognized Expert Contributor
                    • Nov 2008
                    • 391

                    #10
                    You can look here to get more details on how regular expressions work.

                    I don't understand your statement "I tried to print but it only print address". What does this mean? The code should work to create the output that I gave.

                    re.compile is to make a pattern that you can match some text against. In this case it looks for the following:
                    ' "' is the start of the string

                    ' *' is some number of spaces (bigger than or equal to 0)

                    '(.*)' is a string of any characters (.) and any length (*). The brackets say that this is one of the groups you want to find (so m.group(1) will return the string that's in there

                    ' \(' find the string " (". You need to escape character ('\') because ( is one of the special characters (see above)

                    '(.*), ' find the next string of characters (for group(2)) followed by a comma (,) and a space ( ).

                    etc. You probably get the idea by now.

                    Then m is a match object from matching the pattern (p) to line (which is a line from input.txt).

                    Then m.group(i) refers to the groups that you said should be selected by putting the brackets () around them.

                    Note, that regular expressions are "greedy" in the sense that they find the biggest string that fits the pattern (starting from the left). Thus for the last 7 lines of input.txt group(1) is the string "BC_1, Q(1),ouput3, X", which I assume is not what you want.

                    Comment

                    • maximus tee
                      New Member
                      • Dec 2010
                      • 30

                      #11
                      thanks for your guidance. RE is one of the hardest topic to understand in python.

                      sorry for confusion. i tried to understand your code by doing a print. for eg:
                      Code:
                      p=re.compile('   " *(.*) \((.*), (.*), (.*), (.*)\)," &.*') 
                      print p
                      which printed:
                      <_sre.SRE_Mat ch object at 0x02B5C800>

                      and
                      Code:
                          m=p.match(line)
                          print m
                      which printed:
                      None
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      17 BC_1 CLK input X
                      None
                      None
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      15 BC_1 D(1) input X
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      14 BC_1 D(2) input X
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      13 BC_1 D(3) input X
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      12 BC_1 D(4) input X
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      11 BC_1 D(5) input X
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      10 BC_1 D(6) input X
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      9 BC_1 D(7) input X
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      8 BC_1 D(8) input X
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      7 BC_1, Q(1), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      6 BC_1, Q(2), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      5 BC_1, Q(3), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      4 BC_1, Q(4), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      3 BC_1, Q(5), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5CBC0>
                      2 BC_1, Q(6), output3, X 16 1 Z
                      <_sre.SRE_Mat ch object at 0x02B5C620>
                      1 BC_1, Q(7), output3, X 16 1 Z
                      None

                      Comment

                      • Glenton
                        Recognized Expert Contributor
                        • Nov 2008
                        • 391

                        #12
                        Yes, re objects don't print well, I'm afraid.

                        You need to use their attributes or methods. It can be a bit frustrating to debug and to understand. Try reading the docs link from my previous post. Good luck!

                        Comment

                        • Glenton
                          Recognized Expert Contributor
                          • Nov 2008
                          • 391

                          #13
                          Alternatively you can do it without re
                          Code:
                          import re
                          
                          lines=open("input.txt","r")
                          
                          
                          for line in lines:
                              l1=line.replace(" ","").replace('"','').split(",")  #Remove the spaces from the line and separate on ,
                              if len(l1)<2: continue   #to avoid lines that don't fit the general pattern
                              l2=l1[0].split("(")
                              l3=[l1[-2].replace(")","")]
                              l4=l2+l1[1:-2]+l3
                              print l4
                          gives this:
                          Code:
                          >>> 
                          ['17', 'BC_1', 'CLK', 'input', 'X']
                          ['16', 'BC_1', 'OC_NEG', 'input', 'X']
                          ['16', 'BC_1', '*', 'control', '1']
                          ['15', 'BC_1', 'D(1)', 'input', 'X']
                          ['14', 'BC_1', 'D(2)', 'input', 'X']
                          ['13', 'BC_1', 'D(3)', 'input', 'X']
                          ['12', 'BC_1', 'D(4)', 'input', 'X']
                          ['11', 'BC_1', 'D(5)', 'input', 'X']
                          ['10', 'BC_1', 'D(6)', 'input', 'X']
                          ['9', 'BC_1', 'D(7)', 'input', 'X']
                          ['8', 'BC_1', 'D(8)', 'input', 'X']
                          ['7', 'BC_1', 'Q(1)', 'output3', 'X', '16', '1', 'Z']
                          ['6', 'BC_1', 'Q(2)', 'output3', 'X', '16', '1', 'Z']
                          ['5', 'BC_1', 'Q(3)', 'output3', 'X', '16', '1', 'Z']
                          ['4', 'BC_1', 'Q(4)', 'output3', 'X', '16', '1', 'Z']
                          ['3', 'BC_1', 'Q(5)', 'output3', 'X', '16', '1', 'Z']
                          ['2', 'BC_1', 'Q(6)', 'output3', 'X', '16', '1', 'Z']
                          ['1', 'BC_1', 'Q(7)', 'output3', 'X', '16', '1', 'Z']
                          ['0', 'BC_1', 'Q(8)', 'output3', 'X', '16', '1']
                          obviously you can use the contents of the list as you wish

                          Comment

                          Working...