how to use python to extract certain text in the file?

**Mariostg** · Jan 5 '12, 01:04 PM

I believe it is because there are only 5 elements in "17 BC_1 CLK input X" and you are trying to print 7 (a[0] to a[6]).

**maximus tee** · Jan 6 '12, 01:07 AM

thanks will relook into it.

**Glenton** · Jan 6 '12, 02:25 AM

The easy way to do this more safely would be something like

Code:

import re
lines=open("input.txt",'r').readlines()
 
for line in lines:
    a=re.findall(r'\w+',line)
    print re.findall(r'\w+',line)
    #print a[0],a[1],a[2],a[3],a[4],a[5],a[6]
    for b in a:
        print b,
    print

**maximus tee** · Jan 6 '12, 02:37 AM

hi, thanks for your f/back.
would like to check whether the last line print is a typo? and also in the for loop, there is a print, ?
i was thinking of doing:

Code:

for line in lines:
    a=line.split('-')[0]
    print a
    for b in a: 
        print b,
    print

**Glenton** · Jan 6 '12, 02:54 AM

The "print b," is to print b without going to a new line. It's to do the equivalent of printing a[0], a[1], a[2],... for as many as are needed.

The "print" at the end is to make a new line.

I was assuming your regular expression was working, but perhaps it isn't. I can't imaging that your split expression would work either.

Perhaps you can explain the logic of what you're trying to achieve. There are many ways that you could get that output given that input. But what are the more general rules? Eg is the first line always in that format? Do you find that all lines are one of two formats? If you can provide more about what you're trying to achieve, then it will be easier to help.

**maximus tee** · Jan 6 '12, 03:07 AM

apologies for confusion.
general rules:
1) the first line inside input.txt (as attached in the first post):
-- num cell port function safe [ccell disval rslt]
i just need num cell port function safe ccell
however my script below couldnt get this so i skip the line. it will be great if you can show me.

2) i'm trying to convert the line inside the input.txt
"17 (BC_1, CLK, input, X)," &
into 17 BC_1 CLK input X
basically i'm only extracting column for num, cell, port, function, safe and ccell. the rest are not needed.

and " 7 (BC_1, Q(1), output3, X, 16, 1, Z)," &
into 7 BC_1 Q1 output3 X 16 1

so far my script is which result in the output.txt (as attached in the first post):

Code:

import re

fileIn = open("input.txt", "rb")
fileOut = open("output.txt", "w")

for strData in fileIn:
    strData = strData.split('-')[0] #this is to remove the first line

    if("input" in strData):
        a=re.split("\W+", strData)
        #print a
        #fileOut.write (' '.join(a[1:7]) )
        fileOut.write(a[1]+' '+a[2]+' '+a[3]+' '+a[4]+' '+a[5]+' '+a[6]+'\n')
  
    if("output" in strData):
        a=re.split("\W+", strData)
        #print a
        fileOut.write(a[1]+' '+a[2]+' '+a[3]+' '+a[4]+' '+a[5]+' '+a[6]+' '+a[7]+'\n')

**Glenton** · Jan 6 '12, 03:51 AM

Regular expressions are what you need. They take a bit of getting used to, but work brilliantly once you have the hang of it.

You have a different number of variables in your input.txt file, so the below works with the first ones, which have the format you posted originally, but not with the latter ones:

Code:

import re

lines=open("input.txt","r")

p=re.compile('   " *(.*) \((.*), (.*), (.*), (.*)\)," &.*')


for line in lines:
    m=p.match(line)
    if not m: continue
    for i in range(1,6):
        print m.group(i),
    print

Gives

Code:

17 BC_1 CLK input X
15 BC_1 D(1) input X
14 BC_1 D(2) input X
13 BC_1 D(3) input X
12 BC_1 D(4) input X
11 BC_1 D(5) input X
10 BC_1 D(6) input X
9 BC_1 D(7) input X
8 BC_1 D(8) input X
7 BC_1, Q(1), output3, X 16 1 Z
6 BC_1, Q(2), output3, X 16 1 Z
5 BC_1, Q(3), output3, X 16 1 Z
4 BC_1, Q(4), output3, X 16 1 Z
3 BC_1, Q(5), output3, X 16 1 Z
2 BC_1, Q(6), output3, X 16 1 Z
1 BC_1, Q(7), output3, X 16 1 Z

**maximus tee** · Jan 6 '12, 03:58 AM

wow only a few lines of codes. i dont get it regular expression, it is hard.
dont understand:
1)p=re.compile( ' " *(.*) \((.*), (.*), (.*), (.*)\)," &.*')
2)m=p.match(lin e), what does match line mean?
3)m.group(i), what does it group?
i tried to print but it only print address.

thanks

**Glenton** · Jan 6 '12, 05:07 AM

You can look here to get more details on how regular expressions work.

I don't understand your statement "I tried to print but it only print address". What does this mean? The code should work to create the output that I gave.

re.compile is to make a pattern that you can match some text against. In this case it looks for the following:
' "' is the start of the string

' *' is some number of spaces (bigger than or equal to 0)

'(.*)' is a string of any characters (.) and any length (*). The brackets say that this is one of the groups you want to find (so m.group(1) will return the string that's in there

' \(' find the string " (". You need to escape character ('\') because ( is one of the special characters (see above)

'(.*), ' find the next string of characters (for group(2)) followed by a comma (,) and a space ( ).

etc. You probably get the idea by now.

Then m is a match object from matching the pattern (p) to line (which is a line from input.txt).

Then m.group(i) refers to the groups that you said should be selected by putting the brackets () around them.

Note, that regular expressions are "greedy" in the sense that they find the biggest string that fits the pattern (starting from the left). Thus for the last 7 lines of input.txt group(1) is the string "BC_1, Q(1),ouput3, X", which I assume is not what you want.

**maximus tee** · Jan 6 '12, 05:16 AM

thanks for your guidance. RE is one of the hardest topic to understand in python.

sorry for confusion. i tried to understand your code by doing a print. for eg:

Code:

p=re.compile('   " *(.*) \((.*), (.*), (.*), (.*)\)," &.*') 
print p

which printed:
<_sre.SRE_Mat ch object at 0x02B5C800>

and

Code:

    m=p.match(line)
    print m

which printed:
None
<_sre.SRE_Mat ch object at 0x02B5C620>
17 BC_1 CLK input X
None
None
<_sre.SRE_Mat ch object at 0x02B5C620>
15 BC_1 D(1) input X
<_sre.SRE_Mat ch object at 0x02B5CBC0>
14 BC_1 D(2) input X
<_sre.SRE_Mat ch object at 0x02B5C620>
13 BC_1 D(3) input X
<_sre.SRE_Mat ch object at 0x02B5CBC0>
12 BC_1 D(4) input X
<_sre.SRE_Mat ch object at 0x02B5C620>
11 BC_1 D(5) input X
<_sre.SRE_Mat ch object at 0x02B5CBC0>
10 BC_1 D(6) input X
<_sre.SRE_Mat ch object at 0x02B5C620>
9 BC_1 D(7) input X
<_sre.SRE_Mat ch object at 0x02B5CBC0>
8 BC_1 D(8) input X
<_sre.SRE_Mat ch object at 0x02B5C620>
7 BC_1, Q(1), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5CBC0>
6 BC_1, Q(2), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5C620>
5 BC_1, Q(3), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5CBC0>
4 BC_1, Q(4), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5C620>
3 BC_1, Q(5), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5CBC0>
2 BC_1, Q(6), output3, X 16 1 Z
<_sre.SRE_Mat ch object at 0x02B5C620>
1 BC_1, Q(7), output3, X 16 1 Z
None

**Glenton** · Jan 6 '12, 06:05 AM

Yes, re objects don't print well, I'm afraid.

You need to use their attributes or methods. It can be a bit frustrating to debug and to understand. Try reading the docs link from my previous post. Good luck!

**Glenton** · Jan 6 '12, 08:19 AM

Alternatively you can do it without re

Code:

import re

lines=open("input.txt","r")


for line in lines:
    l1=line.replace(" ","").replace('"','').split(",")  #Remove the spaces from the line and separate on ,
    if len(l1)<2: continue   #to avoid lines that don't fit the general pattern
    l2=l1[0].split("(")
    l3=[l1[-2].replace(")","")]
    l4=l2+l1[1:-2]+l3
    print l4

gives this:

Code:

>>> 
['17', 'BC_1', 'CLK', 'input', 'X']
['16', 'BC_1', 'OC_NEG', 'input', 'X']
['16', 'BC_1', '*', 'control', '1']
['15', 'BC_1', 'D(1)', 'input', 'X']
['14', 'BC_1', 'D(2)', 'input', 'X']
['13', 'BC_1', 'D(3)', 'input', 'X']
['12', 'BC_1', 'D(4)', 'input', 'X']
['11', 'BC_1', 'D(5)', 'input', 'X']
['10', 'BC_1', 'D(6)', 'input', 'X']
['9', 'BC_1', 'D(7)', 'input', 'X']
['8', 'BC_1', 'D(8)', 'input', 'X']
['7', 'BC_1', 'Q(1)', 'output3', 'X', '16', '1', 'Z']
['6', 'BC_1', 'Q(2)', 'output3', 'X', '16', '1', 'Z']
['5', 'BC_1', 'Q(3)', 'output3', 'X', '16', '1', 'Z']
['4', 'BC_1', 'Q(4)', 'output3', 'X', '16', '1', 'Z']
['3', 'BC_1', 'Q(5)', 'output3', 'X', '16', '1', 'Z']
['2', 'BC_1', 'Q(6)', 'output3', 'X', '16', '1', 'Z']
['1', 'BC_1', 'Q(7)', 'output3', 'X', '16', '1', 'Z']
['0', 'BC_1', 'Q(8)', 'output3', 'X', '16', '1']

obviously you can use the contents of the list as you wish

how to use python to extract certain text in the file?

how to use python to extract certain text in the file?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment