On text processing

**bearophileHUGS@lycos.com** · Mar 23 '07, 11:05 PM

Re: On text processing

Daniel Nogradi:

Any elegant solution for this?

This is my first try:

ddata = {}

inside_matrix = False
for row in file("data.txt" ):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True

# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]

Input file data.txt:

key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8

The output:

key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']

If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.

Bye,
bearophile

**Daniel Nogradi** · Mar 23 '07, 11:25 PM

Re: On text processing

This is my first try:

>
ddata = {}
>
inside_matrix = False
for row in file("data.txt" ):
if row.strip():
fields = row.split()
if len(fields) == 2:
inside_matrix = False
ddata[fields[0]] = [fields[1]]
lastkey = fields[0]
else:
if inside_matrix:
ddata[lastkey][1].append(fields)
else:
ddata[lastkey].append([fields])
inside_matrix = True
>
# This gives some output for testing only:
for k in sorted(ddata):
print k, ddata[k]
>
>
Input file data.txt:
>
key1 value1
key2 value2
key3 value3
>
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
>
key5 value5
key6 value6
>
key7 value7
more11 more12 more13
more21 more22 more23
>
key8 value8
>
>
The output:
>
key1 ['value1']
key2 ['value2']
key3 ['value3']
key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
'spec34']]]
key5 ['value5']
key6 ['value6']
key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]]
key8 ['value8']
>
>
If there are many simple keys, then you can avoid creating a single
element list for them, but then you have to tell apart the two cases
on the base of the key (while now the presence of the second element
is able to tell apart the two situations). You can also use two
different dicts to keep the two different kinds of data.
>
Bye,
bearophile

Thanks very much, it's indeed quite simple. I was lost in the
itertools documentation :)

**Paddy** · Mar 24 '07, 12:45 AM

Re: On text processing

On Mar 23, 10:30 pm, "Daniel Nogradi" <nogr...@gmail. comwrote:

Hi list,
>
I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.
>
The bash script processed text files of the form:
>
############### ############### #
key1 value1
key2 value2
key3 value3
>
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
>
key5 value5
key6 value6
>
key7 value7
more11 more12 more13
more21 more22 more23
>
key8 value8
############### ############### #####
>
I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.
>
Any elegant solution for this?

My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:

from StringIO import StringIO

fileText = '''\
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
'''
infile = StringIO(fileTe xt)

keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().sp lit()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdef ault(lastkey, []).append(fields )

==============
Here is the sample output:

>>from pprint import pprint as pp
>>pp(keyvalue s)

{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}

>>pp(matrices )

{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}

>>>

- Paddy.

**Paul McGuire** · Mar 24 '07, 02:35 AM

Re: On text processing

On Mar 23, 5:30 pm, "Daniel Nogradi" <nogr...@gmail. comwrote:

Hi list,
>
I'm in a process of rewriting a bash/awk/sed script -- that grew to
big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.
>
The bash script processed text files of the form...
>
Any elegant solution for this?

Is a parser overkill? Here's how you might use pyparsing for this
problem.

I just wanted to show that pyparsing's returned results can be
structured as more than just lists of tokens. Using pyparsing's Dict
class (or the dictOf helper that simplifies using Dict), you can
return results that can be accessed like a nested list, like a dict,
or like an instance with named attributes (see the last line of the
example).

You can adjust the syntax definition of keys and values to fit your
actual data, for instance, if the matrices are actually integers, then
define the matrixRow as:

matrixRow = Group( OneOrMore( Word(nums) ) ) + eol

-- Paul

from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
\
Group, ZeroOrMore, OneOrMore, Optional, dictOf

data = """key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
"""

# retain significant newlines (pyparsing reads over whitespace by
default)
ParserElement.s etDefaultWhites paceChars(" \t")

eol = LineEnd().suppr ess()
elem = Word(alphas,alp hanums)
key = elem
matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
matrix = Group( OneOrMore( matrixRow ) ) + eol
value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
parser = dictOf(key, value)

# parse the data
results = parser.parseStr ing(data)

# access the results
# - like a dict
# - like a list
# - like an instance with keys for attributes
print results.keys()
print

for k in sorted(results. keys()):
print k,
if isinstance( results[k], basestring ):
print results[k]
else:
print results[k][0]
for row in results[k][1]:
print " "," ".join(row)
print

print results.key3

Prints out:
['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

key1 value1
key2 value2
key3 value3
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
key5 value5
key6 value6
key7 value7
more11 more12 more13
more21 more22 more23
key8 value8

value3

**Daniel Nogradi** · Mar 24 '07, 07:45 AM

Re: On text processing

I'm in a process of rewriting a bash/awk/sed script -- that grew to

big -- in python. I can rewrite it in a simple line-by-line way but
that results in ugly python code and I'm sure there is a simple
pythonic way.

The bash script processed text files of the form:

############### ############### #
key1 value1
key2 value2
key3 value3

key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34

key5 value5
key6 value6

key7 value7
more11 more12 more13
more21 more22 more23

key8 value8
############### ############### #####

I guess you get the point. If a line has two entries it is a key/value
pair which should end up in a dictionary. If a key/value pair is
followed by consequtive lines with more then two entries, it is a
matrix that should end up in a list of lists (matrix) that can be
identified by the key preceeding it. The empty line after the last
line of a matrix signifies that the matrix is finished and we are back
to a key/value situation. Note that a matrix is always preceeded by a
key/value pair so that it can really be identified by the key.

Any elegant solution for this?

>
>
My solution expects correctly formatted input and parses it into
separate key/value and matrix holding dicts:
>
>
from StringIO import StringIO
>
fileText = '''\
key1 value1
key2 value2
key3 value3
>
key4 value4
spec11 spec12 spec13 spec14
spec21 spec22 spec23 spec24
spec31 spec32 spec33 spec34
>
key5 value5
key6 value6
>
key7 value7
more11 more12 more13
more21 more22 more23
>
key8 value8
'''
infile = StringIO(fileTe xt)
>
keyvalues = {}
matrices = {}
for line in infile:
fields = line.strip().sp lit()
if len(fields) == 2:
keyvalues[fields[0]] = fields[1]
lastkey = fields[0]
elif fields:
matrices.setdef ault(lastkey, []).append(fields )
>
==============
Here is the sample output:
>

>from pprint import pprint as pp
>pp(keyvalues )

{'key1': 'value1',
'key2': 'value2',
'key3': 'value3',
'key4': 'value4',
'key5': 'value5',
'key6': 'value6',
'key7': 'value7',
'key8': 'value8'}

>pp(matrices)

{'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
['spec21', 'spec22', 'spec23', 'spec24'],
['spec31', 'spec32', 'spec33', 'spec34']],
'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
'more23']]}

>>

Paddy, thanks, this looks even better.
Paul, pyparsing looks like an overkill, even the config parser module
is something that is too complex for me for such a simple task. The
text files are actually input files to a program and will never be
longer than 20-30 lines so Paddy's solution is perfectly fine. In any
case it's good to know that there exists a module called pyparsing :)

On text processing

On text processing

Comment

Comment

Comment

Comment

Comment