On text processing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Daniel Nogradi

    On text processing

    Hi list,

    I'm in a process of rewriting a bash/awk/sed script -- that grew to
    big -- in python. I can rewrite it in a simple line-by-line way but
    that results in ugly python code and I'm sure there is a simple
    pythonic way.

    The bash script processed text files of the form:

    ############### ############### #
    key1 value1
    key2 value2
    key3 value3

    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34

    key5 value5
    key6 value6

    key7 value7
    more11 more12 more13
    more21 more22 more23

    key8 value8
    ############### ############### #####

    I guess you get the point. If a line has two entries it is a key/value
    pair which should end up in a dictionary. If a key/value pair is
    followed by consequtive lines with more then two entries, it is a
    matrix that should end up in a list of lists (matrix) that can be
    identified by the key preceeding it. The empty line after the last
    line of a matrix signifies that the matrix is finished and we are back
    to a key/value situation. Note that a matrix is always preceeded by a
    key/value pair so that it can really be identified by the key.

    Any elegant solution for this?
  • bearophileHUGS@lycos.com

    #2
    Re: On text processing

    Daniel Nogradi:
    Any elegant solution for this?
    This is my first try:

    ddata = {}

    inside_matrix = False
    for row in file("data.txt" ):
    if row.strip():
    fields = row.split()
    if len(fields) == 2:
    inside_matrix = False
    ddata[fields[0]] = [fields[1]]
    lastkey = fields[0]
    else:
    if inside_matrix:
    ddata[lastkey][1].append(fields)
    else:
    ddata[lastkey].append([fields])
    inside_matrix = True

    # This gives some output for testing only:
    for k in sorted(ddata):
    print k, ddata[k]


    Input file data.txt:

    key1 value1
    key2 value2
    key3 value3

    key4 value4
    spec11 spec12 spec13 spec14
    spec21 spec22 spec23 spec24
    spec31 spec32 spec33 spec34

    key5 value5
    key6 value6

    key7 value7
    more11 more12 more13
    more21 more22 more23

    key8 value8


    The output:

    key1 ['value1']
    key2 ['value2']
    key3 ['value3']
    key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
    'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
    'spec34']]]
    key5 ['value5']
    key6 ['value6']
    key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
    'more23']]]
    key8 ['value8']


    If there are many simple keys, then you can avoid creating a single
    element list for them, but then you have to tell apart the two cases
    on the base of the key (while now the presence of the second element
    is able to tell apart the two situations). You can also use two
    different dicts to keep the two different kinds of data.

    Bye,
    bearophile

    Comment

    • Daniel Nogradi

      #3
      Re: On text processing

      This is my first try:
      >
      ddata = {}
      >
      inside_matrix = False
      for row in file("data.txt" ):
      if row.strip():
      fields = row.split()
      if len(fields) == 2:
      inside_matrix = False
      ddata[fields[0]] = [fields[1]]
      lastkey = fields[0]
      else:
      if inside_matrix:
      ddata[lastkey][1].append(fields)
      else:
      ddata[lastkey].append([fields])
      inside_matrix = True
      >
      # This gives some output for testing only:
      for k in sorted(ddata):
      print k, ddata[k]
      >
      >
      Input file data.txt:
      >
      key1 value1
      key2 value2
      key3 value3
      >
      key4 value4
      spec11 spec12 spec13 spec14
      spec21 spec22 spec23 spec24
      spec31 spec32 spec33 spec34
      >
      key5 value5
      key6 value6
      >
      key7 value7
      more11 more12 more13
      more21 more22 more23
      >
      key8 value8
      >
      >
      The output:
      >
      key1 ['value1']
      key2 ['value2']
      key3 ['value3']
      key4 ['value4', [['spec11', 'spec12', 'spec13', 'spec14'], ['spec21',
      'spec22', 'spec23', 'spec24'], ['spec31', 'spec32', 'spec33',
      'spec34']]]
      key5 ['value5']
      key6 ['value6']
      key7 ['value7', [['more11', 'more12', 'more13'], ['more21', 'more22',
      'more23']]]
      key8 ['value8']
      >
      >
      If there are many simple keys, then you can avoid creating a single
      element list for them, but then you have to tell apart the two cases
      on the base of the key (while now the presence of the second element
      is able to tell apart the two situations). You can also use two
      different dicts to keep the two different kinds of data.
      >
      Bye,
      bearophile
      Thanks very much, it's indeed quite simple. I was lost in the
      itertools documentation :)

      Comment

      • Paddy

        #4
        Re: On text processing

        On Mar 23, 10:30 pm, "Daniel Nogradi" <nogr...@gmail. comwrote:
        Hi list,
        >
        I'm in a process of rewriting a bash/awk/sed script -- that grew to
        big -- in python. I can rewrite it in a simple line-by-line way but
        that results in ugly python code and I'm sure there is a simple
        pythonic way.
        >
        The bash script processed text files of the form:
        >
        ############### ############### #
        key1 value1
        key2 value2
        key3 value3
        >
        key4 value4
        spec11 spec12 spec13 spec14
        spec21 spec22 spec23 spec24
        spec31 spec32 spec33 spec34
        >
        key5 value5
        key6 value6
        >
        key7 value7
        more11 more12 more13
        more21 more22 more23
        >
        key8 value8
        ############### ############### #####
        >
        I guess you get the point. If a line has two entries it is a key/value
        pair which should end up in a dictionary. If a key/value pair is
        followed by consequtive lines with more then two entries, it is a
        matrix that should end up in a list of lists (matrix) that can be
        identified by the key preceeding it. The empty line after the last
        line of a matrix signifies that the matrix is finished and we are back
        to a key/value situation. Note that a matrix is always preceeded by a
        key/value pair so that it can really be identified by the key.
        >
        Any elegant solution for this?

        My solution expects correctly formatted input and parses it into
        separate key/value and matrix holding dicts:


        from StringIO import StringIO

        fileText = '''\
        key1 value1
        key2 value2
        key3 value3

        key4 value4
        spec11 spec12 spec13 spec14
        spec21 spec22 spec23 spec24
        spec31 spec32 spec33 spec34

        key5 value5
        key6 value6

        key7 value7
        more11 more12 more13
        more21 more22 more23

        key8 value8
        '''
        infile = StringIO(fileTe xt)

        keyvalues = {}
        matrices = {}
        for line in infile:
        fields = line.strip().sp lit()
        if len(fields) == 2:
        keyvalues[fields[0]] = fields[1]
        lastkey = fields[0]
        elif fields:
        matrices.setdef ault(lastkey, []).append(fields )

        ==============
        Here is the sample output:
        >>from pprint import pprint as pp
        >>pp(keyvalue s)
        {'key1': 'value1',
        'key2': 'value2',
        'key3': 'value3',
        'key4': 'value4',
        'key5': 'value5',
        'key6': 'value6',
        'key7': 'value7',
        'key8': 'value8'}
        >>pp(matrices )
        {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
        ['spec21', 'spec22', 'spec23', 'spec24'],
        ['spec31', 'spec32', 'spec33', 'spec34']],
        'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
        'more23']]}
        >>>
        - Paddy.

        Comment

        • Paul McGuire

          #5
          Re: On text processing

          On Mar 23, 5:30 pm, "Daniel Nogradi" <nogr...@gmail. comwrote:
          Hi list,
          >
          I'm in a process of rewriting a bash/awk/sed script -- that grew to
          big -- in python. I can rewrite it in a simple line-by-line way but
          that results in ugly python code and I'm sure there is a simple
          pythonic way.
          >
          The bash script processed text files of the form...
          >
          Any elegant solution for this?
          Is a parser overkill? Here's how you might use pyparsing for this
          problem.

          I just wanted to show that pyparsing's returned results can be
          structured as more than just lists of tokens. Using pyparsing's Dict
          class (or the dictOf helper that simplifies using Dict), you can
          return results that can be accessed like a nested list, like a dict,
          or like an instance with named attributes (see the last line of the
          example).

          You can adjust the syntax definition of keys and values to fit your
          actual data, for instance, if the matrices are actually integers, then
          define the matrixRow as:

          matrixRow = Group( OneOrMore( Word(nums) ) ) + eol


          -- Paul


          from pyparsing import ParserElement, LineEnd, Word, alphas, alphanums,
          \
          Group, ZeroOrMore, OneOrMore, Optional, dictOf

          data = """key1 value1
          key2 value2
          key3 value3


          key4 value4
          spec11 spec12 spec13 spec14
          spec21 spec22 spec23 spec24
          spec31 spec32 spec33 spec34


          key5 value5
          key6 value6


          key7 value7
          more11 more12 more13
          more21 more22 more23


          key8 value8
          """

          # retain significant newlines (pyparsing reads over whitespace by
          default)
          ParserElement.s etDefaultWhites paceChars(" \t")

          eol = LineEnd().suppr ess()
          elem = Word(alphas,alp hanums)
          key = elem
          matrixRow = Group( elem + elem + OneOrMore(elem) ) + eol
          matrix = Group( OneOrMore( matrixRow ) ) + eol
          value = elem + eol + Optional( matrix ) + ZeroOrMore(eol)
          parser = dictOf(key, value)

          # parse the data
          results = parser.parseStr ing(data)

          # access the results
          # - like a dict
          # - like a list
          # - like an instance with keys for attributes
          print results.keys()
          print

          for k in sorted(results. keys()):
          print k,
          if isinstance( results[k], basestring ):
          print results[k]
          else:
          print results[k][0]
          for row in results[k][1]:
          print " "," ".join(row)
          print

          print results.key3


          Prints out:
          ['key8', 'key3', 'key2', 'key1', 'key7', 'key6', 'key5', 'key4']

          key1 value1
          key2 value2
          key3 value3
          key4 value4
          spec11 spec12 spec13 spec14
          spec21 spec22 spec23 spec24
          spec31 spec32 spec33 spec34
          key5 value5
          key6 value6
          key7 value7
          more11 more12 more13
          more21 more22 more23
          key8 value8

          value3



          Comment

          • Daniel Nogradi

            #6
            Re: On text processing

            I'm in a process of rewriting a bash/awk/sed script -- that grew to
            big -- in python. I can rewrite it in a simple line-by-line way but
            that results in ugly python code and I'm sure there is a simple
            pythonic way.

            The bash script processed text files of the form:

            ############### ############### #
            key1 value1
            key2 value2
            key3 value3

            key4 value4
            spec11 spec12 spec13 spec14
            spec21 spec22 spec23 spec24
            spec31 spec32 spec33 spec34

            key5 value5
            key6 value6

            key7 value7
            more11 more12 more13
            more21 more22 more23

            key8 value8
            ############### ############### #####

            I guess you get the point. If a line has two entries it is a key/value
            pair which should end up in a dictionary. If a key/value pair is
            followed by consequtive lines with more then two entries, it is a
            matrix that should end up in a list of lists (matrix) that can be
            identified by the key preceeding it. The empty line after the last
            line of a matrix signifies that the matrix is finished and we are back
            to a key/value situation. Note that a matrix is always preceeded by a
            key/value pair so that it can really be identified by the key.

            Any elegant solution for this?
            >
            >
            My solution expects correctly formatted input and parses it into
            separate key/value and matrix holding dicts:
            >
            >
            from StringIO import StringIO
            >
            fileText = '''\
            key1 value1
            key2 value2
            key3 value3
            >
            key4 value4
            spec11 spec12 spec13 spec14
            spec21 spec22 spec23 spec24
            spec31 spec32 spec33 spec34
            >
            key5 value5
            key6 value6
            >
            key7 value7
            more11 more12 more13
            more21 more22 more23
            >
            key8 value8
            '''
            infile = StringIO(fileTe xt)
            >
            keyvalues = {}
            matrices = {}
            for line in infile:
            fields = line.strip().sp lit()
            if len(fields) == 2:
            keyvalues[fields[0]] = fields[1]
            lastkey = fields[0]
            elif fields:
            matrices.setdef ault(lastkey, []).append(fields )
            >
            ==============
            Here is the sample output:
            >
            >from pprint import pprint as pp
            >pp(keyvalues )
            {'key1': 'value1',
            'key2': 'value2',
            'key3': 'value3',
            'key4': 'value4',
            'key5': 'value5',
            'key6': 'value6',
            'key7': 'value7',
            'key8': 'value8'}
            >pp(matrices)
            {'key4': [['spec11', 'spec12', 'spec13', 'spec14'],
            ['spec21', 'spec22', 'spec23', 'spec24'],
            ['spec31', 'spec32', 'spec33', 'spec34']],
            'key7': [['more11', 'more12', 'more13'], ['more21', 'more22',
            'more23']]}
            >>
            Paddy, thanks, this looks even better.
            Paul, pyparsing looks like an overkill, even the config parser module
            is something that is too complex for me for such a simple task. The
            text files are actually input files to a program and will never be
            longer than 20-30 lines so Paddy's solution is perfectly fine. In any
            case it's good to know that there exists a module called pyparsing :)

            Comment

            Working...