Parsing a file with iterators

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Luis Zarrabeitia

    Parsing a file with iterators


    I need to parse a file, text file. The format is something like that:

    TYPE1 metadata
    data line 1
    data line 2
    ....
    data line N
    TYPE2 metadata
    data line 1
    ....
    TYPE3 metadata
    ....

    And so on. The type and metadata determine how to parse the following data
    lines. When the parser fails to parse one of the lines, the next parser is
    chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

    This doesn't work:

    ===
    for line in input:
    parser = parser_from_str ing(line)
    parser(input)
    ===

    because when the parser iterates over the input, it can't know that it finished
    processing the section until it reads the next "TYPE" line (actually, until it
    reads the first line that it cannot parse, which if everything went well,should
    be the 'TYPE'), but once it reads it, it is no longer available to the outer
    loop. I wouldn't like to leak the internals of the parsers to the outside..

    What could I do?
    (to the curious: the format is a dialect of the E00 used in GIS)

    --
    Luis Zarrabeitia
    Facultad de Matemática y Computación, UH





  • Eddie Corns

    #2
    Re: Parsing a file with iterators

    Luis Zarrabeitia <kyrie@uh.cuwri tes:

    >I need to parse a file, text file. The format is something like that:
    >TYPE1 metadata
    >data line 1
    >data line 2
    >...
    >data line N
    >TYPE2 metadata
    >data line 1
    >...
    >TYPE3 metadata
    >...
    >And so on. The type and metadata determine how to parse the following dat=
    >a
    >lines. When the parser fails to parse one of the lines, the next parser i=
    >s
    >chosen (or if there is no 'TYPE metadata' line there, an exception is thr=
    >own).
    >This doesn't work:
    >=3D=3D=3D
    >for line in input:
    parser =3D parser_from_str ing(line)
    parser(input)
    >=3D=3D=3D
    >because when the parser iterates over the input, it can't know that it fi=
    >nished
    >processing the section until it reads the next "TYPE" line (actually, unt=
    >il it
    >reads the first line that it cannot parse, which if everything went well,=
    should
    >be the 'TYPE'), but once it reads it, it is no longer available to the ou=
    >ter
    >loop. I wouldn't like to leak the internals of the parsers to the outside=
    >.
    >What could I do?
    >(to the curious: the format is a dialect of the E00 used in GIS)
    >=20
    >--=20
    >Luis Zarrabeitia
    >Facultad de Matem=E1tica y Computaci=F3n, UH
    >http://profesores.matcom.uh.cu/~kyrie



    One simple way is to allow your "input" iterator to support pushing values
    back into the input stream as soon as it finds an input it can't handle.

    See http://code.activestate.com/recipes/502304/ for an example.

    Comment

    • Marc 'BlackJack' Rintsch

      #3
      Re: Parsing a file with iterators

      On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:
      I need to parse a file, text file. The format is something like that:
      >
      TYPE1 metadata
      data line 1
      data line 2
      ...
      data line N
      TYPE2 metadata
      data line 1
      ...
      TYPE3 metadata
      ...
      […]
      because when the parser iterates over the input, it can't know that it
      finished processing the section until it reads the next "TYPE" line
      (actually, until it reads the first line that it cannot parse, which if
      everything went well, should be the 'TYPE'), but once it reads it, it is
      no longer available to the outer loop. I wouldn't like to leak the
      internals of the parsers to the outside.
      >
      What could I do?
      (to the curious: the format is a dialect of the E00 used in GIS)
      Group the lines before processing and feed each group to the right parser:

      import sys
      from itertools import groupby, imap
      from operator import itemgetter


      def parse_a(metadat a, lines):
      print 'parser a', metadata
      for line in lines:
      print 'a', line


      def parse_b(metadat a, lines):
      print 'parser b', metadata
      for line in lines:
      print 'b', line


      def parse_c(metadat a, lines):
      print 'parser c', metadata
      for line in lines:
      print 'c', line


      def test_for_type(l ine):
      return line.startswith ('TYPE')


      def parse(lines):
      def tag():
      type_line = None
      for line in lines:
      if test_for_type(l ine):
      type_line = line
      else:
      yield (type_line, line)

      type2parser = {'TYPE1': parse_a,
      'TYPE2': parse_b,
      'TYPE3': parse_c }

      for type_line, group in groupby(tag(), itemgetter(0)):
      type_id, metadata = type_line.split (' ', 1)
      type2parser[type_id](metadata, imap(itemgetter (1), group))


      def main():
      parse(sys.stdin )

      Comment

      • Paul McGuire

        #4
        Re: Parsing a file with iterators

        On Oct 17, 10:42 am, Luis Zarrabeitia <ky...@uh.cuwro te:
        I need to parse a file, text file. The format is something like that:
        >
        TYPE1 metadata
        data line 1
        data line 2
        ...
        data line N
        TYPE2 metadata
        data line 1
        ...
        TYPE3 metadata
        ...
        >
        And so on. The type and metadata determine how to parse the following data
        lines. When the parser fails to parse one of the lines, the next parser is
        chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).
        >
        <snip>

        Pyparsing will take care of this for you, if you define a set of
        alternatives and then parse/search for them. Here is an annotated
        example. Note the ability to attach names to different fields of the
        parser, and then how those fields are accessed after parsing.

        """
        TYPE1 metadata
        data line 1
        data line 2
        ....
        data line N
        TYPE2 metadata
        data line 1
        ....
        TYPE3 metadata
        ....
        """

        from pyparsing import *

        # define basic element types to be used in data formats
        integer = Word(nums)
        ident = Word(alphas) | quotedString.se tParseAction(re moveQuotes)
        zipcode = Combine(Word(nu ms,exact=5) + Optional("-" +
        Word(nums,exact =4)))
        stateAbbreviati on = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE
        FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS
        MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT
        VA VI VT WA WI WV WY""".split() )

        # define data format for each type
        DATA = Suppress("data" )
        type1dataline = Group(DATA + OneOrMore(integ er))
        type2dataline = Group(DATA + delimitedList(i dent))
        type3dataline = DATA + countedArray(id ent)

        # define complete expressions for each type - note different types
        # may have different metadata
        type1data = "TYPE1" + ident("name") + \
        OneOrMore(type1 dataline)("data ")
        type2data = "TYPE2" + ident("name") + zipcode("zip") + \
        OneOrMore(type2 dataline)("data ")
        type3data = "TYPE3" + ident("name") + stateAbbreviati on("state") + \
        OneOrMore(type3 dataline)("data ")

        # expression containing all different type alternatives
        data = type1data | type2data | type3data

        # search a test input string and dump the matched tokens by name
        testInput = """
        TYPE1 Abercrombie
        data 400 26 42 66
        data 1 1 2 3 5 8 13 21
        data 1 4 9 16 25 36
        data 1 2 4 8 16 32 64
        TYPE2 Benjamin 78704
        data Larry, Curly, Moe
        data Hewey,Dewey ,Louie
        data Tom , Dick, Harry, Fred
        data Thelma,Louise
        TYPE3 Christopher WA
        data 3 "Raspberry Red" "Lemon Yellow" "Orange Orange"
        data 7 Grumpy Sneezy Happy Dopey Bashful Sleepy Doc
        """
        for tokens in data.searchStri ng(testInput):
        print tokens.dump()
        print tokens.name
        if tokens.state: print tokens.state
        for d in tokens.data:
        print " ",d
        print

        Prints:

        ['TYPE1', 'Abercrombie', ['400', '26', '42', '66'], ['1', '1', '2',
        '3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1',
        '2', '4', '8', '16', '32', '64']]
        - data: [['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8',
        '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8',
        '16', '32', '64']]
        - name: Abercrombie
        Abercrombie
        ['400', '26', '42', '66']
        ['1', '1', '2', '3', '5', '8', '13', '21']
        ['1', '4', '9', '16', '25', '36']
        ['1', '2', '4', '8', '16', '32', '64']

        ['TYPE2', 'Benjamin', '78704', ['Larry', 'Curly', 'Moe'], ['Hewey',
        'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma',
        'Louise']]
        - data: [['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'],
        ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']]
        - name: Benjamin
        - zip: 78704
        Benjamin
        ['Larry', 'Curly', 'Moe']
        ['Hewey', 'Dewey', 'Louie']
        ['Tom', 'Dick', 'Harry', 'Fred']
        ['Thelma', 'Louise']

        ['TYPE3', 'Christopher', 'WA', ['Raspberry Red', 'Lemon Yellow',
        'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful',
        'Sleepy', 'Doc']]
        - data: [['Raspberry Red', 'Lemon Yellow', 'Orange Orange'],
        ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']]
        - name: Christopher
        - state: WA
        Christopher
        WA
        ['Raspberry Red', 'Lemon Yellow', 'Orange Orange']
        ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']


        More info on pyparsing at http://pyparsing.wikispaces.com.

        -- Paul

        Comment

        • James Harris

          #5
          Re: Parsing a file with iterators

          On 17 Oct, 16:42, Luis Zarrabeitia <ky...@uh.cuwro te:
          I need to parse a file, text file. The format is something like that:
          >
          TYPE1 metadata
          data line 1
          data line 2
          ...
          data line N
          TYPE2 metadata
          data line 1
          ...
          TYPE3 metadata
          ...
          >
          And so on. The type and metadata determine how to parse the following data
          lines. When the parser fails to parse one of the lines, the next parser is
          chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).
          >
          This doesn't work:
          >
          ===
          for line in input:
          parser = parser_from_str ing(line)
          parser(input)
          ===
          >
          because when the parser iterates over the input, it can't know that it finished
          processing the section until it reads the next "TYPE" line (actually, until it
          reads the first line that it cannot parse, which if everything went well, should
          be the 'TYPE'), but once it reads it, it is no longer available to the outer
          loop. I wouldn't like to leak the internals of the parsers to the outside.
          >
          What could I do?
          (to the curious: the format is a dialect of the E00 used in GIS)
          The main issue seems to be that you need to keep the 'current' line
          data when a parser has decided it doesn't understand it so it can
          still be used to select the next parser. The for loop in your example
          uses the next() method which only returns the next and never the
          current line. There are two easy options though:

          1. Wrap the input file with your own object.
          2. Use the linecache module and maintain a line number.



          --
          HTH,
          James

          Comment

          • George Sakkis

            #6
            Re: Parsing a file with iterators

            On Oct 17, 12:45 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net wrote:
            On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:
            I need to parse a file, text file. The format is something like that:
            >
            TYPE1 metadata
            data line 1
            data line 2
            ...
            data line N
            TYPE2 metadata
            data line 1
            ...
            TYPE3 metadata
            ...
            […]
            because when the parser iterates over the input, it can't know that it
            finished processing the section until it reads the next "TYPE" line
            (actually, until it reads the first line that it cannot parse, which if
            everything went well, should be the 'TYPE'), but once it reads it, it is
            no longer available to the outer loop. I wouldn't like to leak the
            internals of the parsers to the outside.
            >
            What could I do?
            (to the curious: the format is a dialect of the E00 used in GIS)
            >
            Group the lines before processing and feed each group to the right parser:
            >
            import sys
            from itertools import groupby, imap
            from operator import itemgetter
            >
            def parse_a(metadat a, lines):
                print 'parser a', metadata
                for line in lines:
                    print 'a', line
            >
            def parse_b(metadat a, lines):
                print 'parser b', metadata
                for line in lines:
                    print 'b', line
            >
            def parse_c(metadat a, lines):
                print 'parser c', metadata
                for line in lines:
                    print 'c', line
            >
            def test_for_type(l ine):
                return line.startswith ('TYPE')
            >
            def parse(lines):
                def tag():
                    type_line = None
                    for line in lines:
                        if test_for_type(l ine):
                            type_line = line
                        else:
                            yield (type_line, line)
            >
                type2parser = {'TYPE1': parse_a,
                               'TYPE2': parse_b,
                               'TYPE3': parse_c }
            >
                for type_line, group in groupby(tag(), itemgetter(0)):
                    type_id, metadata = type_line.split (' ', 1)
                    type2parser[type_id](metadata, imap(itemgetter (1), group))
            >
            def main():
                parse(sys.stdin )
            I like groupby and find it very powerful but I think it complicates
            things here instead of simplifying them. I would instead create a
            parser instance for every section as soon as the TYPE line is read and
            then feed it one data line at a time (or if all the data lines must or
            should be given at once, append them in a list and feed them all as
            soon as the next section is found), something like:

            class parse_a(object) :
            def __init__(self, metadata):
            print 'parser a', metadata
            def parse(self, line):
            print 'a', line

            # similar for parse_b and parse_c
            # ...

            def parse(lines):
            parse = None
            for line in lines:
            if test_for_type(l ine):
            type_id, metadata = line.split(' ', 1)
            parse = type2parser[type_id](metadata).pars e
            else:
            parse(line)

            George

            Comment

            Working...