[pyparsing] make sure entire string was parsed

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Steven Bethard

    [pyparsing] make sure entire string was parsed

    How do I make sure that my entire string was parsed when I call a
    pyparsing element's parseString method? Here's a dramatically
    simplified version of my problem:

    py> import pyparsing as pp
    py> match = pp.Word(pp.nums )
    py> def parse_num(s, loc, toks):
    .... n, = toks
    .... return int(n) + 10
    ....
    py> match.setParseA ction(parse_num )
    W:(0123...)
    py> match.parseStri ng('121abc')
    ([131], {})

    I want to know (somehow) that when I called match.parseStri ng(), there
    was some of the string left over (in this case, 'abc') after the parse
    was complete. How can I do this? (I don't think I can do character
    counting; all my internal setParseAction( ) functions return non-strings).

    STeVe

    P.S. FWIW, I've included the real code below. I need to throw an
    exception when I call the parseString method of cls._root_node or
    cls._root_nodes and the entire string is not consumed.

    ----------------------------------------------------------------------
    # some character classes
    printables_tran s = _pp.printables. translate
    word_chars = printables_tran s(_id_trans, '()')
    syn_tag_chars = printables_tran s(_id_trans, '()-=')
    func_tag_chars = printables_tran s(_id_trans, '()-=0123456789')

    # basic tag components
    sep = _pp.Literal('-').leaveWhitesp ace()
    alt_sep = _pp.Literal('=' ).leaveWhitespa ce()
    special_word = _pp.Combine(sep + _pp.Word(syn_ta g_chars) + sep)
    supp_sep = (alt_sep | sep).suppress()
    syn_word = _pp.Word(syn_ta g_chars).leaveW hitespace()
    func_word = _pp.Word(func_t ag_chars).leave Whitespace()
    id_word = _pp.Word(_pp.nu ms).leaveWhites pace()

    # the different tag types
    special_tag = special_word.se tResultsName('t ag')
    syn_tag = syn_word.setRes ultsName('tag')
    func_tags = _pp.ZeroOrMore( supp_sep + func_word)
    func_tags = func_tags.setRe sultsName('func s')
    id_tag = _pp.Optional(su pp_sep + id_word).setRes ultsName('id')
    tags = special_tag | (syn_tag + func_tags + id_tag)
    def get_tag(orig_st ring, tokens_start, tokens):
    tokens = dict(tokens)
    tag = tokens.pop('tag ')
    if tag == '-NONE-':
    tag = None
    functions = list(tokens.pop ('funcs', []))
    id = tokens.pop('id' , None)
    return [dict(tag=tag, functions=funct ions, id=id)]
    tags.setParseAc tion(get_tag)

    # node parentheses
    start = _pp.Literal('(' ).suppress()
    end = _pp.Literal(')' ).suppress()

    # words
    word = _pp.Word(word_c hars).setResult sName('word')

    # leaf nodes
    leaf_node = tags + _pp.Optional(wo rd)
    def get_leaf_node(o rig_string, tokens_start, tokens):
    try:
    tag_dict, word = tokens
    word = cls._unescape(w ord)
    except ValueError:
    tag_dict, = tokens
    word = None
    return cls(word=word, **tag_dict)
    leaf_node.setPa rseAction(get_l eaf_node)

    # node, recursive
    node = _pp.Forward()

    # branch nodes
    branch_node = tags + _pp.OneOrMore(n ode)
    def get_branch_node (orig_string, tokens_start, tokens):
    return cls(children=to kens[1:], **tokens[0])
    branch_node.set ParseAction(get _branch_node)

    # node, recursive
    node << start + (branch_node | leaf_node) + end

    # root node may have additional parentheses
    cls._root_node = node | start + node + end
    cls._root_nodes = _pp.OneOrMore(c ls._root_node)
  • Paul McGuire

    #2
    Re: make sure entire string was parsed

    Steven -

    Thanks for giving pyparsing a try! To see whether your input text
    consumes the whole string, add a StringEnd() element to the end of your
    BNF. Then if there is more text after the parsed text, parseString
    will throw a ParseException.

    I notice you call leaveWhitespace on several of your parse elements, so
    you may have to rstrip() the input text before calling parseString. I
    am curious whether leaveWhitespace is really necessary for your
    grammar. If it is, you can usually just call leaveWhitespace on the
    root element, and this will propagate to all the sub elements.

    Lastly, you may get caught up with operator precedence, I think your
    node assignment statement may need to change from
    node << start + (branch_node | leaf_node) + end
    to
    node << (start + (branch_node | leaf_node) + end)

    HTH,
    -- Paul

    Comment

    • Steven Bethard

      #3
      Re: make sure entire string was parsed

      Paul McGuire wrote:[color=blue]
      > Thanks for giving pyparsing a try! To see whether your input text
      > consumes the whole string, add a StringEnd() element to the end of your
      > BNF. Then if there is more text after the parsed text, parseString
      > will throw a ParseException.[/color]

      Thanks, that's exactly what I was looking for.
      [color=blue]
      > I notice you call leaveWhitespace on several of your parse elements, so
      > you may have to rstrip() the input text before calling parseString. I
      > am curious whether leaveWhitespace is really necessary for your
      > grammar. If it is, you can usually just call leaveWhitespace on the
      > root element, and this will propagate to all the sub elements.[/color]

      Yeah, sorry, I was still messing around with that part of the code. My
      problem is that I have to differentiate between:

      (NP -x-y)

      and:

      (NP-x -y)

      I'm doing this now using Combine. Does that seem right?
      [color=blue]
      > Lastly, you may get caught up with operator precedence, I think your
      > node assignment statement may need to change from
      > node << start + (branch_node | leaf_node) + end
      > to
      > node << (start + (branch_node | leaf_node) + end)[/color]

      I think I'm okay:

      py> 2 << 1 + 2
      16
      py> (2 << 1) + 2
      6
      py> 2 << (1 + 2)
      16

      Thanks for the help!

      STeVe

      Comment

      • Paul McGuire

        #4
        Re: make sure entire string was parsed

        Steve -
        [color=blue][color=green]
        >>I have to differentiate between:
        >> (NP -x-y)
        >>and:
        >> (NP-x -y)
        >>I'm doing this now using Combine. Does that seem right?[/color][/color]

        If your word char set is just alphanums+"-", then this will work
        without doing anything unnatural with leaveWhitespace :

        from pyparsing import *

        thing = Word(alphanums+ "-")
        LPAREN = Literal("(").su ppress()
        RPAREN = Literal(")").su ppress()
        node = LPAREN + OneOrMore(thing ) + RPAREN

        print node.parseStrin g("(NP -x-y)")
        print node.parseStrin g("(NP-x -y)")

        will print:

        ['NP', '-x-y']
        ['NP-x', '-y']


        Your examples helped me to see what my operator precedence concern was.
        Fortunately, your usage was an And, composed using '+' operators. If
        your construct was a MatchFirst, composed using '|' operators, things
        aren't so pretty:

        print 2 << 1 | 3
        print 2 << (1 | 3)

        7
        16

        So I've just gotten into the habit of parenthesizing anything I load
        into a Forward using '<<'.

        -- Paul

        Comment

        • Steven Bethard

          #5
          Re: make sure entire string was parsed

          Paul McGuire wrote:[color=blue][color=green][color=darkred]
          >>>I have to differentiate between:
          >>> (NP -x-y)
          >>>and:
          >>> (NP-x -y)
          >>>I'm doing this now using Combine. Does that seem right?[/color][/color]
          >
          > If your word char set is just alphanums+"-", then this will work
          > without doing anything unnatural with leaveWhitespace :
          >
          > from pyparsing import *
          >
          > thing = Word(alphanums+ "-")
          > LPAREN = Literal("(").su ppress()
          > RPAREN = Literal(")").su ppress()
          > node = LPAREN + OneOrMore(thing ) + RPAREN
          >
          > print node.parseStrin g("(NP -x-y)")
          > print node.parseStrin g("(NP-x -y)")
          >
          > will print:
          >
          > ['NP', '-x-y']
          > ['NP-x', '-y'][/color]

          I actually need to break these into:

          ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
          ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

          I know the dict syntax afterwards isn't quite what pyparsing would
          output, but hopefully my intent is clear. I need to use the dict-style
          results from setResultsName( ) calls because in the full grammar, I have
          a lot of optional elements. For example:

          (NP-1 -a)
          --> {'tag':'NP', 'id':'1', 'word':'-a'}
          (NP-x-2 -B)
          --> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'}
          (NP-x-y=2-3 -4)
          --> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3',
          'word':'-4'}
          (-NONE- x)
          --> {'tag':None, 'word':'x'}



          STeVe

          P.S. In case you're curious, here's my current draft of the code:

          # some character classes
          printables_tran s = _pp.printables. translate
          word_chars = printables_tran s(_id_trans, '()')
          word_elem = _pp.Word(word_c hars)
          syn_chars = printables_tran s(_id_trans, '()-=')
          syn_word = _pp.Word(syn_ch ars)
          func_chars = printables_tran s(_id_trans, '()-=0123456789')
          func_word = _pp.Word(func_c hars)
          num_word = _pp.Word(_pp.nu ms)

          # tag separators
          dash = _pp.Literal('-')
          tag_sep = dash.suppress()
          coord_sep = _pp.Literal('=' ).suppress()

          # tag types (use Combine to guarantee no spaces)
          special_tag = _pp.Combine(das h + syn_word + dash)
          syn_tag = syn_word
          func_tags = _pp.ZeroOrMore( _pp.Combine(tag _sep + func_word))
          coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
          id_tag = _pp.Optional(_p p.Combine(tag_s ep + num_word))

          # give tag types result names
          special_tag = special_tag.set ResultsName('ta g')
          syn_tag = syn_tag.setResu ltsName('tag')
          func_tags = func_tags.setRe sultsName('func s')
          coord_tag = coord_tag.setRe sultsName('coor d')
          id_tag = id_tag.setResul tsName('id')

          # combine tag types into a tags element
          normal_tags = syn_tag + func_tags + coord_tag + id_tag
          tags = special_tag | _pp.Combine(nor mal_tags)
          def get_tag(orig_st ring, tokens_start, tokens):
          tokens = dict(tokens)
          tag = tokens.pop('tag ')
          if tag == '-NONE-':
          tag = None
          functions = list(tokens.pop ('funcs', []))
          coord = tokens.pop('coo rd', None)
          id = tokens.pop('id' , None)
          return [dict(tag=tag, functions=funct ions,
          coord=coord, id=id)]
          tags.setParseAc tion(get_tag)

          # node parentheses
          start = _pp.Literal('(' ).suppress()
          end = _pp.Literal(')' ).suppress()

          # words
          word = word_elem.setRe sultsName('word ')

          # leaf nodes
          leaf_node = tags + _pp.Optional(wo rd)
          def get_leaf_node(o rig_string, tokens_start, tokens):
          try:
          tag_dict, word = tokens
          word = cls._unescape(w ord)
          except ValueError:
          tag_dict, = tokens
          word = None
          return cls(word=word, **tag_dict)
          leaf_node.setPa rseAction(get_l eaf_node)

          # node, recursive
          node = _pp.Forward()

          # branch nodes
          branch_node = tags + _pp.OneOrMore(n ode)
          def get_branch_node (orig_string, tokens_start, tokens):
          return cls(children=to kens[1:], **tokens[0])
          branch_node.set ParseAction(get _branch_node)

          # node, recursive
          node << start + (branch_node | leaf_node) + end

          # root node may have additional parentheses
          root_node = node | start + node + end
          root_nodes = _pp.OneOrMore(r oot_node)

          # make sure nodes start and end string
          str_start = _pp.StringStart ()
          str_end = _pp.StringEnd()
          cls._root_node = str_start + root_node + str_end
          cls._root_nodes = str_start + root_nodes + str_end

          Comment

          • Steven Bethard

            #6
            Re: make sure entire string was parsed

            Steven Bethard wrote:[color=blue]
            > Paul McGuire wrote:
            >[color=green][color=darkred]
            >>>> I have to differentiate between:
            >>>> (NP -x-y)
            >>>> and:
            >>>> (NP-x -y)
            >>>> I'm doing this now using Combine. Does that seem right?[/color]
            >>
            >>
            >> If your word char set is just alphanums+"-", then this will work
            >> without doing anything unnatural with leaveWhitespace :
            >>
            >> from pyparsing import *
            >>
            >> thing = Word(alphanums+ "-")
            >> LPAREN = Literal("(").su ppress()
            >> RPAREN = Literal(")").su ppress()
            >> node = LPAREN + OneOrMore(thing ) + RPAREN
            >>
            >> print node.parseStrin g("(NP -x-y)")
            >> print node.parseStrin g("(NP-x -y)")
            >>
            >> will print:
            >>
            >> ['NP', '-x-y']
            >> ['NP-x', '-y'][/color]
            >
            >
            > I actually need to break these into:
            >
            > ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
            > ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}[/color]

            Oops, sorry, the last line should have been:

            ['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'}

            Sorry to introduce confusion into an already confusing parsing problem. ;)

            STeVe

            Comment

            • Paul McGuire

              #7
              Re: make sure entire string was parsed

              Steve -

              Wow, this is a pretty dense pyparsing program. You are really pushing
              the envelope in your use of ParseResults, dicts, etc., but pretty much
              everything seems to be working.

              I still don't know the BNF you are working from, but here are some
              other "shots in the dark":

              1. I'm surprised func_word does not permit numbers anywhere in the
              body. Is this just a feature you have not implemented yet? As long as
              func_word does not start with a digit, you can still define one
              unambiguously to allow numbers after the first character if you define
              func_word as

              func_word = _pp.Word(func_c hars,func_chars +_pp.nums)

              Perhaps similar for syn_word as well.

              2. Is coord an optional sub-element of a func? If so, you might want
              to group them so that they stay together, something like:

              coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
              func_tags = _pp.ZeroOrMore( _pp.Group(tag_s ep + func_word+coord _tag))

              You might also add a default value for coord_tag if none is supplied,
              to simplify your parse action?

              coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word),None)

              Now the coords and funcs will be kept together.

              3. Of course, you are correct in using Combine to ensure that you only
              accept adjacent characters. But you only need to use it at the
              outermost level.

              4. You can use several dict-like functions directly on a ParseResults
              object, such as keys(), items(), values(), in, etc. Also, the []
              notation and the .attribute notation are nearly identical, except that
              [] refs on a missing element will raise a KeyError, .attribute will
              always return something. For instance, in your example, the getTag()
              parse action uses dict.pop() to extract the 'coord' field. If coord is
              present, you could retrieve it using "tokens['coord']" or
              "tokens.coo rd". If coord is missing, "tokens['coord']" will raise a
              KeyError, but tokens.coord will return an empty string. If you need to
              "listify" a ParseResults, try calling asList().


              It's not clear to me what if any further help you are looking for, now
              that your initial question (about StringEnd()) has been answered. But
              please let us know how things work out.

              -- Paul

              Comment

              • Steven Bethard

                #8
                Re: make sure entire string was parsed

                Paul McGuire wrote:[color=blue]
                > I still don't know the BNF you are working from[/color]

                Just to satisfy any curiosity you might have, it's the Penn TreeBank
                format: http://www.cis.upenn.edu/~treebank/
                (Except that the actual Penn Treebank data unfortunately differs from
                the format spec in a few ways.)
                [color=blue]
                > 1. I'm surprised func_word does not permit numbers anywhere in the
                > body. Is this just a feature you have not implemented yet? As long as
                > func_word does not start with a digit, you can still define one
                > unambiguously to allow numbers after the first character if you define
                > func_word as
                >
                > func_word = _pp.Word(func_c hars,func_chars +_pp.nums)[/color]

                Ahh, very nice. The spec's vague, but this is probably what I want to do.
                [color=blue]
                > 2. Is coord an optional sub-element of a func?[/color]

                No, functions, coord and id are optional sub-elements of the tags string.
                [color=blue]
                > You might also add a default value for coord_tag if none is supplied,
                > to simplify your parse action?[/color]

                Oh, that's nice. I missed that functionality.
                [color=blue]
                > It's not clear to me what if any further help you are looking for, now
                > that your initial question (about StringEnd()) has been answered.[/color]

                Yes, thanks, you definitely answered the initial question. And your
                followup commentary was also very helpful. Thanks again!

                STeVe

                Comment

                Working...