[pyparsing] make sure entire string was parsed

**Paul McGuire** · Sep 11 '05, 02:25 AM

Re: make sure entire string was parsed

Steven -

Thanks for giving pyparsing a try! To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF. Then if there is more text after the parsed text, parseString
will throw a ParseException.

I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString. I
am curious whether leaveWhitespace is really necessary for your
grammar. If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.

Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

HTH,
-- Paul

**Steven Bethard** · Sep 11 '05, 06:15 PM

Re: make sure entire string was parsed

Paul McGuire wrote:[color=blue]
> Thanks for giving pyparsing a try! To see whether your input text
> consumes the whole string, add a StringEnd() element to the end of your
> BNF. Then if there is more text after the parsed text, parseString
> will throw a ParseException.[/color]

Thanks, that's exactly what I was looking for.
[color=blue]
> I notice you call leaveWhitespace on several of your parse elements, so
> you may have to rstrip() the input text before calling parseString. I
> am curious whether leaveWhitespace is really necessary for your
> grammar. If it is, you can usually just call leaveWhitespace on the
> root element, and this will propagate to all the sub elements.[/color]

Yeah, sorry, I was still messing around with that part of the code. My
problem is that I have to differentiate between:

(NP -x-y)

and:

(NP-x -y)

I'm doing this now using Combine. Does that seem right?
[color=blue]
> Lastly, you may get caught up with operator precedence, I think your
> node assignment statement may need to change from
> node << start + (branch_node | leaf_node) + end
> to
> node << (start + (branch_node | leaf_node) + end)[/color]

I think I'm okay:

py> 2 << 1 + 2
16
py> (2 << 1) + 2
6
py> 2 << (1 + 2)
16

Thanks for the help!

STeVe

**Paul McGuire** · Sep 11 '05, 10:05 PM

Re: make sure entire string was parsed

Steve -
[color=blue][color=green]
>>I have to differentiate between:
>> (NP -x-y)
>>and:
>> (NP-x -y)
>>I'm doing this now using Combine. Does that seem right?[/color][/color]

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace :

from pyparsing import *

thing = Word(alphanums+ "-")
LPAREN = Literal("(").su ppress()
RPAREN = Literal(")").su ppress()
node = LPAREN + OneOrMore(thing ) + RPAREN

print node.parseStrin g("(NP -x-y)")
print node.parseStrin g("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']

Your examples helped me to see what my operator precedence concern was.
Fortunately, your usage was an And, composed using '+' operators. If
your construct was a MatchFirst, composed using '|' operators, things
aren't so pretty:

print 2 << 1 | 3
print 2 << (1 | 3)

7
16

So I've just gotten into the habit of parenthesizing anything I load
into a Forward using '<<'.

-- Paul

**Steven Bethard** · Sep 12 '05, 03:25 PM

Re: make sure entire string was parsed

Paul McGuire wrote:[color=blue][color=green][color=darkred]
>>>I have to differentiate between:
>>> (NP -x-y)
>>>and:
>>> (NP-x -y)
>>>I'm doing this now using Combine. Does that seem right?[/color][/color]
>
> If your word char set is just alphanums+"-", then this will work
> without doing anything unnatural with leaveWhitespace :
>
> from pyparsing import *
>
> thing = Word(alphanums+ "-")
> LPAREN = Literal("(").su ppress()
> RPAREN = Literal(")").su ppress()
> node = LPAREN + OneOrMore(thing ) + RPAREN
>
> print node.parseStrin g("(NP -x-y)")
> print node.parseStrin g("(NP-x -y)")
>
> will print:
>
> ['NP', '-x-y']
> ['NP-x', '-y'][/color]

I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

I know the dict syntax afterwards isn't quite what pyparsing would
output, but hopefully my intent is clear. I need to use the dict-style
results from setResultsName( ) calls because in the full grammar, I have
a lot of optional elements. For example:

(NP-1 -a)
--> {'tag':'NP', 'id':'1', 'word':'-a'}
(NP-x-2 -B)
--> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'}
(NP-x-y=2-3 -4)
--> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3',
'word':'-4'}
(-NONE- x)
--> {'tag':None, 'word':'x'}

STeVe

P.S. In case you're curious, here's my current draft of the code:

# some character classes
printables_tran s = _pp.printables. translate
word_chars = printables_tran s(_id_trans, '()')
word_elem = _pp.Word(word_c hars)
syn_chars = printables_tran s(_id_trans, '()-=')
syn_word = _pp.Word(syn_ch ars)
func_chars = printables_tran s(_id_trans, '()-=0123456789')
func_word = _pp.Word(func_c hars)
num_word = _pp.Word(_pp.nu ms)

# tag separators
dash = _pp.Literal('-')
tag_sep = dash.suppress()
coord_sep = _pp.Literal('=' ).suppress()

# tag types (use Combine to guarantee no spaces)
special_tag = _pp.Combine(das h + syn_word + dash)
syn_tag = syn_word
func_tags = _pp.ZeroOrMore( _pp.Combine(tag _sep + func_word))
coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
id_tag = _pp.Optional(_p p.Combine(tag_s ep + num_word))

# give tag types result names
special_tag = special_tag.set ResultsName('ta g')
syn_tag = syn_tag.setResu ltsName('tag')
func_tags = func_tags.setRe sultsName('func s')
coord_tag = coord_tag.setRe sultsName('coor d')
id_tag = id_tag.setResul tsName('id')

# combine tag types into a tags element
normal_tags = syn_tag + func_tags + coord_tag + id_tag
tags = special_tag | _pp.Combine(nor mal_tags)
def get_tag(orig_st ring, tokens_start, tokens):
tokens = dict(tokens)
tag = tokens.pop('tag ')
if tag == '-NONE-':
tag = None
functions = list(tokens.pop ('funcs', []))
coord = tokens.pop('coo rd', None)
id = tokens.pop('id' , None)
return [dict(tag=tag, functions=funct ions,
coord=coord, id=id)]
tags.setParseAc tion(get_tag)

# node parentheses
start = _pp.Literal('(' ).suppress()
end = _pp.Literal(')' ).suppress()

# words
word = word_elem.setRe sultsName('word ')

# leaf nodes
leaf_node = tags + _pp.Optional(wo rd)
def get_leaf_node(o rig_string, tokens_start, tokens):
try:
tag_dict, word = tokens
word = cls._unescape(w ord)
except ValueError:
tag_dict, = tokens
word = None
return cls(word=word, **tag_dict)
leaf_node.setPa rseAction(get_l eaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(n ode)
def get_branch_node (orig_string, tokens_start, tokens):
return cls(children=to kens[1:], **tokens[0])
branch_node.set ParseAction(get _branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
root_node = node | start + node + end
root_nodes = _pp.OneOrMore(r oot_node)

# make sure nodes start and end string
str_start = _pp.StringStart ()
str_end = _pp.StringEnd()
cls._root_node = str_start + root_node + str_end
cls._root_nodes = str_start + root_nodes + str_end

**Steven Bethard** · Sep 12 '05, 04:15 PM

Re: make sure entire string was parsed

Steven Bethard wrote:[color=blue]
> Paul McGuire wrote:
>[color=green][color=darkred]
>>>> I have to differentiate between:
>>>> (NP -x-y)
>>>> and:
>>>> (NP-x -y)
>>>> I'm doing this now using Combine. Does that seem right?[/color]
>>
>>
>> If your word char set is just alphanums+"-", then this will work
>> without doing anything unnatural with leaveWhitespace :
>>
>> from pyparsing import *
>>
>> thing = Word(alphanums+ "-")
>> LPAREN = Literal("(").su ppress()
>> RPAREN = Literal(")").su ppress()
>> node = LPAREN + OneOrMore(thing ) + RPAREN
>>
>> print node.parseStrin g("(NP -x-y)")
>> print node.parseStrin g("(NP-x -y)")
>>
>> will print:
>>
>> ['NP', '-x-y']
>> ['NP-x', '-y'][/color]
>
>
> I actually need to break these into:
>
> ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
> ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}[/color]

Oops, sorry, the last line should have been:

['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'}

Sorry to introduce confusion into an already confusing parsing problem. ;)

STeVe

**Paul McGuire** · Sep 13 '05, 03:55 AM

Re: make sure entire string was parsed

Steve -

Wow, this is a pretty dense pyparsing program. You are really pushing
the envelope in your use of ParseResults, dicts, etc., but pretty much
everything seems to be working.

I still don't know the BNF you are working from, but here are some
other "shots in the dark":

1. I'm surprised func_word does not permit numbers anywhere in the
body. Is this just a feature you have not implemented yet? As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_c hars,func_chars +_pp.nums)

Perhaps similar for syn_word as well.

2. Is coord an optional sub-element of a func? If so, you might want
to group them so that they stay together, something like:

coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word))
func_tags = _pp.ZeroOrMore( _pp.Group(tag_s ep + func_word+coord _tag))

You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?

coord_tag = _pp.Optional(_p p.Combine(coord _sep + num_word),None)

Now the coords and funcs will be kept together.

3. Of course, you are correct in using Combine to ensure that you only
accept adjacent characters. But you only need to use it at the
outermost level.

4. You can use several dict-like functions directly on a ParseResults
object, such as keys(), items(), values(), in, etc. Also, the []
notation and the .attribute notation are nearly identical, except that
[] refs on a missing element will raise a KeyError, .attribute will
always return something. For instance, in your example, the getTag()
parse action uses dict.pop() to extract the 'coord' field. If coord is
present, you could retrieve it using "tokens['coord']" or
"tokens.coo rd". If coord is missing, "tokens['coord']" will raise a
KeyError, but tokens.coord will return an empty string. If you need to
"listify" a ParseResults, try calling asList().

It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered. But
please let us know how things work out.

-- Paul

**Steven Bethard** · Sep 13 '05, 04:35 PM

Re: make sure entire string was parsed

Paul McGuire wrote:[color=blue]
> I still don't know the BNF you are working from[/color]

Just to satisfy any curiosity you might have, it's the Penn TreeBank
format: http://www.cis.upenn.edu/~treebank/
(Except that the actual Penn Treebank data unfortunately differs from
the format spec in a few ways.)
[color=blue]
> 1. I'm surprised func_word does not permit numbers anywhere in the
> body. Is this just a feature you have not implemented yet? As long as
> func_word does not start with a digit, you can still define one
> unambiguously to allow numbers after the first character if you define
> func_word as
>
> func_word = _pp.Word(func_c hars,func_chars +_pp.nums)[/color]

Ahh, very nice. The spec's vague, but this is probably what I want to do.
[color=blue]
> 2. Is coord an optional sub-element of a func?[/color]

No, functions, coord and id are optional sub-elements of the tags string.
[color=blue]
> You might also add a default value for coord_tag if none is supplied,
> to simplify your parse action?[/color]

Oh, that's nice. I missed that functionality.
[color=blue]
> It's not clear to me what if any further help you are looking for, now
> that your initial question (about StringEnd()) has been answered.[/color]

Yes, thanks, you definitely answered the initial question. And your
followup commentary was also very helpful. Thanks again!

STeVe

[pyparsing] make sure entire string was parsed

[pyparsing] make sure entire string was parsed

Comment

Comment

Comment

Comment

Comment

Comment

Comment