regex for balanced parentheses?

**Paul McGuire** · Jun 27 '08, 04:28 PM

Re: regex for balanced parentheses?

On Jun 12, 6:06 am, David C. Ullrich <dullr...@spryn et.comwrote:

There's no regex that detects balanced parentheses,
or is there?
>
That is, search('and so ((x+y)+z) = (x+(y+z))')
should return '((x+y)+z)'.
>
Not just a theoretical question, I'm cleaning up
a large body of TeX code and a regex that did that
would be very convenient. (Yes, I know it's not
hard to detect balanced parentheses by hand...)
>
I don't know that stuff, but I seen to recall reading
that there's a theoretical notion of "regular expression"
in CS, and a regular expression in that sense cannot
do this. But in the same place I read that the actual
regexes in programming languages include things which
are not regular expressions in that theoretical sense.
>
David C. Ullrich

Pyparsing includes several helper methods for building common
expression patterns, such as delimitedList, oneOf, operatorPrecede nce,
countedArray - and a fairly recent addition, nestedExpr. nestedExpr
creates an expression for matching nested text within opening and
closing delimiters, such as ()'s, []'s, {}'s, etc. The default
delimiters are ()'s. You can also specify a content expression, so
that pyparsing will look for and construct meaningful results. The
default is to return any text nested within the delimiters, broken up
by whitespace.

Here is your sample string parsed using the default nestedExpr:

>>from pyparsing import nestedExpr
>>for e in nestedExpr().se archString('and so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
[['x+y'], '+z']
['x+', ['y+z']]

Pyparsing found 2 matches in your input string. Note that the parens
are gone from the results - nestedExpr returns a nested list
structure, with nesting corresponding to the ()'s found in the
original string.

Pyparsing supports parse-time callbacks, called 'parse actions', and
it comes with several commonly used methods, such as removeQuotes,
upcaseTokens, and keepOriginalTex t. The purpose of keepOriginalTex t
is to revert any structuring or parsing an expression or other parse
actions might do, and just return the originally matched text.

Here is how keepOriginalTex t gives you back just the nested
parenthetical expressions, without any additional processing or
grouping:

>>from pyparsing import keepOriginalTex t
>>matchedPare ns = nestedExpr().se tParseAction(ke epOriginalText)
>>for e in matchedParens.s earchString('an d so ((x+y)+z) = (x+(y+z))'):

... print e[0]
...
((x+y)+z)
(x+(y+z))

-- Paul

**David C. Ullrich** · Jun 27 '08, 04:28 PM

Re: regex for balanced parentheses?

On Thu, 12 Jun 2008 06:38:16 -0700 (PDT), Paul McGuire
<ptmcg@austin.r r.comwrote:

>On Jun 12, 6:06 am, David C. Ullrich <dullr...@spryn et.comwrote:

>There's no regex that detects balanced parentheses,
>or is there?
>>
>[...]

>
>Pyparsing includes several helper methods for building common
>expression patterns, such as delimitedList, oneOf, operatorPrecede nce,
>countedArray - and a fairly recent addition, nestedExpr. nestedExpr
>creates an expression for matching nested text within opening and
>closing delimiters, such as ()'s, []'s, {}'s, etc.

Keen. Howdya know I wanted that? Thanks.

TeX is one of the amazing things about free software. Knuth
is great in many ways. He totally blew it in one detail,
unfortunately one that comes up a lot: '$' is an opening
delimiter, for which the corresponding closing delimiter
is also '$'. Then better yet, '$$' is another double-duty
delimiter... what I've done with that is first split
on '$$', taking the odd-numbered bits to be the parts
enclosed in $$..$$, and then taking the remining
parts and splitting on $. Not hard but it gives me a
creepy feeling.

Hence the question: Can pyparsing tell the difference
between '$' and '$'? (heh-heh).

The default
>delimiters are ()'s. You can also specify a content expression, so
>that pyparsing will look for and construct meaningful results. The
>default is to return any text nested within the delimiters, broken up
>by whitespace.
>
>Here is your sample string parsed using the default nestedExpr:

>>>from pyparsing import nestedExpr
>>>for e in nestedExpr().se archString('and so ((x+y)+z) = (x+(y+z))'):

>... print e[0]
>...
>[['x+y'], '+z']
>['x+', ['y+z']]
>
>Pyparsing found 2 matches in your input string. Note that the parens
>are gone from the results - nestedExpr returns a nested list
>structure, with nesting corresponding to the ()'s found in the
>original string.
>
>Pyparsing supports parse-time callbacks, called 'parse actions', and
>it comes with several commonly used methods, such as removeQuotes,
>upcaseTokens , and keepOriginalTex t. The purpose of keepOriginalTex t
>is to revert any structuring or parsing an expression or other parse
>actions might do, and just return the originally matched text.
>
>Here is how keepOriginalTex t gives you back just the nested
>parenthetica l expressions, without any additional processing or
>grouping:

>>>from pyparsing import keepOriginalTex t
>>>matchedParen s = nestedExpr().se tParseAction(ke epOriginalText)
>>>for e in matchedParens.s earchString('an d so ((x+y)+z) = (x+(y+z))'):

>... print e[0]
>...
>((x+y)+z)
>(x+(y+z))
>
>-- Paul

David C. Ullrich

**Paul McGuire** · Jun 27 '08, 04:28 PM

Re: regex for balanced parentheses?

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.

-- Paul

**Tim Arnold** · Jun 27 '08, 04:28 PM

Re: regex for balanced parentheses?

"Paul McGuire" <ptmcg@austin.r r.comwrote in message
news:206a91d7-8251-49f6-b500-e47af4887a10@a1 g2000hsb.google groups.com...

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)
>
I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.
>
-- Paul

Definitely agree that TeX can get pretty complicated. My method (writing a
converter from one TeX tag system to another) was to pre-parse using string
match/replace for really simple stuff, regular expressions for the more
complex and pyparsing for the really tough stuff.

One thing that was surprisingly hard for me to figure out was filtering out
comments. I finally just looped through the file line by line, looking for a
'%' that wasn't in a verbatim environment and wasn't escaped, etc.
Funny how sometimes the simplest thing can be difficult to handle.

Definitely pyparsing made the job possible; I can't imagine that job without
it.

--Tim Arnold

**David C. Ullrich** · Jun 27 '08, 04:28 PM

Re: regex for balanced parentheses?

In article
<206a91d7-8251-49f6-b500-e47af4887a10@a1 g2000hsb.google groups.com>,
Paul McGuire <ptmcg@austin.r r.comwrote:

Parsing TeX is definitely not for the faint-of-heart! You might try
something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
not poked at TeX or its ilk since the mid-80's so my TeXpertise is
long rusted away.)

Thanks. Not actually parsing TeX, just going through and
making little changes to a few things. Easier since I wrote
the original crap so I know that various things don't come
up (for example _every_ % is the start of a comment and
_none_ of those comments is actually important, they're
all just old stuff commented out.)

I know of two projects that have taken on the problem using pyparsing
- one is the mathtext module in John Hunter's matplotlib, and Tim
Arnold posted some questions on the subject a while back - try
googling for "pyparsing tex" for further leads.
>
-- Paul

--
David C. Ullrich

regex for balanced parentheses?

regex for balanced parentheses?

Comment

Comment

Comment

Comment

Comment