regex for balanced parentheses?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • David C. Ullrich

    regex for balanced parentheses?

    There's no regex that detects balanced parentheses,
    or is there?

    That is, search('and so ((x+y)+z) = (x+(y+z))')
    should return '((x+y)+z)'.

    Not just a theoretical question, I'm cleaning up
    a large body of TeX code and a regex that did that
    would be very convenient. (Yes, I know it's not
    hard to detect balanced parentheses by hand...)

    I don't know that stuff, but I seen to recall reading
    that there's a theoretical notion of "regular expression"
    in CS, and a regular expression in that sense cannot
    do this. But in the same place I read that the actual
    regexes in programming languages include things which
    are not regular expressions in that theoretical sense.


    David C. Ullrich
  • Paul McGuire

    #2
    Re: regex for balanced parentheses?

    On Jun 12, 6:06 am, David C. Ullrich <dullr...@spryn et.comwrote:
    There's no regex that detects balanced parentheses,
    or is there?
    >
    That is, search('and so ((x+y)+z) = (x+(y+z))')
    should return '((x+y)+z)'.
    >
    Not just a theoretical question, I'm cleaning up
    a large body of TeX code and a regex that did that
    would be very convenient. (Yes, I know it's not
    hard to detect balanced parentheses by hand...)
    >
    I don't know that stuff, but I seen to recall reading
    that there's a theoretical notion of "regular expression"
    in CS, and a regular expression in that sense cannot
    do this. But in the same place I read that the actual
    regexes in programming languages include things which
    are not regular expressions in that theoretical sense.
    >
    David C. Ullrich
    Pyparsing includes several helper methods for building common
    expression patterns, such as delimitedList, oneOf, operatorPrecede nce,
    countedArray - and a fairly recent addition, nestedExpr. nestedExpr
    creates an expression for matching nested text within opening and
    closing delimiters, such as ()'s, []'s, {}'s, etc. The default
    delimiters are ()'s. You can also specify a content expression, so
    that pyparsing will look for and construct meaningful results. The
    default is to return any text nested within the delimiters, broken up
    by whitespace.

    Here is your sample string parsed using the default nestedExpr:
    >>from pyparsing import nestedExpr
    >>for e in nestedExpr().se archString('and so ((x+y)+z) = (x+(y+z))'):
    ... print e[0]
    ...
    [['x+y'], '+z']
    ['x+', ['y+z']]

    Pyparsing found 2 matches in your input string. Note that the parens
    are gone from the results - nestedExpr returns a nested list
    structure, with nesting corresponding to the ()'s found in the
    original string.

    Pyparsing supports parse-time callbacks, called 'parse actions', and
    it comes with several commonly used methods, such as removeQuotes,
    upcaseTokens, and keepOriginalTex t. The purpose of keepOriginalTex t
    is to revert any structuring or parsing an expression or other parse
    actions might do, and just return the originally matched text.

    Here is how keepOriginalTex t gives you back just the nested
    parenthetical expressions, without any additional processing or
    grouping:
    >>from pyparsing import keepOriginalTex t
    >>matchedPare ns = nestedExpr().se tParseAction(ke epOriginalText)
    >>for e in matchedParens.s earchString('an d so ((x+y)+z) = (x+(y+z))'):
    ... print e[0]
    ...
    ((x+y)+z)
    (x+(y+z))

    -- Paul

    Comment

    • David C. Ullrich

      #3
      Re: regex for balanced parentheses?

      On Thu, 12 Jun 2008 06:38:16 -0700 (PDT), Paul McGuire
      <ptmcg@austin.r r.comwrote:
      >On Jun 12, 6:06 am, David C. Ullrich <dullr...@spryn et.comwrote:
      >There's no regex that detects balanced parentheses,
      >or is there?
      >>
      >[...]
      >
      >Pyparsing includes several helper methods for building common
      >expression patterns, such as delimitedList, oneOf, operatorPrecede nce,
      >countedArray - and a fairly recent addition, nestedExpr. nestedExpr
      >creates an expression for matching nested text within opening and
      >closing delimiters, such as ()'s, []'s, {}'s, etc.
      Keen. Howdya know I wanted that? Thanks.

      TeX is one of the amazing things about free software. Knuth
      is great in many ways. He totally blew it in one detail,
      unfortunately one that comes up a lot: '$' is an opening
      delimiter, for which the corresponding closing delimiter
      is also '$'. Then better yet, '$$' is another double-duty
      delimiter... what I've done with that is first split
      on '$$', taking the odd-numbered bits to be the parts
      enclosed in $$..$$, and then taking the remining
      parts and splitting on $. Not hard but it gives me a
      creepy feeling.

      Hence the question: Can pyparsing tell the difference
      between '$' and '$'? (heh-heh).
      The default
      >delimiters are ()'s. You can also specify a content expression, so
      >that pyparsing will look for and construct meaningful results. The
      >default is to return any text nested within the delimiters, broken up
      >by whitespace.
      >
      >Here is your sample string parsed using the default nestedExpr:
      >>>from pyparsing import nestedExpr
      >>>for e in nestedExpr().se archString('and so ((x+y)+z) = (x+(y+z))'):
      >... print e[0]
      >...
      >[['x+y'], '+z']
      >['x+', ['y+z']]
      >
      >Pyparsing found 2 matches in your input string. Note that the parens
      >are gone from the results - nestedExpr returns a nested list
      >structure, with nesting corresponding to the ()'s found in the
      >original string.
      >
      >Pyparsing supports parse-time callbacks, called 'parse actions', and
      >it comes with several commonly used methods, such as removeQuotes,
      >upcaseTokens , and keepOriginalTex t. The purpose of keepOriginalTex t
      >is to revert any structuring or parsing an expression or other parse
      >actions might do, and just return the originally matched text.
      >
      >Here is how keepOriginalTex t gives you back just the nested
      >parenthetica l expressions, without any additional processing or
      >grouping:
      >>>from pyparsing import keepOriginalTex t
      >>>matchedParen s = nestedExpr().se tParseAction(ke epOriginalText)
      >>>for e in matchedParens.s earchString('an d so ((x+y)+z) = (x+(y+z))'):
      >... print e[0]
      >...
      >((x+y)+z)
      >(x+(y+z))
      >
      >-- Paul
      David C. Ullrich

      Comment

      • Paul McGuire

        #4
        Re: regex for balanced parentheses?

        Parsing TeX is definitely not for the faint-of-heart! You might try
        something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
        not poked at TeX or its ilk since the mid-80's so my TeXpertise is
        long rusted away.)

        I know of two projects that have taken on the problem using pyparsing
        - one is the mathtext module in John Hunter's matplotlib, and Tim
        Arnold posted some questions on the subject a while back - try
        googling for "pyparsing tex" for further leads.

        -- Paul

        Comment

        • Tim Arnold

          #5
          Re: regex for balanced parentheses?

          "Paul McGuire" <ptmcg@austin.r r.comwrote in message
          news:206a91d7-8251-49f6-b500-e47af4887a10@a1 g2000hsb.google groups.com...
          Parsing TeX is definitely not for the faint-of-heart! You might try
          something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
          not poked at TeX or its ilk since the mid-80's so my TeXpertise is
          long rusted away.)
          >
          I know of two projects that have taken on the problem using pyparsing
          - one is the mathtext module in John Hunter's matplotlib, and Tim
          Arnold posted some questions on the subject a while back - try
          googling for "pyparsing tex" for further leads.
          >
          -- Paul
          Definitely agree that TeX can get pretty complicated. My method (writing a
          converter from one TeX tag system to another) was to pre-parse using string
          match/replace for really simple stuff, regular expressions for the more
          complex and pyparsing for the really tough stuff.

          One thing that was surprisingly hard for me to figure out was filtering out
          comments. I finally just looped through the file line by line, looking for a
          '%' that wasn't in a verbatim environment and wasn't escaped, etc.
          Funny how sometimes the simplest thing can be difficult to handle.

          Definitely pyparsing made the job possible; I can't imagine that job without
          it.

          --Tim Arnold



          Comment

          • David C. Ullrich

            #6
            Re: regex for balanced parentheses?

            In article
            <206a91d7-8251-49f6-b500-e47af4887a10@a1 g2000hsb.google groups.com>,
            Paul McGuire <ptmcg@austin.r r.comwrote:
            Parsing TeX is definitely not for the faint-of-heart! You might try
            something like QuotedString('$ ', escQuote='$$') in pyparsing. (I've
            not poked at TeX or its ilk since the mid-80's so my TeXpertise is
            long rusted away.)
            Thanks. Not actually parsing TeX, just going through and
            making little changes to a few things. Easier since I wrote
            the original crap so I know that various things don't come
            up (for example _every_ % is the start of a comment and
            _none_ of those comments is actually important, they're
            all just old stuff commented out.)
            I know of two projects that have taken on the problem using pyparsing
            - one is the mathtext module in John Hunter's matplotlib, and Tim
            Arnold posted some questions on the subject a while back - try
            googling for "pyparsing tex" for further leads.
            >
            -- Paul
            --
            David C. Ullrich

            Comment

            Working...