need some regular expression help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chris

    need some regular expression help

    I need a pattern that matches a string that has the same number of '('
    as ')':
    findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
    '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
    Can anybody help me out?

    Thanks for any help!

  • Diez B. Roggisch

    #2
    Re: need some regular expression help


    Chris wrote:
    I need a pattern that matches a string that has the same number of '('
    as ')':
    findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
    '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
    Can anybody help me out?
    This is not possible with regular expressions - they can't "remember"
    how many parens they already encountered.

    You will need a real parser for this - pyparsing seems to be the most
    popular choice today, I personally like spark. I'm sure you find an
    example-grammar that will parse simple arithmetical expressions like
    the one above.

    Diez

    Comment

    • John Machin

      #3
      Re: need some regular expression help

      Chris wrote:
      I need a pattern that matches a string that has the same number of '('
      as ')':
      findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
      '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
      Can anybody help me out?
      >
      No, there is so such pattern. You will have to code up a function.

      Consider what your spec really is: '42^((2x+2)sin( x)) +
      (log(2)/log(5))' has the same number of left and right parentheses; so
      does the zero-length string; so does ') + (' -- perhaps you need to add
      'and starts with a "("'

      Consider what you are going to do with input like this:

      print '(' + some_text + ')'

      Maybe you need to do some lexical analysis and work at the level of
      tokens rather than individual characters.

      Which then raises the usual question: you have a perception that
      regular expressions are the solution -- to what problem??

      HTH,
      John

      Comment

      • hanumizzle

        #4
        Re: need some regular expression help

        On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch <deets@web.dewr ote:
        >
        Chris wrote:
        I need a pattern that matches a string that has the same number of '('
        as ')':
        findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
        '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
        Can anybody help me out?
        >
        This is not possible with regular expressions - they can't "remember"
        how many parens they already encountered.
        Remember that regular expressions are used to represent regular
        grammars. Most regex engines actually aren't regular in that they
        support fancy things like look-behind/ahead and capture groups...IIRC,
        these cannot be part of a true regular expression library.

        With that said, the quote-unquote regexes in Lua have a special
        feature that supports balanced expressions. I believe Python has a
        PCRE lib somewhere; you may be able to use the experimental ??{ }
        construct in that case.

        -- Theerasak

        Comment

        • Roy Smith

          #5
          Re: need some regular expression help

          In article <1160256609.555 007.83170@e3g20 00cwe.googlegro ups.com>,
          "Chris" <chrispatton@gm ail.comwrote:
          I need a pattern that matches a string that has the same number of '('
          as ')':
          findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
          '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
          Can anybody help me out?
          >
          Thanks for any help!
          Why does it need to be a regex? There is a very simple and well-known
          algorithm which does what you want.

          Start with i=0. Walk the string one character at a time, incrementing i
          each time you see a '(', and decrementing it each time you see a ')'. At
          the end of the string, the count should be back to 0. If at any time
          during the process, the count goes negative, you've got mis-matched
          parentheses.

          The algorithm runs in O(n), same as a regex.

          Regex is a wonderful tool, but it's not the answer to all problems.

          Comment

          • Tim Chase

            #6
            Re: need some regular expression help

            Why does it need to be a regex? There is a very simple and well-known
            algorithm which does what you want.
            >
            Start with i=0. Walk the string one character at a time, incrementing i
            each time you see a '(', and decrementing it each time you see a ')'. At
            the end of the string, the count should be back to 0. If at any time
            during the process, the count goes negative, you've got mis-matched
            parentheses.
            >
            The algorithm runs in O(n), same as a regex.
            >
            Regex is a wonderful tool, but it's not the answer to all problems.
            Following Roy's suggestion, one could use something like:
            >>s = '42^((2x+2)sin( x)) + (log(2)/log(5))'
            >>d = {'(':1, ')':-1}
            >>sum(d.get(c , 0) for c in s)
            0


            If you get a sum() 0, then you have too many "(", and if you
            have sum() < 0, you have too many ")" characters. A sum() of 0
            means there's the same number of parens. It still doesn't solve
            the aforementioned problem of things like ')))(((' which is
            balanced, but psychotic. :)

            -tkc




            Comment

            • Diez B. Roggisch

              #7
              Re: need some regular expression help


              hanumizzle wrote:
              On 7 Oct 2006 15:00:29 -0700, Diez B. Roggisch <deets@web.dewr ote:

              Chris wrote:
              I need a pattern that matches a string that has the same number of '('
              as ')':
              findall( compile('...'), '42^((2x+2)sin( x)) + (log(2)/log(5))' ) = [
              '((2x+2)sin(x)) ', '(log(2)/log(5))' ]
              Can anybody help me out?
              This is not possible with regular expressions - they can't "remember"
              how many parens they already encountered.
              >
              Remember that regular expressions are used to represent regular
              grammars. Most regex engines actually aren't regular in that they
              support fancy things like look-behind/ahead and capture groups...IIRC,
              these cannot be part of a true regular expression library.
              Certainly true, and it always gives me a hard time because I don't know
              to which extend a regular expression nowadays might do the job because
              of these extensions. It was so much easier back in the old times....
              With that said, the quote-unquote regexes in Lua have a special
              feature that supports balanced expressions. I believe Python has a
              PCRE lib somewhere; you may be able to use the experimental ??{ }
              construct in that case.
              Even if it has - I'm not sure if it really does you good, for several
              reasons:

              - regexes - even enhanced ones - don't build trees. But that is what
              you ultimately want
              from an expression like sin(log(x))

              - even if they are more powerful these days, the theory of context
              free grammars still applies.
              so if what you need isn't LL(k) but LR(k), how do you specify that
              to the regex engine?

              - the regexes are useful because of their compact notations, parsers
              allow for better structured outcome


              Diez

              Comment

              • Theerasak Photha

                #8
                Re: need some regular expression help

                On 8 Oct 2006 01:49:50 -0700, Diez B. Roggisch <deets@web.dewr ote:
                Even if it has - I'm not sure if it really does you good, for several
                reasons:
                >
                - regexes - even enhanced ones - don't build trees. But that is what
                you ultimately want
                from an expression like sin(log(x))
                >
                - even if they are more powerful these days, the theory of context
                free grammars still applies.
                so if what you need isn't LL(k) but LR(k), how do you specify that
                to the regex engine?
                >
                - the regexes are useful because of their compact notations, parsers
                allow for better structured outcome
                Just wait for Perl 6 :D

                -- Theerasak

                Comment

                • bearophileHUGS@lycos.com

                  #9
                  Re: need some regular expression help

                  Tim Chase:
                  It still doesn't solve the aforementioned problem
                  of things like ')))(((' which is balanced, but psychotic. :)
                  This may solve the problem:

                  def balanced(txt):
                  d = {'(':1, ')':-1}
                  tot = 0
                  for c in txt:
                  tot += d.get(c, 0)
                  if tot < 0:
                  return False
                  return tot == 0

                  print balanced("42^(( 2x+2)sin(x)) + (log(2)/log(5))") # True
                  print balanced("42^(( 2x+2)sin(x) + (log(2)/log(5))") # False
                  print balanced("42^(( 2x+2)sin(x))) + (log(2)/log(5))") # False
                  print balanced(")))(( (") # False

                  A possibile alternative for Py 2.5. The dict solution looks better, but
                  this may be faster:

                  def balanced2(txt):
                  tot = 0
                  for c in txt:
                  tot += 1 if c=="(" else (-1 if c==")" else 0)
                  if tot < 0:
                  return False
                  return tot == 0

                  Bye,
                  bearophile

                  Comment

                  • Fredrik Lundh

                    #10
                    Re: need some regular expression help

                    bearophileHUGS@ lycos.com wrote:
                    The dict solution looks better, but this may be faster:
                    it's slightly faster, but both your alternatives are about 10x slower
                    than a straightforward :

                    def balanced(txt):
                    return txt.count("(") == txt.count(")")

                    </F>

                    Comment

                    • Mirco Wahab

                      #11
                      Re: need some regular expression help

                      Thus spoke Diez B. Roggisch (on 2006-10-08 10:49):
                      Certainly true, and it always gives me a hard time because I don't know
                      to which extend a regular expression nowadays might do the job because
                      of these extensions. It was so much easier back in the old times....
                      Right, in perl, this would be a no-brainer,
                      its documented all over the place, like:

                      my $re;

                      $re = qr{
                      (?:
                      (?[^\\()]+ | \\. )
                      |
                      \( (??{ $re }) \)
                      )*
                      }xs;

                      where you have a 'delayed execution'
                      of the

                      (??{ $re })

                      which in the end makes the whole a thing
                      recursive one, it gets expanded and
                      executed if the match finds its way
                      to it.

                      Above regex will match balanced parens,
                      as in:

                      my $good = 'a + (b / (c - 2)) * (d ^ (e+f)) ';
                      my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) ';
                      my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )';

                      if you do:

                      print "ok \n" if $good =~ /^$re$/;
                      print "ok \n" if $bad1 =~ /^$re$/;
                      print "ok \n" if $bad2 =~ /^$re$/;


                      This in some depth documented e.g. in

                      (topic: Recursive Regexes)

                      Regards

                      M.

                      Comment

                      • Diez B. Roggisch

                        #12
                        Re: need some regular expression help

                        Mirco Wahab schrieb:
                        Thus spoke Diez B. Roggisch (on 2006-10-08 10:49):
                        >Certainly true, and it always gives me a hard time because I don't know
                        >to which extend a regular expression nowadays might do the job because
                        >of these extensions. It was so much easier back in the old times....
                        >
                        Right, in perl, this would be a no-brainer,
                        its documented all over the place, like:
                        >
                        my $re;
                        >
                        $re = qr{
                        (?:
                        (?[^\\()]+ | \\. )
                        |
                        \( (??{ $re }) \)
                        )*
                        }xs;
                        >
                        where you have a 'delayed execution'
                        of the
                        >
                        (??{ $re })
                        >
                        which in the end makes the whole a thing
                        recursive one, it gets expanded and
                        executed if the match finds its way
                        to it.
                        >
                        Above regex will match balanced parens,
                        as in:
                        >
                        my $good = 'a + (b / (c - 2)) * (d ^ (e+f)) ';
                        my $bad1 = 'a + (b / (c - 2) * (d ^ (e+f)) ';
                        my $bad2 = 'a + (b / (c - 2)) * (d) ^ (e+f) )';
                        >
                        if you do:
                        >
                        print "ok \n" if $good =~ /^$re$/;
                        print "ok \n" if $bad1 =~ /^$re$/;
                        print "ok \n" if $bad2 =~ /^$re$/;
                        >
                        >
                        This in some depth documented e.g. in

                        (topic: Recursive Regexes)
                        That clearly is a recursive grammar rule, and thus it can't be regular
                        anymore :) But first of all, I find it ugly - the clean separation of
                        lexical and syntactical analysis is better here, IMHO - and secondly,
                        what are the properties of that parsing? Is it LL(k), LR(k), backtracking?

                        Diez

                        Comment

                        • bearophileHUGS@lycos.com

                          #13
                          Re: need some regular expression help

                          Fredrik Lundh wrote:
                          it's slightly faster, but both your alternatives are about 10x slower
                          than a straightforward :
                          def balanced(txt):
                          return txt.count("(") == txt.count(")")
                          I know, but if you read my post again you see that I have shown those
                          solutions to mark ")))(((" as bad expressions. Just counting the parens
                          isn't enough.

                          Bye,
                          bearophile

                          Comment

                          • Roy Smith

                            #14
                            Re: need some regular expression help

                            "Diez B. Roggisch" <deets@web.dewr ote:
                            Certainly true, and it always gives me a hard time because I don't know
                            to which extend a regular expression nowadays might do the job because
                            of these extensions. It was so much easier back in the old times....
                            What old times? I've been working with regex for mumble years and there's
                            always been the problem that every implementation supports a slightly
                            different syntax. Even back in the "good old days", grep, awk, sed, and ed
                            all had slightly different flavors.

                            Comment

                            • Theerasak Photha

                              #15
                              Re: need some regular expression help

                              On 10/8/06, Roy Smith <roy@panix.comw rote:
                              "Diez B. Roggisch" <deets@web.dewr ote:
                              Certainly true, and it always gives me a hard time because I don't know
                              to which extend a regular expression nowadays might do the job because
                              of these extensions. It was so much easier back in the old times....
                              >
                              What old times? I've been working with regex for mumble years and there's
                              always been the problem that every implementation supports a slightly
                              different syntax. Even back in the "good old days", grep, awk, sed, and ed
                              all had slightly different flavors.
                              Which grep? Which awk? :)

                              -- Theerasak

                              Comment

                              Working...