leftmost longest match (of disjunctions)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Joerg Schuster

    leftmost longest match (of disjunctions)

    Hello,

    The program given below returns the lines:

    a
    ab

    Is there a way to use python regular expressions such that the program
    would return the following lines?

    ab
    ab

    ############### ############### ############### ############### ############

    import re

    rx1 = re.compile("(a| ab)")
    rx2 = re.compile("(ab |a)")

    out1 = rx1.search("ab" )
    out2 = rx2.search("ab" )

    print out1.group(1)
    print out2.group(1)

    ############### ############### ############### ############### ############

    Jörg
  • Peter Hansen

    #2
    Re: leftmost longest match (of disjunctions)

    Joerg Schuster wrote:[color=blue]
    >
    > The program given below returns the lines:
    >
    > a
    > ab
    >
    > Is there a way to use python regular expressions such that the program
    > would return the following lines?
    >
    > ab
    > ab
    >
    > ############### ############### ############### ############### ############
    >
    > import re
    >
    > rx1 = re.compile("(a| ab)")
    > rx2 = re.compile("(ab |a)")[/color]

    Have you checked the documentation for "re"?

    It reads:
    "|" A|B, where A and B can be arbitrary REs, creates a regular expression
    that will match either A or B. An arbitrary number of REs can be separated
    by the "|" in this way. This can be used inside groups (see below) as well.
    As the target string is scanned, REs separated by "|" are tried from left to
    right. When one pattern completely matches, that branch is accepted. This
    means that once A matches, B will not be tested further, even if it would
    produce a longer overall match. In other words, the "|" operator is never
    greedy.

    ------

    Seems pretty clear and explicit to me. Your example is basically a working
    proof of the above code, so I'm not sure what you were expecting differently.

    -Peter

    Comment

    • Joerg Schuster

      #3
      Re: leftmost longest match (of disjunctions) ; greediness of "|&quot ;

      Peter Hansen <peter@engcorp. com> writes:
      [color=blue]
      > produce a longer overall match. In other words, the "|" operator is never
      > greedy.[/color]

      O.k. Thanks for pointing this out. Maybe I should have formulated my
      question differently: Is there a trick (be it dirty or not) to make
      "|" greedy in python?


      Jörg

      Comment

      • Joerg Schuster

        #4
        Re: leftmost longest match (of disjunctions)

        Peter Hansen <peter@engcorp. com> writes:
        [color=blue]
        > produce a longer overall match. In other words, the "|" operator is never
        > greedy.[/color]

        O.k. Thanks for pointing this out. Maybe I should have formulated my
        question differently: Is there a trick (be it dirty or not) to make
        "|" greedy in python?


        Jörg

        Comment

        • Padraig@Linux.ie

          #5
          Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

          Joerg Schuster wrote:[color=blue]
          > Peter Hansen <peter@engcorp. com> writes:
          >
          >[color=green]
          >>produce a longer overall match. In other words, the "|" operator is never
          >>greedy.[/color]
          >
          >
          > O.k. Thanks for pointing this out. Maybe I should have formulated my
          > question differently: Is there a trick (be it dirty or not) to make
          > "|" greedy in python?[/color]

          sort the re by size first?

          Pádraig.

          Comment

          • Joerg Schuster

            #6
            Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

            Padraig@Linux.i e writes:
            [color=blue]
            > Joerg Schuster wrote:[color=green]
            > > Peter Hansen <peter@engcorp. com> writes:
            > >[color=darkred]
            > >> produce a longer overall match. In other words, the "|" operator
            > >> is never greedy.[/color]
            > > O.k. Thanks for pointing this out. Maybe I should have formulated my
            > > question differently: Is there a trick (be it dirty or not) to make
            > > "|" greedy in python?[/color]
            >
            > sort the re by size first?[/color]

            The point is not to get the match of the longest part of the
            disjunction, but to get the match of that part of the disjunction
            which is the longest one. (The match of ".*" may be much longer than
            the match of "abc", although the latter regex contains more
            characters.)

            Jörg





            Comment

            • Joerg Schuster

              #7
              Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

              Joerg Schuster <js@cis.uni-muenchen.de> writes:
              [color=blue]
              > Padraig@Linux.i e writes:
              >[color=green]
              > > Joerg Schuster wrote:[color=darkred]
              > > > Peter Hansen <peter@engcorp. com> writes:
              > > >
              > > >> produce a longer overall match. In other words, the "|" operator
              > > >> is never greedy.
              > > > O.k. Thanks for pointing this out. Maybe I should have formulated my
              > > > question differently: Is there a trick (be it dirty or not) to make
              > > > "|" greedy in python?[/color]
              > >
              > > sort the re by size first?[/color]
              >
              > The point is not to get the match of the longest part of the
              > disjunction, but to get the match of that part of the disjunction
              > which is the longest one. (The match of ".*" may be much longer than
              > the match of "abc", although the latter regex contains more
              > characters.)
              >
              > Jörg[/color]

              I should have read your mail more carefully. You wrote about sorting
              the *re*, not about sorting the regex. Sorry.

              Jörg

              Comment

              • Diez B. Roggisch

                #8
                Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                Joerg Schuster wrote:
                [color=blue]
                > Peter Hansen <peter@engcorp. com> writes:
                >[color=green]
                >> produce a longer overall match. In other words, the "|" operator is
                >> never greedy.[/color]
                >
                > O.k. Thanks for pointing this out. Maybe I should have formulated my
                > question differently: Is there a trick (be it dirty or not) to make
                > "|" greedy in python?[/color]

                I don't think so - as regexps are finite state automata, their capabilities
                are limited to what these machines can do. So they can't backtrack to a
                part of the disjunction that was a shorter match if the longer match
                suddenly will fail. Consider this example:

                rex: a|abc

                string: abd

                If the state-machine was greedy, it would follow the second path and fail.

                The only thing you can do to avoid is is to split your disjunctions and
                search separately, and afterwards takt the longest part. That could be done
                using a small grammar, that allows you to detect the disjunctions and
                create separate rexes for it.

                Diez

                Comment

                • Fredrik Lundh

                  #9
                  Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                  Joerg Schuster wrote:
                  [color=blue]
                  > I should have read your mail more carefully. You wrote about sorting
                  > the *re*, not about sorting the regex. Sorry.[/color]

                  what's the difference?

                  </F>




                  Comment

                  • Fredrik Lundh

                    #10
                    Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                    Joerg Schuster wrote:
                    [color=blue][color=green][color=darkred]
                    > > > O.k. Thanks for pointing this out. Maybe I should have formulated my
                    > > > question differently: Is there a trick (be it dirty or not) to make
                    > > > "|" greedy in python?[/color]
                    > >
                    > > sort the re by size first?[/color]
                    >
                    > The point is not to get the match of the longest part of the
                    > disjunction, but to get the match of that part of the disjunction
                    > which is the longest one. (The match of ".*" may be much longer
                    > than the match of "abc", although the latter regex contains more
                    > characters.)[/color]

                    you can use "sre_parse.pars e(x).getwidth() " on a subexpression, to
                    get the shortest/longest possible match.
                    [color=blue][color=green][color=darkred]
                    >>> from sre_parse import parse
                    >>> parse("a?").get width()[/color][/color][/color]
                    (0, 1)[color=blue][color=green][color=darkred]
                    >>> parse("ab").get width()[/color][/color][/color]
                    (2, 2)[color=blue][color=green][color=darkred]
                    >>> parse(".+").get width()[/color][/color][/color]
                    (1, 65535)

                    (where >=65535 should be interpreted as "any number")

                    </F>




                    Comment

                    • Joerg Schuster

                      #11
                      Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                      "Fredrik Lundh" <fredrik@python ware.com> writes:
                      [color=blue]
                      > Joerg Schuster wrote:
                      >[color=green]
                      > > I should have read your mail more carefully. You wrote about sorting
                      > > the *re*, not about sorting the regex. Sorry.[/color]
                      >
                      > what's the difference?[/color]


                      One could suspect that re.search() first matches with a greedy "|", then
                      stores all the matches of all parts of the disjunction in the
                      re-object and then returns the first one.

                      Jörg

                      Comment

                      • Joerg Schuster

                        #12
                        Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                        "Fredrik Lundh" <fredrik@python ware.com> writes:
                        [color=blue]
                        > you can use "sre_parse.pars e(x).getwidth() " on a subexpression, to
                        > get the shortest/longest possible match.
                        >[color=green][color=darkred]
                        > >>> from sre_parse import parse
                        > >>> parse("a?").get width()[/color][/color]
                        > (0, 1)[color=green][color=darkred]
                        > >>> parse("ab").get width()[/color][/color]
                        > (2, 2)[color=green][color=darkred]
                        > >>> parse(".+").get width()[/color][/color]
                        > (1, 65535)
                        >
                        > (where >=65535 should be interpreted as "any number")
                        >
                        > </F>[/color]

                        Thanks for the tip.

                        Comment

                        • Joerg Schuster

                          #13
                          Re: leftmost longest match (of disjunctions) ; greediness of &quot;|&quot ;

                          "Diez B. Roggisch" <deets_noospaam @web.de> writes:

                          [color=blue]
                          > The only thing you can do to avoid is is to split your disjunctions and
                          > search separately, and afterwards takt the longest part. That could be done
                          > using a small grammar, that allows you to detect the disjunctions and
                          > create separate rexes for it.
                          >
                          > Diez[/color]

                          Thanks for the tip.

                          Comment

                          Working...