A better RE?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Magnus Lycka

    A better RE?

    I want an re that matches strings like "21MAR06 31APR06 1236",
    where the last part is day numbers (1-7), i.e it can contain
    the numbers 1-7, in order, only one of each, and at least one
    digit. I want it as three groups. I was thinking of

    r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7? )"

    but that will match even if the third group is empty,
    right? Does anyone have good and not overly complex RE for
    this?

    P.S. I know the "now you have two problems reply..."
  • Fredrik Lundh

    #2
    Re: A better RE?

    Magnus Lycka wrote:
    [color=blue]
    > I want an re that matches strings like "21MAR06 31APR06 1236",
    > where the last part is day numbers (1-7), i.e it can contain
    > the numbers 1-7, in order, only one of each, and at least one
    > digit. I want it as three groups. I was thinking of
    >
    > r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7? )"
    >
    > but that will match even if the third group is empty,
    > right? Does anyone have good and not overly complex RE for
    > this?[/color]

    how about (untested)

    r"(\d\d[A-Z]{3}\d\d) (\d\d[A-Z]{3}\d\d) (?=[1234567])(1?2?3?4?5?6?7 ?)"

    where {3} means require three copies of the previous RE part, and
    (?=[1234567]) means require at least one of 1-7, but don't move
    forward if it matches.

    </F>



    Comment

    • Schüle Daniel

      #3
      Re: A better RE?

      Magnus Lycka wrote:[color=blue]
      > I want an re that matches strings like "21MAR06 31APR06 1236",
      > where the last part is day numbers (1-7), i.e it can contain
      > the numbers 1-7, in order, only one of each, and at least one
      > digit. I want it as three groups. I was thinking of
      >
      > r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7? )"
      >
      > but that will match even if the third group is empty,
      > right? Does anyone have good and not overly complex RE for
      > this?
      >
      > P.S. I know the "now you have two problems reply..."[/color]
      [color=blue][color=green][color=darkred]
      >>> txt = "21MAR06 31APR06 1236"[/color][/color][/color]
      [color=blue][color=green][color=darkred]
      >>> m = '(?:JAN|FEB|MAR |APR|MAI|JUN|JU L|AUG|SEP|OCT|N OV|DEZ)'[/color][/color][/color]
      # non capturing group (:?)
      [color=blue][color=green][color=darkred]
      >>> p = re.compile(r"(\ d\d%s\d\d) (\d\d%s\d\d)[/color][/color][/color]
      (?=[1234567])(1?2?3?4?5?6?7 ?)" % (m,m))
      [color=blue][color=green][color=darkred]
      >>> p.match(txt).gr oup(1)[/color][/color][/color]
      '21MAR06'
      [color=blue][color=green][color=darkred]
      >>> p.match(txt).gr oup(2)[/color][/color][/color]
      '31APR06'
      [color=blue][color=green][color=darkred]
      >>> p.match(txt).gr oup(3)[/color][/color][/color]
      1236

      Comment

      • bruno at modulix

        #4
        Re: A better RE?

        Magnus Lycka wrote:[color=blue]
        > I want an re that matches strings like "21MAR06 31APR06 1236",
        > where the last part is day numbers (1-7), i.e it can contain
        > the numbers 1-7, in order, only one of each, and at least one
        > digit. I want it as three groups. I was thinking of
        >
        > r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7? )"
        >
        > but that will match even if the third group is empty,
        > right? Does anyone have good and not overly complex RE for
        > this?[/color]

        Simplest:
        [color=blue][color=green][color=darkred]
        >>> exp = r"(\d{2}[A-Z]{3}\d{2}) (\d{2}[A-Z]{3}\d{2}) (\d+)"
        >>> re.match(exp, s).groups()[/color][/color][/color]
        ('21MAR06', '31APR06', '1236')

        but this could give you false positive, depending on the real data.

        If you want to be as strict as possible, this becomes a little bit hairy.
        [color=blue]
        > P.S. I know the "now you have two problems reply..."[/color]

        !-)

        --
        bruno desthuilliers
        python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
        p in 'onurb@xiludom. gro'.split('@')])"

        Comment

        • Eddie Corns

          #5
          Re: A better RE?

          Magnus Lycka <lycka@carmen.s e> writes:
          [color=blue]
          >I want an re that matches strings like "21MAR06 31APR06 1236",
          >where the last part is day numbers (1-7), i.e it can contain
          >the numbers 1-7, in order, only one of each, and at least one
          >digit. I want it as three groups. I was thinking of[/color]

          Just a small point - what does "in order" mean here? if it means that eg 1362
          is not valid then you're stuck because it's context sensitive and hence not
          regular.

          I can't see how any of the fancy extensions could help here but maybe I'm just
          lacking insight.

          Now if "[\1-7]" worked you'd be home and dry.

          Eddie

          Comment

          • Fredrik Lundh

            #6
            Re: A better RE?

            Eddie Corns wrote:

            [color=blue][color=green]
            > >I want an re that matches strings like "21MAR06 31APR06 1236",
            > >where the last part is day numbers (1-7), i.e it can contain
            > >the numbers 1-7, in order, only one of each, and at least one
            > >digit. I want it as three groups. I was thinking of[/color]
            >
            > Just a small point - what does "in order" mean here? if it means that eg 1362
            > is not valid then you're stuck because it's context sensitive and hence not
            > regular.
            >
            > I can't see how any of the fancy extensions could help here but maybe I'm
            > just lacking insight.[/color]

            import re

            p = re.compile("(?=[1234567])(1?2?3?4?5?6?7 ?)$")

            def test(s):
            m = p.match(s)
            print repr(s), "=>", m and m.groups() or "none"

            test("")
            test("1236")
            test("1362")
            test("12345678" )

            prints

            '' => none
            '1236' => ('1236',)
            '1362' => none
            '12345678' => none

            </F>



            Comment

            • Jim

              #7
              Re: A better RE?


              Eddie Corns wrote:[color=blue]
              > Just a small point - what does "in order" mean here? if it means that eg 1362
              > is not valid then you're stuck because it's context sensitive and hence not
              > regular.[/color]
              I'm not seeing that. Any finite language is regular -- as a last
              resort you could list all ascending sequences of 7 or fewer digits (but
              perhaps I misunderstood the original poster's requirements).

              Jim

              Comment

              • Eddie Corns

                #8
                Re: A better RE?

                "Fredrik Lundh" <fredrik@python ware.com> writes:
                [color=blue]
                >Eddie Corns wrote:[/color]

                [color=blue][color=green][color=darkred]
                >> >I want an re that matches strings like "21MAR06 31APR06 1236",
                >> >where the last part is day numbers (1-7), i.e it can contain
                >> >the numbers 1-7, in order, only one of each, and at least one
                >> >digit. I want it as three groups. I was thinking of[/color]
                >>
                >> Just a small point - what does "in order" mean here? if it means that eg 1362
                >> is not valid then you're stuck because it's context sensitive and hence not
                >> regular.
                >>
                >> I can't see how any of the fancy extensions could help here but maybe I'm
                >> just lacking insight.[/color][/color]
                [color=blue]
                >import re[/color]
                [color=blue]
                >p = re.compile("(?=[1234567])(1?2?3?4?5?6?7 ?)$")[/color]
                [color=blue]
                >def test(s):
                > m = p.match(s)
                > print repr(s), "=>", m and m.groups() or "none"[/color]
                [color=blue]
                >test("")
                >test("1236")
                >test("1362")
                >test("12345678 ")[/color]
                [color=blue]
                >prints[/color]
                [color=blue]
                >'' => none
                >'1236' => ('1236',)
                >'1362' => none
                >'12345678' => none[/color]
                [color=blue]
                ></F>[/color]

                I know I know! I cancelled the article about a minute after posting it.

                Eddie

                Comment

                • Eddie Corns

                  #9
                  Re: A better RE?

                  "Jim" <jhefferon@smcv t.edu> writes:

                  [color=blue]
                  >Eddie Corns wrote:[color=green]
                  >> Just a small point - what does "in order" mean here? if it means that eg 1362
                  >> is not valid then you're stuck because it's context sensitive and hence not
                  >> regular.[/color]
                  >I'm not seeing that. Any finite language is regular -- as a last
                  >resort you could list all ascending sequences of 7 or fewer digits (but
                  >perhaps I misunderstood the original poster's requirements).[/color]

                  No, that's what I did. Just carelessnes on my part, time I had a holiday!

                  Eddie

                  Comment

                  • Paul McGuire

                    #10
                    Re: A better RE?

                    "Magnus Lycka" <lycka@carmen.s e> wrote in message
                    news:duq0cj$7ih $1@wake.carmen. se...[color=blue]
                    > I want an re that matches strings like "21MAR06 31APR06 1236",
                    > where the last part is day numbers (1-7), i.e it can contain
                    > the numbers 1-7, in order, only one of each, and at least one
                    > digit. I want it as three groups. I was thinking of
                    >
                    > r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7? )"
                    >
                    > but that will match even if the third group is empty,
                    > right? Does anyone have good and not overly complex RE for
                    > this?
                    >
                    > P.S. I know the "now you have two problems reply..."[/color]

                    For the pyparsing-inclined, here are two versions, along with several
                    examples on how to extract the fields from the returned ParseResults object.
                    The second version is more rigorous in enforcing the days-of-week rules on
                    the 3rd field.

                    Note that the month field is already limited to valid month abbreviations,
                    and the same technique used to validate the days-of-week field could be used
                    to ensure that the date fields are valid dates (no 31st of FEB, etc.), that
                    the second date is after the first, etc.

                    -- Paul
                    Download pyparsing at http://pyparsing.sourceforge.net.


                    data = "21MAR06 31APR06 1236"
                    data2 = "21MAR06 31APR06 1362"

                    from pyparsing import *

                    # define format of an entry
                    month = oneOf("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC")
                    date = Combine( Word(nums,exact =2) + month + Word(nums,exact =2) )
                    daysOfWeek = Word("1234567")
                    entry = date.setResults Name("startDate ") + \
                    date.setResults Name("endDate") + \
                    daysOfWeek.setR esultsName("wee kDays") + \
                    lineEnd

                    # extract entry data
                    e = entry.parseStri ng(data)

                    # various ways to access the results
                    print e.startDate, e.endDate, e.weekDays
                    print "%(startDat e)s : %(endDate)s : %(weekDays)s" % e
                    print e.asList()
                    print e
                    print

                    # get more rigorous in testing for valid days of week field
                    def rigorousDayOfWe ekTest(s,l,toks ):
                    # remove duplicates from toks[0], sort, then compare to original
                    tmp = "".join(sorted( dict([(ll,0) for ll in toks[0]]).keys()))
                    if tmp != toks[0]:
                    raise ParseException( s,l,"Invalid days of week field")

                    daysOfWeek.setP arseAction(rigo rousDayOfWeekTe st)
                    entry = date.setResults Name("startDate ") + \
                    date.setResults Name("endDate") + \
                    daysOfWeek.setR esultsName("wee kDays") + \
                    lineEnd

                    print entry.parseStri ng(data)
                    print entry.parseStri ng(data2) # <-- raises ParseException


                    Comment

                    • Magnus Lycka

                      #11
                      Re: A better RE?

                      Fredrik Lundh wrote:[color=blue]
                      > Magnus Lycka wrote:
                      > r"(\d\d[A-Z]{3}\d\d) (\d\d[A-Z]{3}\d\d) (?=[1234567])(1?2?3?4?5?6?7 ?)"
                      >[/color]

                      Thanks a lot. (I knew about {3} of course, I was in a hurry
                      when I posted since I was close to missing my train...)

                      Comment

                      • Magnus Lycka

                        #12
                        Re: A better RE?

                        Schüle Daniel wrote:[color=blue][color=green][color=darkred]
                        > >>> txt = "21MAR06 31APR06 1236"[/color][/color]
                        >[color=green][color=darkred]
                        > >>> m = '(?:JAN|FEB|MAR |APR|MAI|JUN|JU L|AUG|SEP|OCT|N OV|DEZ)'[/color][/color]
                        > # non capturing group (:?)
                        >[color=green][color=darkred]
                        > >>> p = re.compile(r"(\ d\d%s\d\d) (\d\d%s\d\d)[/color][/color]
                        > (?=[1234567])(1?2?3?4?5?6?7 ?)" % (m,m))
                        >[color=green][color=darkred]
                        > >>> p.match(txt).gr oup(1)[/color][/color]
                        > '21MAR06'
                        >[color=green][color=darkred]
                        > >>> p.match(txt).gr oup(2)[/color][/color]
                        > '31APR06'
                        >[color=green][color=darkred]
                        > >>> p.match(txt).gr oup(3)[/color][/color]
                        > 1236
                        >[/color]

                        Excellent. Thanks!

                        Comment

                        Working...