Re: python regex character group matches

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Fredrik Lundh

    Re: python regex character group matches

    christopher taylor wrote:
    my issue, is that the pattern i used was returning:
    >
    [ '\\uAD0X', '\\u1BF3', ... ]
    >
    when i expected:
    >
    [ '\\uAD0X\\u1BF3 ', ]
    >
    the code looks something like this:
    >
    pat = re.compile("(\\ \u[0-9A-F]{4})+", re.UNICODE|re.L OCALE)
    #print pat.findall(txt _line)
    results = pat.finditer(tx t_line)
    >
    i ran the pattern through a couple of my colleagues and they were all
    in agreement that my pattern should have matched correctly.
    First, [0-9A-F] cannot match an "X". Assuming that's a typo, your next
    problem is a precedence issue: (X)+ means "one or more (X)", not "one or
    more X inside parens". In other words, that pattern matches one or more
    X's and captures the last one.

    Assuming that you want to find runs of \uXXXX escapes, simply use
    non-capturing parentheses:

    pat = re.compile(u"(? :\\\u[0-9A-F]{4})")

    and use group(0) instead of group(1) to get the match.

    </F>

  • Steven D'Aprano

    #2
    Re: python regex character group matches

    On Wed, 17 Sep 2008 15:56:31 +0200, Fredrik Lundh wrote:
    Assuming that you want to find runs of \uXXXX escapes, simply use
    non-capturing parentheses:
    >
    pat = re.compile(u"(? :\\\u[0-9A-F]{4})")
    Doesn't work for me:
    >>pat = re.compile(u"(? :\\\u[0-9A-F]{4})")
    UnicodeDecodeEr ror: 'unicodeescape' codec can't decode bytes in position
    5-7: truncated \uXXXX escape


    Assuming that the OP is searching byte strings, I came up with this:
    >>pat = re.compile('(\\ \u[0-9A-F]{4})+')
    >>pat.search('a bcd\\u1234\\uAA 99\\u0BC4efg'). group(0)
    '\\u1234\\uAA99 \\u0BC4'



    --
    Steven

    Comment

    • Fredrik Lundh

      #3
      Re: python regex character group matches

      Steven D'Aprano wrote:
      >Assuming that you want to find runs of \uXXXX escapes, simply use
      >non-capturing parentheses:
      >>
      > pat = re.compile(u"(? :\\\u[0-9A-F]{4})")
      >
      Doesn't work for me:
      >
      >>>pat = re.compile(u"(? :\\\u[0-9A-F]{4})")
      it helps if you cut and paste the right line... here's a better version:

      pat = re.compile(r"(? :\\u[0-9A-F]{4})+")

      </F>

      Comment

      Working...