python regex: misbehaviour with "\r" (0x0D) as Newline characterin Unicode Mode

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Arian Sanusi

    python regex: misbehaviour with "\r" (0x0D) as Newline characterin Unicode Mode

    Hi,

    concerning to unicode, "\n", "\r "and "\r\n" (0x000A, 0x000D and
    0x000D+0x000A) should be threatened as newline character
    at least this is how i understand it:
    (http://en.wikipedia.org/wiki/Newline#Unicode)

    obviously, the re module does not care, and on unix, only threatens \n
    as newline char:
    >>a=re.compile( u"^a",re.U|re.M )
    >>a.search(u"bc \ra")
    >>a.search(u"bc \na")
    <_sre.SRE_Mat ch object at 0xb5908fa8>

    same thing for $:
    >>b = re.compile(u"c$ ",re.U|re.M )
    >>b.search(u"bc \r\n")
    >>b.search(u"ab c")
    <_sre.SRE_Mat ch object at 0xb5908f70>
    >>b.search(u"bc \nde")
    <_sre.SRE_Mat ch object at 0xb5908fa8>

    is this a known bug in the re module? i couldn't find any issues in the
    bug tracker.
    Or is this just a user fault and you guys can help me?

    arian

    p.s.: appears in both python2.4 and 2.5
Working...