Python and Cyrillic characters in regular expression

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • phasma

    Python and Cyrillic characters in regular expression

    Hi, I'm trying extract all alphabetic characters from string.

    reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
    buf = re.match(string )

    But it's doesn't work. If string starts from Cyrillic character, all
    works fine. But if string starts from Latin character, match returns
    only Latin characters.

    Please, help.
  • Fredrik Lundh

    #2
    Re: Python and Cyrillic characters in regular expression

    phasma wrote:
    Hi, I'm trying extract all alphabetic characters from string.
    >
    reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
    buf = re.match(string )
    >
    But it's doesn't work. If string starts from Cyrillic character, all
    works fine. But if string starts from Latin character, match returns
    only Latin characters.
    can you provide a few sample strings that show this behaviour?

    </F>

    Comment

    • phasma

      #3
      Re: Python and Cyrillic characters in regular expression

      string = u"ðÒÉ×ÅÔ"
      (u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)

      string = u"Hi.ðÒÉ×ÅÔ"
      (u'Hi',)

      On Sep 4, 9:53špm, Fredrik Lundh <fred...@python ware.comwrote:
      phasma wrote:
      Hi, I'm trying extract all alphabetic characters from string.
      >
      reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
      buf = re.match(string )
      >
      But it's doesn't work. If string starts from Cyrillic character, all
      works fine. But if string starts from Latin character, match returns
      only Latin characters.
      >
      can you provide a few sample strings that show this behaviour?
      >
      </F>

      Comment

      • MRAB

        #4
        Re: Python and Cyrillic characters in regular expression

        On Sep 5, 12:28 pm, phasma <xpa...@gmail.c omwrote:
        string = u"ðÒÉ×ÅÔ"
        All the characters are letters.
        (u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)
        >
        string = u"Hi.ðÒÉ×ÅÔ"
        The third character isn't a letter and isn't whitespace.
        (u'Hi',)
        >
        On Sep 4, 9:53špm, Fredrik Lundh <fred...@python ware.comwrote:
        >
        phasma wrote:
        Hi, I'm trying extract all alphabetic characters from string.
        >
        reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
        buf = re.match(string )
        >
        But it's doesn't work. If string starts from Cyrillic character, all
        works fine. But if string starts from Latin character, match returns
        only Latin characters.
        >
        can you provide a few sample strings that show this behaviour?
        >

        Comment

        • Fredrik Lundh

          #5
          Re: Python and Cyrillic characters in regular expression

          phasma wrote:
          string = u"ðÒÉ×ÅÔ"
          (u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)
          >
          string = u"Hi.ðÒÉ×ÅÔ"
          (u'Hi',)
          the [\w\s] pattern you used matches letters, numbers, underscore, and
          whitespace. "." doesn't fall into that category, so the "match" method
          stops when it gets to that character.

          maybe you could use re.sub or re.findall?
          >># replace all non-alphanumerics with the empty string
          >>re.sub("(?u)\ W+", "", string)
          u'Hi\u041f\u044 0\u0438\u0432\u 0435\u0442'
          >># find runs of alphanumeric characters
          >>re.findall("( ?u)\w+", string)
          [u'Hi', u'\u041f\u0440\ u0438\u0432\u04 35\u0442']
          >>"".join(re.fi ndall("(?u)\w+" , string))
          u'Hi\u041f\u044 0\u0438\u0432\u 0435\u0442'

          (the "sub" example expects you to specify what characters you want to
          skip, while "findall" expects you to specify what you want to keep.)

          </F>

          Comment

          Working...