Python and Cyrillic characters in regular expression

**Fredrik Lundh** · Sep 4 '08, 05:59 PM

Re: Python and Cyrillic characters in regular expression

phasma wrote:

Hi, I'm trying extract all alphabetic characters from string.
>
reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
buf = re.match(string )
>
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

can you provide a few sample strings that show this behaviour?

</F>

**phasma** · Sep 5 '08, 11:35 AM

Re: Python and Cyrillic characters in regular expression

string = u"ðÒÉ×ÅÔ"
(u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)

string = u"Hi.ðÒÉ×ÅÔ"
(u'Hi',)

On Sep 4, 9:53špm, Fredrik Lundh <fred...@python ware.comwrote:

phasma wrote:

Hi, I'm trying extract all alphabetic characters from string.

>

reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
buf = re.match(string )

>

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

>
can you provide a few sample strings that show this behaviour?
>
</F>

**MRAB** · Sep 5 '08, 02:35 PM

Re: Python and Cyrillic characters in regular expression

On Sep 5, 12:28 pm, phasma <xpa...@gmail.c omwrote:

string = u"ðÒÉ×ÅÔ"

All the characters are letters.

(u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)
>
string = u"Hi.ðÒÉ×ÅÔ"

The third character isn't a letter and isn't whitespace.

(u'Hi',)
>

On Sep 4, 9:53špm, Fredrik Lundh <fred...@python ware.comwrote:
>

phasma wrote:

Hi, I'm trying extract all alphabetic characters from string.

>

reg = re.compile('(?u )([\w\s]+)', re.UNICODE)
buf = re.match(string )

>

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

>

can you provide a few sample strings that show this behaviour?

>

**Fredrik Lundh** · Sep 5 '08, 05:55 PM

Re: Python and Cyrillic characters in regular expression

phasma wrote:

string = u"ðÒÉ×ÅÔ"
(u'\u041f\u0440 \u0438\u0432\u0 435\u0442',)
>
string = u"Hi.ðÒÉ×ÅÔ"
(u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, and
whitespace. "." doesn't fall into that category, so the "match" method
stops when it gets to that character.

maybe you could use re.sub or re.findall?

>># replace all non-alphanumerics with the empty string
>>re.sub("(?u)\ W+", "", string)

u'Hi\u041f\u044 0\u0438\u0432\u 0435\u0442'

>># find runs of alphanumeric characters
>>re.findall("( ?u)\w+", string)

[u'Hi', u'\u041f\u0440\ u0438\u0432\u04 35\u0442']

>>"".join(re.fi ndall("(?u)\w+" , string))

u'Hi\u041f\u044 0\u0438\u0432\u 0435\u0442'

(the "sub" example expects you to specify what characters you want to
skip, while "findall" expects you to specify what you want to keep.)

</F>

Python and Cyrillic characters in regular expression

Python and Cyrillic characters in regular expression

Comment

Comment

Comment

Comment