Using re to find unicode ranges

**Paul McGuire** · Sep 29 '08, 02:55 PM

Re: Using re to find unicode ranges

On Sep 29, 8:17 am, Eric Abrahamsen <e...@ericabrah amsen.netwrote:

Is it possible to use the re module to find runs of characters within
a certain Unicode range?
>
I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></spantags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline
pattern (which uses regular expressions). The regular expression
solution would be much simpler and faster, but something tells me
there's no way to use a regex to find character ranges... Chinese
characters appear to fall between 19968 and 40959 using ord(), and I
suppose I can go that route if necessary, but I think it would be ugly.
>
Any hints or suggestions would be appreciated!
>
Eric

Eric -

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPypars ing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.

-- Paul

**Mark Tolonen** · Sep 29 '08, 03:05 PM

Re: Using re to find unicode ranges

"Eric Abrahamsen" <eric@ericabrah amsen.netwrote in message
news:mailman.16 74.1222694261.3 487.python-list@python.org ...

Is it possible to use the re module to find runs of characters within a
certain Unicode range?
>
I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></spantags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline pattern
(which uses regular expressions). The regular expression solution would
be much simpler and faster, but something tells me there's no way to use
a regex to find character ranges... Chinese characters appear to fall
between 19968 and 40959 using ord(), and I suppose I can go that route if
necessary, but I think it would be ugly.

# coding: utf-8
import re
sample = u'My name is é©¬å…‹. I am ç¾Žå›½äºº.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n

output:

é©¬å…‹
ç¾Žå›½äºº

--Mark

**Eric Abrahamsen** · Sep 30 '08, 03:55 AM

Re: Using re to find unicode ranges

On Sep 29, 11:03 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:

"Eric Abrahamsen" <e...@ericabrah amsen.netwrote in message
>
news:mailman.16 74.1222694261.3 487.python-list@python.org ...
>

Is it possible to use the re module to find runs of characters within a
certain Unicode range?

>

I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></spantags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline pattern
(which uses regular expressions). The regular expression solution would
be much simpler and faster, but something tells me there's no way to use
a regex to find character ranges... Chinese characters appear to fall
between 19968 and 40959 using ord(), and I suppose I can go that routeif
necessary, but I think it would be ugly.

>
# coding: utf-8
import re
sample = u'My name is Âí¿Ë. I am ÃÀ¹úÈË.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n

Of course! And obvious, once you point it out. Thanks for the help.

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPypars ing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.
--Mark

Mark - not quite what I'm after here, but pretty interesting
nonetheless...

E

Using re to find unicode ranges

Using re to find unicode ranges

Comment

Comment

Comment