Identifying unicode punctuation characters with Python regex

**=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=** · Nov 14 '08, 10:35 AM

Re: Identifying unicode punctuation characters with Python regex

I'm trying to build a regex in python to identify punctuation

characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".

Regards,
Martin

**Shiao** · Nov 14 '08, 10:35 AM

Re: Identifying unicode punctuation characters with Python regex

On Nov 14, 11:27 am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

>
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".
>
Regards,
Martin

Thanks Martin. I'll do this.

**Mark Tolonen** · Nov 14 '08, 10:45 AM

Re: Identifying unicode punctuation characters with Python regex

"Shiao" <multiseed@gmai l.comwrote in message
news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com...

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
>
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
>
Thank in advance for any suggestions.
>
John

You can always build your own pattern. Something like (Python 3.0rc2):

>>import unicodedata

Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')

>>import re
>>r=re.compile( '['+Po+']')
>>x='æˆ‘æ˜¯ç¾Žå œ‹äººã€‚'
>>x

'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'

>>r.findall(x )

['ã€‚']

-Mark

**Mark Tolonen** · Nov 14 '08, 11:35 AM

Re: Identifying unicode punctuation characters with Python regex

"Mark Tolonen" <M8R-yfto6h@mailinat or.comwrote in message
news:xsydnXWBAr iky4DUnZ2dnUVZ_ jCdnZ2d@comcast .com...

>
"Shiao" <multiseed@gmai l.comwrote in message
news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com...

>Hello,
>I'm trying to build a regex in python to identify punctuation
>characters in all the languages. Some regex implementations support an
>extended syntax \p{P} that does just that. As far as I know, python re
>doesn't. Any idea of a possible alternative?
>>
>Apart from manually including the punctuation character range for each
>and every language, I don't see how this can be done.
>>
>Thank in advance for any suggestions.
>>
>John

>
You can always build your own pattern. Something like (Python 3.0rc2):
>

>>>import unicodedata

Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')

>>>import re
>>>r=re.compile ('['+Po+']')
>>>x='æˆ‘æ˜¯ç¾Ž åœ‹äººã€‚'
>>>x

'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'

>>>r.findall( x)

['ã€‚']
>
-Mark
>

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2

>>import unicodedata as u
>>A=''.join(chr (i) for i in range(65536))
>>P=''.join(chr (i) for i in range(65536) if u.category(chr( i))[0]=='P')
>>len(A)

65536

>>len(P)

491

>>len(re.findal l('['+P+']',A)) # ] was naturally
>>escaped

490

>>set(P)-set(re.findall( '['+P+']',A)) # so only missing \

{'\\'}

>>P=P.replace(' \\','\\\\').rep lace(']','\\]') # escape both of them.
>>len(re.findal l('['+P+']',A))

491

-Mark

**Shiao** · Nov 14 '08, 02:15 PM

Re: Identifying unicode punctuation characters with Python regex

On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:

"Mark Tolonen" <M8R-yft...@mailinat or.comwrote in message
>
news:xsydnXWBAr iky4DUnZ2dnUVZ_ jCdnZ2d@comcast .com...
>
>
>
>
>

"Shiao" <multis...@gmai l.comwrote in message
news:3a95a51c-cc4f-45ff-ae4d-c596c7bfab72@l3 3g2000pri.googl egroups.com....

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

>

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

>

Thank in advance for any suggestions.

>

John

>

You can always build your own pattern. Something like (Python 3.0rc2):

>

>>import unicodedata

Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')

>>import re
>>r=re.compile( '['+Po+']')
>>x='§Ú¬O¬ü°ê¤H ¡C'
>>x

'§Ú¬O¬ü°ê¤H¡C'

>>r.findall(x )

['¡C']

>

-Mark

>
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
>
IDLE 3.0rc2>>import unicodedata as u

>A=''.join(chr( i) for i in range(65536))
>P=''.join(chr( i) for i in range(65536) if u.category(chr( i))[0]=='P')
>len(A)

65536

>len(P)

491

>len(re.findall ('['+P+']',A)) # ] was naturally
>escaped

490

>set(P)-set(re.findall( '['+P+']',A)) # so only missing \

{'\\'}

>P=P.replace('\ \','\\\\').repl ace(']','\\]') # escape both of them..
>len(re.findall ('['+P+']',A))

>
491
>
-Mark

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)

**jhermann** · Nov 19 '08, 11:45 AM

Re: Identifying unicode punctuation characters with Python regex

>P=P.replace('\ \','\\\\').repl ace(']','\\]') # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.

Identifying unicode punctuation characters with Python regex

Identifying unicode punctuation characters with Python regex

Comment

Comment

Comment

Comment

Comment

Comment