sre is broken in SuSE 9.2

**Serge Orlov** · Jul 18 '05, 09:25 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:[color=blue]
> On all platfroms \w matches all unicode letters when used with flag
> re.UNICODE, but this doesn't work on SuSE 9.2:
>
> Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
> [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
> Type "help", "copyright" , "credits" or "license" for more[/color]
information.[color=blue][color=green][color=darkred]
> >>> import re
> >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
> >>>[/color][/color]
>
> BTW, is correctly recognize this character as lowercase letter:[color=green][color=darkred]
> >>> import unicodedata
> >>> unicodedata.cat egory(u'\xe4')[/color][/color]
> 'Ll'
>
> I've looked through all SuSE patches applied, but found nothing[/color]
related.[color=blue]
> What is the reason for broken behavior? Incorrect configure options?[/color]

I can get the same results on RedHat's python 2.2.3 if I pass re.L
option, it looks like this option is implicitly set in Suse.

Serge

**Denis S. Otkidach** · Jul 18 '05, 09:25 PM

Re: sre is broken in SuSE 9.2

On 10 Feb 2005 03:59:51 -0800
"Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
[color=blue][color=green]
> > On all platfroms \w matches all unicode letters when used with flag
> > re.UNICODE, but this doesn't work on SuSE 9.2:[/color][/color]
[...][color=blue]
> I can get the same results on RedHat's python 2.2.3 if I pass re.L
> option, it looks like this option is implicitly set in Suse.[/color]

Looks like you are right:
[color=blue][color=green][color=darkred]
>>> import re
>>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
>>> from locale import *
>>> setlocale(LC_AL L, 'de_DE')[/color][/color][/color]
'de_DE'[color=blue][color=green][color=darkred]
>>> re.compile(ur'\ w+', re.U).match(u'\ xe4')[/color][/color][/color]
<_sre.SRE_Mat ch object at 0x40375560>

But I see nothing related to implicit re.L option in their patches and
the sources themselves are the same as on other platforms. I'd prefer
to find the source of problem.

--
Denis S. Otkidach
http://www.python.ru/ [ru]

**Daniel Dittmar** · Jul 18 '05, 09:25 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:
[color=blue]
> On all platfroms \w matches all unicode letters when used with flag
> re.UNICODE, but this doesn't work on SuSE 9.2:[/color]

I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
RedHat), check sys.maxunicode.

This is not an explanation, but perhaps a hint where to look.

Daniel

**Denis S. Otkidach** · Jul 18 '05, 09:25 PM

Re: sre is broken in SuSE 9.2

On Thu, 10 Feb 2005 16:23:09 +0100
Daniel Dittmar <daniel.dittmar @sap.corp> wrote:
[color=blue]
> Denis S. Otkidach wrote:
>[color=green]
> > On all platfroms \w matches all unicode letters when used with flag
> > re.UNICODE, but this doesn't work on SuSE 9.2:[/color]
>
> I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
> RedHat), check sys.maxunicode.
>
> This is not an explanation, but perhaps a hint where to look.[/color]

Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
not a problem. Can --with-wctype-functions configure option be the
source of problem?

--
Denis S. Otkidach
http://www.python.ru/ [ru]

**Serge Orlov** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:[color=blue]
> On 10 Feb 2005 03:59:51 -0800
> "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
>[color=green][color=darkred]
> > > On all platfroms \w matches all unicode letters when used with[/color][/color][/color]
flag[color=blue][color=green][color=darkred]
> > > re.UNICODE, but this doesn't work on SuSE 9.2:[/color][/color]
> [...][color=green]
> > I can get the same results on RedHat's python 2.2.3 if I pass re.L
> > option, it looks like this option is implicitly set in Suse.[/color]
>
> Looks like you are right:
>[color=green][color=darkred]
> >>> import re
> >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
> >>> from locale import *
> >>> setlocale(LC_AL L, 'de_DE')[/color][/color]
> 'de_DE'[color=green][color=darkred]
> >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')[/color][/color]
> <_sre.SRE_Mat ch object at 0x40375560>
>
> But I see nothing related to implicit re.L option in their patches
> and the sources themselves are the same as on other platforms. I'd
> prefer to find the source of problem.[/color]

I found that

print u'\xc4'.isalpha ()
import locale
print locale.getlocal e()

produces different results on Suse (python 2.3.3)

False
(None, None)

and RedHat (python 2.2.3)

1
(None, None)

Serge.

**Fredrik Lundh** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:
[color=blue][color=green][color=darkred]
>> > On all platfroms \w matches all unicode letters when used with flag
>> > re.UNICODE, but this doesn't work on SuSE 9.2:[/color]
>>
>> I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
>> RedHat), check sys.maxunicode.
>>
>> This is not an explanation, but perhaps a hint where to look.[/color]
>
> Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
> not a problem. Can --with-wctype-functions configure option be the
> source of problem?[/color]

yes.

that option disables Python's own Unicode database, and relies on the C library's
wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
for all environments.

is this an official SuSE release? do they often release stuff that hasn't been tested
at all?

</F>

**Denis S. Otkidach** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

On Thu, 10 Feb 2005 17:46:06 +0100
"Fredrik Lundh" <fredrik@python ware.com> wrote:
[color=blue][color=green]
> > Can --with-wctype-functions configure option be the
> > source of problem?[/color]
>
> yes.
>
> that option disables Python's own Unicode database, and relies on the C library's
> wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
> for all environments.
>
> is this an official SuSE release? do they often release stuff that hasn't been tested
> at all?[/color]

Yes, it's official release:
# rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: SUSE LINUX AG, Nuernberg, Germany
Release : 3 Build Date: Tue Oct 5 02:28:25 2004
Install date: Fri Jan 28 13:53:49 2005 Build Host: gambey.suse.de
Group : Development/Languages/Python Source RPM: python-2.3.4-3.src.rpm
Size : 15108594 License: Artistic License, Other License(s), see package
Signature : DSA/SHA1, Tue Oct 5 02:42:38 2004, Key ID a84edae89c800ac a
Packager : http://www.suse.de/feedback
URL : http://www.python.org/
Summary : Python Interpreter
<snip>

BTW, where have they found something with Artistic License in Python?

--
Denis S. Otkidach
http://www.python.ru/ [ru]

**Serge Orlov** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:[color=blue]
> On all platfroms \w matches all unicode letters when used with flag
> re.UNICODE, but this doesn't work on SuSE 9.2:
>
> Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
> [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
> Type "help", "copyright" , "credits" or "license" for more[/color]
information.[color=blue][color=green][color=darkred]
> >>> import re
> >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
> >>>[/color][/color]
>
> BTW, is correctly recognize this character as lowercase letter:[color=green][color=darkred]
> >>> import unicodedata
> >>> unicodedata.cat egory(u'\xe4')[/color][/color]
> 'Ll'
>
> I've looked through all SuSE patches applied, but found nothing
> related. What is the reason for broken behavior? Incorrect
> configure options?[/color]

To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable.

Serge.

**Peter Maas** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Serge Orlov schrieb:[color=blue]
> Denis S. Otkidach wrote:
> To summarize the discussion: either it's a bug in glibc or there is an
> option to specify modern POSIX locale. POSIX locale consist of
> characters from the portable character set, unicode is certainly
> portable.[/color]

What about the environment variable LANG? I have SuSE 9.1 and
LANG = de_DE.UTF-8. Your example is running well on my computer.

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0 BtcGx1c3IuZGU=\ n'.decode('base 64')
-------------------------------------------------------------------

**Serge Orlov** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Peter Maas wrote:[color=blue]
> Serge Orlov schrieb:[color=green]
> > Denis S. Otkidach wrote:
> > To summarize the discussion: either it's a bug in glibc or there is[/color][/color]
an[color=blue][color=green]
> > option to specify modern POSIX locale. POSIX locale consist of
> > characters from the portable character set, unicode is certainly
> > portable.[/color]
>
> What about the environment variable LANG? I have SuSE 9.1 and
> LANG = de_DE.UTF-8. Your example is running well on my computer.[/color]

This thread is about problems only with LANG=C or LANG=POSIX, it's not
about other locales. Other locales are working as expected.

Serge.

**Fredrik Lundh** · Jul 18 '05, 09:26 PM

Re: sre is broken in SuSE 9.2

Peter Maas wrote:
[color=blue][color=green]
>> To summarize the discussion: either it's a bug in glibc or there is an
>> option to specify modern POSIX locale. POSIX locale consist of
>> characters from the portable character set, unicode is certainly
>> portable.[/color]
>
> What about the environment variable LANG? I have SuSE 9.1 and
> LANG = de_DE.UTF-8. Your example is running well on my computer.[/color]

Python's Unicode subsystem shouldn't depend on the system's LANG
setting.

</F>

**Denis S. Otkidach** · Jul 18 '05, 09:27 PM

Re: sre is broken in SuSE 9.2

On 10 Feb 2005 11:49:33 -0800
"Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
[color=blue]
> This thread is about problems only with LANG=C or LANG=POSIX, it's not
> about other locales. Other locales are working as expected.[/color]

You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py
doesn't pass. $LANG doesn't matter if I don't call setlocale.
Fortunately setting any non-C locale solves the problem for all (I
believe) unicode character:
[color=blue][color=green][color=darkred]
>>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color][/color][/color]
[u'\xb5\xba\xe4\ u0430']

--
Denis S. Otkidach
http://www.python.ru/ [ru]

**Serge Orlov** · Jul 18 '05, 09:28 PM

Re: sre is broken in SuSE 9.2

Denis S. Otkidach wrote:[color=blue]
> On 10 Feb 2005 11:49:33 -0800
> "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
>[color=green]
> > This thread is about problems only with LANG=C or LANG=POSIX, it's[/color][/color]
not[color=blue][color=green]
> > about other locales. Other locales are working as expected.[/color]
>
> You are not right. I have LANG=de_DE.UTF-8, and the Python[/color]
test_re.py[color=blue]
> doesn't pass.[/color]

I meant "only with C or POSIX locales" when I wrote "only with LANG=C
or LANG=POSIX". My bad.
[color=blue]
> $LANG doesn't matter if I don't call setlocale.[/color]

Sure.
[color=blue]
> Fortunately setting any non-C locale solves the problem for all (I
> believe) unicode character:
>[color=green][color=darkred]
> >>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color][/color]
> [u'\xb5\xba\xe4\ u0430'][/color]

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

Serge.

**Fredrik Lundh** · Jul 18 '05, 09:29 PM

Re: sre is broken in SuSE 9.2

Serge Orlov wrote:
[color=blue][color=green][color=darkred]
>> >>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color]
>> [u'\xb5\xba\xe4\ u0430'][/color]
>
> I can't find the strict definition of isalpha, but I believe average
> C program shouldn't care about the current locale alphabet, so isalpha
> is a union of all supported characters in all alphabets[/color]

btw, what does isalpha have to do with this example?

</F>

sre is broken in SuSE 9.2

sre is broken in SuSE 9.2

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment