sre is broken in SuSE 9.2

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Denis S. Otkidach

    sre is broken in SuSE 9.2

    On all platfroms \w matches all unicode letters when used with flag
    re.UNICODE, but this doesn't work on SuSE 9.2:

    Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
    [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
    Type "help", "copyright" , "credits" or "license" for more information.[color=blue][color=green][color=darkred]
    >>> import re
    >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
    >>>[/color][/color][/color]

    BTW, is correctly recognize this character as lowercase letter:[color=blue][color=green][color=darkred]
    >>> import unicodedata
    >>> unicodedata.cat egory(u'\xe4')[/color][/color][/color]
    'Ll'

    I've looked through all SuSE patches applied, but found nothing related.
    What is the reason for broken behavior? Incorrect configure options?

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
  • Serge Orlov

    #2
    Re: sre is broken in SuSE 9.2

    Denis S. Otkidach wrote:[color=blue]
    > On all platfroms \w matches all unicode letters when used with flag
    > re.UNICODE, but this doesn't work on SuSE 9.2:
    >
    > Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
    > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
    > Type "help", "copyright" , "credits" or "license" for more[/color]
    information.[color=blue][color=green][color=darkred]
    > >>> import re
    > >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
    > >>>[/color][/color]
    >
    > BTW, is correctly recognize this character as lowercase letter:[color=green][color=darkred]
    > >>> import unicodedata
    > >>> unicodedata.cat egory(u'\xe4')[/color][/color]
    > 'Ll'
    >
    > I've looked through all SuSE patches applied, but found nothing[/color]
    related.[color=blue]
    > What is the reason for broken behavior? Incorrect configure options?[/color]

    I can get the same results on RedHat's python 2.2.3 if I pass re.L
    option, it looks like this option is implicitly set in Suse.

    Serge

    Comment

    • Denis S. Otkidach

      #3
      Re: sre is broken in SuSE 9.2

      On 10 Feb 2005 03:59:51 -0800
      "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
      [color=blue][color=green]
      > > On all platfroms \w matches all unicode letters when used with flag
      > > re.UNICODE, but this doesn't work on SuSE 9.2:[/color][/color]
      [...][color=blue]
      > I can get the same results on RedHat's python 2.2.3 if I pass re.L
      > option, it looks like this option is implicitly set in Suse.[/color]

      Looks like you are right:
      [color=blue][color=green][color=darkred]
      >>> import re
      >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
      >>> from locale import *
      >>> setlocale(LC_AL L, 'de_DE')[/color][/color][/color]
      'de_DE'[color=blue][color=green][color=darkred]
      >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')[/color][/color][/color]
      <_sre.SRE_Mat ch object at 0x40375560>

      But I see nothing related to implicit re.L option in their patches and
      the sources themselves are the same as on other platforms. I'd prefer
      to find the source of problem.

      --
      Denis S. Otkidach
      http://www.python.ru/ [ru]

      Comment

      • Daniel Dittmar

        #4
        Re: sre is broken in SuSE 9.2

        Denis S. Otkidach wrote:
        [color=blue]
        > On all platfroms \w matches all unicode letters when used with flag
        > re.UNICODE, but this doesn't work on SuSE 9.2:[/color]

        I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
        RedHat), check sys.maxunicode.

        This is not an explanation, but perhaps a hint where to look.

        Daniel

        Comment

        • Denis S. Otkidach

          #5
          Re: sre is broken in SuSE 9.2

          On Thu, 10 Feb 2005 16:23:09 +0100
          Daniel Dittmar <daniel.dittmar @sap.corp> wrote:
          [color=blue]
          > Denis S. Otkidach wrote:
          >[color=green]
          > > On all platfroms \w matches all unicode letters when used with flag
          > > re.UNICODE, but this doesn't work on SuSE 9.2:[/color]
          >
          > I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
          > RedHat), check sys.maxunicode.
          >
          > This is not an explanation, but perhaps a hint where to look.[/color]

          Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
          not a problem. Can --with-wctype-functions configure option be the
          source of problem?

          --
          Denis S. Otkidach
          http://www.python.ru/ [ru]

          Comment

          • Serge Orlov

            #6
            Re: sre is broken in SuSE 9.2

            Denis S. Otkidach wrote:[color=blue]
            > On 10 Feb 2005 03:59:51 -0800
            > "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
            >[color=green][color=darkred]
            > > > On all platfroms \w matches all unicode letters when used with[/color][/color][/color]
            flag[color=blue][color=green][color=darkred]
            > > > re.UNICODE, but this doesn't work on SuSE 9.2:[/color][/color]
            > [...][color=green]
            > > I can get the same results on RedHat's python 2.2.3 if I pass re.L
            > > option, it looks like this option is implicitly set in Suse.[/color]
            >
            > Looks like you are right:
            >[color=green][color=darkred]
            > >>> import re
            > >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
            > >>> from locale import *
            > >>> setlocale(LC_AL L, 'de_DE')[/color][/color]
            > 'de_DE'[color=green][color=darkred]
            > >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')[/color][/color]
            > <_sre.SRE_Mat ch object at 0x40375560>
            >
            > But I see nothing related to implicit re.L option in their patches
            > and the sources themselves are the same as on other platforms. I'd
            > prefer to find the source of problem.[/color]

            I found that

            print u'\xc4'.isalpha ()
            import locale
            print locale.getlocal e()

            produces different results on Suse (python 2.3.3)

            False
            (None, None)


            and RedHat (python 2.2.3)

            1
            (None, None)

            Serge.

            Comment

            • Fredrik Lundh

              #7
              Re: sre is broken in SuSE 9.2

              Denis S. Otkidach wrote:
              [color=blue][color=green][color=darkred]
              >> > On all platfroms \w matches all unicode letters when used with flag
              >> > re.UNICODE, but this doesn't work on SuSE 9.2:[/color]
              >>
              >> I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
              >> RedHat), check sys.maxunicode.
              >>
              >> This is not an explanation, but perhaps a hint where to look.[/color]
              >
              > Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
              > not a problem. Can --with-wctype-functions configure option be the
              > source of problem?[/color]

              yes.

              that option disables Python's own Unicode database, and relies on the C library's
              wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
              for all environments.

              is this an official SuSE release? do they often release stuff that hasn't been tested
              at all?

              </F>



              Comment

              • Denis S. Otkidach

                #8
                Re: sre is broken in SuSE 9.2

                On Thu, 10 Feb 2005 17:46:06 +0100
                "Fredrik Lundh" <fredrik@python ware.com> wrote:
                [color=blue][color=green]
                > > Can --with-wctype-functions configure option be the
                > > source of problem?[/color]
                >
                > yes.
                >
                > that option disables Python's own Unicode database, and relies on the C library's
                > wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
                > for all environments.
                >
                > is this an official SuSE release? do they often release stuff that hasn't been tested
                > at all?[/color]

                Yes, it's official release:
                # rpm -qi python
                Name : python Relocations: (not relocatable)
                Version : 2.3.4 Vendor: SUSE LINUX AG, Nuernberg, Germany
                Release : 3 Build Date: Tue Oct 5 02:28:25 2004
                Install date: Fri Jan 28 13:53:49 2005 Build Host: gambey.suse.de
                Group : Development/Languages/Python Source RPM: python-2.3.4-3.src.rpm
                Size : 15108594 License: Artistic License, Other License(s), see package
                Signature : DSA/SHA1, Tue Oct 5 02:42:38 2004, Key ID a84edae89c800ac a
                Packager : http://www.suse.de/feedback
                URL : http://www.python.org/
                Summary : Python Interpreter
                <snip>

                BTW, where have they found something with Artistic License in Python?

                --
                Denis S. Otkidach
                http://www.python.ru/ [ru]

                Comment

                • Serge Orlov

                  #9
                  Re: sre is broken in SuSE 9.2

                  Denis S. Otkidach wrote:[color=blue]
                  > On all platfroms \w matches all unicode letters when used with flag
                  > re.UNICODE, but this doesn't work on SuSE 9.2:
                  >
                  > Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
                  > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
                  > Type "help", "copyright" , "credits" or "license" for more[/color]
                  information.[color=blue][color=green][color=darkred]
                  > >>> import re
                  > >>> re.compile(ur'\ w+', re.U).match(u'\ xe4')
                  > >>>[/color][/color]
                  >
                  > BTW, is correctly recognize this character as lowercase letter:[color=green][color=darkred]
                  > >>> import unicodedata
                  > >>> unicodedata.cat egory(u'\xe4')[/color][/color]
                  > 'Ll'
                  >
                  > I've looked through all SuSE patches applied, but found nothing
                  > related. What is the reason for broken behavior? Incorrect
                  > configure options?[/color]

                  To summarize the discussion: either it's a bug in glibc or there is an
                  option to specify modern POSIX locale. POSIX locale consist of
                  characters from the portable character set, unicode is certainly
                  portable.

                  Serge.

                  Comment

                  • Peter Maas

                    #10
                    Re: sre is broken in SuSE 9.2

                    Serge Orlov schrieb:[color=blue]
                    > Denis S. Otkidach wrote:
                    > To summarize the discussion: either it's a bug in glibc or there is an
                    > option to specify modern POSIX locale. POSIX locale consist of
                    > characters from the portable character set, unicode is certainly
                    > portable.[/color]

                    What about the environment variable LANG? I have SuSE 9.1 and
                    LANG = de_DE.UTF-8. Your example is running well on my computer.

                    --
                    -------------------------------------------------------------------
                    Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
                    E-mail 'cGV0ZXIubWFhc0 BtcGx1c3IuZGU=\ n'.decode('base 64')
                    -------------------------------------------------------------------

                    Comment

                    • Serge Orlov

                      #11
                      Re: sre is broken in SuSE 9.2

                      Peter Maas wrote:[color=blue]
                      > Serge Orlov schrieb:[color=green]
                      > > Denis S. Otkidach wrote:
                      > > To summarize the discussion: either it's a bug in glibc or there is[/color][/color]
                      an[color=blue][color=green]
                      > > option to specify modern POSIX locale. POSIX locale consist of
                      > > characters from the portable character set, unicode is certainly
                      > > portable.[/color]
                      >
                      > What about the environment variable LANG? I have SuSE 9.1 and
                      > LANG = de_DE.UTF-8. Your example is running well on my computer.[/color]

                      This thread is about problems only with LANG=C or LANG=POSIX, it's not
                      about other locales. Other locales are working as expected.

                      Serge.

                      Comment

                      • Fredrik Lundh

                        #12
                        Re: sre is broken in SuSE 9.2

                        Peter Maas wrote:
                        [color=blue][color=green]
                        >> To summarize the discussion: either it's a bug in glibc or there is an
                        >> option to specify modern POSIX locale. POSIX locale consist of
                        >> characters from the portable character set, unicode is certainly
                        >> portable.[/color]
                        >
                        > What about the environment variable LANG? I have SuSE 9.1 and
                        > LANG = de_DE.UTF-8. Your example is running well on my computer.[/color]

                        Python's Unicode subsystem shouldn't depend on the system's LANG
                        setting.

                        </F>



                        Comment

                        • Denis S. Otkidach

                          #13
                          Re: sre is broken in SuSE 9.2

                          On 10 Feb 2005 11:49:33 -0800
                          "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
                          [color=blue]
                          > This thread is about problems only with LANG=C or LANG=POSIX, it's not
                          > about other locales. Other locales are working as expected.[/color]

                          You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py
                          doesn't pass. $LANG doesn't matter if I don't call setlocale.
                          Fortunately setting any non-C locale solves the problem for all (I
                          believe) unicode character:
                          [color=blue][color=green][color=darkred]
                          >>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color][/color][/color]
                          [u'\xb5\xba\xe4\ u0430']

                          --
                          Denis S. Otkidach
                          http://www.python.ru/ [ru]

                          Comment

                          • Serge Orlov

                            #14
                            Re: sre is broken in SuSE 9.2

                            Denis S. Otkidach wrote:[color=blue]
                            > On 10 Feb 2005 11:49:33 -0800
                            > "Serge Orlov" <Serge.Orlov@gm ail.com> wrote:
                            >[color=green]
                            > > This thread is about problems only with LANG=C or LANG=POSIX, it's[/color][/color]
                            not[color=blue][color=green]
                            > > about other locales. Other locales are working as expected.[/color]
                            >
                            > You are not right. I have LANG=de_DE.UTF-8, and the Python[/color]
                            test_re.py[color=blue]
                            > doesn't pass.[/color]

                            I meant "only with C or POSIX locales" when I wrote "only with LANG=C
                            or LANG=POSIX". My bad.
                            [color=blue]
                            > $LANG doesn't matter if I don't call setlocale.[/color]

                            Sure.
                            [color=blue]
                            > Fortunately setting any non-C locale solves the problem for all (I
                            > believe) unicode character:
                            >[color=green][color=darkred]
                            > >>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color][/color]
                            > [u'\xb5\xba\xe4\ u0430'][/color]

                            I can't find the strict definition of isalpha, but I believe average
                            C program shouldn't care about the current locale alphabet, so isalpha
                            is a union of all supported characters in all alphabets

                            Serge.

                            Comment

                            • Fredrik Lundh

                              #15
                              Re: sre is broken in SuSE 9.2

                              Serge Orlov wrote:
                              [color=blue][color=green][color=darkred]
                              >> >>> re.compile(ur'\ w+', re.U).findall(u '\xb5\xba\xe4\u 0430')[/color]
                              >> [u'\xb5\xba\xe4\ u0430'][/color]
                              >
                              > I can't find the strict definition of isalpha, but I believe average
                              > C program shouldn't care about the current locale alphabet, so isalpha
                              > is a union of all supported characters in all alphabets[/color]

                              btw, what does isalpha have to do with this example?

                              </F>



                              Comment

                              Working...