convert Unicode to lower/uppercase?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Hallvard B Furuseth

    convert Unicode to lower/uppercase?

    Has someone got a Python routine or module which converts Unicode
    strings to lowercase (or uppercase)?

    What I actually need to do is to compare a number of strings in a
    case-insensitive manner, so I assume it's simplest to convert to
    lower/upper first.

    Possibly all strings will be from the latin-1 character set, so I could
    convert to 8-bit latin-1, map to lowercase, and convert back, but that
    seems rather cumbersome.

    --
    Hallvard
  • Peter Otten

    #2
    Re: convert Unicode to lower/uppercase?

    nospam wrote:
    [color=blue]
    > Has someone got a Python routine or module which converts Unicode
    > strings to lowercase (or uppercase)?[/color]

    Toiled and came up with:
    [color=blue][color=green][color=darkred]
    >>> print u"abcäöüß".uppe r()[/color][/color][/color]
    ABCÄÖÜß
    [color=blue][color=green][color=darkred]
    >>> u"ABCÄÖÜ".lower ()[/color][/color][/color]
    u'abc\xe4\xf6\x fc'

    Peter

    Comment

    • Hallvard B Furuseth

      #3
      Re: convert Unicode to lower/uppercase?

      Thanks!

      --
      Hallvard

      Comment

      • jallan

        #4
        Re: convert Unicode to lower/uppercase?

        Peter Otten <__peter__@web. de> wrote in message news:<bkepb9$6a 4$01$1@news.t-online.com>...[color=blue]
        > nospam wrote:
        >[color=green]
        > > Has someone got a Python routine or module which converts Unicode
        > > strings to lowercase (or uppercase)?[/color]
        >
        > Toiled and came up with:
        >[color=green][color=darkred]
        > >>> print u"abcäöüß".uppe r()[/color][/color]
        > ABCÄÖÜß
        >[color=green][color=darkred]
        > >>> u"ABCÄÖÜ".lower ()[/color][/color]
        > u'abc\xe4\xf6\x fc'
        >
        > Peter[/color]

        But that really doesn't work properly. According to Unicode specs and
        German usage the uppercase of "ß" is actually "SS", that is the single
        character "ß" should uppercase to two characters.

        Jim Allan

        Comment

        • Martin v. Löwis

          #5
          Re: convert Unicode to lower/uppercase?

          jallan wrote:
          [color=blue]
          > But that really doesn't work properly. According to Unicode specs and
          > German usage the uppercase of "ß" is actually "SS", that is the single
          > character "ß" should uppercase to two characters.[/color]

          Can you cite exact chapter and verse of the Unicode specs that say so?
          According to the Unicode database,



          has neither an uppercase mapping, nor a lowercase mapping.

          Also, in German, the uppercase mapping of ß is of ongoing debate.
          For example, the Duden from 1919 says

          | Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
          | _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
          | aufhören muß, sobald ein geeigneter Druckbuchstabe für das
          | große ß geschaffen ist.

          The usage of SZ has only been eliminated in the recent change of
          the amtliche Rechtschreibung .

          Regards,
          Martin

          Comment

          • Asun Friere

            #6
            Re: convert Unicode to lower/uppercase?

            "Martin v. Löwis" <martin@v.loewi s.de> wrote in message news:<bkkusk$pv i$05$1@news.t-online.com>...[color=blue]
            > The usage of SZ has only been eliminated in the recent change of
            > the amtliche Rechtschreibung .
            >[/color]

            And replaced with what? ie. is there now a single capital for SZ?

            Comment

            • Gerhard Häring

              #7
              Re: convert Unicode to lower/uppercase?

              Asun Friere wrote:[color=blue]
              > "Martin v. Löwis" <martin@v.loewi s.de> wrote in message news:<bkkusk$pv i$05$1@news.t-online.com>...[color=green]
              >>The usage of SZ has only been eliminated in the recent change of
              >>the amtliche Rechtschreibung .[/color]
              >
              > And replaced with what? ie. is there now a single capital for SZ?[/color]

              ß (sz) has not been completely eliminated. After *short* vocals it has
              been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
              vocals, it is still used (Maß, Gruß, ...).

              -- Gerhard

              PS: I was quite disappointed with the reform of German ortography. I'd
              have favoured much more radical steps, like elimination of
              capitalization of the noun.

              Comment

              • Peter Otten

                #8
                Re: convert Unicode to lower/uppercase?

                "Martin v. Löwis" wrote:
                [color=blue]
                > jallan wrote:
                >[color=green]
                >> But that really doesn't work properly. According to Unicode specs and
                >> German usage the uppercase of "ß" is actually "SS", that is the single
                >> character "ß" should uppercase to two characters.[/color]
                >
                > Can you cite exact chapter and verse of the Unicode specs that say so?
                > According to the Unicode database,
                >
                > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
                >
                > has neither an uppercase mapping, nor a lowercase mapping.[/color]

                It seems like UnicodeData.txt does not give the full story. Quoting from


                [...]
                # (For compatibility, the UnicodeData.txt file only contains case mappings
                for
                # characters where they are 1-1, and does not have locale-specific
                mappings.)
                [...]
                # <code>; <lower> ; <title> ; <upper> ; (<condition_lis t> ;)? # <comment>
                [...]
                # The German es-zed is special--the normal mapping is to SS.
                # Note: the titlecase should never occur in practice. It is equal to
                titlecase(upper case(<es-zed>))

                00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
                [...]

                Thus, to comply with the standard, "ß".upper() --> "SS" is required.
                [color=blue]
                > Also, in German, the uppercase mapping of ß is of ongoing debate.[/color]

                My personal impression is that, even before the orthography reform in 1998,
                the SZ variant was seldom used.
                For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

                Peter

                Comment

                • jallan

                  #9
                  Re: convert Unicode to lower/uppercase?

                  Peter Otten <__peter__@web. de> wrote in message news:<bkm919$as t$01$1@news.t-online.com>...[color=blue]
                  > "Martin v. Löwis" wrote:
                  >[color=green]
                  > > jallan wrote:
                  > >[color=darkred]
                  > >> But that really doesn't work properly. According to Unicode specs and
                  > >> German usage the uppercase of "ß" is actually "SS", that is the single
                  > >> character "ß" should uppercase to two characters.[/color]
                  > >
                  > > Can you cite exact chapter and verse of the Unicode specs that say so?
                  > > According to the Unicode database,
                  > >
                  > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
                  > >
                  > > has neither an uppercase mapping, nor a lowercase mapping.[/color]
                  >
                  > It seems like UnicodeData.txt does not give the full story. Quoting from
                  > http://www.unicode.org/Public/UNIDAT...ialCasing.txt:
                  >
                  > [...][/color]
                  [color=blue]
                  > # (For compatibility, the UnicodeData.txt file only contains case mappings
                  > for
                  > # characters where they are 1-1, and does not have locale-specific
                  > mappings.)
                  > [...]
                  > # <code>; <lower> ; <title> ; <upper> ; (<condition_lis t> ;)? # <comment>
                  > [...]
                  > # The German es-zed is special--the normal mapping is to SS.
                  > # Note: the titlecase should never occur in practice. It is equal to
                  > titlecase(upper case(<es-zed>))
                  >
                  > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
                  > [...]
                  >
                  > Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]

                  Yes.

                  Also the Unicode main charts in the annotation for 00DF state:

                  uppercase is "SS"

                  See http://www.unicode.org/charts/PDF/U0080.pdf

                  This note on the character first appeared in Unicode 1.0 (published in
                  1991) and has been in every revision.

                  Unicode 1.0, Volume One also lists this in the lower case to upper
                  case casing tables on page 453.

                  There is nothing new about this casing requirement.

                  A further mention occurs in the Unicode 4.0 specifications in Table
                  4-1 in section 4.2 Case--Normative. See


                  This contains the warning:

                  << Only legacy implementations that cannot handle case mappings that
                  increase sring lengths should use UnicodeData case mappings alone. The
                  single-character mappings are insufficient for languages such as
                  German. >>

                  So is Python just another shit legacy implementation?

                  Jim Allan

                  Comment

                  • Martin v. Löwis

                    #10
                    Re: convert Unicode to lower/uppercase?

                    afriere@yahoo.c o.uk (Asun Friere) writes:
                    [color=blue][color=green]
                    > > The usage of SZ has only been eliminated in the recent change of
                    > > the amtliche Rechtschreibung .
                    > >[/color]
                    >
                    > And replaced with what? ie. is there now a single capital for SZ?[/color]

                    Unfortunately, I don't have a current Duden here, but I *think* you
                    now have to write double-S. There is, of course, the old MASSE vs
                    MASZE issue - I don't know whether this is considered relevant, as
                    capitalization is rare, anyway, and ambiguities can be clarified from
                    the context.

                    Regards,
                    Martin

                    Comment

                    • Martin v. Löwis

                      #11
                      Re: convert Unicode to lower/uppercase?

                      Peter Otten <__peter__@web. de> writes:
                      [color=blue]
                      > # The German es-zed is special--the normal mapping is to SS.
                      > # Note: the titlecase should never occur in practice. It is equal to
                      > titlecase(upper case(<es-zed>))
                      >
                      > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
                      > [...]
                      >
                      > Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]

                      No. It would be required if .upper would claim to implement
                      SpecialCasing - but it makes no such claim.
                      [color=blue]
                      > My personal impression is that, even before the orthography reform in 1998,
                      > the SZ variant was seldom used.[/color]

                      There is, of course, the famous "MASSE oder MASZE" example, in particular
                      in the form "WIR TRINKEN BIER IN MASSEN".

                      Regards,
                      Martin

                      Comment

                      • Martin v. Löwis

                        #12
                        Re: convert Unicode to lower/uppercase?

                        jallan@smrtytre k.com (jallan) writes:
                        [color=blue]
                        > So is Python just another shit legacy implementation?[/color]

                        Yes :-)

                        Regards,
                        Martin

                        Comment

                        • Asun Friere

                          #13
                          Re: convert Unicode to lower/uppercase?

                          Gerhard Häring <gh@ghaering.de > wrote in message news:<mailman.1 064213550.26639 .python-list@python.org >...
                          [color=blue]
                          > PS: I was quite disappointed with the reform of German ortography. I'd
                          > have favoured much more radical steps, like elimination of
                          > capitalization of the noun.[/color]

                          As an English speaker, who occasionally finds himself trying to
                          decipher German text, let me tell you that little flags like that
                          --"pick me! I'm a noun!" --are actually quite useful.

                          Comment

                          • jallan

                            #14
                            Re: convert Unicode to lower/uppercase?

                            martin@v.loewis .de (Martin v. Löwis) wrote in message news:<m3smmo5zx 6.fsf@mira.info rmatik.hu-berlin.de>...[color=blue]
                            > Peter Otten <__peter__@web. de> writes:
                            >[color=green]
                            > > # The German es-zed is special--the normal mapping is to SS.
                            > > # Note: the titlecase should never occur in practice. It is equal to
                            > > titlecase(upper case(<es-zed>))
                            > >
                            > > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
                            > > [...]
                            > >
                            > > Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]
                            >
                            > No. It would be required if .upper would claim to implement
                            > SpecialCasing - but it makes no such claim.[/color]

                            Of course not. From http://www.python.org/doc/current/li....html#l2h-203:

                            <<
                            *upper( )*
                            Return a copy of the string converted to uppercase.[color=blue][color=green]
                            >>[/color][/color]

                            This makes no claim about how the magic is done. But there is
                            certainly an implied claim that it is done correctly.

                            Unicode specifications are easily available at
                            http://www.unicode.org/versions/Unicode4.0.0/.

                            At 3.13 is indicated:

                            << The full case mappings for Unicode characters are obtained by using
                            the mappings from SpecialCasing.t xt _plus_ the mappings from
                            UnicodeData.txt , excluding any latter mappings that would conflict. >>

                            Case mappings for Unicode require use of SpecialCasing otherwise the
                            results are not in accord with the Unicode standard.

                            At 4.2 is found:

                            << Only legacy implementations that cannot handle case mappings that
                            increase string lengths should use UnicodeData case mappings alone.
                            The single-character mappings are insufficient for languages such as
                            German >>

                            I don't see any particular reason why Python "cannot handle case
                            mappings that increase string lengths".

                            Unicode again warns that using UnicodeData.txt alone is not
                            sufficient.

                            The text continues on "SpecialCasting .txt":

                            << Contains additional case mappings that map to more than one
                            character, such as "ß" to "SS". >>

                            Section 5.18 Case Mappings goes into further detail about casing
                            issues and specifically mentions:

                            << Case mappings may produce strings of different length than the
                            original. For example the German character U+00DF ß LATIN SMALL LETTER
                            SHAPR S expands when uppercase to the sequence of two characters "SS".
                            This also occurs where there is no prcomposed character corresponding
                            to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
                            PRECEDED BY APOSTROPHE. >>

                            See also http://www.unicode.org/faq/casemap_charprop-old.html for the
                            Unicode FAQ which contains:

                            <<
                            Q: Why is there no upper-case SHARP S (ß)?

                            A: There are 139 lower-case letters in Unicode 2.1 that have no direct
                            uppercase equivalent. Should there be introduced new bogus characters
                            for all of them, so that when you see an "fl" ligature you can
                            uppercase it to "FL" without expanding anything? Of course not.

                            Note that case conversion is inherently language-sensitive, notably in
                            the case of IPA, which needs to be left strictly alone even when
                            embedded in another language which is being case converted. The best
                            you can get is an approximate fit. [JC]

                            Q: Is all of the Unicode case mapping information in UnicodeData.txt ?

                            A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
                            but doesn't include 1:many mappings such as the one needed for
                            uppercasing ß. Since many parsers now expect this file to have at most
                            single characters in the case mapping fields, an additional file
                            (SpecialCasing. txt) was added to provide the 1:many mappings. For more
                            information, see UTR #21- Case Mappings [MD][color=blue][color=green]
                            >>[/color][/color]

                            Python specifications make an implied claim of full support for
                            Unicode and an implied claim that the function upper() uppercases a
                            string properly.

                            The implied combined claim is that Python supports Unicode and
                            supports proper casing in Unicode.

                            This implied claim is false.

                            Truly accurate documentation for upper() should say that it uppercases
                            a string except for those characters where uppercasing would expand a
                            character to more than one character in which circumstance that
                            character is not uppercased or uppercased with loss of data.

                            Python specifications need not say how casing is done, whether by
                            using Unicode tables directly or by using its own methods that
                            accomplish the same results.

                            Users should not have to know such details. They may wish to know
                            where a particular function does not do what might be expected of it.

                            Jim Allan

                            Comment

                            • Peter Otten

                              #15
                              Re: convert Unicode to lower/uppercase?

                              jallan wrote:
                              [color=blue]
                              > I don't see any particular reason why Python "cannot handle case
                              > mappings that increase string lengths".[/color]

                              Now that's a long post. I think it essentially boils down to the above
                              statement.

                              Looking into stringobject.c (judging from a first impression,
                              unicodeobject.c has essentially the same algorithm, but with a few
                              indirections):

                              static PyObject *
                              string_upper(Py StringObject *self)
                              {
                              char *s = PyString_AS_STR ING(self), *s_new;
                              int i, n = PyString_GET_SI ZE(self);
                              PyObject *new;

                              new = PyString_FromSt ringAndSize(NUL L, n);
                              if (new == NULL)
                              return NULL;
                              s_new = PyString_AsStri ng(new);
                              for (i = 0; i < n; i++) {
                              int c = Py_CHARMASK(*s+ +);
                              if (islower(c)) {
                              *s_new = toupper(c);
                              } else
                              *s_new = c;
                              s_new++;
                              }
                              return new;
                              }

                              The whole routine builds on the assumption that len(s) == len(s.upper()) and
                              nothing short of a complete rewrite will fix that. But if you volunteer...

                              Personally, I think it's a long way to go for a little s, sharp as it may be
                              :-)

                              Peter

                              Comment

                              Working...