convert Unicode to lower/uppercase?

**Peter Otten** · Jul 18 '05, 02:44 AM

Re: convert Unicode to lower/uppercase?

nospam wrote:
[color=blue]
> Has someone got a Python routine or module which converts Unicode
> strings to lowercase (or uppercase)?[/color]

Toiled and came up with:
[color=blue][color=green][color=darkred]
>>> print u"abcäöüß".uppe r()[/color][/color][/color]
ABCÄÖÜß
[color=blue][color=green][color=darkred]
>>> u"ABCÄÖÜ".lower ()[/color][/color][/color]
u'abc\xe4\xf6\x fc'

Peter

**Hallvard B Furuseth** · Jul 18 '05, 02:44 AM

Re: convert Unicode to lower/uppercase?

Thanks!

--
Hallvard

**jallan** · Jul 18 '05, 02:48 AM

Re: convert Unicode to lower/uppercase?

Peter Otten <__peter__@web. de> wrote in message news:<bkepb9$6a 4$01$1@news.t-online.com>...[color=blue]
> nospam wrote:
>[color=green]
> > Has someone got a Python routine or module which converts Unicode
> > strings to lowercase (or uppercase)?[/color]
>
> Toiled and came up with:
>[color=green][color=darkred]
> >>> print u"abcäöüß".uppe r()[/color][/color]
> ABCÄÖÜß
>[color=green][color=darkred]
> >>> u"ABCÄÖÜ".lower ()[/color][/color]
> u'abc\xe4\xf6\x fc'
>
> Peter[/color]

But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Jim Allan

**Martin v. Löwis** · Jul 18 '05, 02:48 AM

Re: convert Unicode to lower/uppercase?

jallan wrote:
[color=blue]
> But that really doesn't work properly. According to Unicode specs and
> German usage the uppercase of "ß" is actually "SS", that is the single
> character "ß" should uppercase to two characters.[/color]

Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Also, in German, the uppercase mapping of ß is of ongoing debate.
For example, the Duden from 1919 says

| Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
| _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
| aufhören muß, sobald ein geeigneter Druckbuchstabe für das
| große ß geschaffen ist.

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung .

Regards,
Martin

**Asun Friere** · Jul 18 '05, 02:49 AM

Re: convert Unicode to lower/uppercase?

"Martin v. Löwis" <martin@v.loewi s.de> wrote in message news:<bkkusk$pv i$05$1@news.t-online.com>...[color=blue]
> The usage of SZ has only been eliminated in the recent change of
> the amtliche Rechtschreibung .
>[/color]

And replaced with what? ie. is there now a single capital for SZ?

**Gerhard Häring** · Jul 18 '05, 02:49 AM

Re: convert Unicode to lower/uppercase?

Asun Friere wrote:[color=blue]
> "Martin v. Löwis" <martin@v.loewi s.de> wrote in message news:<bkkusk$pv i$05$1@news.t-online.com>...[color=green]
>>The usage of SZ has only been eliminated in the recent change of
>>the amtliche Rechtschreibung .[/color]
>
> And replaced with what? ie. is there now a single capital for SZ?[/color]

ß (sz) has not been completely eliminated. After *short* vocals it has
been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
vocals, it is still used (Maß, Gruß, ...).

-- Gerhard

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

**Peter Otten** · Jul 18 '05, 02:49 AM

Re: convert Unicode to lower/uppercase?

"Martin v. Löwis" wrote:
[color=blue]
> jallan wrote:
>[color=green]
>> But that really doesn't work properly. According to Unicode specs and
>> German usage the uppercase of "ß" is actually "SS", that is the single
>> character "ß" should uppercase to two characters.[/color]
>
> Can you cite exact chapter and verse of the Unicode specs that say so?
> According to the Unicode database,
>
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> has neither an uppercase mapping, nor a lowercase mapping.[/color]

It seems like UnicodeData.txt does not give the full story. Quoting from

404 Not Found

http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_lis t> ;)? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(upper case(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.
[color=blue]
> Also, in German, the uppercase mapping of ß is of ongoing debate.[/color]

My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.
For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

Peter

**jallan** · Jul 18 '05, 02:49 AM

Re: convert Unicode to lower/uppercase?

Peter Otten <__peter__@web. de> wrote in message news:<bkm919$as t$01$1@news.t-online.com>...[color=blue]
> "Martin v. Löwis" wrote:
>[color=green]
> > jallan wrote:
> >[color=darkred]
> >> But that really doesn't work properly. According to Unicode specs and
> >> German usage the uppercase of "ß" is actually "SS", that is the single
> >> character "ß" should uppercase to two characters.[/color]
> >
> > Can you cite exact chapter and verse of the Unicode specs that say so?
> > According to the Unicode database,
> >
> > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> >
> > has neither an uppercase mapping, nor a lowercase mapping.[/color]
>
> It seems like UnicodeData.txt does not give the full story. Quoting from
> http://www.unicode.org/Public/UNIDAT...ialCasing.txt:
>
> [...][/color]
[color=blue]
> # (For compatibility, the UnicodeData.txt file only contains case mappings
> for
> # characters where they are 1-1, and does not have locale-specific
> mappings.)
> [...]
> # <code>; <lower> ; <title> ; <upper> ; (<condition_lis t> ;)? # <comment>
> [...]
> # The German es-zed is special--the normal mapping is to SS.
> # Note: the titlecase should never occur in practice. It is equal to
> titlecase(upper case(<es-zed>))
>
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> [...]
>
> Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]

Yes.

Also the Unicode main charts in the annotation for 00DF state:

uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative. See

http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another shit legacy implementation?

Jim Allan

**Martin v. Löwis** · Jul 18 '05, 02:50 AM

Re: convert Unicode to lower/uppercase?

afriere@yahoo.c o.uk (Asun Friere) writes:
[color=blue][color=green]
> > The usage of SZ has only been eliminated in the recent change of
> > the amtliche Rechtschreibung .
> >[/color]
>
> And replaced with what? ie. is there now a single capital for SZ?[/color]

Unfortunately, I don't have a current Duden here, but I *think* you
now have to write double-S. There is, of course, the old MASSE vs
MASZE issue - I don't know whether this is considered relevant, as
capitalization is rare, anyway, and ambiguities can be clarified from
the context.

Regards,
Martin

**Martin v. Löwis** · Jul 18 '05, 02:50 AM

Re: convert Unicode to lower/uppercase?

Peter Otten <__peter__@web. de> writes:
[color=blue]
> # The German es-zed is special--the normal mapping is to SS.
> # Note: the titlecase should never occur in practice. It is equal to
> titlecase(upper case(<es-zed>))
>
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> [...]
>
> Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.
[color=blue]
> My personal impression is that, even before the orthography reform in 1998,
> the SZ variant was seldom used.[/color]

There is, of course, the famous "MASSE oder MASZE" example, in particular
in the form "WIR TRINKEN BIER IN MASSEN".

Regards,
Martin

**Martin v. Löwis** · Jul 18 '05, 02:50 AM

Re: convert Unicode to lower/uppercase?

jallan@smrtytre k.com (jallan) writes:
[color=blue]
> So is Python just another shit legacy implementation?[/color]

Yes :-)

Regards,
Martin

**Asun Friere** · Jul 18 '05, 02:51 AM

Re: convert Unicode to lower/uppercase?

Gerhard Häring <gh@ghaering.de > wrote in message news:<mailman.1 064213550.26639 .python-list@python.org >...
[color=blue]
> PS: I was quite disappointed with the reform of German ortography. I'd
> have favoured much more radical steps, like elimination of
> capitalization of the noun.[/color]

As an English speaker, who occasionally finds himself trying to
decipher German text, let me tell you that little flags like that
--"pick me! I'm a noun!" --are actually quite useful.

**jallan** · Jul 18 '05, 02:52 AM

Re: convert Unicode to lower/uppercase?

martin@v.loewis .de (Martin v. Löwis) wrote in message news:<m3smmo5zx 6.fsf@mira.info rmatik.hu-berlin.de>...[color=blue]
> Peter Otten <__peter__@web. de> writes:
>[color=green]
> > # The German es-zed is special--the normal mapping is to SS.
> > # Note: the titlecase should never occur in practice. It is equal to
> > titlecase(upper case(<es-zed>))
> >
> > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
> > [...]
> >
> > Thus, to comply with the standard, "ß".upper() --> "SS" is required.[/color]
>
> No. It would be required if .upper would claim to implement
> SpecialCasing - but it makes no such claim.[/color]

Of course not. From http://www.python.org/doc/current/li....html#l2h-203:

<<
*upper( )*
Return a copy of the string converted to uppercase.[color=blue][color=green]
>>[/color][/color]

This makes no claim about how the magic is done. But there is
certainly an implied claim that it is done correctly.

Unicode specifications are easily available at
http://www.unicode.org/versions/Unicode4.0.0/.

At 3.13 is indicated:

<< The full case mappings for Unicode characters are obtained by using
the mappings from SpecialCasing.t xt _plus_ the mappings from
UnicodeData.txt , excluding any latter mappings that would conflict. >>

Case mappings for Unicode require use of SpecialCasing otherwise the
results are not in accord with the Unicode standard.

At 4.2 is found:

<< Only legacy implementations that cannot handle case mappings that
increase string lengths should use UnicodeData case mappings alone.
The single-character mappings are insufficient for languages such as
German >>

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Unicode again warns that using UnicodeData.txt alone is not
sufficient.

The text continues on "SpecialCasting .txt":

<< Contains additional case mappings that map to more than one
character, such as "ß" to "SS". >>

Section 5.18 Case Mappings goes into further detail about casing
issues and specifically mentions:

<< Case mappings may produce strings of different length than the
original. For example the German character U+00DF ß LATIN SMALL LETTER
SHAPR S expands when uppercase to the sequence of two characters "SS".
This also occurs where there is no prcomposed character corresponding
to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE. >>

See also http://www.unicode.org/faq/casemap_charprop-old.html for the
Unicode FAQ which contains:

<<
Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can
uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in
the case of IPA, which needs to be left strictly alone even when
embedded in another language which is being case converted. The best
you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt ?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing. txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD][color=blue][color=green]
>>[/color][/color]

Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

The implied combined claim is that Python supports Unicode and
supports proper casing in Unicode.

This implied claim is false.

Truly accurate documentation for upper() should say that it uppercases
a string except for those characters where uppercasing would expand a
character to more than one character in which circumstance that
character is not uppercased or uppercased with loss of data.

Python specifications need not say how casing is done, whether by
using Unicode tables directly or by using its own methods that
accomplish the same results.

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Jim Allan

**Peter Otten** · Jul 18 '05, 02:53 AM

Re: convert Unicode to lower/uppercase?

jallan wrote:
[color=blue]
> I don't see any particular reason why Python "cannot handle case
> mappings that increase string lengths".[/color]

Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(Py StringObject *self)
{
char *s = PyString_AS_STR ING(self), *s_new;
int i, n = PyString_GET_SI ZE(self);
PyObject *new;

new = PyString_FromSt ringAndSize(NUL L, n);
if (new == NULL)
return NULL;
s_new = PyString_AsStri ng(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s+ +);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

Personally, I think it's a long way to go for a little s, sharp as it may be
:-)

Peter

convert Unicode to lower/uppercase?

convert Unicode to lower/uppercase?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment