Re: std::string vs. Unicode UTF-8
On Wed, 28 Sep 2005 08:28:13 +0200, Mirek Fidler <cxl@volny.cz > wrote:
[color=blue][color=green]
>> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
>> was still pretending that they use 16-bit characters and that each
>> Unicode character consists of a single 16-bit character. Neither of
>> these two properties holds: Unicode is [currently] a 20-bit encoding
>> and a Unicode character can consist of multiple such 20-bit entities[/color]
> ^^^^^^^^^^^^^^^ ^
>
>16-bit?[/color]
From the Unicode Technical Introduction:
"In all, the Unicode Standard, Version 4.0 provides codes for 96,447
characters from the world's alphabets, ideograph sets, and symbol
collections...T he majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic multilingual
plane, or BMP for short. There are about 6,300 unused code points for future
expansion in the BMP, plus over 870,000 unused supplementary code points on
the other planes...The Unicode Standard also reserves code points for private
use. Vendors or end users can assign these internally for their own characters
and symbols, or use them with specialized fonts. There are 6,400 private use
code points on the BMP and another 131,068 supplementary private use code
points, should 6,400 be insufficient for particular applications."
Despite the indication that the code space for Unicode is potentially larger
than 32 bits, the following statement seems to suggest that a 32-bit integer
is more than enough to represent all Unicode characters:
"UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded
in a single 32-bit code unit when using UTF-32."
-dr
On Wed, 28 Sep 2005 08:28:13 +0200, Mirek Fidler <cxl@volny.cz > wrote:
[color=blue][color=green]
>> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
>> was still pretending that they use 16-bit characters and that each
>> Unicode character consists of a single 16-bit character. Neither of
>> these two properties holds: Unicode is [currently] a 20-bit encoding
>> and a Unicode character can consist of multiple such 20-bit entities[/color]
> ^^^^^^^^^^^^^^^ ^
>
>16-bit?[/color]
From the Unicode Technical Introduction:
"In all, the Unicode Standard, Version 4.0 provides codes for 96,447
characters from the world's alphabets, ideograph sets, and symbol
collections...T he majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic multilingual
plane, or BMP for short. There are about 6,300 unused code points for future
expansion in the BMP, plus over 870,000 unused supplementary code points on
the other planes...The Unicode Standard also reserves code points for private
use. Vendors or end users can assign these internally for their own characters
and symbols, or use them with specialized fonts. There are 6,400 private use
code points on the BMP and another 131,068 supplementary private use code
points, should 6,400 be insufficient for particular applications."
Despite the indication that the code space for Unicode is potentially larger
than 32 bits, the following statement seems to suggest that a 32-bit integer
is more than enough to represent all Unicode characters:
"UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded
in a single 32-bit code unit when using UTF-32."
-dr
Comment