std::string vs. Unicode UTF-8

**Dave Rahardja** · Sep 28 '05, 01:35 PM

Re: std::string vs. Unicode UTF-8

On Wed, 28 Sep 2005 08:28:13 +0200, Mirek Fidler <cxl@volny.cz > wrote:

[color=blue][color=green]
>> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
>> was still pretending that they use 16-bit characters and that each
>> Unicode character consists of a single 16-bit character. Neither of
>> these two properties holds: Unicode is [currently] a 20-bit encoding
>> and a Unicode character can consist of multiple such 20-bit entities[/color]
> ^^^^^^^^^^^^^^^ ^
>
>16-bit?[/color]

From the Unicode Technical Introduction:

"In all, the Unicode Standard, Version 4.0 provides codes for 96,447
characters from the world's alphabets, ideograph sets, and symbol
collections...T he majority of common-use characters fit into the first 64K
code points, an area of the codespace that is called the basic multilingual
plane, or BMP for short. There are about 6,300 unused code points for future
expansion in the BMP, plus over 870,000 unused supplementary code points on
the other planes...The Unicode Standard also reserves code points for private
use. Vendors or end users can assign these internally for their own characters
and symbols, or use them with specialized fonts. There are 6,400 private use
code points on the BMP and another 131,068 supplementary private use code
points, should 6,400 be insufficient for particular applications."

Despite the indication that the code space for Unicode is potentially larger
than 32 bits, the following statement seems to suggest that a 32-bit integer
is more than enough to represent all Unicode characters:

"UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is encoded
in a single 32-bit code unit when using UTF-32."

Technical Introduction

http://www.unicode.org/standard/principles.html

Unicode Standard, technical introduction

-dr

**Pete Becker** · Sep 28 '05, 01:55 PM

Re: std::string vs. Unicode UTF-8

Dietmar Kuehl wrote:[color=blue]
> Pete Becker wrote:
>[color=green]
>>That's unfortunate, since it's exactly what wchar_t and wstring were
>>designed for. What is your objection to them?[/color]
>
>
> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
> was still pretending that they use 16-bit characters and that each
> Unicode character consists of a single 16-bit character. Neither of
> these two properties holds: Unicode is [currently] a 20-bit encoding
> and a Unicode character can consist of multiple such 20-bit entities
> for combining characters.[/color]

Well, true, but wchar_t can certainly be large enough to hold 20 bits.
And the claim from the Unicode folks is that that's all you need.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Jonathan Coxhead** · Oct 1 '05, 04:55 AM

Re: std::string vs. Unicode UTF-8

Pete Becker wrote:[color=blue]
> Dietmar Kuehl wrote:
>[color=green]
>> Pete Becker wrote:
>>[color=darkred]
>>> That's unfortunate, since it's exactly what wchar_t and wstring were
>>> designed for. What is your objection to them?[/color]
>>
>>
>>
>> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
>> was still pretending that they use 16-bit characters and that each
>> Unicode character consists of a single 16-bit character. Neither of
>> these two properties holds: Unicode is [currently] a 20-bit encoding
>> and a Unicode character can consist of multiple such 20-bit entities
>> for combining characters.[/color]
>
>
> Well, true, but wchar_t can certainly be large enough to hold 20 bits.
> And the claim from the Unicode folks is that that's all you need.[/color]

Actually, you need 21 bits. There are 0x11 planes with 0x10000 characters in
each, so 0x110000 characters. This space is completely flat, though it has
holes. Or, you can use UTF-16, where a character is encoded as 1 or 2 16-bit
values, so in C counts as neither a wide-character encoding nor a multibyte
encoding. (It might be a "multishort " encoding, if such a thing existed.) Or you
can use UTF-8, which is a true multibyte encoding. The translation between these
representations is purely algorithmic.

Anyway, 20 bits: not enough.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**kanze** · Oct 1 '05, 04:55 AM

Re: std::string vs. Unicode UTF-8

Pete Becker wrote:[color=blue]
> Dietmar Kuehl wrote:[color=green]
> > Pete Becker wrote:[/color][/color]
[color=blue][color=green][color=darkred]
> >>That's unfortunate, since it's exactly what wchar_t and
> >>wstring were designed for. What is your objection to them?[/color][/color][/color]
[color=blue][color=green]
> > Well, 'wchar_t' and 'wstring' were designed at a time when
> > Unicode was still pretending that they use 16-bit characters
> > and that each Unicode character consists of a single 16-bit
> > character. Neither of these two properties holds: Unicode is
> > [currently] a 20-bit encoding and a Unicode character can
> > consist of multiple such 20-bit entities for combining
> > characters.[/color][/color]

(If you have 20 or more bits, there's no need for the combining
characters; there only present to allow representing character
codes larger than 0xFFFF as two 16 bit characters.)
[color=blue]
> Well, true, but wchar_t can certainly be large enough to hold
> 20 bits. And the claim from the Unicode folks is that that's
> all you need.[/color]

I think the point is that when wchar_t was introduced, it wasn't
obvious that Unicode was the solution, and Unicode at the time
was only 16 bits anyway. Given that, vendors have defined
wchar_t in a variety of ways. And given that vendors want to
support their existing code bases, that really won't change,
regardless of what the standard says.

Given this, there is definite value in leaving wchar_t as it is
(which is pretty unusable in portable code), and defining a new
type which is guaranteed to be Unicode. (This is, I believe,
the route C is taking; there's probably some value in remaining
C compatible here as well.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Dave Rahardja** · Oct 1 '05, 03:25 PM

Re: std::string vs. Unicode UTF-8

On Fri, 30 Sep 2005 23:41:35 CST, "kanze" <kanze@gabi-soft.fr> wrote:
[color=blue][color=green]
>> Well, true, but wchar_t can certainly be large enough to hold
>> 20 bits. And the claim from the Unicode folks is that that's
>> all you need.[/color]
>
>I think the point is that when wchar_t was introduced, it wasn't
>obvious that Unicode was the solution, and Unicode at the time
>was only 16 bits anyway. Given that, vendors have defined
>wchar_t in a variety of ways. And given that vendors want to
>support their existing code bases, that really won't change,
>regardless of what the standard says.
>
>Given this, there is definite value in leaving wchar_t as it is
>(which is pretty unusable in portable code), and defining a new
>type which is guaranteed to be Unicode. (This is, I believe,
>the route C is taking; there's probably some value in remaining
>C compatible here as well.)[/color]

I think wchar_t is fine the way it is defined:

(3.9.1.5)
Type wchar_t is a distinct type whose values can represent distinct codes for
all members of the largest extended character set specified among the
supported locales (22.1.1). Type wchar_t shall have the same size, signedness,
and alignment requirements (3.9) as one of the other integral types, called
its underlying type.

What we need is a Unicode locale! ;-)

-dr

**Richard Kettlewell** · Oct 1 '05, 05:15 PM

Re: std::string vs. Unicode UTF-8

"kanze" <kanze@gabi-soft.fr> writes:[color=blue]
> (If you have 20 or more bits, there's no need for the combining
> characters; there only present to allow representing character codes
> larger than 0xFFFF as two 16 bit characters.)[/color]

I believe you are thinking of surrogates, rather than combining
characters, here. The need (or otherwise) for the latter is
independent of representation.

--

I Deny Everything

http://www.greenend.org.uk/rjk/

The front cover to my personal web site.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**P.J. Plauger** · Oct 1 '05, 05:15 PM

Re: std::string vs. Unicode UTF-8

"kanze" <kanze@gabi-soft.fr> wrote in message
news:1127985061 .409082.75870@g 49g2000cwa.goog legroups.com...
[color=blue]
> I think the point is that when wchar_t was introduced, it wasn't
> obvious that Unicode was the solution, and Unicode at the time
> was only 16 bits anyway. Given that, vendors have defined
> wchar_t in a variety of ways. And given that vendors want to
> support their existing code bases, that really won't change,
> regardless of what the standard says.
>
> Given this, there is definite value in leaving wchar_t as it is
> (which is pretty unusable in portable code), and defining a new
> type which is guaranteed to be Unicode. (This is, I believe,
> the route C is taking; there's probably some value in remaining
> C compatible here as well.)[/color]

Right, there's a (non-normative) Technical Report that defines
16- and 32-bit character types independent of wchar_t. We'll
be shipping it as part of our next release, along with a slew
of code conversions you can use with these new types.

P.J. Plauger
Dinkumware, Ltd.

503 Service Unavailable

http://www.dinkumware.com

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**kanze** · Oct 4 '05, 04:05 AM

Re: std::string vs. Unicode UTF-8

Richard Kettlewell wrote:[color=blue]
> "kanze" <kanze@gabi-soft.fr> writes:[color=green]
> > (If you have 20 or more bits, there's no need for the
> > combining characters; there only present to allow
> > representing character codes larger than 0xFFFF as two 16
> > bit characters.)[/color][/color]
[color=blue]
> I believe you are thinking of surrogates, rather than
> combining characters, here. The need (or otherwise) for the
> latter is independent of representation.[/color]

I was definitly talking about surrogates. And it is possible to
represent any Unicode character in UTF-32 without the use of
surrogates; they are only necessary in UTF-16.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Sean Parent** · Oct 5 '05, 04:35 AM

Re: std::string vs. Unicode UTF-8

A few comments on this thread -

Unicode has been 21 bits since it's inception, at least it was 21 bits by
the time Unicode 1.0 came out - (I worked with Eric Mader, Dave Opstad, and
Mark Davis at Apple <http://www.unicode.org/history/>). Although I've heard
grumblings that people would like to extend it to include pages for more
dead languages.

UCS-2 is a subset of Unicode that fits in 16 bits without double word
encoding. It is part of ISO 10646, which also defines UCS-4, which for all
practical purposes is the same encoding as UTF-32 (there's a document on the
relationship on the unicode.org site). UTF-16 and UTF-32 both have endian
variants.

Operations such as "the number of characters in a string" has very little
meaning - there is no direct relationship between characters and glyphs,
there are combining characters (not the same as a multi-byte or word
encoding). Even if defined as the number of Unicode code points in a string,
it isn't particularly interesting.

Operations such as string catenation, sub-string searching, upper-case to
lower-case conversion, and collation are all non-trivial on a Unicode string
regardless of the encoding.

I think the current string classes and codecvt functionality in the language
is pretty decent (I would have preferred if wchar_t had been nailed to 32
bits, or even 16 bits... But that will be somewhat addressed). I'd like to
see the complexity of the current string classes specified - and I think a
lightweight copy (constant time) is needed - but I think move semantics will
address this. I also think it would be good to mark strings with their
encoding because it is too easy to end up with Mojibake
<http://en.wikipedia.or g/wiki/Mojibake> but I don't think this requires a
whole new string class (I honestly don't think there is such a thing as a
once size fits all string class).

I'd love to see the functionality of the IBM ICU libraries
<http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
not a fan of the ICU C++ interface (as I mentioned above - I don't see a
need for a new string class, I'd like ICU rethought as generic algorithms
that work regardless of the string representation.

Beyond that, I'd like to work towards a standard markup - strings require
more information than just their encoding to really be handled properly. You
need to know which sections of a string are in which language (which can't
be determined completely from the characters used) - items such as gender,
plurality, and formal forms all play a part in doing proper operations such
as replacements. The ASL xstring glossary library is a step in this
direction <http://opensource.adob e.com/group__asl__xst ring.html>

--
Sean Parent
Sr. Engineering Manager
Software Technology Lab
Adobe Systems Incorporated
sparent@adobe.c om

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Niklas Matthies** · Oct 5 '05, 04:35 AM

Re: std::string vs. Unicode UTF-8

On 2005-10-04 04:00, kanze wrote:
:[color=blue]
> I was definitly talking about surrogates. And it is possible to
> represent any Unicode character in UTF-32 without the use of
> surrogates;[/color]

It's even necessary, because surrogate code points outside of UTF-16
are non-conformant and cause the corresponding byte or code point
sequences to be ill-formed.

-- Niklas Matthies

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**kuyper@wizard.net** · Oct 5 '05, 04:45 AM

Re: std::string vs. Unicode UTF-8

kanze wrote:[color=blue]
> Richard Kettlewell wrote:[color=green]
> > "kanze" <kanze@gabi-soft.fr> writes:[color=darkred]
> > > (If you have 20 or more bits, there's no need for the
> > > combining characters; there only present to allow
> > > representing character codes larger than 0xFFFF as two 16
> > > bit characters.)[/color][/color]
>[color=green]
> > I believe you are thinking of surrogates, rather than
> > combining characters, here. The need (or otherwise) for the
> > latter is independent of representation.[/color]
>
> I was definitly talking about surrogates. And it is possible to
> represent any Unicode character in UTF-32 without the use of
> surrogates; they are only necessary in UTF-16.[/color]

As the Unicode documents themselves point out, what a reader would
consider to be a single character is often represented in Unicode as
the combination of several unicode characters. Can an implementation
use UTF-32 encoding for wchar_t, and meet all of the requirements of
the C standard with respect to wchar_t, when combined characters are
involved? I think you can meet those requirements only by interpreting
every reference in the C standard to a wide "character" as referring to
a "unicode character" rather than as referring to what end users would
consider a character.

If search_string ends with an uncombined character, and target_string
contains the exact same sequence of wchar_t values followed by one or
more combining characters, I believe that wcsstr(search_s tring,
target_string) is supposed to report a match. That strikes me as
problematic.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**kuyper@wizard.net** · Oct 6 '05, 05:35 AM

Re: std::string vs. Unicode UTF-8

Sean Parent wrote:
..[color=blue]
> I think the current string classes and codecvt functionality in the language
> is pretty decent (I would have preferred if wchar_t had been nailed to 32
> bits, or even 16 bits... But that will be somewhat addressed). I'd like to[/color]

Requiring wchar_t to have more than 8 bits is pointless in itself. If
an implementor would have chosen to make wchar_t 8 bits without that
requirement, forcing the implementor to use 16 bits will merely
encourage definition of a 16-bit type that contains the same range of
values as his 8 bit type would have had. In the process, you'll be
making his implementation marginally more complicated and inefficient.

What might be worthwhile is to require some actual support for Unicode.
I'm not sure it's a good idea to impose such a requirement; there's a
real advantage to giving implementors the freedom to not support
Unicode if they know that their particular customer base has no need
for it. However, such a requirement would at least guarantee some
benefit to some users, which requiring wchar_t to be at least 16 bits
would NOT do.

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Lance Diduck** · Oct 6 '05, 05:35 AM

Re: std::string vs. Unicode UTF-8

This was a great overvewi .Thanks![color=blue]
> I think the current string classes and codecvt functionality in the language
> is pretty decent (I would have preferred if wchar_t had been nailed to 32
> bits, or even 16 bits...[/color]
Of the four platforms that I regularly code for , two are 32 bit, and
two are 16bit def for wchar_t. And of each variety, two are big endian
(AIX and Solaris), and two are little (Linux and Microsoft) (I haven't
researched Cygwin, which would be interesting to see). This is four
different encodings. Any comparisions involving literals are suspect,
not to mention "binary support."
message catalogs help -- and the diversity there is off topic, but is
far far more non standard and uneven than whar_t support.
Given that most localization is done in a GUI framework rather than
through IOstreams, it would help if automatic invocation of codecvt
were placed in something like stringstream. But as it is codecvt only
invoked automatically in things that don't write to memory. And except
perhpas for CGI calls, there is little demand for "console mode"
internationaliz ed applications.
[color=blue]
>
> I'd love to see the functionality of the IBM ICU libraries
> <http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
> not a fan of the ICU C++ interface (as I mentioned above - I don't see a
> need for a new string class,[/color]
The ICU C++ string uses -- and I'm not kidding --"bogus sematics."
http://icu.sourceforge.net/apiref/ic...tring.html#a82 You
check the validity of your string by calling the isBogus
method.Addition ally, every ICU class inherits from UMemory, and can
only change the heap manager by redefining this base class, and
redeploying the library.
THe ICU looks like a port from Java, and has a very Java feel to it. I
believe it is a great starting point though.

Other than string literals, and the lack of character iterators, the
main problem with the C++ string and Unicode is the compare function.
To get a true comparision one would really use the locale compare
function, mapped to some normalization and collation algorithm, and not
string compare, which is more or less memcmp. The interface for string
compare can only compare using the number of bytes in the smaller of
the strings to be compared -- so even if you did manage somehow to cram
normalization in a char_traits class, the triats::copare interface
requires truncation the larger of the two strings.
This works great for backward compatibility, though.

[color=blue]
>
> Beyond that, I'd like to work towards a standard markup -[/color]
But wouldn't that depend on the renderer? But adoption of XSL-FO may be
a goos start. However, RIM devices etc would barely be able to fit such
a renderer.

]

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Sean Parent** · Oct 6 '05, 02:35 PM

Re: std::string vs. Unicode UTF-8

in article 1128564183.5762 03.88670@g49g20 00...legro ups.com, Lance
Diduck at lancediduck@nyc .rr.com wrote on 10/5/05 11:22 PM:
[color=blue][color=green]
>>
>> Beyond that, I'd like to work towards a standard markup -[/color]
> But wouldn't that depend on the renderer? But adoption of XSL-FO may be
> a goos start. However, RIM devices etc would barely be able to fit such
> a renderer.[/color]

I should have clarified - I'm not looking at markup for rendering intents
(that's a separate but important issues) rather for semantic intents -
marking substrings with their language, gender, plurality, and locale as
well as alternates (alternate languages, alternate forms such as
formal/casual). These are important attributes for string processing. More
RDF than XSL-FO.

Sean

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

**Simon Bone** · Oct 7 '05, 03:35 AM

Re: std::string vs. Unicode UTF-8

On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:

[color=blue]
> What might be worthwhile is to require some actual support for Unicode.
> I'm not sure it's a good idea to impose such a requirement; there's a
> real advantage to giving implementors the freedom to not support
> Unicode if they know that their particular customer base has no need
> for it. However, such a requirement would at least guarantee some
> benefit to some users, which requiring wchar_t to be at least 16 bits
> would NOT do.
>[/color]

Like the freedom not to implement export because no-one in their customer
base needs it? ;-)

I think standard Unicode support would be more widely appreciated than
export. If some vendors continue to decide not to quite finish their
implementations , so what? The world has not stopped turning while we wait
for more C++ 98 implementations to become strictly complete. I also expect
most C++ implementors would provide Unicode support following the
standard, if it was included.

Simon Bone

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

std::string vs. Unicode UTF-8

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment