std::string vs. Unicode UTF-8

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dave Rahardja

    #16
    Re: std::string vs. Unicode UTF-8

    On Wed, 28 Sep 2005 08:28:13 +0200, Mirek Fidler <cxl@volny.cz > wrote:

    [color=blue][color=green]
    >> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
    >> was still pretending that they use 16-bit characters and that each
    >> Unicode character consists of a single 16-bit character. Neither of
    >> these two properties holds: Unicode is [currently] a 20-bit encoding
    >> and a Unicode character can consist of multiple such 20-bit entities[/color]
    > ^^^^^^^^^^^^^^^ ^
    >
    >16-bit?[/color]

    From the Unicode Technical Introduction:

    "In all, the Unicode Standard, Version 4.0 provides codes for 96,447
    characters from the world's alphabets, ideograph sets, and symbol
    collections...T he majority of common-use characters fit into the first 64K
    code points, an area of the codespace that is called the basic multilingual
    plane, or BMP for short. There are about 6,300 unused code points for future
    expansion in the BMP, plus over 870,000 unused supplementary code points on
    the other planes...The Unicode Standard also reserves code points for private
    use. Vendors or end users can assign these internally for their own characters
    and symbols, or use them with specialized fonts. There are 6,400 private use
    code points on the BMP and another 131,068 supplementary private use code
    points, should 6,400 be insufficient for particular applications."

    Despite the indication that the code space for Unicode is potentially larger
    than 32 bits, the following statement seems to suggest that a 32-bit integer
    is more than enough to represent all Unicode characters:

    "UTF-32 is popular where memory space is no concern, but fixed width, single
    code unit access to characters is desired. Each Unicode character is encoded
    in a single 32-bit code unit when using UTF-32."



    -dr

    Comment

    • Pete Becker

      #17
      Re: std::string vs. Unicode UTF-8

      Dietmar Kuehl wrote:[color=blue]
      > Pete Becker wrote:
      >[color=green]
      >>That's unfortunate, since it's exactly what wchar_t and wstring were
      >>designed for. What is your objection to them?[/color]
      >
      >
      > Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
      > was still pretending that they use 16-bit characters and that each
      > Unicode character consists of a single 16-bit character. Neither of
      > these two properties holds: Unicode is [currently] a 20-bit encoding
      > and a Unicode character can consist of multiple such 20-bit entities
      > for combining characters.[/color]

      Well, true, but wchar_t can certainly be large enough to hold 20 bits.
      And the claim from the Unicode folks is that that's all you need.

      --

      Pete Becker
      Dinkumware, Ltd. (http://www.dinkumware.com)

      ---
      [ comp.std.c++ is moderated. To submit articles, try just posting with ]
      [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
      [ --- Please see the FAQ before posting. --- ]
      [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

      Comment

      • Jonathan Coxhead

        #18
        Re: std::string vs. Unicode UTF-8

        Pete Becker wrote:[color=blue]
        > Dietmar Kuehl wrote:
        >[color=green]
        >> Pete Becker wrote:
        >>[color=darkred]
        >>> That's unfortunate, since it's exactly what wchar_t and wstring were
        >>> designed for. What is your objection to them?[/color]
        >>
        >>
        >>
        >> Well, 'wchar_t' and 'wstring' were designed at a time when Unicode
        >> was still pretending that they use 16-bit characters and that each
        >> Unicode character consists of a single 16-bit character. Neither of
        >> these two properties holds: Unicode is [currently] a 20-bit encoding
        >> and a Unicode character can consist of multiple such 20-bit entities
        >> for combining characters.[/color]
        >
        >
        > Well, true, but wchar_t can certainly be large enough to hold 20 bits.
        > And the claim from the Unicode folks is that that's all you need.[/color]

        Actually, you need 21 bits. There are 0x11 planes with 0x10000 characters in
        each, so 0x110000 characters. This space is completely flat, though it has
        holes. Or, you can use UTF-16, where a character is encoded as 1 or 2 16-bit
        values, so in C counts as neither a wide-character encoding nor a multibyte
        encoding. (It might be a "multishort " encoding, if such a thing existed.) Or you
        can use UTF-8, which is a true multibyte encoding. The translation between these
        representations is purely algorithmic.

        Anyway, 20 bits: not enough.

        ---
        [ comp.std.c++ is moderated. To submit articles, try just posting with ]
        [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
        [ --- Please see the FAQ before posting. --- ]
        [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

        Comment

        • kanze

          #19
          Re: std::string vs. Unicode UTF-8

          Pete Becker wrote:[color=blue]
          > Dietmar Kuehl wrote:[color=green]
          > > Pete Becker wrote:[/color][/color]
          [color=blue][color=green][color=darkred]
          > >>That's unfortunate, since it's exactly what wchar_t and
          > >>wstring were designed for. What is your objection to them?[/color][/color][/color]
          [color=blue][color=green]
          > > Well, 'wchar_t' and 'wstring' were designed at a time when
          > > Unicode was still pretending that they use 16-bit characters
          > > and that each Unicode character consists of a single 16-bit
          > > character. Neither of these two properties holds: Unicode is
          > > [currently] a 20-bit encoding and a Unicode character can
          > > consist of multiple such 20-bit entities for combining
          > > characters.[/color][/color]

          (If you have 20 or more bits, there's no need for the combining
          characters; there only present to allow representing character
          codes larger than 0xFFFF as two 16 bit characters.)
          [color=blue]
          > Well, true, but wchar_t can certainly be large enough to hold
          > 20 bits. And the claim from the Unicode folks is that that's
          > all you need.[/color]

          I think the point is that when wchar_t was introduced, it wasn't
          obvious that Unicode was the solution, and Unicode at the time
          was only 16 bits anyway. Given that, vendors have defined
          wchar_t in a variety of ways. And given that vendors want to
          support their existing code bases, that really won't change,
          regardless of what the standard says.

          Given this, there is definite value in leaving wchar_t as it is
          (which is pretty unusable in portable code), and defining a new
          type which is guaranteed to be Unicode. (This is, I believe,
          the route C is taking; there's probably some value in remaining
          C compatible here as well.)

          --
          James Kanze GABI Software
          Conseils en informatique orientée objet/
          Beratung in objektorientier ter Datenverarbeitu ng
          9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


          ---
          [ comp.std.c++ is moderated. To submit articles, try just posting with ]
          [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
          [ --- Please see the FAQ before posting. --- ]
          [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

          Comment

          • Dave Rahardja

            #20
            Re: std::string vs. Unicode UTF-8

            On Fri, 30 Sep 2005 23:41:35 CST, "kanze" <kanze@gabi-soft.fr> wrote:
            [color=blue][color=green]
            >> Well, true, but wchar_t can certainly be large enough to hold
            >> 20 bits. And the claim from the Unicode folks is that that's
            >> all you need.[/color]
            >
            >I think the point is that when wchar_t was introduced, it wasn't
            >obvious that Unicode was the solution, and Unicode at the time
            >was only 16 bits anyway. Given that, vendors have defined
            >wchar_t in a variety of ways. And given that vendors want to
            >support their existing code bases, that really won't change,
            >regardless of what the standard says.
            >
            >Given this, there is definite value in leaving wchar_t as it is
            >(which is pretty unusable in portable code), and defining a new
            >type which is guaranteed to be Unicode. (This is, I believe,
            >the route C is taking; there's probably some value in remaining
            >C compatible here as well.)[/color]

            I think wchar_t is fine the way it is defined:

            (3.9.1.5)
            Type wchar_t is a distinct type whose values can represent distinct codes for
            all members of the largest extended character set specified among the
            supported locales (22.1.1). Type wchar_t shall have the same size, signedness,
            and alignment requirements (3.9) as one of the other integral types, called
            its underlying type.

            What we need is a Unicode locale! ;-)

            -dr

            Comment

            • Richard Kettlewell

              #21
              Re: std::string vs. Unicode UTF-8

              "kanze" <kanze@gabi-soft.fr> writes:[color=blue]
              > (If you have 20 or more bits, there's no need for the combining
              > characters; there only present to allow representing character codes
              > larger than 0xFFFF as two 16 bit characters.)[/color]

              I believe you are thinking of surrogates, rather than combining
              characters, here. The need (or otherwise) for the latter is
              independent of representation.

              --
              The front cover to my personal web site.


              ---
              [ comp.std.c++ is moderated. To submit articles, try just posting with ]
              [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
              [ --- Please see the FAQ before posting. --- ]
              [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

              Comment

              • P.J. Plauger

                #22
                Re: std::string vs. Unicode UTF-8

                "kanze" <kanze@gabi-soft.fr> wrote in message
                news:1127985061 .409082.75870@g 49g2000cwa.goog legroups.com...
                [color=blue]
                > I think the point is that when wchar_t was introduced, it wasn't
                > obvious that Unicode was the solution, and Unicode at the time
                > was only 16 bits anyway. Given that, vendors have defined
                > wchar_t in a variety of ways. And given that vendors want to
                > support their existing code bases, that really won't change,
                > regardless of what the standard says.
                >
                > Given this, there is definite value in leaving wchar_t as it is
                > (which is pretty unusable in portable code), and defining a new
                > type which is guaranteed to be Unicode. (This is, I believe,
                > the route C is taking; there's probably some value in remaining
                > C compatible here as well.)[/color]

                Right, there's a (non-normative) Technical Report that defines
                16- and 32-bit character types independent of wchar_t. We'll
                be shipping it as part of our next release, along with a slew
                of code conversions you can use with these new types.

                P.J. Plauger
                Dinkumware, Ltd.



                ---
                [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                [ --- Please see the FAQ before posting. --- ]
                [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                Comment

                • kanze

                  #23
                  Re: std::string vs. Unicode UTF-8

                  Richard Kettlewell wrote:[color=blue]
                  > "kanze" <kanze@gabi-soft.fr> writes:[color=green]
                  > > (If you have 20 or more bits, there's no need for the
                  > > combining characters; there only present to allow
                  > > representing character codes larger than 0xFFFF as two 16
                  > > bit characters.)[/color][/color]
                  [color=blue]
                  > I believe you are thinking of surrogates, rather than
                  > combining characters, here. The need (or otherwise) for the
                  > latter is independent of representation.[/color]

                  I was definitly talking about surrogates. And it is possible to
                  represent any Unicode character in UTF-32 without the use of
                  surrogates; they are only necessary in UTF-16.

                  --
                  James Kanze GABI Software
                  Conseils en informatique orientée objet/
                  Beratung in objektorientier ter Datenverarbeitu ng
                  9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


                  ---
                  [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                  [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                  [ --- Please see the FAQ before posting. --- ]
                  [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                  Comment

                  • Sean Parent

                    #24
                    Re: std::string vs. Unicode UTF-8

                    A few comments on this thread -

                    Unicode has been 21 bits since it's inception, at least it was 21 bits by
                    the time Unicode 1.0 came out - (I worked with Eric Mader, Dave Opstad, and
                    Mark Davis at Apple <http://www.unicode.org/history/>). Although I've heard
                    grumblings that people would like to extend it to include pages for more
                    dead languages.

                    UCS-2 is a subset of Unicode that fits in 16 bits without double word
                    encoding. It is part of ISO 10646, which also defines UCS-4, which for all
                    practical purposes is the same encoding as UTF-32 (there's a document on the
                    relationship on the unicode.org site). UTF-16 and UTF-32 both have endian
                    variants.

                    Operations such as "the number of characters in a string" has very little
                    meaning - there is no direct relationship between characters and glyphs,
                    there are combining characters (not the same as a multi-byte or word
                    encoding). Even if defined as the number of Unicode code points in a string,
                    it isn't particularly interesting.

                    Operations such as string catenation, sub-string searching, upper-case to
                    lower-case conversion, and collation are all non-trivial on a Unicode string
                    regardless of the encoding.

                    I think the current string classes and codecvt functionality in the language
                    is pretty decent (I would have preferred if wchar_t had been nailed to 32
                    bits, or even 16 bits... But that will be somewhat addressed). I'd like to
                    see the complexity of the current string classes specified - and I think a
                    lightweight copy (constant time) is needed - but I think move semantics will
                    address this. I also think it would be good to mark strings with their
                    encoding because it is too easy to end up with Mojibake
                    <http://en.wikipedia.or g/wiki/Mojibake> but I don't think this requires a
                    whole new string class (I honestly don't think there is such a thing as a
                    once size fits all string class).

                    I'd love to see the functionality of the IBM ICU libraries
                    <http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
                    not a fan of the ICU C++ interface (as I mentioned above - I don't see a
                    need for a new string class, I'd like ICU rethought as generic algorithms
                    that work regardless of the string representation.

                    Beyond that, I'd like to work towards a standard markup - strings require
                    more information than just their encoding to really be handled properly. You
                    need to know which sections of a string are in which language (which can't
                    be determined completely from the characters used) - items such as gender,
                    plurality, and formal forms all play a part in doing proper operations such
                    as replacements. The ASL xstring glossary library is a step in this
                    direction <http://opensource.adob e.com/group__asl__xst ring.html>

                    --
                    Sean Parent
                    Sr. Engineering Manager
                    Software Technology Lab
                    Adobe Systems Incorporated
                    sparent@adobe.c om

                    ---
                    [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                    [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                    [ --- Please see the FAQ before posting. --- ]
                    [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                    Comment

                    • Niklas Matthies

                      #25
                      Re: std::string vs. Unicode UTF-8

                      On 2005-10-04 04:00, kanze wrote:
                      :[color=blue]
                      > I was definitly talking about surrogates. And it is possible to
                      > represent any Unicode character in UTF-32 without the use of
                      > surrogates;[/color]

                      It's even necessary, because surrogate code points outside of UTF-16
                      are non-conformant and cause the corresponding byte or code point
                      sequences to be ill-formed.

                      -- Niklas Matthies

                      ---
                      [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                      [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                      [ --- Please see the FAQ before posting. --- ]
                      [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                      Comment

                      • kuyper@wizard.net

                        #26
                        Re: std::string vs. Unicode UTF-8

                        kanze wrote:[color=blue]
                        > Richard Kettlewell wrote:[color=green]
                        > > "kanze" <kanze@gabi-soft.fr> writes:[color=darkred]
                        > > > (If you have 20 or more bits, there's no need for the
                        > > > combining characters; there only present to allow
                        > > > representing character codes larger than 0xFFFF as two 16
                        > > > bit characters.)[/color][/color]
                        >[color=green]
                        > > I believe you are thinking of surrogates, rather than
                        > > combining characters, here. The need (or otherwise) for the
                        > > latter is independent of representation.[/color]
                        >
                        > I was definitly talking about surrogates. And it is possible to
                        > represent any Unicode character in UTF-32 without the use of
                        > surrogates; they are only necessary in UTF-16.[/color]

                        As the Unicode documents themselves point out, what a reader would
                        consider to be a single character is often represented in Unicode as
                        the combination of several unicode characters. Can an implementation
                        use UTF-32 encoding for wchar_t, and meet all of the requirements of
                        the C standard with respect to wchar_t, when combined characters are
                        involved? I think you can meet those requirements only by interpreting
                        every reference in the C standard to a wide "character" as referring to
                        a "unicode character" rather than as referring to what end users would
                        consider a character.

                        If search_string ends with an uncombined character, and target_string
                        contains the exact same sequence of wchar_t values followed by one or
                        more combining characters, I believe that wcsstr(search_s tring,
                        target_string) is supposed to report a match. That strikes me as
                        problematic.

                        ---
                        [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                        [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                        [ --- Please see the FAQ before posting. --- ]
                        [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                        Comment

                        • kuyper@wizard.net

                          #27
                          Re: std::string vs. Unicode UTF-8

                          Sean Parent wrote:
                          ..[color=blue]
                          > I think the current string classes and codecvt functionality in the language
                          > is pretty decent (I would have preferred if wchar_t had been nailed to 32
                          > bits, or even 16 bits... But that will be somewhat addressed). I'd like to[/color]

                          Requiring wchar_t to have more than 8 bits is pointless in itself. If
                          an implementor would have chosen to make wchar_t 8 bits without that
                          requirement, forcing the implementor to use 16 bits will merely
                          encourage definition of a 16-bit type that contains the same range of
                          values as his 8 bit type would have had. In the process, you'll be
                          making his implementation marginally more complicated and inefficient.

                          What might be worthwhile is to require some actual support for Unicode.
                          I'm not sure it's a good idea to impose such a requirement; there's a
                          real advantage to giving implementors the freedom to not support
                          Unicode if they know that their particular customer base has no need
                          for it. However, such a requirement would at least guarantee some
                          benefit to some users, which requiring wchar_t to be at least 16 bits
                          would NOT do.

                          ---
                          [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                          [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                          [ --- Please see the FAQ before posting. --- ]
                          [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                          Comment

                          • Lance Diduck

                            #28
                            Re: std::string vs. Unicode UTF-8

                            This was a great overvewi .Thanks![color=blue]
                            > I think the current string classes and codecvt functionality in the language
                            > is pretty decent (I would have preferred if wchar_t had been nailed to 32
                            > bits, or even 16 bits...[/color]
                            Of the four platforms that I regularly code for , two are 32 bit, and
                            two are 16bit def for wchar_t. And of each variety, two are big endian
                            (AIX and Solaris), and two are little (Linux and Microsoft) (I haven't
                            researched Cygwin, which would be interesting to see). This is four
                            different encodings. Any comparisions involving literals are suspect,
                            not to mention "binary support."
                            message catalogs help -- and the diversity there is off topic, but is
                            far far more non standard and uneven than whar_t support.
                            Given that most localization is done in a GUI framework rather than
                            through IOstreams, it would help if automatic invocation of codecvt
                            were placed in something like stringstream. But as it is codecvt only
                            invoked automatically in things that don't write to memory. And except
                            perhpas for CGI calls, there is little demand for "console mode"
                            internationaliz ed applications.
                            [color=blue]
                            >
                            > I'd love to see the functionality of the IBM ICU libraries
                            > <http://www-306.ibm.com/software/globalization/icu/index.jsp> although I'm
                            > not a fan of the ICU C++ interface (as I mentioned above - I don't see a
                            > need for a new string class,[/color]
                            The ICU C++ string uses -- and I'm not kidding --"bogus sematics."
                            http://icu.sourceforge.net/apiref/ic...tring.html#a82 You
                            check the validity of your string by calling the isBogus
                            method.Addition ally, every ICU class inherits from UMemory, and can
                            only change the heap manager by redefining this base class, and
                            redeploying the library.
                            THe ICU looks like a port from Java, and has a very Java feel to it. I
                            believe it is a great starting point though.

                            Other than string literals, and the lack of character iterators, the
                            main problem with the C++ string and Unicode is the compare function.
                            To get a true comparision one would really use the locale compare
                            function, mapped to some normalization and collation algorithm, and not
                            string compare, which is more or less memcmp. The interface for string
                            compare can only compare using the number of bytes in the smaller of
                            the strings to be compared -- so even if you did manage somehow to cram
                            normalization in a char_traits class, the triats::copare interface
                            requires truncation the larger of the two strings.
                            This works great for backward compatibility, though.

                            [color=blue]
                            >
                            > Beyond that, I'd like to work towards a standard markup -[/color]
                            But wouldn't that depend on the renderer? But adoption of XSL-FO may be
                            a goos start. However, RIM devices etc would barely be able to fit such
                            a renderer.




                            ]

                            ---
                            [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                            [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                            [ --- Please see the FAQ before posting. --- ]
                            [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                            Comment

                            • Sean Parent

                              #29
                              Re: std::string vs. Unicode UTF-8




                              in article 1128564183.5762 03.88670@g49g20 00...legro ups.com, Lance
                              Diduck at lancediduck@nyc .rr.com wrote on 10/5/05 11:22 PM:
                              [color=blue][color=green]
                              >>
                              >> Beyond that, I'd like to work towards a standard markup -[/color]
                              > But wouldn't that depend on the renderer? But adoption of XSL-FO may be
                              > a goos start. However, RIM devices etc would barely be able to fit such
                              > a renderer.[/color]

                              I should have clarified - I'm not looking at markup for rendering intents
                              (that's a separate but important issues) rather for semantic intents -
                              marking substrings with their language, gender, plurality, and locale as
                              well as alternates (alternate languages, alternate forms such as
                              formal/casual). These are important attributes for string processing. More
                              RDF than XSL-FO.

                              Sean

                              ---
                              [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                              [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                              [ --- Please see the FAQ before posting. --- ]
                              [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                              Comment

                              • Simon Bone

                                #30
                                Re: std::string vs. Unicode UTF-8

                                On Thu, 06 Oct 2005 00:20:59 -0600, kuyper wrote:

                                [color=blue]
                                > What might be worthwhile is to require some actual support for Unicode.
                                > I'm not sure it's a good idea to impose such a requirement; there's a
                                > real advantage to giving implementors the freedom to not support
                                > Unicode if they know that their particular customer base has no need
                                > for it. However, such a requirement would at least guarantee some
                                > benefit to some users, which requiring wchar_t to be at least 16 bits
                                > would NOT do.
                                >[/color]

                                Like the freedom not to implement export because no-one in their customer
                                base needs it? ;-)

                                I think standard Unicode support would be more widely appreciated than
                                export. If some vendors continue to decide not to quite finish their
                                implementations , so what? The world has not stopped turning while we wait
                                for more C++ 98 implementations to become strictly complete. I also expect
                                most C++ implementors would provide Unicode support following the
                                standard, if it was included.

                                Simon Bone

                                ---
                                [ comp.std.c++ is moderated. To submit articles, try just posting with ]
                                [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.e du ]
                                [ --- Please see the FAQ before posting. --- ]
                                [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]

                                Comment

                                Working...