Converting between Unicode and default locale

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Keith MacDonald

    Converting between Unicode and default locale

    Hello,

    Is there a portable (at least for VC.Net and g++) method to convert text
    between
    wchar_t and char, using the standard library? I may have missed something
    obvious, but the section on codecvt, in Josuttis' "The Standard C++
    Library", did not help, and I'm still awaiting delivery of Langer's
    "Standard C++ IOStreams and Locales".

    Thanks,
    Keith MacDonald
    [snip, before replying directly]


  • Mike Wahler

    #2
    Re: Converting between Unicode and default locale


    "Keith MacDonald" <keith@text-snip-pad.com> wrote in message
    news:bl274i$9h8 $1$8302bc10@new s.demon.co.uk.. .[color=blue]
    > Hello,
    >
    > Is there a portable (at least for VC.Net and g++) method to convert text
    > between
    > wchar_t and char, using the standard library? I may have missed something
    > obvious, but the section on codecvt, in Josuttis' "The Standard C++
    > Library", did not help, and I'm still awaiting delivery of Langer's
    > "Standard C++ IOStreams and Locales".[/color]

    I read in my copy of L&K that there is no built-in support
    for wide character streams. Type 'wchar_t' is only used
    to implement multibyte stream i/o.

    Also note that depending upon your platform's byte size,
    not all Unicode values will necessarily fit into type
    'char'.

    -Mike


    Comment

    • Ron Natalie

      #3
      Re: Converting between Unicode and default locale


      "Mike Wahler" <mkwahler@mkwah ler.net> wrote in message news:ok1db.5400 $NX3.2298@newsr ead3.news.pas.e arthlink.net...
      [color=blue]
      > I read in my copy of L&K that there is no built-in support
      > for wide character streams. Type 'wchar_t' is only used
      > to implement multibyte stream i/o.[/color]

      Mulstibyte is using more than one char to encode a character.
      wchar_t is fixed size wide characters. But I knew what you
      meant.

      Yes, it's a major defect in the internationaliz ation support.
      I have lobbied in comp.std.C++ to fix this (adding wchar_t
      interfaces to the few places that are sorely lacking it
      like the filenames in fstreams, etc...). Unfortunately,
      I get a lot of bitching and moaning from rest of the
      standard community who haven't seriously dealt with
      some of the more problematic character encodings such as Japanese.


      Comment

      • Aaron Isotton

        #4
        Re: Converting between Unicode and default locale

        On Fri, 26 Sep 2003 21:21:38 +0100, Keith MacDonald wrote:
        [color=blue]
        > Hello,
        >
        > Is there a portable (at least for VC.Net and g++) method to convert text
        > between
        > wchar_t and char, using the standard library? I may have missed something
        > obvious, but the section on codecvt, in Josuttis' "The Standard C++
        > Library", did not help, and I'm still awaiting delivery of Langer's
        > "Standard C++ IOStreams and Locales".[/color]

        Try mbstowcs/wcstombs.
        --
        Aaron Isotton
        Welcome to isotton.com, Aaron Isotton's website.


        Comment

        • Gianni Mariani

          #5
          Re: Converting between Unicode and default locale

          Ron Natalie wrote:[color=blue]
          > "Mike Wahler" <mkwahler@mkwah ler.net> wrote in message news:ok1db.5400 $NX3.2298@newsr ead3.news.pas.e arthlink.net...
          >
          >[color=green]
          >>I read in my copy of L&K that there is no built-in support
          >>for wide character streams. Type 'wchar_t' is only used
          >>to implement multibyte stream i/o.[/color]
          >
          >
          > Mulstibyte is using more than one char to encode a character.
          > wchar_t is fixed size wide characters. But I knew what you
          > meant.
          >
          > Yes, it's a major defect in the internationaliz ation support.
          > I have lobbied in comp.std.C++ to fix this (adding wchar_t
          > interfaces to the few places that are sorely lacking it
          > like the filenames in fstreams, etc...). Unfortunately,
          > I get a lot of bitching and moaning from rest of the
          > standard community who haven't seriously dealt with
          > some of the more problematic character encodings such as Japanese.[/color]

          Except that some vendors use utf-16 and some use ucs-4 as their what_t
          type. UTF-16 usually breaks a whole bunch of assumptions on what a
          whar_t type is supposed to be.

          On platforms that use utf-16, the complexity of processing ucs-4 or
          utf-16 characters is equivalent so it makes sense to only support utf-8.

          If you know your code is ONLY dealing with utf-8 characters, you can
          make processing utf-8 characters very efficient by inlining some of the
          code thats deals with utf-8.


          Comment

          • Mike Wahler

            #6
            Re: Converting between Unicode and default locale

            "Ron Natalie" <ron@sensor.com > wrote in message
            news:3f74a4e1$0 $143$9a6e19ea@n ews.newshosting .com...[color=blue]
            >
            > "Mike Wahler" <mkwahler@mkwah ler.net> wrote in message[/color]
            news:ok1db.5400 $NX3.2298@newsr ead3.news.pas.e arthlink.net...[color=blue]
            >[color=green]
            > > I read in my copy of L&K that there is no built-in support
            > > for wide character streams. Type 'wchar_t' is only used
            > > to implement multibyte stream i/o.[/color]
            >
            > Mulstibyte is using more than one char to encode a character.[/color]

            Right.
            [color=blue]
            > wchar_t is fixed size wide characters.[/color]

            Right.
            [color=blue]
            >But I knew what you
            > meant.[/color]

            I meant what I said. (Actually I suppose L&K meant it,
            I'm only repeating it).

            What they were explaining is that of course a multibyte
            file's contents cannot be stored with type 'char' objects
            without losing information, so the multibyte characters
            are converted (via a facet) to/from a wide character
            encoding interally to the stream. The transport
            layer actually accesses the file in 'char'-size
            objects.

            Ref: Langer & Kreft 2.3, p 113

            If you feel I'm misunderstandin g, please do clarify.

            [color=blue]
            > Yes, it's a major defect in the internationaliz ation support.[/color]

            Yes, I agree. Didn't folks work hard to create a
            standard character set which could accomodate virtually
            all written languages?
            [color=blue]
            > I have lobbied in comp.std.C++ to fix this (adding wchar_t
            > interfaces to the few places that are sorely lacking it
            > like the filenames in fstreams, etc...). Unfortunately,
            > I get a lot of bitching and moaning from rest of the
            > standard community who haven't seriously dealt with
            > some of the more problematic character encodings such as Japanese.[/color]

            I haven't had to deal with international issues yet, but I
            know that it's only a matter of time, and I'd sure like
            some Unicode support so I can practice ahead of time.

            Any time I spend more than a few minutes with my nose
            inside the L&K book, I come away with my head swimming. :-)

            -Mike


            Comment

            • Ron Natalie

              #7
              Re: Converting between Unicode and default locale


              "Aaron Isotton" <aaron@isotton. com> wrote in message news:pan.2003.0 9.26.20.44.53.7 39654@isotton.c om...
              [color=blue][color=green]
              > > Is there a portable (at least for VC.Net and g++) method to convert text
              > > between
              > > wchar_t and char, using the standard library? I may have missed something
              > > obvious, but the section on codecvt, in Josuttis' "The Standard C++
              > > Library", did not help, and I'm still awaiting delivery of Langer's
              > > "Standard C++ IOStreams and Locales".[/color]
              >
              > Try mbstowcs/wcstombs.
              > --[/color]
              Unfortunately that is not adequate for the windows environment.
              In actuality, it is impossible to properly use UNICODE filenames with
              the standard C++ library on windows.

              I have not been able to make any inroads with the standardization people
              about doing something about this.


              Comment

              • Ron Natalie

                #8
                Re: Converting between Unicode and default locale


                "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message news:bl29u5$45b @dispatch.conce ntric.net...
                [color=blue]
                >
                > Except that some vendors use utf-16 and some use ucs-4 as their what_t
                > type. UTF-16 usually breaks a whole bunch of assumptions on what a
                > whar_t type is supposed to be.[/color]

                Immaterial to the problem. The standard library is broken even if your
                wchar_t is 32 bits.
                [color=blue]
                > On platforms that use utf-16, the complexity of processing ucs-4 or
                > utf-16 characters is equivalent so it makes sense to only support utf-8.[/color]

                I do not agree. And windows doesn't provide an implicit char to wchar_t
                translation in the system interfaces (utf-8) or otherwise. It's immaterial
                to the fact that wchar_t might become a multi-wide-byte encoding. The
                standard library does not provide the hooks necessary to fully support
                wchar_t such as you might have.
                [color=blue]
                > If you know your code is ONLY dealing with utf-8 characters, you can
                > make processing utf-8 characters very efficient by inlining some of the
                > code thats deals with utf-8.[/color]

                The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
                16 bit values if you want to use other than the base codetable. We've
                had to write our own bloody fstreams that does a UTF-8 to wchar_t
                conversion (essentially reimplimenting fstream to work properly)
                but that ought not to be necessary. It's a defect in the language.


                Comment

                • Ron Natalie

                  #9
                  Re: Converting between Unicode and default locale


                  "Mike Wahler" <mkwahler@mkwah ler.net> wrote in message news:oX1db.5447 $NX3.1617@newsr ead3.news.pas.e arthlink.net...
                  [color=blue]
                  > What they were explaining is that of course a multibyte
                  > file's contents cannot be stored with type 'char' objects
                  > without losing information, so the multibyte characters
                  > are converted (via a facet) to/from a wide character
                  > encoding interally to the stream. The transport
                  > layer actually accesses the file in 'char'-size
                  > objects.[/color]

                  I'm not understanding what you are saying. There's no reason
                  why a multibyte (in char) encoding of a wchar_t loses any information.
                  UTF-8 will encode 32 bit UNICODE in some number between 1 and
                  6 char's.

                  [color=blue]
                  > Ref: Langer & Kreft 2.3, p 113[/color]

                  I don't have the book.

                  Don't even get me started that the "basic character type" and
                  the "smallest addressable unit of storage" really should be
                  distinct types and not overloaded on char. This is the
                  price we pay for working in an American-centric industry
                  I guess.


                  Comment

                  • Mike Wahler

                    #10
                    Re: Converting between Unicode and default locale

                    "Ron Natalie" <ron@sensor.com > wrote in message
                    news:3f74ad55$0 $175$9a6e19ea@n ews.newshosting .com...[color=blue]
                    >
                    > "Mike Wahler" <mkwahler@mkwah ler.net> wrote in message[/color]
                    news:oX1db.5447 $NX3.1617@newsr ead3.news.pas.e arthlink.net...[color=blue]
                    >[color=green]
                    > > What they were explaining is that of course a multibyte
                    > > file's contents cannot be stored with type 'char' objects
                    > > without losing information, so the multibyte characters
                    > > are converted (via a facet) to/from a wide character
                    > > encoding interally to the stream. The transport
                    > > layer actually accesses the file in 'char'-size
                    > > objects.[/color]
                    >
                    > I'm not understanding what you are saying.[/color]

                    I'm not sure I'm conveying the info correctly.
                    I've include a quote from L&K below.
                    [color=blue]
                    > There's no reason
                    > why a multibyte (in char) encoding of a wchar_t loses any information.
                    > UTF-8 will encode 32 bit UNICODE in some number between 1 and
                    > 6 char's.
                    >
                    >[color=green]
                    > > Ref: Langer & Kreft 2.3, p 113[/color]
                    >
                    > I don't have the book.[/color]

                    Angelika Langer & Klaus Kreft,
                    "Standard C++ IOStreams and Locales,"
                    Chapter 2, "The Architecture of IOStreams"
                    Section 2.3, "Character Types and Character Traits",
                    page 113:

                    <quote>

                    MULTIBYTE FILES

                    CHARACTER TYPE. Multibye files contain characters in a
                    multibyte encoding. Different from one-byte or wide-character
                    encodings, multibyte characters do not have the same size.
                    A single multibyte character can have a length of 1, 2, 3, or
                    more bytes. Obviously, none of the built-in character types,
                    char or wchar_t, is large enough to hold any character of a
                    given multibyte encoding. For this reason, multibyte characters
                    contained in a multibyte file are chopped into units of one
                    byte each. The wide-character file stream extracts data from
                    the multibyte file byte by byte, interprets the byte sequence,
                    finds out which and how many bytes form a multibyte character,
                    identifies the character, and translates it to a wide-character <<===
                    encoding.

                    Due to the decomposition of the multibytes into one- byte
                    units, the type of characters exchanged between the transport
                    layer and a multibyte file is char.

                    CHARACTER ENCODING. The encoding of characters exchanged
                    between the transport layer and a multibyte file can be any
                    multibyte encoding. Ite depends wholly on the content of the
                    multibyte file. As wide-character file streams internally
                    represent characters as units of type wchar_t encoded in the
                    programming environment's wide-character encoding, a code
                    conversion is always necessary. The code conversion is per-
                    formed by the stream buffer's code conversion facet. There
                    is no default conversion defined. It all depends on the code
                    conversion facet contained in the stream buffer's locale object,
                    which initially is the current global locale.

                    In sum, the external character representation of wide-
                    character file streams is that of the units transferred to and
                    from a multibyte file. Its character type is char, and the
                    encoding depends on the stream's code conversion facet.
                    </quote>


                    The above implies to me that in order to access a multibyte
                    file, one needs to use a basic(i/o)stream<wchar_ t>. Am I
                    missing something or assuming too much?
                    [color=blue]
                    > Don't even get me started that the "basic character type" and
                    > the "smallest addressable unit of storage"[/color]

                    I don't think that's part of this issue. They describe
                    abstract 'character types', about which a stream obtains
                    pertinent information via 'character traits' types.
                    [color=blue]
                    >really should be
                    > distinct types and not overloaded on char.[/color]

                    I don't know what you mean here. I don't see L&K
                    mention either "basic character type" or "smallest
                    addressible unit of storage," or "overloadin g on char."
                    They talk about how iostreams is templatized on a
                    'character type', which can be either of the built-in
                    types char or wchar_t, or some other invented character
                    type which meets the requirements imposed by iostreams
                    (defines EOF value, etc).
                    [color=blue]
                    > This is the
                    > price we pay for working in an American-centric industry
                    > I guess.[/color]

                    What about this do you feel is "American-centric"?

                    Thanks for your input.

                    -Mike


                    Comment

                    • Gianni Mariani

                      #11
                      Re: Converting between Unicode and default locale

                      Ron Natalie wrote:[color=blue]
                      > "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message news:bl29u5$45b @dispatch.conce ntric.net...
                      >[/color]
                      ....[color=blue]
                      > The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
                      > 16 bit values if you want to use other than the base codetable. We've
                      > had to write our own bloody fstreams that does a UTF-8 to wchar_t
                      > conversion (essentially reimplimenting fstream to work properly)
                      > but that ought not to be necessary. It's a defect in the language.[/color]


                      Did you consider just implementing a utf-8 specific string library as an
                      alternative ?








                      Comment

                      • Mike Wahler

                        #12
                        Re: Converting between Unicode and default locale


                        "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message
                        news:bl2hp0$452 @dispatch.conce ntric.net...[color=blue]
                        > Ron Natalie wrote:[color=green]
                        > > "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message[/color][/color]
                        news:bl29u5$45b @dispatch.conce ntric.net...[color=blue][color=green]
                        > >[/color]
                        > ...[color=green]
                        > > The WIN32 interfaces do not support utf-8. Yoiu have to feed them the
                        > > 16 bit values if you want to use other than the base codetable. We've
                        > > had to write our own bloody fstreams that does a UTF-8 to wchar_t
                        > > conversion (essentially reimplimenting fstream to work properly)
                        > > but that ought not to be necessary. It's a defect in the language.[/color]
                        >
                        >
                        > Did you consider just implementing a utf-8 specific string library as an
                        > alternative ?[/color]

                        How would that enable streaming of the characters?

                        -Mike


                        Comment

                        • Gianni Mariani

                          #13
                          Re: Converting between Unicode and default locale

                          Mike Wahler wrote:[color=blue]
                          > "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message[/color]
                          ....[color=blue]
                          >
                          > How would that enable streaming of the characters?[/color]

                          Explain how it does not.



                          Comment

                          • Mike Wahler

                            #14
                            Re: Converting between Unicode and default locale


                            "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message
                            news:bl2j1f$45c @dispatch.conce ntric.net...[color=blue]
                            > Mike Wahler wrote:[color=green]
                            > > "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message[/color]
                            > ...[color=green]
                            > >
                            > > How would that enable streaming of the characters?[/color]
                            >
                            > Explain how it does not.[/color]

                            I asked you first. :-)

                            Internationaliz ation and character sets are not
                            something I claim expertise in. I'm in this
                            thread to try to learn a thing or two myself.
                            Toward that end, I offered a quote from L&K
                            with my interpretation, and asked that any
                            misconceptions be pointed out.

                            -Mike


                            Comment

                            • Gianni Mariani

                              #15
                              Re: Converting between Unicode and default locale

                              Mike Wahler wrote:[color=blue]
                              > "Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message
                              > news:bl2j1f$45c @dispatch.conce ntric.net...
                              >[color=green]
                              >>Mike Wahler wrote:
                              >>[color=darkred]
                              >>>"Gianni Mariani" <gi2nospam@mari ani.ws> wrote in message[/color]
                              >>
                              >>...
                              >>[color=darkred]
                              >>>How would that enable streaming of the characters?[/color]
                              >>
                              >>Explain how it does not.[/color]
                              >
                              >
                              > I asked you first. :-)
                              >
                              > Internationaliz ation and character sets are not
                              > something I claim expertise in. I'm in this
                              > thread to try to learn a thing or two myself.
                              > Toward that end, I offered a quote from L&K
                              > with my interpretation, and asked that any
                              > misconceptions be pointed out.[/color]

                              Well, I don't really know what you're trying to do.

                              However, I can say that Unicode is a very complex beast.

                              You have issues like:

                              Composed characters

                              Bidirectional strings

                              Multiple representations of the same characters

                              Language tags

                              Ligatures

                              Private use characters

                              ++ more

                              Let's compare:

                              issue |utf-8 | utf-16 | utf-32
                              ----------------------------------------
                              endian | no | yes | yes
                              ascii-is-ascii | yes | no | no
                              is-multi-*unit* | yes | yes | mostly no
                              is compact | yes | kind of| no
                              is stateful | no | yes | yes

                              The problem is that a true internationaliz ation library is far more
                              complex than what is provided by the C++ standard and on top of that,
                              you don't really need to deal with all these issues all the time.

                              For most uninternational ized applications, not even touching the legacy
                              single byte code and pushing multibyte data through it will render
                              exactly the right results.

                              So, the real question is. What kind of issues are you dealing with that
                              require internationaliz ation ?




                              Comment

                              Working...