8 bit character string to 16 bit character string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Brand Bogard

    8 bit character string to 16 bit character string

    Does the C standard include a library function to convert an 8 bit character
    string to a 16 bit character string?


  • Walter Roberson

    #2
    Re: 8 bit character string to 16 bit character string

    In article <e54og7$t0c$1@n ewshost.mot.com >,
    Brand Bogard <brand.bogard@m otorola.com> wrote:[color=blue]
    >Does the C standard include a library function to convert an 8 bit character
    >string to a 16 bit character string?[/color]

    No. All that the C standard knows about char is that it is a -minimum-
    of 8 bits long.

    What might interest you, however is:

    wchar_t is value superset of char_t, so if you have an array of wchar_t
    and copy each member of a char array in the corresponding position in
    it, the result will be a valid wchar_t string representing the same text.

    Once you have a wchar_t string, you can use wcstombs() to convert
    it into a locale-dependant multibyte string (c.f. LC_CTYPE). If
    your locale has been set up properly, this should do the transformation
    you want.

    By itself "16 bit character string" is not specific enough: you
    need to know which encoding you are using, such as utf-16 .

    --
    Prototypes are supertypes of their clones. -- maplesoft

    Comment

    • those who know me have no need of my name

      #3
      Re: 8 bit character string to 16 bit character string

      in comp.lang.c i read:[color=blue]
      >In article <e54og7$t0c$1@n ewshost.mot.com >,
      >Brand Bogard <brand.bogard@m otorola.com> wrote:[/color]
      [color=blue][color=green]
      >>Does the C standard include a library function to convert an 8 bit
      >>character string to a 16 bit character string?[/color]
      >
      >No. All that the C standard knows about char is that it is a -minimum-
      >of 8 bits long.[/color]

      i.e., if you must work with 8 and 16 bit character strings you will need
      custom routines if you want much portability.
      [color=blue]
      >What might interest you, however is:
      >
      >wchar_t is value superset of char_t, so if you have an array of wchar_t
      >and copy each member of a char array in the corresponding position in
      >it, the result will be a valid wchar_t string representing the same text.[/color]

      also, if setlocale() has been appropriately used then mbstowcs or mbsrtowcs
      will convert a string (a sequence of char terminated by a null byte '\0'),
      each of which may be part of a multi-byte sequence, into a wide-character
      string (a sequence of wchar_t terminated by a wide null byte L'\0').

      --
      a signature

      Comment

      • Haider

        #4
        Re: 8 bit character string to 16 bit character string

        Try mbstowcs it will work.

        Comment

        • Brand Bogard

          #5
          Re: 8 bit character string to 16 bit character string

          "Haider" <hmminto@yahoo. com> wrote in message
          news:1148646763 .134013.253570@ i40g2000cwc.goo glegroups.com.. .[color=blue]
          > Try mbstowcs it will work.
          >[/color]
          mbstowcs isn't in out environment, but mbtowc is. Thanks.


          Comment

          • Walter Roberson

            #6
            Re: 8 bit character string to 16 bit character string

            In article <e57gg4$6ov$1@n ewshost.mot.com >,
            Brand Bogard <brand.bogard@m otorola.com> wrote:
            [color=blue]
            >mbstowcs isn't in out environment, but mbtowc is.[/color]

            mbstowcs() is part of the C89 standard, and so should be available
            in any hosted environment. I suggest you check <stdlib.h> to see if
            it is declared there.

            mbstowcs() is for converting multibyte character strings into
            wide character strings. Multibyte character strings are not
            necessarily "16 bit characters"; for example, the encoding used might
            normally represent ISO8896-1 characters as single bytes, only
            shifting into 16+ bit representations when necessary to encode
            characters from other character sets. In some cases, a multibyte
            character string that requires multiple bytes to represent might
            convert into byte that fits within a standard (narrow) char.
            The detailed representations of characters in multibyte strings
            is outside of the perview of the C standard (other than a constraint
            put upon the nul character.)

            If you have a (narrow) char string, you cannot convert it to
            a wchar_t string by setting your locale to "C" and then passing
            the string through mbstowcs(). That's because the "C" locale specifies
            a -particular- character encoding, and that encoding might not match
            the encoding of the execution character set, so mbstowcs() might
            map the characters to something unexpected, or could even fail
            (if the execution character set happened to use encodings that
            were incompatible with the encoding structure for the C locale
            character set.)

            Thus, in order to convert a char string into a wider string, you
            have to copy the chars one by one into an array of wchar_t .
            If you need to work with Unicode or utf-16 or whatever after that,
            then wcstombs() is what you should look at.
            --
            Programming is what happens while you're busy making other plans.

            Comment

            • Simon Biber

              #7
              Re: 8 bit character string to 16 bit character string

              Walter Roberson wrote:[color=blue]
              > In article <e54og7$t0c$1@n ewshost.mot.com >,
              > Brand Bogard <brand.bogard@m otorola.com> wrote:
              >[color=green]
              >>Does the C standard include a library function to convert an 8 bit character
              >>string to a 16 bit character string?[/color]
              >
              >
              > No. All that the C standard knows about char is that it is a -minimum-
              > of 8 bits long.
              >
              > What might interest you, however is:
              >
              > wchar_t is value superset of char_t, so if you have an array of wchar_t
              > and copy each member of a char array in the corresponding position in
              > it, the result will be a valid wchar_t string representing the same text.[/color]

              No, in the general case it is not!

              On most of the Linux systems that I admin, wchar_t is UTF-32 and char is
              UTF-8. In that case, if you simply copy each member of a char array in
              the corresponding position to a wchar_t array, it will not be a valid
              wchar_t string representing the same text!

              The same is true for any encoding of the char array apart from ISO-8859-1.

              The standard only guarantees the "value superset" semantics for the
              _basic character set_. (Ref: C99 7.17 paragraph 2)

              Assuming that wchar_t is either UTF-16 or UTF-32, then there is only
              case where char arrays containing characters outside the basic character
              set can be copied wholesale into wchar_t arrays. That is where the
              encoding of the char array is ISO-8859-1.

              --
              Simon.

              Comment

              • Stephen Sprunk

                #8
                Re: 8 bit character string to 16 bit character string

                "Walter Roberson" <roberson@ibd.n rc-cnrc.gc.ca> wrote in message
                news:e57imu$cj9 $1@canopus.cc.u manitoba.ca...[color=blue]
                > If you have a (narrow) char string, you cannot convert it to
                > a wchar_t string by setting your locale to "C" and then passing
                > the string through mbstowcs(). That's because the "C" locale specifies
                > a -particular- character encoding, and that encoding might not match
                > the encoding of the execution character set, so mbstowcs() might
                > map the characters to something unexpected, or could even fail
                > (if the execution character set happened to use encodings that
                > were incompatible with the encoding structure for the C locale
                > character set.)
                >
                > Thus, in order to convert a char string into a wider string, you
                > have to copy the chars one by one into an array of wchar_t .
                > If you need to work with Unicode or utf-16 or whatever after that,
                > then wcstombs() is what you should look at.[/color]

                Please pardon the tangent...

                Does anyone have a reference to _how to actually use_ the multi-byte / wide
                functions in a real program? I've studied the documentation available, and
                I can't make heads or tails of them or figure out how to do what I want.

                Specifically, I'm looking for a way to read from a text file that is in one
                multibyte encoding, manipulate the contents as wide chars, then write to a
                text file that is in a _different_ multibyte encoding. I'm sure it's
                simple, but I can't find any examples of code using the standard C
                functions, just stuff like <OT>libiconv</OT>.

                S

                --
                Stephen Sprunk "Stupid people surround themselves with smart
                CCIE #3723 people. Smart people surround themselves with
                K5SSS smart people who disagree with them." --Aaron Sorkin


                *** Posted via a free Usenet account from http://www.teranews.com ***

                Comment

                • those who know me have no need of my name

                  #9
                  using mbcs and wide functions (was: Re: 8 bit character string to16 bit character string)

                  in comp.lang.c i read:
                  [color=blue]
                  >Does anyone have a reference to _how to actually use_ the multi-byte /
                  >wide functions in a real program?[/color]

                  the main issue is that it is something of a portability nightmare, at least
                  without resorting to facilities beyond those in the c standard.
                  [color=blue]
                  >Specifically , I'm looking for a way to read from a text file that is
                  >in one multibyte encoding, manipulate the contents as wide chars, then
                  >write to a text file that is in a _different_ multibyte encoding.[/color]

                  the main issue is setting the locales properly. since there are few
                  standards for the meaning of the names, and what few exist don't tend to be
                  strict, this means much guessing and potential failures. sometimes this is
                  a non-issue, as a single known (and working) locale is involved for input
                  and output.

                  secondarily is library conformance; specifically whether it supports amd1
                  or c99, vs plain old c89. without amd1 or later you need to read a string
                  then use mbstowcs to convert to a wide string, at which point you can
                  manipulate the various wchar_t. character by character is not possible
                  using just c89 facilities (unless you want to go into the business of
                  decoding character encodings yourself).

                  a program that counts upper-case characters looks nearly the same when
                  insensitive to locale:

                  #include <stdio.h>
                  #include <ctype.h>

                  int main(void)
                  {
                  unsigned long upper = 0;
                  int c;

                  while (EOF != (c = getc(stdin)))
                  if (isupper(c))
                  upper++;

                  printf("There were %lu upper-case characters.\n", upper);

                  return 0;
                  }

                  as when sensitive (w/amd1 or c99 conformance):

                  #include <stdio.h>
                  #include <stdlib.h>
                  #include <locale.h>
                  #include <wchar.h>
                  #include <wctype.h>

                  int main(void)
                  {
                  unsigned long upper = 0;
                  wint_t c;

                  if (0 ==
                  setlocale(LC_CT YPE, "")) /* environment specified locale */
                  {
                  fputs("your locale is invalid, the world ends\n", stderr);
                  abort();
                  }

                  while (WEOF != (c = getwc(stdin)))
                  if (iswupper(c))
                  upper++;

                  wprintf(L"There were %lu upper-case characters.\n", upper);

                  return 0;
                  }

                  but your desire for a different locale on output makes it tricky. worse,
                  switching between locales can have issues, so best to get everything done
                  with one locale before moving to the next. you might let the user specify
                  each, and pray they supply valid names:

                  #include <stdio.h>
                  #include <stdlib.h>
                  #include <locale.h>
                  #include <wchar.h>
                  #include <wctype.h>

                  int main(void)
                  {
                  unsigned long upper = 0;
                  wint_t c;

                  if (3 != argc)
                  {
                  fputs("incorrec t number of arguments\n", stderr);
                  fputs("supply input and output locale names\n", stderr);
                  abort();
                  }

                  if (0 ==
                  setlocale(LC_CT YPE, argv[1])) /* user specified input locale */
                  {
                  fputs("input locale is invalid, the world ends\n", stderr);
                  abort();
                  }
                  while (WEOF != (c = getwc(stdin)))
                  if (iswupper(c))
                  upper++;

                  if (0 ==
                  setlocale(LC_AL L, argv[2])) /* user specified output locale */
                  {
                  fputs("output locale is invalid, the world ends\n", stderr);
                  abort();
                  }
                  wprintf(L"There were %lu upper-case characters.\n", upper);

                  return 0;
                  }

                  though i've used wide string literals, and associated output functions, i
                  haven't actually shown anything that would make them useful, because the
                  form is implementation defined so anything outside the basic character set
                  may not be portable. wonderful, huh? now that isn't to say there is no
                  way to handle it, most people would use a localization (l10n) mechanism
                  like catgets or gettext so that the strings would be fetched from an
                  external resource which is aligned with the implementation requirements.
                  c99 provides a (somewhat clumsy) way to use iso-10646 characters in wide
                  string literals, which increases source portability -- i could have used
                  them here, though that would just make the "c99 isn't real" people come out
                  of the woodwork.

                  --
                  a signature

                  Comment

                  Working...