Multi-byte chars

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bill Cunningham

    Multi-byte chars

    I've been reading the C standard online and I'm puzzled as to what multibyte
    chars are. Wide chars I believe would be characters for languages such as
    cantonese or Japanese. I know the ASCII character set specifies that each
    character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
    character?
    Also how would you use the function parameter main (char argc, char
    **argv) if that's correct?

    Bill





    -----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
    http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
    -----== Over 80,000 Newsgroups - 16 Different Servers! =-----
  • Richard Heathfield

    #2
    Re: Multi-byte chars

    Bill Cunningham wrote:
    [color=blue]
    > I've been reading the C standard online and I'm puzzled as to what
    > multibyte chars are.[/color]

    A multibyte character is a "sequence of one or more bytes representing a
    member of the extended character set of either the source or the execution
    environment", if I have the quote from 3.7.2 right.
    [color=blue]
    > Wide chars I believe would be characters for
    > languages such as cantonese or Japanese.[/color]

    C isn't as specific as that. See 3.7.3.
    [color=blue]
    > I know the ASCII character set
    > specifies that each character such as 'b' or 'B' is an 8 bit character.[/color]

    7 bits, not 8. ASCII is a 7-bit code.

    <snip>

    --
    Richard Heathfield : binary@eton.pow ernet.co.uk
    "Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
    C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
    K&R answers, C books, etc: http://users.powernet.co.uk/eton

    Comment

    • lawrence.jones@eds.com

      #3
      Re: Multi-byte chars

      Bill Cunningham <some@some.ne t> wrote:[color=blue]
      >
      > I've been reading the C standard online and I'm puzzled as to what multibyte
      > chars are. Wide chars I believe would be characters for languages such as
      > cantonese or Japanese. I know the ASCII character set specifies that each
      > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
      > character?[/color]

      A single logical character that requires more than one byte to express.
      For example, consider the UTF-8 encoding format for ISO 10646: normal
      ASCII characters (between \x00 and \x7f) are encoded as a single byte
      with the same value. Other characters are encoded as multiple bytes,
      each of which has the top bit set; the first byte is in the range \xc0
      to \xfd and indicates the number of bytes that follow, subsequent bytes
      are in the range \x80 to \xbf. UTF-8 encoded characters can be any
      length between one and six bytes. So 'A' is encoded as \x41 but '©'
      (the copyright sign) is encoded as \xc2\xa9.

      Multibyte encodings can be very space efficient, but they are difficult
      to process since different characters have different lengths. Wide
      characters, on the other hand, are intended to be efficient for
      processing, but not necessarily space efficient. Wide characters are
      integers that are large enough so that every logical character can be
      represented in just one wide character.

      -Larry Jones

      If I get a bad grade, it'll be YOUR fault for not doing the work for me!
      -- Calvin

      Comment

      • Jun Woong

        #4
        Re: Multi-byte chars


        <lawrence.jones @eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=blue]
        > Bill Cunningham <some@some.ne t> wrote:[color=green]
        > >
        > > I've been reading the C standard online and I'm puzzled as to what multibyte
        > > chars are. Wide chars I believe would be characters for languages such as
        > > cantonese or Japanese. I know the ASCII character set specifies that each
        > > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
        > > character?[/color]
        >
        > A single logical character that requires more than one byte to express.
        > For example, consider the UTF-8 encoding format for ISO 10646: normal
        > ASCII characters (between \x00 and \x7f) are encoded as a single byte
        > with the same value.[/color]

        My understanding is that the standard requires 'A' == L'A' by the fact
        that the basic character set must be a subset of the extended
        character set. Do this and what you mentioned above mean that a
        character set whose code values differ from ASCII's can't be the basic
        set on an implementation where code values of Unicode is used as those
        of the extended set?


        --
        Jun, Woong (mycoboco@hanma il.net)
        Dept. of Physics, Univ. of Seoul



        Comment

        • Dan Pop

          #5
          Re: Multi-byte chars

          In <bebmda$1ho$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

          [color=blue]
          ><lawrence.jone s@eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=green]
          >> Bill Cunningham <some@some.ne t> wrote:[color=darkred]
          >> >
          >> > I've been reading the C standard online and I'm puzzled as to what multibyte
          >> > chars are. Wide chars I believe would be characters for languages such as
          >> > cantonese or Japanese. I know the ASCII character set specifies that each
          >> > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
          >> > character?[/color]
          >>
          >> A single logical character that requires more than one byte to express.
          >> For example, consider the UTF-8 encoding format for ISO 10646: normal
          >> ASCII characters (between \x00 and \x7f) are encoded as a single byte
          >> with the same value.[/color]
          >
          >My understanding is that the standard requires 'A' == L'A' by the fact
          >that the basic character set must be a subset of the extended
          >character set.[/color]

          Non sequitur. The fact that A belongs to the basic character set has
          no relevance on the value of L'A', AFAICT. All the standard has to say
          on the issue is:

          11 A wide character constant has type wchar_t, an integer type
          defined in the <stddef.h> header. The value of a wide character
          constant containing a single multibyte character that maps to
          a member of the extended execution character set is the wide
          character corresponding to that multibyte character, as defined
          by the mbtowc function, with an implementation-defined current
          locale.
          [color=blue]
          >Do this and what you mentioned above mean that a
          >character set whose code values differ from ASCII's can't be the basic
          >set on an implementation where code values of Unicode is used as those
          >of the extended set?[/color]

          Nope, he was merely describing what happens on an implementation using
          ASCII for normal characters and UCS for wide characters (therefore UTF-8
          for multi-byte characters).

          There is nothing preventing an implementation from using EBCDIC for
          normal characters and UCS for wide characters, in which case it is foolish
          to expect 'A' == L'A'.

          Furthermore, there is nothing preventing an implementation from using
          ASCII for normal characters and EBCDIC for wide characters (or vice
          versa). The fact that C99 supports UCNs in source code means nothing WRT
          the execution character set (whose extended version need not contain any
          additional characters).

          Dan
          --
          Dan Pop
          DESY Zeuthen, RZ group
          Email: Dan.Pop@ifh.de

          Comment

          • Jun Woong

            #6
            Re: Multi-byte chars


            "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bebts2$kf2 $2@sunnews.cern .ch...[color=blue]
            > In <bebmda$1ho$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
            >[color=green]
            > ><lawrence.jone s@eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=darkred]
            > >> Bill Cunningham <some@some.ne t> wrote:
            > >> >
            > >> > I've been reading the C standard online and I'm puzzled as to what multibyte
            > >> > chars are. Wide chars I believe would be characters for languages such as
            > >> > cantonese or Japanese. I know the ASCII character set specifies that each
            > >> > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
            > >> > character?
            > >>
            > >> A single logical character that requires more than one byte to express.
            > >> For example, consider the UTF-8 encoding format for ISO 10646: normal
            > >> ASCII characters (between \x00 and \x7f) are encoded as a single byte
            > >> with the same value.[/color]
            > >
            > >My understanding is that the standard requires 'A' == L'A' by the fact
            > >that the basic character set must be a subset of the extended
            > >character set.[/color]
            >
            > Non sequitur. The fact that A belongs to the basic character set has
            > no relevance on the value of L'A', AFAICT. All the standard has to say
            > on the issue is:
            >
            > 11 A wide character constant has type wchar_t, an integer type
            > defined in the <stddef.h> header. The value of a wide character
            > constant containing a single multibyte character that maps to
            > a member of the extended execution character set is the wide
            > character corresponding to that multibyte character, as defined
            > by the mbtowc function, with an implementation-defined current
            > locale.[/color]

            And in 7.17p2:

            wchar_t

            which is an integer type whose range of values can represent
            distinct codes for all members of the largest extended character
            set specified among the supported locales; the null character
            shall have the code value zero and each member of the basic
            character set shall have a code value equal to its value when used
            as the lone character in an integer character constant.


            --
            Jun, Woong (mycoboco@hanma il.net)
            Dept. of Physics, Univ. of Seoul



            Comment

            • lawrence.jones@eds.com

              #7
              Re: Multi-byte chars

              Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
              >
              > My understanding is that the standard requires 'A' == L'A' by the fact
              > that the basic character set must be a subset of the extended
              > character set. Do this and what you mentioned above mean that a
              > character set whose code values differ from ASCII's can't be the basic
              > set on an implementation where code values of Unicode is used as those
              > of the extended set?[/color]

              Yes, but. That requirement is a hold-over from the very earliest days of
              extended character set support, before there were functions to convert
              between wide and narrow characters. Now that those functions exist,
              there is no longer any reason for the requirement, and the committee has
              voted to remove it. See the committee's response to DR #279:

              <http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/dr_279.htm>

              -Larry Jones

              Somebody's always running my life. I never get to do what I want to do.
              -- Calvin

              Comment

              • Dan Pop

                #8
                Re: Multi-byte chars

                In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

                [color=blue]
                >And in 7.17p2:
                >
                > wchar_t
                >
                > which is an integer type whose range of values can represent
                > distinct codes for all members of the largest extended character
                > set specified among the supported locales; the null character
                > shall have the code value zero and each member of the basic
                > character set shall have a code value equal to its value when used
                > as the lone character in an integer character constant.[/color]

                This requirement, carried on from C89, is simply broken: implementations
                that don't use ASCII for normal characters wouldn't be able to use *any*
                of the ASCII extensions (UCS, most importantly) for wide characters.

                Dan
                --
                Dan Pop
                DESY Zeuthen, RZ group
                Email: Dan.Pop@ifh.de

                Comment

                • Jun Woong

                  #9
                  Re: Multi-byte chars


                  "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bec92v$p1g $1@sunnews.cern .ch...[color=blue]
                  > In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
                  >
                  >[color=green]
                  > >And in 7.17p2:
                  > >
                  > > wchar_t
                  > >
                  > > which is an integer type whose range of values can represent
                  > > distinct codes for all members of the largest extended character
                  > > set specified among the supported locales; the null character
                  > > shall have the code value zero and each member of the basic
                  > > character set shall have a code value equal to its value when used
                  > > as the lone character in an integer character constant.[/color]
                  >
                  > This requirement, carried on from C89, is simply broken: implementations
                  > that don't use ASCII for normal characters wouldn't be able to use *any*
                  > of the ASCII extensions (UCS, most importantly) for wide characters.
                  >[/color]

                  Then, the proper answer to my previous question should be mention of
                  the DR in process, not citation of an irrelevant wording.


                  --
                  Jun, Woong (mycoboco@hanma il.net)
                  Dept. of Physics, Univ. of Seoul



                  Comment

                  • Jun Woong

                    #10
                    Re: Multi-byte chars


                    <lawrence.jones @eds.com> wrote in message news:732ceb.07f .ln@cvg-65-27-189-87.cinci.rr.com ...
                    [...][color=blue]
                    >
                    > Yes, but. That requirement is a hold-over from the very earliest days of
                    > extended character set support, before there were functions to convert
                    > between wide and narrow characters. Now that those functions exist,
                    > there is no longer any reason for the requirement,[/color]

                    Weren't there some conversion functions between wide and multibyte
                    characters in C90? Do you mean that the wording in question was
                    written before the C89 committee decided to put those functions into
                    the standard, or that now we have more complete set of functions to
                    deal with wide and multibyte characters so don't need the requirement
                    any more?


                    --
                    Jun, Woong (mycoboco@hanma il.net)
                    Dept. of Physics, Univ. of Seoul



                    Comment

                    • lawrence.jones@eds.com

                      #11
                      Re: Multi-byte chars

                      Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
                      >
                      > Weren't there some conversion functions between wide and multibyte
                      > characters in C90? Do you mean that the wording in question was
                      > written before the C89 committee decided to put those functions into
                      > the standard, or that now we have more complete set of functions to
                      > deal with wide and multibyte characters so don't need the requirement
                      > any more?[/color]

                      There were conversions between wide characters and multibyte *strings*,
                      but there weren't any conversions dealing with single byte characters
                      until btowc() and wctob() were added in NA1.

                      -Larry Jones

                      Oh yeah? You just wait! -- Calvin

                      Comment

                      • Dan Pop

                        #12
                        Re: Multi-byte chars

                        In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

                        [color=blue]
                        >"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bec92v$p1g $1@sunnews.cern .ch...[color=green]
                        >> In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
                        >>
                        >>[color=darkred]
                        >> >And in 7.17p2:
                        >> >
                        >> > wchar_t
                        >> >
                        >> > which is an integer type whose range of values can represent
                        >> > distinct codes for all members of the largest extended character
                        >> > set specified among the supported locales; the null character
                        >> > shall have the code value zero and each member of the basic
                        >> > character set shall have a code value equal to its value when used
                        >> > as the lone character in an integer character constant.[/color]
                        >>
                        >> This requirement, carried on from C89, is simply broken: implementations
                        >> that don't use ASCII for normal characters wouldn't be able to use *any*
                        >> of the ASCII extensions (UCS, most importantly) for wide characters.[/color]
                        >
                        >Then, the proper answer to my previous question should be mention of
                        >the DR in process, not citation of an irrelevant wording.[/color]

                        I have quoted the *relevant* wording. The library clause has no business
                        defining the semantics of wide characters, which are a language issue.

                        Dan
                        --
                        Dan Pop
                        DESY Zeuthen, RZ group
                        Email: Dan.Pop@ifh.de

                        Comment

                        • Jun Woong

                          #13
                          Re: Multi-byte chars


                          <lawrence.jones @eds.com> wrote in message news:stleeb.j0s .ln@cvg-65-27-189-87.cinci.rr.com ...[color=blue]
                          > Jun Woong <mycoboco@hanma il.net> wrote:[color=green]
                          > >
                          > > Weren't there some conversion functions between wide and multibyte
                          > > characters in C90? Do you mean that the wording in question was
                          > > written before the C89 committee decided to put those functions into
                          > > the standard, or that now we have more complete set of functions to
                          > > deal with wide and multibyte characters so don't need the requirement
                          > > any more?[/color]
                          >
                          > There were conversions between wide characters and multibyte *strings*,
                          > but there weren't any conversions dealing with single byte characters
                          > until btowc() and wctob() were added in NA1.
                          >[/color]

                          Oh, now I see your point, thank you. I thought it in an implementer's
                          viewpoint who has full access to the internal state for the
                          conversion.


                          --
                          Jun, Woong (mycoboco@hanma il.net)
                          Dept. of Physics, Univ. of Seoul



                          Comment

                          • Jun Woong

                            #14
                            Re: Multi-byte chars


                            "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:beer3s$6c5 $3@sunnews.cern .ch...[color=blue]
                            > In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
                            [...][color=blue][color=green]
                            > >
                            > >Then, the proper answer to my previous question should be mention of
                            > >the DR in process, not citation of an irrelevant wording.[/color]
                            >
                            > I have quoted the *relevant* wording. The library clause has no business[/color]
                            ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=blue]
                            > defining the semantics of wide characters, which are a language issue.[/color]
                            ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~[color=blue]
                            >[/color]

                            Sorry, but this makes me feel that it's not worth discussing this
                            problem with you any more. Some implementations of the standard
                            library depended on that '%' == L'%' with the requirement of C90,
                            and it was a reliable choice in practice *at that time*.


                            --
                            Jun, Woong (mycoboco@hanma il.net)
                            Dept. of Physics, Univ. of Seoul



                            Comment

                            • Dan Pop

                              #15
                              Re: Multi-byte chars

                              In <beg43f$se3$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

                              [color=blue]
                              >"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:beer3s$6c5 $3@sunnews.cern .ch...[color=green]
                              >> In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
                              >[...][color=green][color=darkred]
                              >> >
                              >> >Then, the proper answer to my previous question should be mention of
                              >> >the DR in process, not citation of an irrelevant wording.[/color]
                              >>
                              >> I have quoted the *relevant* wording. The library clause has no business[/color]
                              > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=green]
                              >> defining the semantics of wide characters, which are a language issue.[/color]
                              > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~
                              >
                              >Sorry, but this makes me feel that it's not worth discussing this
                              >problem with you any more.[/color]

                              As I've already told you, you're always welcome to ignore my posts.
                              The text you've underlined makes perfect sense to me (otherwise I
                              wouldn't have written in the first place).
                              [color=blue]
                              >Some implementations of the standard
                              >library depended on that '%' == L'%' with the requirement of C90,
                              >and it was a reliable choice in practice *at that time*.[/color]

                              The implementor can depend on *anything* he wants, because he has full
                              control over the implementation, he doesn't need any guarantees from the
                              standard about the relationship between normal characters and wide
                              characters because he knows *exactly* what this relationship is on that
                              particular implementation.

                              I thought this was obvious to you...

                              Dan
                              --
                              Dan Pop
                              DESY Zeuthen, RZ group
                              Email: Dan.Pop@ifh.de

                              Comment

                              Working...