Multi-byte chars

**Richard Heathfield** · Nov 13 '05, 03:23 AM

Re: Multi-byte chars

Bill Cunningham wrote:
[color=blue]
> I've been reading the C standard online and I'm puzzled as to what
> multibyte chars are.[/color]

A multibyte character is a "sequence of one or more bytes representing a
member of the extended character set of either the source or the execution
environment", if I have the quote from 3.7.2 right.
[color=blue]
> Wide chars I believe would be characters for
> languages such as cantonese or Japanese.[/color]

C isn't as specific as that. See 3.7.3.
[color=blue]
> I know the ASCII character set
> specifies that each character such as 'b' or 'B' is an 8 bit character.[/color]

7 bits, not 8. ASCII is a 7-bit code.

<snip>

--
Richard Heathfield : binary@eton.pow ernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton

**lawrence.jones@eds.com** · Nov 13 '05, 03:23 AM

Re: Multi-byte chars

Bill Cunningham <some@some.ne t> wrote:[color=blue]
>
> I've been reading the C standard online and I'm puzzled as to what multibyte
> chars are. Wide chars I believe would be characters for languages such as
> cantonese or Japanese. I know the ASCII character set specifies that each
> character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
> character?[/color]

A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value. Other characters are encoded as multiple bytes,
each of which has the top bit set; the first byte is in the range \xc0
to \xfd and indicates the number of bytes that follow, subsequent bytes
are in the range \x80 to \xbf. UTF-8 encoded characters can be any
length between one and six bytes. So 'A' is encoded as \x41 but '©'
(the copyright sign) is encoded as \xc2\xa9.

Multibyte encodings can be very space efficient, but they are difficult
to process since different characters have different lengths. Wide
characters, on the other hand, are intended to be efficient for
processing, but not necessarily space efficient. Wide characters are
integers that are large enough so that every logical character can be
represented in just one wide character.

-Larry Jones

If I get a bad grade, it'll be YOUR fault for not doing the work for me!
-- Calvin

**Jun Woong** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

<lawrence.jones @eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=blue]
> Bill Cunningham <some@some.ne t> wrote:[color=green]
> >
> > I've been reading the C standard online and I'm puzzled as to what multibyte
> > chars are. Wide chars I believe would be characters for languages such as
> > cantonese or Japanese. I know the ASCII character set specifies that each
> > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
> > character?[/color]
>
> A single logical character that requires more than one byte to express.
> For example, consider the UTF-8 encoding format for ISO 10646: normal
> ASCII characters (between \x00 and \x7f) are encoded as a single byte
> with the same value.[/color]

My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set. Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Dan Pop** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

In <bebmda$1ho$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
><lawrence.jone s@eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=green]
>> Bill Cunningham <some@some.ne t> wrote:[color=darkred]
>> >
>> > I've been reading the C standard online and I'm puzzled as to what multibyte
>> > chars are. Wide chars I believe would be characters for languages such as
>> > cantonese or Japanese. I know the ASCII character set specifies that each
>> > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
>> > character?[/color]
>>
>> A single logical character that requires more than one byte to express.
>> For example, consider the UTF-8 encoding format for ISO 10646: normal
>> ASCII characters (between \x00 and \x7f) are encoded as a single byte
>> with the same value.[/color]
>
>My understanding is that the standard requires 'A' == L'A' by the fact
>that the basic character set must be a subset of the extended
>character set.[/color]

Non sequitur. The fact that A belongs to the basic character set has
no relevance on the value of L'A', AFAICT. All the standard has to say
on the issue is:

11 A wide character constant has type wchar_t, an integer type
defined in the <stddef.h> header. The value of a wide character
constant containing a single multibyte character that maps to
a member of the extended execution character set is the wide
character corresponding to that multibyte character, as defined
by the mbtowc function, with an implementation-defined current
locale.
[color=blue]
>Do this and what you mentioned above mean that a
>character set whose code values differ from ASCII's can't be the basic
>set on an implementation where code values of Unicode is used as those
>of the extended set?[/color]

Nope, he was merely describing what happens on an implementation using
ASCII for normal characters and UCS for wide characters (therefore UTF-8
for multi-byte characters).

There is nothing preventing an implementation from using EBCDIC for
normal characters and UCS for wide characters, in which case it is foolish
to expect 'A' == L'A'.

Furthermore, there is nothing preventing an implementation from using
ASCII for normal characters and EBCDIC for wide characters (or vice
versa). The fact that C99 supports UCNs in source code means nothing WRT
the execution character set (whose extended version need not contain any
additional characters).

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**Jun Woong** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bebts2$kf2 $2@sunnews.cern .ch...[color=blue]
> In <bebmda$1ho$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
>[color=green]
> ><lawrence.jone s@eds.com> wrote in message news:nvn9eb.8g. ln@cvg-65-27-189-87.cinci.rr.com ...[color=darkred]
> >> Bill Cunningham <some@some.ne t> wrote:
> >> >
> >> > I've been reading the C standard online and I'm puzzled as to what multibyte
> >> > chars are. Wide chars I believe would be characters for languages such as
> >> > cantonese or Japanese. I know the ASCII character set specifies that each
> >> > character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
> >> > character?
> >>
> >> A single logical character that requires more than one byte to express.
> >> For example, consider the UTF-8 encoding format for ISO 10646: normal
> >> ASCII characters (between \x00 and \x7f) are encoded as a single byte
> >> with the same value.[/color]
> >
> >My understanding is that the standard requires 'A' == L'A' by the fact
> >that the basic character set must be a subset of the extended
> >character set.[/color]
>
> Non sequitur. The fact that A belongs to the basic character set has
> no relevance on the value of L'A', AFAICT. All the standard has to say
> on the issue is:
>
> 11 A wide character constant has type wchar_t, an integer type
> defined in the <stddef.h> header. The value of a wide character
> constant containing a single multibyte character that maps to
> a member of the extended execution character set is the wide
> character corresponding to that multibyte character, as defined
> by the mbtowc function, with an implementation-defined current
> locale.[/color]

And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**lawrence.jones@eds.com** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
>
> My understanding is that the standard requires 'A' == L'A' by the fact
> that the basic character set must be a subset of the extended
> character set. Do this and what you mentioned above mean that a
> character set whose code values differ from ASCII's can't be the basic
> set on an implementation where code values of Unicode is used as those
> of the extended set?[/color]

Yes, but. That requirement is a hold-over from the very earliest days of
extended character set support, before there were functions to convert
between wide and narrow characters. Now that those functions exist,
there is no longer any reason for the requirement, and the committee has
voted to remove it. See the committee's response to DR #279:

<http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/dr_279.htm>

-Larry Jones

Somebody's always running my life. I never get to do what I want to do.
-- Calvin

**Dan Pop** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
>And in 7.17p2:
>
> wchar_t
>
> which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character
> set specified among the supported locales; the null character
> shall have the code value zero and each member of the basic
> character set shall have a code value equal to its value when used
> as the lone character in an integer character constant.[/color]

This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**Jun Woong** · Nov 13 '05, 03:24 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bec92v$p1g $1@sunnews.cern .ch...[color=blue]
> In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
>
>[color=green]
> >And in 7.17p2:
> >
> > wchar_t
> >
> > which is an integer type whose range of values can represent
> > distinct codes for all members of the largest extended character
> > set specified among the supported locales; the null character
> > shall have the code value zero and each member of the basic
> > character set shall have a code value equal to its value when used
> > as the lone character in an integer character constant.[/color]
>
> This requirement, carried on from C89, is simply broken: implementations
> that don't use ASCII for normal characters wouldn't be able to use *any*
> of the ASCII extensions (UCS, most importantly) for wide characters.
>[/color]

Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Jun Woong** · Nov 13 '05, 03:25 AM

Re: Multi-byte chars

<lawrence.jones @eds.com> wrote in message news:732ceb.07f .ln@cvg-65-27-189-87.cinci.rr.com ...
[...][color=blue]
>
> Yes, but. That requirement is a hold-over from the very earliest days of
> extended character set support, before there were functions to convert
> between wide and narrow characters. Now that those functions exist,
> there is no longer any reason for the requirement,[/color]

Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**lawrence.jones@eds.com** · Nov 13 '05, 03:26 AM

Re: Multi-byte chars

Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
>
> Weren't there some conversion functions between wide and multibyte
> characters in C90? Do you mean that the wording in question was
> written before the C89 committee decided to put those functions into
> the standard, or that now we have more complete set of functions to
> deal with wide and multibyte characters so don't need the requirement
> any more?[/color]

There were conversions between wide characters and multibyte *strings*,
but there weren't any conversions dealing with single byte characters
until btowc() and wctob() were added in NA1.

-Larry Jones

Oh yeah? You just wait! -- Calvin

**Dan Pop** · Nov 13 '05, 03:26 AM

Re: Multi-byte chars

In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
>"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bec92v$p1g $1@sunnews.cern .ch...[color=green]
>> In <bec2kb$gjc$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:
>>
>>[color=darkred]
>> >And in 7.17p2:
>> >
>> > wchar_t
>> >
>> > which is an integer type whose range of values can represent
>> > distinct codes for all members of the largest extended character
>> > set specified among the supported locales; the null character
>> > shall have the code value zero and each member of the basic
>> > character set shall have a code value equal to its value when used
>> > as the lone character in an integer character constant.[/color]
>>
>> This requirement, carried on from C89, is simply broken: implementations
>> that don't use ASCII for normal characters wouldn't be able to use *any*
>> of the ASCII extensions (UCS, most importantly) for wide characters.[/color]
>
>Then, the proper answer to my previous question should be mention of
>the DR in process, not citation of an irrelevant wording.[/color]

I have quoted the *relevant* wording. The library clause has no business
defining the semantics of wide characters, which are a language issue.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**Jun Woong** · Nov 13 '05, 03:27 AM

Re: Multi-byte chars

<lawrence.jones @eds.com> wrote in message news:stleeb.j0s .ln@cvg-65-27-189-87.cinci.rr.com ...[color=blue]
> Jun Woong <mycoboco@hanma il.net> wrote:[color=green]
> >
> > Weren't there some conversion functions between wide and multibyte
> > characters in C90? Do you mean that the wording in question was
> > written before the C89 committee decided to put those functions into
> > the standard, or that now we have more complete set of functions to
> > deal with wide and multibyte characters so don't need the requirement
> > any more?[/color]
>
> There were conversions between wide characters and multibyte *strings*,
> but there weren't any conversions dealing with single byte characters
> until btowc() and wctob() were added in NA1.
>[/color]

Oh, now I see your point, thank you. I thought it in an implementer's
viewpoint who has full access to the internal state for the
conversion.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Jun Woong** · Nov 13 '05, 03:27 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:beer3s$6c5 $3@sunnews.cern .ch...[color=blue]
> In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
[...][color=blue][color=green]
> >
> >Then, the proper answer to my previous question should be mention of
> >the DR in process, not citation of an irrelevant wording.[/color]
>
> I have quoted the *relevant* wording. The library clause has no business[/color]
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=blue]
> defining the semantics of wide characters, which are a language issue.[/color]
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~[color=blue]
>[/color]

Sorry, but this makes me feel that it's not worth discussing this
problem with you any more. Some implementations of the standard
library depended on that '%' == L'%' with the requirement of C90,
and it was a reliable choice in practice *at that time*.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Dan Pop** · Nov 13 '05, 03:27 AM

Re: Multi-byte chars

In <beg43f$se3$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
>"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:beer3s$6c5 $3@sunnews.cern .ch...[color=green]
>> In <becb37$n94$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
>[...][color=green][color=darkred]
>> >
>> >Then, the proper answer to my previous question should be mention of
>> >the DR in process, not citation of an irrelevant wording.[/color]
>>
>> I have quoted the *relevant* wording. The library clause has no business[/color]
> ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=green]
>> defining the semantics of wide characters, which are a language issue.[/color]
> ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~
>
>Sorry, but this makes me feel that it's not worth discussing this
>problem with you any more.[/color]

As I've already told you, you're always welcome to ignore my posts.
The text you've underlined makes perfect sense to me (otherwise I
wouldn't have written in the first place).
[color=blue]
>Some implementations of the standard
>library depended on that '%' == L'%' with the requirement of C90,
>and it was a reliable choice in practice *at that time*.[/color]

The implementor can depend on *anything* he wants, because he has full
control over the implementation, he doesn't need any guarantees from the
standard about the relationship between normal characters and wide
characters because he knows *exactly* what this relationship is on that
particular implementation.

I thought this was obvious to you...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

Multi-byte chars

Multi-byte chars

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment