Multi-byte chars

**Dan Pop** · Nov 13 '05, 03:28 AM

Re: Multi-byte chars

In <beguk0$fu4$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
>"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:begm13$kq9 $1@sunnews.cern .ch...[color=green]
>> In <beg43f$se3$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
>[...][color=green][color=darkred]
>> >>
>> >> I have quoted the *relevant* wording. The library clause has no business
>> > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~
>> >> defining the semantics of wide characters, which are a language issue.
>> > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~
>> >[/color][/color]
>[...][color=green]
>>
>> The text you've underlined makes perfect sense to me (otherwise I
>> wouldn't have written in the first place).[/color]
>
>According to your logic, the following program is not s.c. even in[/color]

Don't invoke my logic, since you're obviously unable to undestand it.
[color=blue]
>C90, which is perfectly incorrect thought. Is this what you are
>saying?
>
> #include <stdio.h>
>
> int main(void)
> {
> if ('a' == L'a') puts("okay");
>
> return 0;
> }[/color]

Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation? Remove the broken text from the library
clause and C90 becomes more sensible. Ditto about C99, which contains
the same text.
[color=blue][color=green][color=darkred]
>> >Some implementations of the standard
>> >library depended on that '%' == L'%' with the requirement of C90,
>> >and it was a reliable choice in practice *at that time*.[/color]
>>
>> The implementor can depend on *anything* he wants, because he has full
>> control over the implementation, he doesn't need any guarantees from the
>> standard about the relationship between normal characters and wide
>> characters because he knows *exactly* what this relationship is on that
>> particular implementation.[/color]
>
>The story changes if the implementer wants to make as many parts of
>his library conform to the standard as possible.[/color]

The standard contains no requirement that the standard library is
implemented in C in the first place. A library implementation conforms
to the standard if it follows the standard specification for the library,
no matter in what language it is written or how portable or non-portable
its code is. Ideally, all the parts of the library should conform to the
library specification, not only "as many parts as possible" ;-)

Assuming that you're talking about implementing the library in portable
C (which is definitely NOT what you wrote above), I fail to see how the
assumption 'a' == L'a' can make the code more portable.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**lawrence.jones@eds.com** · Nov 13 '05, 03:29 AM

Re: Multi-byte chars

Dan Pop <Dan.Pop@cern.c h> wrote:[color=blue]
>
> Nope, what I'm saying is that C90 is broken by making this program
> strictly conforming: what are the choices for wide characters of an
> EBCDIC-based implementation?[/color]

I wouldn't call it broken, just overly restrictive. Until very
recently, no one with an EBCDIC implementation wanted the wchar_t
encoding to be anything other than IBM's DBCS (Double Byte Character
Set), which has the same relation to EBCDIC that Unicode/ISO 10646 has
to ASCII.

-Larry Jones

He doesn't complain, but his self-righteousness sure gets on my nerves.
-- Calvin

**Jun Woong** · Nov 13 '05, 03:29 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:behk0b$4jn $4@sunnews.cern .ch...[color=blue]
> In <beguk0$fu4$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
[...][color=blue][color=green]
> >
> >According to your logic, the following program is not s.c. even in[/color]
>
> Don't invoke my logic, since you're obviously unable to undestand it.[/color]

Sorry, your logic is too foolish for me to understand.
[color=blue]
>[color=green]
> >C90, which is perfectly incorrect thought. Is this what you are
> >saying?
> >
> > #include <stdio.h>
> >
> > int main(void)
> > {
> > if ('a' == L'a') puts("okay");
> >
> > return 0;
> > }[/color]
>
> Nope, what I'm saying is that C90 is broken by making this program
> strictly conforming: what are the choices for wide characters of an
> EBCDIC-based implementation? Remove the broken text from the library
> clause and C90 becomes more sensible.[/color]

This is completely your personal opinion, which is completely
different from the text of C90 exactly says; please don't force others
to follow your poor opinion as did in "return; in main()" discussion.

I've never thought that it was broken, considering that we didn't have
enough support for multibyte and wide characters in C90, it was rather
very restrictive. The only problem I can see about this is that the
committee should have removed it when drafting C99, since we already
had lots of support for the characters then.

[...][color=blue][color=green]
> >
> >The story changes if the implementer wants to make as many parts of
> >his library conform to the standard as possible.[/color]
>
> The standard contains no requirement that the standard library is
> implemented in C in the first place. A library implementation conforms
> to the standard if it follows the standard specification for the library,
> no matter in what language it is written or how portable or non-portable
> its code is. Ideally, all the parts of the library should conform to the
> library specification, not only "as many parts as possible" ;-)[/color]

Sorry for my poor wording.
[color=blue]
>
> Assuming that you're talking about implementing the library in portable
> C (which is definitely NOT what you wrote above), I fail to see how the
> assumption 'a' == L'a' can make the code more portable.
>[/color]

Try to implement one of the printf() family in C90 (excluding NA1).

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Jun Woong** · Nov 13 '05, 03:30 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bejlm8$ra0 $3@sunnews.cern .ch...[color=blue]
> In <beitfd$rrt$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
[...][color=blue]
> Rudeness works both ways ;-)[/color]

It's fortune that you know it.
[color=blue][color=green]
> >
> >This is completely your personal opinion, which is completely
> >different from the text of C90 exactly says;[/color]
>
> Nope, it isn't, because it's my opinion about what C90 says.[/color]

Yes, it's just your opinion, not what C90 says, which is what I said.
So what?
[color=blue]
> I'm not
> denying that it says what it says, merely claiming that what it says is
> wrong. For reasons I have clearly explained.[/color]

I don't think so. It's very restrictive rather than broken at that
time; read Larry's posting on this.
[color=blue]
>[color=green]
> >please don't force others
> >to follow your poor opinion as did in "return; in main()" discussion.[/color]
>
> Are you a complete idiot or what? I didn't force anyone to adopt any of
> my opinions in any discussion (how could I do that, assuming that I wanted
> to?).[/color]

You said it's broken. I said it's not broken, just very restrictive.
But what C90 says doesn't change regardless of whatever we think about
it. The standards, C90 and C99 as the current state, explicitly
guarantees that 'a' == L'a'. What's the problem with this? What
justifies you to say:

The fact that A belongs to the basic character set has
no relevance on the value of L'A'

?

If you meant to say that the wording in the standard should be revised
or will be revised, then you should have done so (as Larry did), not
given me the poor explanation above.
[color=blue]
>[color=green]
> >I've never thought that it was broken, considering that we didn't have
> >enough support for multibyte and wide characters in C90,[/color]
>
> Why wasn't the support enough? And if it wasn't enough, why didn't the
> committee add the missing bits, instead of breaking the standard?[/color]

Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
section, IIRC.
[color=blue]
>[color=green]
> >it was rather
> >very restrictive. The only problem I can see about this is that the[/color][/color]
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=blue][color=green]
> >committee should have removed it when drafting C99, since we already[/color][/color]
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~[color=blue][color=green]
> >had lots of support for the characters then.[/color][/color]
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~[color=blue]
>
> Since both standards say the same thing, your argument about not enough
> support in C90 is completely unsupported. Try something better.[/color]

Read the underlined wording.
[color=blue]
>[color=green]
> >
> >Try to implement one of the printf() family in C90 (excluding NA1).[/color]
>
> Convert the format string to wide characters and use only wide character
> constants in the implementation of printf. Generate the output as wide
> characters and convert them to multibyte characters before actually
> outputting them. Where is the portability problem? Which of these
> conversions isn't supported by C89?
>
> The thing I can't figure out is how to generate a multibyte format string
> in C89, as a string literal. The only solution is to start with a wide
> string literal and convert it to a multibyte character string.[/color]

The multibyte character sequence given to printf() by user can have
redundant shift characters which can make the resulting mb characters
from the wide characters differ from the original. The guarantee that
'%' == L'%' can make it easy to write a code to scan the conversion
specifier from the mb character sequence, despite lack of support for
conversion between characters; of course, there was a more complicated
way to do it not depedning on the fact.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Dan Pop** · Nov 13 '05, 03:30 AM

Re: Multi-byte chars

In <bejqr9$ejg$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

[color=blue]
>"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bejlm8$ra0 $3@sunnews.cern .ch...[color=green]
>> In <beitfd$rrt$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
>[...]
>
>It's fortune that you know it.[/color]

Could you please be a little more careful when writing English text?
[color=blue][color=green][color=darkred]
>> >This is completely your personal opinion, which is completely
>> >different from the text of C90 exactly says;[/color]
>>
>> Nope, it isn't, because it's my opinion about what C90 says.[/color]
>
>Yes, it's just your opinion, not what C90 says, which is what I said.
>So what?[/color]

I am perfectly entitled to my opinion. Just like anyone else.
[color=blue][color=green]
>> I'm not
>> denying that it says what it says, merely claiming that what it says is
>> wrong. For reasons I have clearly explained.[/color]
>
>I don't think so. It's very restrictive rather than broken at that
>time; read Larry's posting on this.[/color]

I have: it didn't sound very convincing to someone inclined to use his
own judgement instead of blindly believing everything said by a committee
member.

A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
characters), for NO good reason, is downright broken in my book. And both
C89 and C99 do that.
[color=blue][color=green][color=darkred]
>> >please don't force others
>> >to follow your poor opinion as did in "return; in main()" discussion.[/color]
>>
>> Are you a complete idiot or what? I didn't force anyone to adopt any of
>> my opinions in any discussion (how could I do that, assuming that I wanted
>> to?).[/color]
>
>You said it's broken. I said it's not broken, just very restrictive.
>But what C90 says doesn't change regardless of whatever we think about
>it. The standards, C90 and C99 as the current state, explicitly
>guarantees that 'a' == L'a'. What's the problem with this? What
>justifies you to say:
>
> The fact that A belongs to the basic character set has
> no relevance on the value of L'A'[/color]

I have already explained what. And I agree that the standard provides
this guarantee. What's the problem with this? ;-)
[color=blue][color=green][color=darkred]
>> >I've never thought that it was broken, considering that we didn't have
>> >enough support for multibyte and wide characters in C90,[/color]
>>
>> Why wasn't the support enough? And if it wasn't enough, why didn't the
>> committee add the missing bits, instead of breaking the standard?[/color]
>
>Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
>section, IIRC.[/color]

Quote the relevant paragraphs.
[color=blue][color=green][color=darkred]
>> >it was rather
>> >very restrictive. The only problem I can see about this is that the[/color][/color]
> ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=green][color=darkred]
>> >committee should have removed it when drafting C99, since we already[/color][/color]
> ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~[color=green][color=darkred]
>> >had lots of support for the characters then.[/color][/color]
> ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~[color=green]
>>
>> Since both standards say the same thing, your argument about not enough
>> support in C90 is completely unsupported. Try something better.[/color]
>
>Read the underlined wording.[/color]

Does it change the fact that both standards say the same thing? If not,
the underlined text doesn't prove anything at all.
[color=blue][color=green][color=darkred]
>> >Try to implement one of the printf() family in C90 (excluding NA1).[/color]
>>
>> Convert the format string to wide characters and use only wide character
>> constants in the implementation of printf. Generate the output as wide
>> characters and convert them to multibyte characters before actually
>> outputting them. Where is the portability problem? Which of these
>> conversions isn't supported by C89?
>>
>> The thing I can't figure out is how to generate a multibyte format string
>> in C89, as a string literal. The only solution is to start with a wide
>> string literal and convert it to a multibyte character string.[/color]
>
>The multibyte character sequence given to printf() by user can have
>redundant shift characters which can make the resulting mb characters
>from the wide characters differ from the original.[/color]

Differ in what sense? Are the semantics of the text preserved or not?
[color=blue]
>The guarantee that
>'%' == L'%' can make it easy to write a code to scan the conversion
>specifier from the mb character sequence,[/color]

Nope, it cannot: you cannot process multibyte characters *before*
converting them to wide characters, because the standard does NOT
specify the encoding mechanism. Keep in mind that characters from the
base character set preserve their single byte values *only* in the initial
shift state (whatever that is):

While in the
initial shift state, all single-byte characters retain their usual
interpretation and do not alter the shift state. The interpretation
^^^^^^^^^^^^^^^ ^^^
for subsequent bytes in the sequence is a function of the current
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^
shift state.
^^^^^^^^^^^^[color=blue]
>despite lack of support for
>conversion between characters; of course, there was a more complicated
>way to do it not depedning on the fact.[/color]

There is no other way, without making assumptions about how mb characters
are encoded (see the quote above). And if you make such assumptions,
your code is no longer portable. There is no easy way to tell whether
a byte you read from the string corresponds to a single byte character
or is a shift state changer or is the first character of a multibyte
character.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**Randy Howard** · Nov 13 '05, 03:31 AM

Re: Multi-byte chars

In article <beitfd$rrt$1@n ews.hananet.net >, mycoboco@hanmai l.net
says...[color=blue]
> "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:behk0b$4jn $4@sunnews.cern .ch...[color=green]
> > Don't invoke my logic, since you're obviously unable to undestand it.[/color]
>
> Sorry, your logic is too foolish for me to understand.[/color]

Can the two of you go off privately somewhere and beat each other to
a pulp? Watching it here doesn't seem very productive.

--
Randy Howard
remove the obvious bits from my address to reply.

**Jun Woong** · Nov 13 '05, 03:31 AM

Re: Multi-byte chars

"Jun Woong" <mycoboco@hanma il.net> wrote in message news:bem51q$3u2 $1@news.hananet .net...
[...][color=blue]
>
> char foo[] = "\x70\x70\x01\x 02";
> char bar[MB_CUR_MAX];
>
> Assuming that str[] contains a valid multibyte character sequence,
> '\x70' is a shift character and redundant shift characters are
> allowed,
>
> mbtowc(&wc, str, sizeof(str)-1);[/color]

Sorry. Two occurrences of "str" should be replaced with "foo".

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Dan Pop** · Nov 13 '05, 03:31 AM

Re: Multi-byte chars

In <kpjkeb.2m9.ln@ cvg-65-27-189-87.cinci.rr.com > lawrence.jones@ eds.com writes:
[color=blue]
>Dan Pop <Dan.Pop@cern.c h> wrote:[color=green]
>>
>> I am perfectly entitled to my opinion. Just like anyone else.[/color]
>
>Indeed you are, as am I.
>[color=green]
>> A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
>> characters), for NO good reason, is downright broken in my book. And both
>> C89 and C99 do that.[/color]
>
>My opinion is that your opinion is downright broken. ;-)
>
>There were very good reasons for the restriction in C89.[/color]

This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.

AFAICT, there was NO good reason for this restriction in C89. Due to the
shift state issue, it provided no help when dealing with mb character
strings.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan.Pop@ifh.de

**lawrence.jones@eds.com** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

Dan Pop <Dan.Pop@cern.c h> wrote [quoting me]:[color=blue][color=green]
>>
>>There were very good reasons for the restriction in C89.[/color]
>
> This statement is worth zilch without an enumeration of the "very good
> reasons". Unlike JW, I'm completely immune to the "magister dixit" style
> of argumentation.[/color]

Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

-Larry Jones

I stand FIRM in my belief of what's right! I REFUSE to
compromise my principles! -- Calvin

**lawrence.jones@eds.com** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

Dan Pop <Dan.Pop@cern.c h> wrote:[color=blue]
>
> The work on Unicode started in 1986, which is a good three years before
> the adoption of C89.[/color]

But it hadn't gotten very far by the time C89 was finished (which was,
remember, a year before it was published due to procedural snafus). The
16-bit camp and the 32-bit camp were both deeply entrenched and fighting
with each other, leading to the eventual schism between the ISO 10646
folks and the Unicode folks that wasn't reconciled until fairly
recently. There wasn't even concensus among the masses that a universal
character set was practical, achievable, or even desirable.

-Larry Jones

Everything's gotta have rules, rules, rules! -- Calvin

**Kevin Easton** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

lawrence.jones@ eds.com wrote:[color=blue]
> Dan Pop <Dan.Pop@cern.c h> wrote [quoting me]:[color=green][color=darkred]
>>>
>>>There were very good reasons for the restriction in C89.[/color]
>>
>> This statement is worth zilch without an enumeration of the "very good
>> reasons". Unlike JW, I'm completely immune to the "magister dixit" style
>> of argumentation.[/color]
>
> Bully for you. This isn't my area of expertise, thus the appeal to
> authority. P. J. Plauger alludes to the kinds of problems it was
> intended to address in his discussion of the _Printf function in "The
> Standard C Library".
>
> The fundamental issue is how to recognize a "%" in the format string.
> As you've said, it is necessary to convert the format string to a
> sequence of wide characters and look for one corresponding to a percent
> sign. But what is the wide character code for a percent sign? It's
> tempting to say that it's L'%', but remember that the wide character
> encoding is allowed to be locale-specific, and the user is allowed to
> change the current locale at any time, so that doesn't work without
> something like the restriction under discussion. (With the restriction,
> of course, you don't even need to use a wide character constant, '%' is
> sufficient).
>
> Without it, you'd be forced to call mbtowc on "%" every time to get the
> current encoding, but the implementation must behave as if no library
> function calls mbtowc, so you'd also have to save and restore its state
> around the call. That was considered to be unacceptable overhead to
> require, thus the restriction. (Which, as I've said before, was
> innocuous at the time since no one was even contemplating an
> implementation where it did not hold.)[/color]

Why can't the implementation provide, for it's own use, a lookup table
of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
has this information available.

- Kevin.

**Jun Woong** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ 3c1whh$7h6$1@to mato.pcug.org.a u...[color=blue]
> lawrence.jones@ eds.com wrote:[color=green]
> >
> > Bully for you. This isn't my area of expertise, thus the appeal to
> > authority. P. J. Plauger alludes to the kinds of problems it was
> > intended to address in his discussion of the _Printf function in "The
> > Standard C Library".
> >
> > The fundamental issue is how to recognize a "%" in the format string.
> > As you've said, it is necessary to convert the format string to a
> > sequence of wide characters and look for one corresponding to a percent
> > sign. But what is the wide character code for a percent sign? It's
> > tempting to say that it's L'%', but remember that the wide character
> > encoding is allowed to be locale-specific, and the user is allowed to
> > change the current locale at any time, so that doesn't work without
> > something like the restriction under discussion. (With the restriction,
> > of course, you don't even need to use a wide character constant, '%' is
> > sufficient).
> >
> > Without it, you'd be forced to call mbtowc on "%" every time to get the
> > current encoding, but the implementation must behave as if no library
> > function calls mbtowc, so you'd also have to save and restore its state
> > around the call. That was considered to be unacceptable overhead to
> > require, thus the restriction. (Which, as I've said before, was
> > innocuous at the time since no one was even contemplating an
> > implementation where it did not hold.)[/color]
>
> Why can't the implementation provide, for it's own use, a lookup table
> of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
> has this information available.
>[/color]

One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Kevin Easton** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
>
> "Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ 3c1whh$7h6$1@to mato.pcug.org.a u...[color=green]
>> lawrence.jones@ eds.com wrote:[/color][/color]
[ ...implementing _Printf, and '%' == L'%'... ][color=blue][color=green][color=darkred]
>> > Without it, you'd be forced to call mbtowc on "%" every time to get the
>> > current encoding, but the implementation must behave as if no library
>> > function calls mbtowc, so you'd also have to save and restore its state
>> > around the call. That was considered to be unacceptable overhead to
>> > require, thus the restriction. (Which, as I've said before, was
>> > innocuous at the time since no one was even contemplating an
>> > implementation where it did not hold.)[/color]
>>
>> Why can't the implementation provide, for it's own use, a lookup table
>> of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
>> has this information available.
>>[/color]
>
> One reason I can think is portability. One easier (but not portable)
> way than you said is to take advantage of an internal access to the
> state of the conversion.[/color]

There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.

- Kevin.

**Jun Woong** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bemibc$bu2 $3@sunnews.cern .ch...[color=blue]
> In <bem51q$3u2$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[color=green]
> >
> >When C90 was the current standard, was there UCS?[/color]
>
> UCS did exist when C99 was drafted, yet the broken text is still there.[/color]

I've already said that I agree with your position that C99 shouldn't
have had the text. I guess it was a mistake.
[color=blue]
> The work on Unicode started in 1986, which is a good three years before
> the adoption of C89.[/color]

Its publication was certainly after C90's.
[color=blue][color=green]
> >
> >I also agree with that C99 (or C90+NA1) should have been revised to
> >get rid of the wording in question, but never do about C90.[/color]
>
> What *exactly* was it buying to C90?[/color]

The text in C90 didn't make a major problem in practice at that time.

[...][color=blue][color=green]
> >
> >PJ Plauger describes the history about NA1 in that section, which is
> >reasonable long. IIRC when C90 was published, the commitee already
> >knew that C90's support for some features like the wide characters was
> >not enough. But because the committee promised later supplement (which
> >was NA1) to members who objected approval of the standard, we was able
> >to have C90 at that time.[/color]
>
> This doesn explain anything at all about the necessity of having
> 'a' == L'a', does it?[/color]

Read in context, please.
[color=blue][color=green]
> >
> > char foo[] = "\x70\x70\x01\x 02";
> > char bar[MB_CUR_MAX];
> >
> >Assuming that str[] contains a valid multibyte character sequence,
> >'\x70' is a shift character and redundant shift characters are
> >allowed,
> >
> > mbtowc(&wc, str, sizeof(str)-1);
> > wctomb(bar, wc);
> >
> >the sequence in bar[] can be "\x70\x01\x 02". Is this wrong?[/color]
>
> I can't see anything wrong with that. Where is the problem?[/color]

DP> Convert the format string to wide characters and use only wide character
~~~~~~~~~~~~~~~ ~~~~
DP> constants in the implementation of printf. Generate the output as wide
~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
DP> characters and convert them to multibyte characters before actually
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~
DP> outputting them. [...]
~~~~~~~~~~~~~~~

[color=blue][color=green]
> >
> >Corret, but this is not what I said. What I said is,
> >
> > while(mbtowc(&w c, fmtstr, len) > 0) {
> > if (wc == '%') /* conversion specifier */
> >
> >(Sure, the implementation is allowed to use mbtowc for this purpose).
> >This construct depends on the guarantee that '%' == L'%'.[/color]
>
> And what the hell is wrong with
>
> if (wc == L'%') /* conversion specifier */
>
> which does NOT depend on that guarantee and is what I have suggested as
> the portable solution to your problem?[/color]

Nope, it still depends on the guarantee. If there is no guarantee like
that, wc can have a different value from L'%' depending on locales,
even if wc contains a wide percent character in that locale.
[color=blue][color=green]
> >
> >Misunderstandi ng here. What I had in my mind (and used before) needs
> >an internal access to the state for the character conversion, which is
> >non-portable, of course.[/color]
>
> Then, why did you invoke *portability* arguments for the usefulness of
> the guarantee under discussion?[/color]

See above. And the reason I mentioned the other way is to say that an
implementer can rely on the implementation details if he doesn't care
about portability.
[color=blue]
>
> Nope, the code was equally easy to write in pure C89, without relying on
> the guarantee, as demonstrated above.[/color]

In an incorrect way.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

**Jun Woong** · Nov 13 '05, 03:32 AM

Re: Multi-byte chars

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ lu2whh$7h6$1@to mato.pcug.org.a u...[color=blue]
> Jun Woong <mycoboco@hanma il.net> wrote:[/color]
[...][color=blue][color=green]
> >
> > One reason I can think is portability. One easier (but not portable)
> > way than you said is to take advantage of an internal access to the
> > state of the conversion.[/color]
>
> There are plenty of library functions that have unacceptable overheads
> when implemented in a portable manner, but can usually be efficiently
> implemented in a non-portable way. In particular, strcmp() comes to
> mind - so I don't think the possibility of a portable implementation
> suffering unacceptable overhead when a non-portable implementation
> wouldn't is sufficient reason to add the restriction.
>[/color]

The story can change, if the committee thought over a possibility for
uses to want to write a similar code in a portable way like that.
Without such a guarantee, the only way you, as an user of an
implementation who don't know about the implementation details, can
write a similar code is to use a technique that's somewhat complicated
and has overhead.

--
Jun, Woong (mycoboco@hanma il.net)
Dept. of Physics, Univ. of Seoul

Multi-byte chars

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment