Multi-byte chars

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dan Pop

    #16
    Re: Multi-byte chars

    In <beguk0$fu4$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

    [color=blue]
    >"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:begm13$kq9 $1@sunnews.cern .ch...[color=green]
    >> In <beg43f$se3$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
    >[...][color=green][color=darkred]
    >> >>
    >> >> I have quoted the *relevant* wording. The library clause has no business
    >> > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~
    >> >> defining the semantics of wide characters, which are a language issue.
    >> > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~
    >> >[/color][/color]
    >[...][color=green]
    >>
    >> The text you've underlined makes perfect sense to me (otherwise I
    >> wouldn't have written in the first place).[/color]
    >
    >According to your logic, the following program is not s.c. even in[/color]

    Don't invoke my logic, since you're obviously unable to undestand it.
    [color=blue]
    >C90, which is perfectly incorrect thought. Is this what you are
    >saying?
    >
    > #include <stdio.h>
    >
    > int main(void)
    > {
    > if ('a' == L'a') puts("okay");
    >
    > return 0;
    > }[/color]

    Nope, what I'm saying is that C90 is broken by making this program
    strictly conforming: what are the choices for wide characters of an
    EBCDIC-based implementation? Remove the broken text from the library
    clause and C90 becomes more sensible. Ditto about C99, which contains
    the same text.
    [color=blue][color=green][color=darkred]
    >> >Some implementations of the standard
    >> >library depended on that '%' == L'%' with the requirement of C90,
    >> >and it was a reliable choice in practice *at that time*.[/color]
    >>
    >> The implementor can depend on *anything* he wants, because he has full
    >> control over the implementation, he doesn't need any guarantees from the
    >> standard about the relationship between normal characters and wide
    >> characters because he knows *exactly* what this relationship is on that
    >> particular implementation.[/color]
    >
    >The story changes if the implementer wants to make as many parts of
    >his library conform to the standard as possible.[/color]

    The standard contains no requirement that the standard library is
    implemented in C in the first place. A library implementation conforms
    to the standard if it follows the standard specification for the library,
    no matter in what language it is written or how portable or non-portable
    its code is. Ideally, all the parts of the library should conform to the
    library specification, not only "as many parts as possible" ;-)

    Assuming that you're talking about implementing the library in portable
    C (which is definitely NOT what you wrote above), I fail to see how the
    assumption 'a' == L'a' can make the code more portable.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email: Dan.Pop@ifh.de

    Comment

    • lawrence.jones@eds.com

      #17
      Re: Multi-byte chars

      Dan Pop <Dan.Pop@cern.c h> wrote:[color=blue]
      >
      > Nope, what I'm saying is that C90 is broken by making this program
      > strictly conforming: what are the choices for wide characters of an
      > EBCDIC-based implementation?[/color]

      I wouldn't call it broken, just overly restrictive. Until very
      recently, no one with an EBCDIC implementation wanted the wchar_t
      encoding to be anything other than IBM's DBCS (Double Byte Character
      Set), which has the same relation to EBCDIC that Unicode/ISO 10646 has
      to ASCII.

      -Larry Jones

      He doesn't complain, but his self-righteousness sure gets on my nerves.
      -- Calvin

      Comment

      • Jun Woong

        #18
        Re: Multi-byte chars


        "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:behk0b$4jn $4@sunnews.cern .ch...[color=blue]
        > In <beguk0$fu4$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
        [...][color=blue][color=green]
        > >
        > >According to your logic, the following program is not s.c. even in[/color]
        >
        > Don't invoke my logic, since you're obviously unable to undestand it.[/color]

        Sorry, your logic is too foolish for me to understand.
        [color=blue]
        >[color=green]
        > >C90, which is perfectly incorrect thought. Is this what you are
        > >saying?
        > >
        > > #include <stdio.h>
        > >
        > > int main(void)
        > > {
        > > if ('a' == L'a') puts("okay");
        > >
        > > return 0;
        > > }[/color]
        >
        > Nope, what I'm saying is that C90 is broken by making this program
        > strictly conforming: what are the choices for wide characters of an
        > EBCDIC-based implementation? Remove the broken text from the library
        > clause and C90 becomes more sensible.[/color]

        This is completely your personal opinion, which is completely
        different from the text of C90 exactly says; please don't force others
        to follow your poor opinion as did in "return; in main()" discussion.

        I've never thought that it was broken, considering that we didn't have
        enough support for multibyte and wide characters in C90, it was rather
        very restrictive. The only problem I can see about this is that the
        committee should have removed it when drafting C99, since we already
        had lots of support for the characters then.

        [...][color=blue][color=green]
        > >
        > >The story changes if the implementer wants to make as many parts of
        > >his library conform to the standard as possible.[/color]
        >
        > The standard contains no requirement that the standard library is
        > implemented in C in the first place. A library implementation conforms
        > to the standard if it follows the standard specification for the library,
        > no matter in what language it is written or how portable or non-portable
        > its code is. Ideally, all the parts of the library should conform to the
        > library specification, not only "as many parts as possible" ;-)[/color]

        Sorry for my poor wording.
        [color=blue]
        >
        > Assuming that you're talking about implementing the library in portable
        > C (which is definitely NOT what you wrote above), I fail to see how the
        > assumption 'a' == L'a' can make the code more portable.
        >[/color]

        Try to implement one of the printf() family in C90 (excluding NA1).


        --
        Jun, Woong (mycoboco@hanma il.net)
        Dept. of Physics, Univ. of Seoul



        Comment

        • Jun Woong

          #19
          Re: Multi-byte chars


          "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bejlm8$ra0 $3@sunnews.cern .ch...[color=blue]
          > In <beitfd$rrt$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
          [...][color=blue]
          > Rudeness works both ways ;-)[/color]

          It's fortune that you know it.
          [color=blue][color=green]
          > >
          > >This is completely your personal opinion, which is completely
          > >different from the text of C90 exactly says;[/color]
          >
          > Nope, it isn't, because it's my opinion about what C90 says.[/color]

          Yes, it's just your opinion, not what C90 says, which is what I said.
          So what?
          [color=blue]
          > I'm not
          > denying that it says what it says, merely claiming that what it says is
          > wrong. For reasons I have clearly explained.[/color]

          I don't think so. It's very restrictive rather than broken at that
          time; read Larry's posting on this.
          [color=blue]
          >[color=green]
          > >please don't force others
          > >to follow your poor opinion as did in "return; in main()" discussion.[/color]
          >
          > Are you a complete idiot or what? I didn't force anyone to adopt any of
          > my opinions in any discussion (how could I do that, assuming that I wanted
          > to?).[/color]

          You said it's broken. I said it's not broken, just very restrictive.
          But what C90 says doesn't change regardless of whatever we think about
          it. The standards, C90 and C99 as the current state, explicitly
          guarantees that 'a' == L'a'. What's the problem with this? What
          justifies you to say:

          The fact that A belongs to the basic character set has
          no relevance on the value of L'A'

          ?

          If you meant to say that the wording in the standard should be revised
          or will be revised, then you should have done so (as Larry did), not
          given me the poor explanation above.
          [color=blue]
          >[color=green]
          > >I've never thought that it was broken, considering that we didn't have
          > >enough support for multibyte and wide characters in C90,[/color]
          >
          > Why wasn't the support enough? And if it wasn't enough, why didn't the
          > committee add the missing bits, instead of breaking the standard?[/color]

          Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
          section, IIRC.
          [color=blue]
          >[color=green]
          > >it was rather
          > >very restrictive. The only problem I can see about this is that the[/color][/color]
          ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=blue][color=green]
          > >committee should have removed it when drafting C99, since we already[/color][/color]
          ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~[color=blue][color=green]
          > >had lots of support for the characters then.[/color][/color]
          ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~[color=blue]
          >
          > Since both standards say the same thing, your argument about not enough
          > support in C90 is completely unsupported. Try something better.[/color]

          Read the underlined wording.
          [color=blue]
          >[color=green]
          > >
          > >Try to implement one of the printf() family in C90 (excluding NA1).[/color]
          >
          > Convert the format string to wide characters and use only wide character
          > constants in the implementation of printf. Generate the output as wide
          > characters and convert them to multibyte characters before actually
          > outputting them. Where is the portability problem? Which of these
          > conversions isn't supported by C89?
          >
          > The thing I can't figure out is how to generate a multibyte format string
          > in C89, as a string literal. The only solution is to start with a wide
          > string literal and convert it to a multibyte character string.[/color]

          The multibyte character sequence given to printf() by user can have
          redundant shift characters which can make the resulting mb characters
          from the wide characters differ from the original. The guarantee that
          '%' == L'%' can make it easy to write a code to scan the conversion
          specifier from the mb character sequence, despite lack of support for
          conversion between characters; of course, there was a more complicated
          way to do it not depedning on the fact.


          --
          Jun, Woong (mycoboco@hanma il.net)
          Dept. of Physics, Univ. of Seoul



          Comment

          • Dan Pop

            #20
            Re: Multi-byte chars

            In <bejqr9$ejg$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:

            [color=blue]
            >"Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bejlm8$ra0 $3@sunnews.cern .ch...[color=green]
            >> In <beitfd$rrt$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[/color]
            >[...]
            >
            >It's fortune that you know it.[/color]

            Could you please be a little more careful when writing English text?
            [color=blue][color=green][color=darkred]
            >> >This is completely your personal opinion, which is completely
            >> >different from the text of C90 exactly says;[/color]
            >>
            >> Nope, it isn't, because it's my opinion about what C90 says.[/color]
            >
            >Yes, it's just your opinion, not what C90 says, which is what I said.
            >So what?[/color]

            I am perfectly entitled to my opinion. Just like anyone else.
            [color=blue][color=green]
            >> I'm not
            >> denying that it says what it says, merely claiming that what it says is
            >> wrong. For reasons I have clearly explained.[/color]
            >
            >I don't think so. It's very restrictive rather than broken at that
            >time; read Larry's posting on this.[/color]

            I have: it didn't sound very convincing to someone inclined to use his
            own judgement instead of blindly believing everything said by a committee
            member.

            A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
            characters), for NO good reason, is downright broken in my book. And both
            C89 and C99 do that.
            [color=blue][color=green][color=darkred]
            >> >please don't force others
            >> >to follow your poor opinion as did in "return; in main()" discussion.[/color]
            >>
            >> Are you a complete idiot or what? I didn't force anyone to adopt any of
            >> my opinions in any discussion (how could I do that, assuming that I wanted
            >> to?).[/color]
            >
            >You said it's broken. I said it's not broken, just very restrictive.
            >But what C90 says doesn't change regardless of whatever we think about
            >it. The standards, C90 and C99 as the current state, explicitly
            >guarantees that 'a' == L'a'. What's the problem with this? What
            >justifies you to say:
            >
            > The fact that A belongs to the basic character set has
            > no relevance on the value of L'A'[/color]

            I have already explained what. And I agree that the standard provides
            this guarantee. What's the problem with this? ;-)
            [color=blue][color=green][color=darkred]
            >> >I've never thought that it was broken, considering that we didn't have
            >> >enough support for multibyte and wide characters in C90,[/color]
            >>
            >> Why wasn't the support enough? And if it wasn't enough, why didn't the
            >> committee add the missing bits, instead of breaking the standard?[/color]
            >
            >Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
            >section, IIRC.[/color]

            Quote the relevant paragraphs.
            [color=blue][color=green][color=darkred]
            >> >it was rather
            >> >very restrictive. The only problem I can see about this is that the[/color][/color]
            > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~[color=green][color=darkred]
            >> >committee should have removed it when drafting C99, since we already[/color][/color]
            > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~[color=green][color=darkred]
            >> >had lots of support for the characters then.[/color][/color]
            > ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~[color=green]
            >>
            >> Since both standards say the same thing, your argument about not enough
            >> support in C90 is completely unsupported. Try something better.[/color]
            >
            >Read the underlined wording.[/color]

            Does it change the fact that both standards say the same thing? If not,
            the underlined text doesn't prove anything at all.
            [color=blue][color=green][color=darkred]
            >> >Try to implement one of the printf() family in C90 (excluding NA1).[/color]
            >>
            >> Convert the format string to wide characters and use only wide character
            >> constants in the implementation of printf. Generate the output as wide
            >> characters and convert them to multibyte characters before actually
            >> outputting them. Where is the portability problem? Which of these
            >> conversions isn't supported by C89?
            >>
            >> The thing I can't figure out is how to generate a multibyte format string
            >> in C89, as a string literal. The only solution is to start with a wide
            >> string literal and convert it to a multibyte character string.[/color]
            >
            >The multibyte character sequence given to printf() by user can have
            >redundant shift characters which can make the resulting mb characters
            >from the wide characters differ from the original.[/color]

            Differ in what sense? Are the semantics of the text preserved or not?
            [color=blue]
            >The guarantee that
            >'%' == L'%' can make it easy to write a code to scan the conversion
            >specifier from the mb character sequence,[/color]

            Nope, it cannot: you cannot process multibyte characters *before*
            converting them to wide characters, because the standard does NOT
            specify the encoding mechanism. Keep in mind that characters from the
            base character set preserve their single byte values *only* in the initial
            shift state (whatever that is):

            While in the
            initial shift state, all single-byte characters retain their usual
            interpretation and do not alter the shift state. The interpretation
            ^^^^^^^^^^^^^^^ ^^^
            for subsequent bytes in the sequence is a function of the current
            ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^
            shift state.
            ^^^^^^^^^^^^[color=blue]
            >despite lack of support for
            >conversion between characters; of course, there was a more complicated
            >way to do it not depedning on the fact.[/color]

            There is no other way, without making assumptions about how mb characters
            are encoded (see the quote above). And if you make such assumptions,
            your code is no longer portable. There is no easy way to tell whether
            a byte you read from the string corresponds to a single byte character
            or is a shift state changer or is the first character of a multibyte
            character.

            Dan
            --
            Dan Pop
            DESY Zeuthen, RZ group
            Email: Dan.Pop@ifh.de

            Comment

            • Randy Howard

              #21
              Re: Multi-byte chars

              In article <beitfd$rrt$1@n ews.hananet.net >, mycoboco@hanmai l.net
              says...[color=blue]
              > "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:behk0b$4jn $4@sunnews.cern .ch...[color=green]
              > > Don't invoke my logic, since you're obviously unable to undestand it.[/color]
              >
              > Sorry, your logic is too foolish for me to understand.[/color]

              Can the two of you go off privately somewhere and beat each other to
              a pulp? Watching it here doesn't seem very productive.

              --
              Randy Howard
              remove the obvious bits from my address to reply.

              Comment

              • Jun Woong

                #22
                Re: Multi-byte chars


                "Jun Woong" <mycoboco@hanma il.net> wrote in message news:bem51q$3u2 $1@news.hananet .net...
                [...][color=blue]
                >
                > char foo[] = "\x70\x70\x01\x 02";
                > char bar[MB_CUR_MAX];
                >
                > Assuming that str[] contains a valid multibyte character sequence,
                > '\x70' is a shift character and redundant shift characters are
                > allowed,
                >
                > mbtowc(&wc, str, sizeof(str)-1);[/color]

                Sorry. Two occurrences of "str" should be replaced with "foo".


                --
                Jun, Woong (mycoboco@hanma il.net)
                Dept. of Physics, Univ. of Seoul



                Comment

                • Dan Pop

                  #23
                  Re: Multi-byte chars

                  In <kpjkeb.2m9.ln@ cvg-65-27-189-87.cinci.rr.com > lawrence.jones@ eds.com writes:
                  [color=blue]
                  >Dan Pop <Dan.Pop@cern.c h> wrote:[color=green]
                  >>
                  >> I am perfectly entitled to my opinion. Just like anyone else.[/color]
                  >
                  >Indeed you are, as am I.
                  >[color=green]
                  >> A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
                  >> characters), for NO good reason, is downright broken in my book. And both
                  >> C89 and C99 do that.[/color]
                  >
                  >My opinion is that your opinion is downright broken. ;-)
                  >
                  >There were very good reasons for the restriction in C89.[/color]

                  This statement is worth zilch without an enumeration of the "very good
                  reasons". Unlike JW, I'm completely immune to the "magister dixit" style
                  of argumentation.

                  AFAICT, there was NO good reason for this restriction in C89. Due to the
                  shift state issue, it provided no help when dealing with mb character
                  strings.

                  Dan
                  --
                  Dan Pop
                  DESY Zeuthen, RZ group
                  Email: Dan.Pop@ifh.de

                  Comment

                  • lawrence.jones@eds.com

                    #24
                    Re: Multi-byte chars

                    Dan Pop <Dan.Pop@cern.c h> wrote [quoting me]:[color=blue][color=green]
                    >>
                    >>There were very good reasons for the restriction in C89.[/color]
                    >
                    > This statement is worth zilch without an enumeration of the "very good
                    > reasons". Unlike JW, I'm completely immune to the "magister dixit" style
                    > of argumentation.[/color]

                    Bully for you. This isn't my area of expertise, thus the appeal to
                    authority. P. J. Plauger alludes to the kinds of problems it was
                    intended to address in his discussion of the _Printf function in "The
                    Standard C Library".

                    The fundamental issue is how to recognize a "%" in the format string.
                    As you've said, it is necessary to convert the format string to a
                    sequence of wide characters and look for one corresponding to a percent
                    sign. But what is the wide character code for a percent sign? It's
                    tempting to say that it's L'%', but remember that the wide character
                    encoding is allowed to be locale-specific, and the user is allowed to
                    change the current locale at any time, so that doesn't work without
                    something like the restriction under discussion. (With the restriction,
                    of course, you don't even need to use a wide character constant, '%' is
                    sufficient).

                    Without it, you'd be forced to call mbtowc on "%" every time to get the
                    current encoding, but the implementation must behave as if no library
                    function calls mbtowc, so you'd also have to save and restore its state
                    around the call. That was considered to be unacceptable overhead to
                    require, thus the restriction. (Which, as I've said before, was
                    innocuous at the time since no one was even contemplating an
                    implementation where it did not hold.)

                    -Larry Jones

                    I stand FIRM in my belief of what's right! I REFUSE to
                    compromise my principles! -- Calvin

                    Comment

                    • lawrence.jones@eds.com

                      #25
                      Re: Multi-byte chars

                      Dan Pop <Dan.Pop@cern.c h> wrote:[color=blue]
                      >
                      > The work on Unicode started in 1986, which is a good three years before
                      > the adoption of C89.[/color]

                      But it hadn't gotten very far by the time C89 was finished (which was,
                      remember, a year before it was published due to procedural snafus). The
                      16-bit camp and the 32-bit camp were both deeply entrenched and fighting
                      with each other, leading to the eventual schism between the ISO 10646
                      folks and the Unicode folks that wasn't reconciled until fairly
                      recently. There wasn't even concensus among the masses that a universal
                      character set was practical, achievable, or even desirable.

                      -Larry Jones

                      Everything's gotta have rules, rules, rules! -- Calvin

                      Comment

                      • Kevin Easton

                        #26
                        Re: Multi-byte chars

                        lawrence.jones@ eds.com wrote:[color=blue]
                        > Dan Pop <Dan.Pop@cern.c h> wrote [quoting me]:[color=green][color=darkred]
                        >>>
                        >>>There were very good reasons for the restriction in C89.[/color]
                        >>
                        >> This statement is worth zilch without an enumeration of the "very good
                        >> reasons". Unlike JW, I'm completely immune to the "magister dixit" style
                        >> of argumentation.[/color]
                        >
                        > Bully for you. This isn't my area of expertise, thus the appeal to
                        > authority. P. J. Plauger alludes to the kinds of problems it was
                        > intended to address in his discussion of the _Printf function in "The
                        > Standard C Library".
                        >
                        > The fundamental issue is how to recognize a "%" in the format string.
                        > As you've said, it is necessary to convert the format string to a
                        > sequence of wide characters and look for one corresponding to a percent
                        > sign. But what is the wide character code for a percent sign? It's
                        > tempting to say that it's L'%', but remember that the wide character
                        > encoding is allowed to be locale-specific, and the user is allowed to
                        > change the current locale at any time, so that doesn't work without
                        > something like the restriction under discussion. (With the restriction,
                        > of course, you don't even need to use a wide character constant, '%' is
                        > sufficient).
                        >
                        > Without it, you'd be forced to call mbtowc on "%" every time to get the
                        > current encoding, but the implementation must behave as if no library
                        > function calls mbtowc, so you'd also have to save and restore its state
                        > around the call. That was considered to be unacceptable overhead to
                        > require, thus the restriction. (Which, as I've said before, was
                        > innocuous at the time since no one was even contemplating an
                        > implementation where it did not hold.)[/color]

                        Why can't the implementation provide, for it's own use, a lookup table
                        of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
                        has this information available.

                        - Kevin.

                        Comment

                        • Jun Woong

                          #27
                          Re: Multi-byte chars


                          "Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ 3c1whh$7h6$1@to mato.pcug.org.a u...[color=blue]
                          > lawrence.jones@ eds.com wrote:[color=green]
                          > >
                          > > Bully for you. This isn't my area of expertise, thus the appeal to
                          > > authority. P. J. Plauger alludes to the kinds of problems it was
                          > > intended to address in his discussion of the _Printf function in "The
                          > > Standard C Library".
                          > >
                          > > The fundamental issue is how to recognize a "%" in the format string.
                          > > As you've said, it is necessary to convert the format string to a
                          > > sequence of wide characters and look for one corresponding to a percent
                          > > sign. But what is the wide character code for a percent sign? It's
                          > > tempting to say that it's L'%', but remember that the wide character
                          > > encoding is allowed to be locale-specific, and the user is allowed to
                          > > change the current locale at any time, so that doesn't work without
                          > > something like the restriction under discussion. (With the restriction,
                          > > of course, you don't even need to use a wide character constant, '%' is
                          > > sufficient).
                          > >
                          > > Without it, you'd be forced to call mbtowc on "%" every time to get the
                          > > current encoding, but the implementation must behave as if no library
                          > > function calls mbtowc, so you'd also have to save and restore its state
                          > > around the call. That was considered to be unacceptable overhead to
                          > > require, thus the restriction. (Which, as I've said before, was
                          > > innocuous at the time since no one was even contemplating an
                          > > implementation where it did not hold.)[/color]
                          >
                          > Why can't the implementation provide, for it's own use, a lookup table
                          > of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
                          > has this information available.
                          >[/color]

                          One reason I can think is portability. One easier (but not portable)
                          way than you said is to take advantage of an internal access to the
                          state of the conversion.


                          --
                          Jun, Woong (mycoboco@hanma il.net)
                          Dept. of Physics, Univ. of Seoul



                          Comment

                          • Kevin Easton

                            #28
                            Re: Multi-byte chars

                            Jun Woong <mycoboco@hanma il.net> wrote:[color=blue]
                            >
                            > "Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ 3c1whh$7h6$1@to mato.pcug.org.a u...[color=green]
                            >> lawrence.jones@ eds.com wrote:[/color][/color]
                            [ ...implementing _Printf, and '%' == L'%'... ][color=blue][color=green][color=darkred]
                            >> > Without it, you'd be forced to call mbtowc on "%" every time to get the
                            >> > current encoding, but the implementation must behave as if no library
                            >> > function calls mbtowc, so you'd also have to save and restore its state
                            >> > around the call. That was considered to be unacceptable overhead to
                            >> > require, thus the restriction. (Which, as I've said before, was
                            >> > innocuous at the time since no one was even contemplating an
                            >> > implementation where it did not hold.)[/color]
                            >>
                            >> Why can't the implementation provide, for it's own use, a lookup table
                            >> of what_percent_lo oks_like_in_thi s_locale[] - after all, mbtowc clearly
                            >> has this information available.
                            >>[/color]
                            >
                            > One reason I can think is portability. One easier (but not portable)
                            > way than you said is to take advantage of an internal access to the
                            > state of the conversion.[/color]

                            There are plenty of library functions that have unacceptable overheads
                            when implemented in a portable manner, but can usually be efficiently
                            implemented in a non-portable way. In particular, strcmp() comes to
                            mind - so I don't think the possibility of a portable implementation
                            suffering unacceptable overhead when a non-portable implementation
                            wouldn't is sufficient reason to add the restriction.

                            - Kevin.

                            Comment

                            • Jun Woong

                              #29
                              Re: Multi-byte chars


                              "Dan Pop" <Dan.Pop@cern.c h> wrote in message news:bemibc$bu2 $3@sunnews.cern .ch...[color=blue]
                              > In <bem51q$3u2$1@n ews.hananet.net > "Jun Woong" <mycoboco@hanma il.net> writes:[color=green]
                              > >
                              > >When C90 was the current standard, was there UCS?[/color]
                              >
                              > UCS did exist when C99 was drafted, yet the broken text is still there.[/color]

                              I've already said that I agree with your position that C99 shouldn't
                              have had the text. I guess it was a mistake.
                              [color=blue]
                              > The work on Unicode started in 1986, which is a good three years before
                              > the adoption of C89.[/color]

                              Its publication was certainly after C90's.
                              [color=blue][color=green]
                              > >
                              > >I also agree with that C99 (or C90+NA1) should have been revised to
                              > >get rid of the wording in question, but never do about C90.[/color]
                              >
                              > What *exactly* was it buying to C90?[/color]

                              The text in C90 didn't make a major problem in practice at that time.

                              [...][color=blue][color=green]
                              > >
                              > >PJ Plauger describes the history about NA1 in that section, which is
                              > >reasonable long. IIRC when C90 was published, the commitee already
                              > >knew that C90's support for some features like the wide characters was
                              > >not enough. But because the committee promised later supplement (which
                              > >was NA1) to members who objected approval of the standard, we was able
                              > >to have C90 at that time.[/color]
                              >
                              > This doesn explain anything at all about the necessity of having
                              > 'a' == L'a', does it?[/color]

                              Read in context, please.
                              [color=blue][color=green]
                              > >
                              > > char foo[] = "\x70\x70\x01\x 02";
                              > > char bar[MB_CUR_MAX];
                              > >
                              > >Assuming that str[] contains a valid multibyte character sequence,
                              > >'\x70' is a shift character and redundant shift characters are
                              > >allowed,
                              > >
                              > > mbtowc(&wc, str, sizeof(str)-1);
                              > > wctomb(bar, wc);
                              > >
                              > >the sequence in bar[] can be "\x70\x01\x 02". Is this wrong?[/color]
                              >
                              > I can't see anything wrong with that. Where is the problem?[/color]


                              DP> Convert the format string to wide characters and use only wide character
                              ~~~~~~~~~~~~~~~ ~~~~
                              DP> constants in the implementation of printf. Generate the output as wide
                              ~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
                              DP> characters and convert them to multibyte characters before actually
                              ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~
                              DP> outputting them. [...]
                              ~~~~~~~~~~~~~~~

                              [color=blue][color=green]
                              > >
                              > >Corret, but this is not what I said. What I said is,
                              > >
                              > > while(mbtowc(&w c, fmtstr, len) > 0) {
                              > > if (wc == '%') /* conversion specifier */
                              > >
                              > >(Sure, the implementation is allowed to use mbtowc for this purpose).
                              > >This construct depends on the guarantee that '%' == L'%'.[/color]
                              >
                              > And what the hell is wrong with
                              >
                              > if (wc == L'%') /* conversion specifier */
                              >
                              > which does NOT depend on that guarantee and is what I have suggested as
                              > the portable solution to your problem?[/color]

                              Nope, it still depends on the guarantee. If there is no guarantee like
                              that, wc can have a different value from L'%' depending on locales,
                              even if wc contains a wide percent character in that locale.
                              [color=blue][color=green]
                              > >
                              > >Misunderstandi ng here. What I had in my mind (and used before) needs
                              > >an internal access to the state for the character conversion, which is
                              > >non-portable, of course.[/color]
                              >
                              > Then, why did you invoke *portability* arguments for the usefulness of
                              > the guarantee under discussion?[/color]

                              See above. And the reason I mentioned the other way is to say that an
                              implementer can rely on the implementation details if he doesn't care
                              about portability.
                              [color=blue]
                              >
                              > Nope, the code was equally easy to write in pure C89, without relying on
                              > the guarantee, as demonstrated above.[/color]

                              In an incorrect way.


                              --
                              Jun, Woong (mycoboco@hanma il.net)
                              Dept. of Physics, Univ. of Seoul



                              Comment

                              • Jun Woong

                                #30
                                Re: Multi-byte chars


                                "Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:newscache$ lu2whh$7h6$1@to mato.pcug.org.a u...[color=blue]
                                > Jun Woong <mycoboco@hanma il.net> wrote:[/color]
                                [...][color=blue][color=green]
                                > >
                                > > One reason I can think is portability. One easier (but not portable)
                                > > way than you said is to take advantage of an internal access to the
                                > > state of the conversion.[/color]
                                >
                                > There are plenty of library functions that have unacceptable overheads
                                > when implemented in a portable manner, but can usually be efficiently
                                > implemented in a non-portable way. In particular, strcmp() comes to
                                > mind - so I don't think the possibility of a portable implementation
                                > suffering unacceptable overhead when a non-portable implementation
                                > wouldn't is sufficient reason to add the restriction.
                                >[/color]

                                The story can change, if the committee thought over a possibility for
                                uses to want to write a similar code in a portable way like that.
                                Without such a guarantee, the only way you, as an user of an
                                implementation who don't know about the implementation details, can
                                write a similar code is to use a technique that's somewhat complicated
                                and has overhead.


                                --
                                Jun, Woong (mycoboco@hanma il.net)
                                Dept. of Physics, Univ. of Seoul



                                Comment

                                Working...