Re: Named vs. numerical entities
"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
news:Pine.LNX.4 .53.04071617075 00.7333@ppepc56 .ph.gla.ac.uk.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Sorry, I missed it somehow. I intend to read it later,[/color]
>
> Call back here when you have done?
>[color=green]
> > 1. There's nothing any more nonsensical about the concept of a Unicode
> > encoding for the Unicode character set than there is about ASCII[/color][/color]
encoding[color=blue][color=green]
> > for the ASCII character set,[/color]
>
> Actually there are substantial differences. And you see this also
> with that MIME parameter which is (mis)named "charset" - but specifies
> what we now would call a "character encoding scheme".
>
> Back when 7 or 8 bits were sufficient to represent all of the
> characters of a repertoire, it was quasi-obvious that the "coded
> character set" was defined by assigning numbers (0-127 or 0-255 as the
> case may be) to the characters of the repertoire, and then to lay out
> the fonts according to that scheme, and to transmit the characters by
> means of bytes having that value.
>
> Consequently, back then it looked as if the things that we now call
> "coded character set", "character encoding" and "font arrangement"
> were just different names for the same thing. Of course, you needed a
> different font for each "charset" (i.e character encoding), which got
> to be a considerable drag.
>
> Nowadays these concepts have to be disambiguated. Unicode characters
> are designated by a code point which can, in principle, go up to 2**31
> (it hasn't got that far yet). Those numbers then have to be
> represented in a way which is convenient for transmission and/or
> storage (different design criteria apply for different purposes).
>[color=green]
> > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
>
> Actually not. But discussing that would be a pointless digression, so
> let's move on.
>[color=green]
> > So why are the UTF-* encoding, "encodings of the Unicode character[/color][/color]
set"?[color=blue]
>
> It's not practical, for various reasons, to transmit characters as
> 32-bit units. For one thing, it's very wasteful. For another,
> there's no unique byte-ordering, hence all this fuss about endian-ness
> when units of 16 or 32 bits are involved.
>
> There's also the question of representing unicode characters in a
> mail-safe context (hence utf-7). That will fade with time, but even
> 8-bit-safe mail formats ban null bytes, which means that utf-16 or
> utf-32/ucs-4 representations cannot be used without a further layer of
> encoding.
>[color=green]
> > Is it because they are closely related to the Unicode character set[/color]
>
> Is it because you won't read the tutorial before asking further
> questions?[/color]
No, it's because sometimes questions can be satisfied by relatively simple
answers without requiring one to read a whole tutorial (though sometimes
not). Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not).
"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
news:Pine.LNX.4 .53.04071617075 00.7333@ppepc56 .ph.gla.ac.uk.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Sorry, I missed it somehow. I intend to read it later,[/color]
>
> Call back here when you have done?
>[color=green]
> > 1. There's nothing any more nonsensical about the concept of a Unicode
> > encoding for the Unicode character set than there is about ASCII[/color][/color]
encoding[color=blue][color=green]
> > for the ASCII character set,[/color]
>
> Actually there are substantial differences. And you see this also
> with that MIME parameter which is (mis)named "charset" - but specifies
> what we now would call a "character encoding scheme".
>
> Back when 7 or 8 bits were sufficient to represent all of the
> characters of a repertoire, it was quasi-obvious that the "coded
> character set" was defined by assigning numbers (0-127 or 0-255 as the
> case may be) to the characters of the repertoire, and then to lay out
> the fonts according to that scheme, and to transmit the characters by
> means of bytes having that value.
>
> Consequently, back then it looked as if the things that we now call
> "coded character set", "character encoding" and "font arrangement"
> were just different names for the same thing. Of course, you needed a
> different font for each "charset" (i.e character encoding), which got
> to be a considerable drag.
>
> Nowadays these concepts have to be disambiguated. Unicode characters
> are designated by a code point which can, in principle, go up to 2**31
> (it hasn't got that far yet). Those numbers then have to be
> represented in a way which is convenient for transmission and/or
> storage (different design criteria apply for different purposes).
>[color=green]
> > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
>
> Actually not. But discussing that would be a pointless digression, so
> let's move on.
>[color=green]
> > So why are the UTF-* encoding, "encodings of the Unicode character[/color][/color]
set"?[color=blue]
>
> It's not practical, for various reasons, to transmit characters as
> 32-bit units. For one thing, it's very wasteful. For another,
> there's no unique byte-ordering, hence all this fuss about endian-ness
> when units of 16 or 32 bits are involved.
>
> There's also the question of representing unicode characters in a
> mail-safe context (hence utf-7). That will fade with time, but even
> 8-bit-safe mail formats ban null bytes, which means that utf-16 or
> utf-32/ucs-4 representations cannot be used without a further layer of
> encoding.
>[color=green]
> > Is it because they are closely related to the Unicode character set[/color]
>
> Is it because you won't read the tutorial before asking further
> questions?[/color]
No, it's because sometimes questions can be satisfied by relatively simple
answers without requiring one to read a whole tutorial (though sometimes
not). Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not).
Comment