Named vs. numerical entities

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Pierre Goiffon wrote:
[color=blue][color=green]
>> And in any case, probably the best choice (if no other constraints
>> apply) of Unicode encoding scheme for HTML used in a WWW context is
>> utf-8, not utf-16LE/BE.[/color]
>
> Do you mean, when using a vast majority of latin characters ?
> If not, wouldn't the file will get very large ?[/color]

Not bigger than a simple image.
[color=blue]
> Wouldn't it be better to use UTF-16 ?[/color]

Only if you prefer not to be indexed by Google correctly.
<http://www.google.com/search?q=%22UTF-1+6%22>

--
Top-posting.
What's the most irritating thing on Usenet?

**Nick Kew** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

In article <Pine.LNX.4.53. 0407160944450.7 114@ppepc56.ph. gla.ac.uk>,
"Alan J. Flavell" <flavell@ph.gla .ac.uk> writes:
[color=blue]
> There are third-party Apache modules which take care of this "on the
> fly",[/color]

mod_deflate is standard. No need for third-party modules.

--
Nick Kew

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Pierre Goiffon wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla .ac.uk> a écrit dans le message de
> news:Pine.LNX.4 .53.04071613340 00.7123@ppepc56 .ph.gla.ac.uk[color=green]
> > And in any case, probably the best choice (if no other constraints
> > apply) of Unicode encoding scheme for HTML used in a WWW context is
> > utf-8, not utf-16LE/BE.[/color]
>
> Do you mean, when using a vast majority of latin characters ?[/color]

Not necessarily: Greek, Cyrillic, Arabic, Hebrew are all represented
by 2 octets in utf-8. Armenian, Syriac and Coptic too, hmmm. The
cutoff (IINM) is U+07FF.

CJK scripts are a different matter, but AFAICS they are still usually
represented in one of their traditional encodings, rather than in a
Unicode-based scheme.

Indic scripts will also need 3 octets per character in utf-8 (and in
this case AIUI the use of unicode-based encodings is very beneficial,
since there /was/ no widely accepted pre-unicode scheme: I'm told that
in order to read Indian newspapers on the web, pretty much each
newspaper needed a different "font" i.e in effect was using its own
private character encoding. But I'm no expert in that field, so the
information is only second-hand).
[color=blue]
> If not, wouldn't the file will get very large ? Wouldn't it be
> better to use UTF-16 ?[/color]

I haven't widely tested browser compatibility for utf-16 encodings, so
I can't comment on that aspect. But keep in mind that the markup,
styles, etc. etc. are expressed by ASCII characters, and by using
utf-16 you're going to double the size of *those* as compared with
utf-8.

But yes, if your material is such that most of the data characters
need 3 octets in utf-8, and you've decided to use a unicode scheme,
then utf-16 could well be more-compact, you're right.

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Nick Kew wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla .ac.uk> writes:
>[color=green]
> > There are third-party Apache modules which take care of this "on the
> > fly",[/color]
>
> mod_deflate is standard. No need for third-party modules.[/color]

Thanks for the information!

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Alan J. Flavell wrote:
[color=blue]
> Indic scripts will also need 3 octets per character in utf-8 (and in
> this case AIUI the use of unicode-based encodings is very beneficial,
> since there /was/ no widely accepted pre-unicode scheme: I'm told that
> in order to read Indian newspapers on the web, pretty much each
> newspaper needed a different "font" i.e in effect was using its own
> private character encoding.[/color]

But there's also <http://www.bbc.co.uk/hindi/>
and <http://www.bbc.co.uk/tamil/> .

--
Top-posting.
What's the most irritating thing on Usenet?

**Brian** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Jonas Smithson wrote:
[color=blue]
> However, some of my pages have numerous character entities on
> them... let's say up to fifty on a page, perhaps; if they each
> entailed an extra six bytes (for example) over some alternate
> method, then that might add up to an extra 300 bytes. What does
> that equal in download time? How many bytes of difference do *you*
> think would make a "noticeable difference" between two documents...
> say, to a user on a 56K modem?[/color]

Well, do the math. 300/56000 is not very significant. I suppose,
300/~33000 is more accurate a comparison, but even there, it's nothing
to worry about. Spending time tuning one image on a page will likely
have a greater impact than encoding will.

You should only worry about encoding if it causes rendering problems.

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Brian** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Alan J. Flavell wrote:
[color=blue]
> On Fri, 16 Jul 2004, Brian wrote:
>[color=green]
>> UTF-8 is an 8-bit character set[/color]
>
> No, utf-8 isn't a "character set" at all (that MIME "charset"
> parameter denotes what we nowadays call a "character encoding
> scheme").[/color]

Cripes, I cannot keep the terminology straight. I wish they had called
that thing by its name, charenc or something. Yes, utf-8 is an encoding.

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
news:Pine.GSO.4 .44.04071614250 10.9642-100000@s5b003.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Why can't a document be encoded (and transmitted) in Unicode?[/color]
>
> It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
> and in addition in different byte order for UTF-16 and UTF-32.
> <http://www.unicode.org/unicode/faq/utf_bom.html>
>[color=green]
> > If
> > Windows Notepad lets you save a text file as Unicode (big- or
> > little-endian), isn't that the same thing?[/color]
>
> "Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
> UTF-32 isn't used in MS Windows AFAIK.[/color]

I'm really interested in what the distinction is. I admit I don't know what
UTF-16 or why it's different from what I would call "Unicode encoding", but
why wouldn't a fixed 16-bit encoding scheme where "A" is encoded as 0040, an
em-dash is encoded as 2014, a katakana "pu" is encoded as 30D7, and so forth
not be "Unicode encoding"?

Is it that this encoding scheme already existed and had the name "UTF-16"
before the term "Unicode" was coined? So that the reason we don't call it
"Unicode encoding" is simply that it already has another name?

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue][color=green]
>> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
>
> I'm really interested in what the distinction is. I admit I don't know what
> UTF-16 or why it's different from what I would call "Unicode encoding", [...][/color]

Err, did you read the page above, which I cited with reason?

--
Top-posting.
What's the most irritating thing on Usenet?

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
news:Pine.GSO.4 .44.04071617052 20.11169-100000@s5b003.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green][color=darkred]
> >> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
> >
> > I'm really interested in what the distinction is. I admit I don't know[/color][/color]
what[color=blue][color=green]
> > UTF-16 or why it's different from what I would call "Unicode encoding",[/color][/color]
[...][color=blue]
>
> Err, did you read the page above, which I cited with reason?
>[/color]

Sorry, I missed it somehow. I intend to read it later, but from glancing at
it, I have the following thoughts:

1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding
for the ASCII character set, but for whatever reasons (I assume efficiency
has something to do with it) it's not *used*.

2. EBCDIC and ASCII define the same characters, IIRC; but as character sets
they just number them differently. A document could be encoded in EBCDIC
just as easily as in ASCII. It wouldn't make any sense to speak of an EBCDIC
encoding of an ASCII document or an ASCII encoding of an EBCDIC document:
each is a separate encoding of a document based on the representations of
the document's characters in the respective character sets.

So why are the UTF-* encoding, "encodings of the Unicode character set"? Is
it because they are closely related to the Unicode character set by virtue
of the fact that there is a mapping from UCS to UTF-* produced by applying a
small set of simple functions?

2.

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> 1. There's nothing any more nonsensical about the concept of a Unicode
> encoding for the Unicode character set than there is about ASCII encoding
> for the ASCII character set,[/color]

Maybe I could understand this sentence with fewer negatives :-)
[color=blue]
> 2. EBCDIC and ASCII define the same characters, IIRC;[/color]

ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.
EBCDIC is a generic term for several (many?) coded character sets of
256 characters defined by IBM. Just four of them are listed here:
<http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/>
[color=blue]
> So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]

Think of Unicode as assigning characters to natural numbers - currently
from 0 to x10FFFF = 1114111. For example, number 945 = x3B1 means
the Greek small letter alpha.

The UTFs define how these numbers are represented by _byte_ sequences
(in a computer or on the Internet).

--
Top-posting.
What's the most irritating thing on Usenet?

**C A Upsdell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

> ASCII is a coded character set of 128 characters defined in ANSI X3.4[color=blue]
> and ISO 646.[/color]

Not quite. You are thinking of US-ASCII. There are a variety of national
ASCII character sets.

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Sorry, I missed it somehow. I intend to read it later,[/color]

Call back here when you have done?
[color=blue]
> 1. There's nothing any more nonsensical about the concept of a Unicode
> encoding for the Unicode character set than there is about ASCII encoding
> for the ASCII character set,[/color]

Actually there are substantial differences. And you see this also
with that MIME parameter which is (mis)named "charset" - but specifies
what we now would call a "character encoding scheme".

Back when 7 or 8 bits were sufficient to represent all of the
characters of a repertoire, it was quasi-obvious that the "coded
character set" was defined by assigning numbers (0-127 or 0-255 as the
case may be) to the characters of the repertoire, and then to lay out
the fonts according to that scheme, and to transmit the characters by
means of bytes having that value.

Consequently, back then it looked as if the things that we now call
"coded character set", "character encoding" and "font arrangement"
were just different names for the same thing. Of course, you needed a
different font for each "charset" (i.e character encoding), which got
to be a considerable drag.

Nowadays these concepts have to be disambiguated. Unicode characters
are designated by a code point which can, in principle, go up to 2**31
(it hasn't got that far yet). Those numbers then have to be
represented in a way which is convenient for transmission and/or
storage (different design criteria apply for different purposes).
[color=blue]
> 2. EBCDIC and ASCII define the same characters, IIRC;[/color]

Actually not. But discussing that would be a pointless digression, so
let's move on.
[color=blue]
> So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]

It's not practical, for various reasons, to transmit characters as
32-bit units. For one thing, it's very wasteful. For another,
there's no unique byte-ordering, hence all this fuss about endian-ness
when units of 16 or 32 bits are involved.

There's also the question of representing unicode characters in a
mail-safe context (hence utf-7). That will fade with time, but even
8-bit-safe mail formats ban null bytes, which means that utf-16 or
utf-32/ucs-4 representations cannot be used without a further layer of
encoding.
[color=blue]
> Is it because they are closely related to the Unicode character set[/color]

Is it because you won't read the tutorial before asking further
questions?

ttfn

**Lars Eighner** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

In our last episode,
<BjTJc.185939$r CA1.116992@news 01.bloor.is.net .cable.rogers.c om>,
the lovely and talented C A Upsdell
broadcast on comp.infosystem s.www.authoring.html:
[color=blue][color=green]
>> ASCII is a coded character set of 128 characters defined in ANSI X3.4
>> and ISO 646.[/color][/color]
[color=blue]
> Not quite. You are thinking of US-ASCII. There are a variety of national
> ASCII character sets.[/color]

No. There is only one ASCII. It is a 7-bit code with 128 characters.
Think: what does the A in ASCII stand for?

--
Lars Eighner -finger for geek code- eighner@io.com http://www.io.com/~eighner/
If it wasn't for muscle spasms, I wouldn't get any exercise at all.

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue][color=green]
>> ASCII is a coded character set of 128 characters defined in ANSI X3.4
>> and ISO 646.[/color]
>
> Not quite. You are thinking of US-ASCII.[/color]

ASCII and US-ASCII are synonyms.
<http://www.iana.org/assignments/character-sets>
[color=blue]
> There are a variety of national ASCII character sets.[/color]

No, they are called "7-bit codes" or "7-bit coded character sets"
as defined in ISO 646. <http://www.itscj.ipsj. or.jp/ISO-IR/>

Named vs. numerical entities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment