Named vs. numerical entities

**Brian** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Jonas Smithson wrote:[color=blue]
> I recently read the claim somewhere that numerical entities (such
> as —) have a speed advantage over the equivalent named
> entities (such as —) because the numerical entity requires
> just a single byte to be downloaded to the browser, while the named
> entity requires one byte for each letter.[/color]

My, that was a load of poppycock you were told.
[color=blue]
> I found this claim a little surprising[/color]

That's being too kind.
[color=blue]
> I would have thought *each* numeral in the numerical entity would
> require one byte.[/color]

That depends on the encoding. You'd best consult the guides if you
want to know more. I wish I understood it all better. I don't, despite
reading **numerous** posts from folks here who are quite well-versed.
If you're interested, Google the group for "Alan Flavell encoding" or
"Andreas Prilop charset". That'll turn up lots of posts. I'd suggest
you read what they say carefully; read those who argue with them, at
least on character encoding issues, with a grain of salt.
[color=blue]
> Also, which form of the entity enjoys wider browser support? They
> both seem to work with modern browsers... but what about older or
> very buggy browsers?[/color]

Again, A. Flavell is your man. Brace yourself for some heavy reading:

404 Not Found

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Stan Brown** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Jonas Smithson" <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote in
comp.infosystem s.www.authoring.html:[color=blue]
>I recently read the claim somewhere that numerical entities (such as
>—) have a speed advantage over the equivalent named entities
>(such as —) because the numerical entity requires just a single
>byte to be downloaded to the browser, while the named entity requires
>one byte for each letter. (So in this case, it would presumably be one
>byte vs. seven bytes.) I found this claim a little surprising -- I
>would have thought *each* numeral in the numerical entity would require
>one byte.[/color]

It does.

Where the difference arises is if you actually create your document
in Unicode instead of an 8-bit character set. If the document is
actually composed in Unicode, and transmitted in Unicode, then there
is an advantage of the actual 8212 character because it needs only
two bytes whereas — is 7 characters. (I can't remember whether
that's 7*2=14 bytes or some compression goes on, but it's certainly
more than 2 bytes.)

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA

http://OakRoadSystems.com/

HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/

**Brian** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such
> as —) have a speed advantage over the equivalent named
> entities (such as —) because the numerical entity requires
> just a single byte to be downloaded to the browser, while the named
> entity requires one byte for each letter. (So in this case, it
> would presumably be one byte vs. seven bytes.)[/color]

BTW, did the person whose work you read actually claim that there
would be a noticeable difference in 2 documents, where document (a)
had 6 (or 12, or, heck, even 60) bytes more than document (b)?

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Brian** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Stan Brown wrote:
[color=blue]
> Where the difference arises is if you actually create your document
> in Unicode[/color]

I'm not sure what you mean by this. Unicode is a character set, not an
encoding. AIUI, all HTML documents are presumed to be written in
Unicode, although that's an awkward thing to say.
[color=blue]
> instead of an 8-bit character set. If the document is actually
> composed in Unicode, and transmitted in Unicode,[/color]

There's no such thing as "transmitte d in Unicode". You mean
encoded in UTF-8? But UTF-8 is an 8-bit character set (hence the name).
[color=blue]
> then there is an advantage of the actual 8212 character because it
> needs only two bytes whereas — is 7 characters.[/color]

The only sense I can make of this is that if you use an encoding that
permits a direct representation of a charcter instead of requiring an
entity you'll save few byes. So, in UTF-8, the letter A requires 1
byte where A would require 5. Is that what you meant?

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Jonas Smithson** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Brian wrote:
[color=blue]
> BTW, did the person whose work you read actually claim that there
> would be a noticeable difference in 2 documents, where document (a)
> had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]

No, he didn't put the remark in context, as I recall... although I
don't even remember whether I read it online or in some computer book,
and the whole subject of encodings is totally confusing to me so I
probably misunderstood whatever context there may have been.

However, some of my pages have numerous character entities on them...
let's say up to fifty on a page, perhaps; if they each entailed an
extra six bytes (for example) over some alternate method, then that
might add up to an extra 300 bytes. What does that equal in download
time? How many bytes of difference do *you* think would make a
"noticeable difference" between two documents... say, to a user on a
56K modem?

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Brian wrote:
[color=blue]
> There's no such thing as "transmitte d in Unicode".[/color]

Agreed.
[color=blue]
> You mean encoded in UTF-8? But UTF-8 is an 8-bit character set[/color]

No, utf-8 isn't a "character set" at all (that MIME "charset"
parameter denotes what we nowadays call a "character encoding
scheme").
[color=blue]
> (hence the name).[/color]

The utf-8 scheme is built with 8-bit units, indeed, but characters are
represented by variable numbers of those units. (As you obviously
know).

cheers

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Thu, 15 Jul 2004, Brian wrote:
[color=blue][color=green]
> > I found this claim a little surprising[/color]
>
> That's being too kind.[/color]

;-)

If the hon Usenaut is worried about the size of their HTML documents,
it may be worth noting that most current browsers are happy to accept
gzip-compressed HTML. At least for documents which are in a Latin
base-language, this can make far more difference to total size than
worrying about the difference between a few &-notations and utf-8
encoding.

But it's probably not worth doing this until the individual HTML items
are significantly larger than the amount of HTTP red-tape involved in
retrieving the item. More than a few kBytes each, let's say.

For extra brownie points, the server can be set to honour the
browser's Accept-encoding header, sending gzip-compressed format to
those who say they accept it, and straight HTML to any who don't.

There are third-party Apache modules which take care of this "on the
fly", but it can be done more simply (i.e with MultiViews) if one is
willing to store both versions on the server. Disk space is cheap
nowadays, after all.

good luck

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such as
> —) have a speed advantage over the equivalent named entities[/color]

Others have rightly explained what nonsense that is...
[color=blue]
> Also, which form of the entity enjoys wider browser support?[/color]

You've been given the URL of my checklist for the wider picture, but
to summarise the relevant points:

- utf-8 encoding is widely supported and a compact representation; its
problem is more the possibility of mishandling in the hands of authors
who are not yet familiar with it.

- The Latin-1 named entities (those proposed in the appendix to
RFC1866/HTML2.0) are very well supported

- Generally speaking the entities introduced in HTML4 are now
supported, but there are still browsers around (e.g NN4.*) that don't
understand them. For almost all of these characters, I'd still say
that the &#number; representation is somewhat more widely supported.

It's best, of course, if your HTML authoring software takes care of
the details for you, according to some options which you can set.

€ is widely recognised, and at least still comprehensible in
browsers which don't implement it (since browsers usually display
character entities literally if they don't understand them).
[color=blue]
> They both seem to work with modern browsers... but what about older
> or very buggy browsers?[/color]

The checklist does its best to take that into account and choose best
compromises depending on the character repertoire which you need.

WebTV seemed to be hopeless with anything outside of a subset of
Windows-1252 repertoire. If you have anything more challenging as
your content, then you'd basically have to write it off. I hear that
they're working on it.

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
>On Fri, 16 Jul 2004, Brian wrote:
>[color=green]
>> There's no such thing as "transmitte d in Unicode".[/color]
>
>Agreed.[/color]

Why can't a document be encoded (and transmitted) in Unicode? If
Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Why can't a document be encoded (and transmitted) in Unicode?[/color]

It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
and in addition in different byte order for UTF-16 and UTF-32.
<http://www.unicode.org/unicode/faq/utf_bom.html>
[color=blue]
> If
> Windows Notepad lets you save a text file as Unicode (big- or
> little-endian), isn't that the same thing?[/color]

"Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
UTF-32 isn't used in MS Windows AFAIK.

--
Top-posting.
What's the most irritating thing on Usenet?

**Andreas Prilop** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such as
> —) have a speed advantage over the equivalent named entities
> (such as —) because the numerical entity requires just a single
> byte to be downloaded to the browser, while the named entity requires
> one byte for each letter.[/color]

Others told you already that isn't true. But even if it were true,
a single image is usually bigger than your source text. So length
doesn't really matter. [ Oops, what did I write :-) ]

But as <http://ppewww.ph.gla.a c.uk/~flavell/charset/checklist.html# s6>
explains, decimal references are somewhat better supported among
(older) browsers than hexadecimal references or entities.

--
Top-posting.
What's the most irritating thing on Usenet?

**Neal** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004 05:34:11 GMT, Jonas Smithson
<smithsonNOSPAM @REMOVETHISboar dermail.com> wrote:
[color=blue]
> Brian wrote:
>[color=green]
>> BTW, did the person whose work you read actually claim that there
>> would be a noticeable difference in 2 documents, where document (a)
>> had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]
>
> No, he didn't put the remark in context, as I recall... although I
> don't even remember whether I read it online or in some computer book,
> and the whole subject of encodings is totally confusing to me so I
> probably misunderstood whatever context there may have been.
>
> However, some of my pages have numerous character entities on them...
> let's say up to fifty on a page, perhaps; if they each entailed an
> extra six bytes (for example) over some alternate method, then that
> might add up to an extra 300 bytes. What does that equal in download
> time? How many bytes of difference do *you* think would make a
> "noticeable difference" between two documents... say, to a user on a
> 56K modem?[/color]

Negligible. Probably most pages have that much deletable/editable crap in
them plus some...

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Why can't a document be encoded (and transmitted) in Unicode?[/color]

Because "Unicode" is not the name of an encoding scheme.
[color=blue]
> If Windows Notepad lets you save a text file as Unicode (big- or
> little-endian), isn't that the same thing?[/color]

You're talking about just two of the possible encoding schemes for
Unicode. MS using baby-talk is maybe "good enough for government
work", but this here is a technical forum. What MS's terms are
denoting are utf-16LE and utf-16BE encoding schemes.

And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE.

**Pierre Goiffon** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Alan J. Flavell" <flavell@ph.gla .ac.uk> a écrit dans le message de
news:Pine.LNX.4 .53.04071613340 00.7123@ppepc56 .ph.gla.ac.uk[color=blue]
> And in any case, probably the best choice (if no other constraints
> apply) of Unicode encoding scheme for HTML used in a WWW context is
> utf-8, not utf-16LE/BE.[/color]

Do you mean, when using a vast majority of latin characters ?
If not, wouldn't the file will get very large ? Wouldn't it be better to use
UTF-16 ?

Named vs. numerical entities

Named vs. numerical entities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment