Named vs. numerical entities

**C A Upsdell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
news:Pine.LNX.4 .53.04071619590 60.7123@ppepc56 .ph.gla.ac.uk.. .[color=blue]
> On Fri, 16 Jul 2004, C A Upsdell wrote:
>[color=green]
> > Standards written later appear to have disassociated the term ASCII
> > from the national variants[/color]
>
> Uh-uh, it's an international conspiracy to hide the origin of these
> codes, is it? You don't seriously believe that the US American
> national standards body would go making national character codes for
> other countries, do you?[/color]

I generally respect what you say, even when I disagree with you. But a
paragraph like this is unworthy of you. International conspiracy? ISO an
American standards body? Standards being set by one national standards body
without consulting with other nations? You speak as if the US were the only
legitimate country in the world! Surely you are not (gasp!) a US
Republican!
[color=blue][color=green]
> > and extended sets,[/color]
>
> At this point nobody's arguing about "extended sets". It's about national[/color]
variants based on the 7-bit code called ASCII.

And as I said before, there were 8-bit ASCII sets, sometimes called extended
ASCII: 7 bits are not adequate to code characters for most European
languages, or for specialized character sets.

I do wish I had never discarded the manuals I used 3 decades ago. And I
wish that people would refuse to believe that information does not exist if
it does not make its way to the Internet. I have used computers, languages,
operating systems, tools, and manuals that have long been extinct. E.g.,
how many remember 8080 assembly programming using Intel MDS Development
Systems running the ISIS-II operating system. Or my favourite programmer's
editor, the Sage Professional Editor for Windows and OS/2? Or how to
program Intel's 8259A UART for either 7- and 8-bit serial communications? )
Sigh?

**Stan Brown** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

"Jonas Smithson" <smithsonNOSPAM @REMOVETHISboar dermail.com> wrote in
comp.infosystem s.www.authoring.html:[color=blue]
>But I got the core information I needed: there's no
>speed advantage of — over —.[/color]

It's true that there's no speed advantage.

There is another advantage, however, one that I have not seen
mentioned in this thread: Netscape 4 understands — but does
not understand —. That might weigh in your decision.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA

http://OakRoadSystems.com/

HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/

**Tim** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Pierre Goiffon wrote:
[color=blue][color=green]
>> If not, wouldn't the file will get very large ? Wouldn't it be
>> better to use UTF-16 ?[/color][/color]

"Alan J. Flavell" <flavell@ph.gla .ac.uk> posted:
[color=blue]
> I haven't widely tested browser compatibility for utf-16 encodings, so
> I can't comment on that aspect.[/color]

Not that long ago I tried utf-16 on several different (and *current*
versions of) web browsers. Only some could use it.

I know that's vague, and I'm not inclined to run all the tests before I
post this response. But it was enough to convince *me* that it was a bad
idea.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

**Leif K-Brooks** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

Jonas Smithson wrote:[color=blue]
> I can't find anything in the editor's preferences or
> dialogs about "utf-8". When they say "Save as Unicode", is it likely
> they mean the same thing you mean by "save in utf-8 format"?[/color]

I have way too much time on my hands, so I think I'll write a
(hopefully) easy to understant explanation of this stuff. I'm not an
expert, and I'm sure one will correct me on some of the finer points,
but I should at least be able to give you a good enough idea of this stuff.

Computers store things in bytes, which are numbers between 0 and 255.
This system works great for numbers, since you can use multiple bytes to
store numbers larger than 255, but text is a bit problematic when all
you have to work with is numbers.

Enter character sets and encodings. A character set is just that: a set
of characters. An encoding is a way to convert characters in a character
set into a series of bytes. Some simple character sets which define 256
characters or less can also be considered encodings, since nothing
special is required to convert them into bytes.

The first character set, which was also an encoding because it defined
only 128 characters, was called ASCII. It was fine for early computers,
but there was a problem: it only defined the Latin alphabet, digits, and
a few simple symbols. Countries which needed accented letters had
trouble, and countries which had entirely different alphabets couldn't
use ASCII at all.

In an attempt to fix all of those problems, the International
Orginization for Standardization and others defined encodings which kept
the 128 ASCII characters, but also used the other 128 integers in a byte
for other characters. Unfortunately, there were more than 128 characters
needed for other alphabets, so several incompatible encodings defining
different characters were created instead of just one. That worked for a
while, but the incompatibility of the different encodings stopped
characters from different alphabets from being used in the same
document, which some people needed to do.

The most important character set today is called Unicode. It currently
defines 96000 characters, and reserves the right to define a total of
1114112 characters in the future. It has Latin, Greek, Chinese, and
everything in between; hopefully enough for anyone.

Note that I said Unicode is a character set, not an encoding. It has
three different encodings: UTF-8, UTF-16, and UTF-32. UTF-8 is probably
the most used; it uses a different number of bytes (between 1 and 4) for
different characters, and all ASCII text is also valid UTF-8 text.
UTF-16 also uses a variable number of bytes; 2-4 in this case. UTF-32 is
the simplest for programs to process; it uses 4 bytes for every character.

As to whether your editor means UTF-8 by Unicode, I'm not sure. It
doesn't really mean Unicode, but whether it means UTF-8, UTF-16, or
UTF-32 is difficult to say.

[color=blue]
> If I were working and saving in unicode, would that mean (for example)
> that I could type an emdash the way we Mac users do it
> (command-option-hyphen) and that would actually work in the HTML
> document on other platforms, without my using any character entity (or
> character reference or whatever it's called)?[/color]

Yes. I believe Mac OS X handles these things very nicely, so you
shouldn't have any trouble.
[color=blue]
> And would the emdash character then be more
> "compact" (smaller download) than the character reference (—)
> I've been using?[/color]

Yes. — is 7 bytes in UTF-8, but the emdash encoded in UTF-8 is
only two bytes.
[color=blue]
> But...um... didn't I read somewhere that unicode
> documents are much larger than... the other kind... (what's a
> 'non-Unicode' document called?) and so should only be used if you need
> support for large character sets like Chinese etc...?[/color]

Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,
but UTF-16 and UTF-32 documents are larger.
[color=blue]
> And then, of course, there's the whole other issue that my FTP program
> automatically converts code to iso-8859-1 charset when you upload it,
> unless you tell it not to, and when BBEdit talks directly to the FTP
> server I don't know what it does.[/color]

My advice would be to replace your FTP client if it's that broken, but
you might be able to fix it by uploading in binary mode instead of text.
As for what BBEdit does, my guess would be that it does the right
thing if it has an option for Unicode when saving.
[color=blue]
> And if I did save a text file as unicode, when I opened it later in a
> text editor (perhaps even a different one), would I be able to tell
> what it was saved as?[/color]

Not unless your text editor told you, which it might.

**Alan J. Flavell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Sat, 17 Jul 2004, Jonas Smithson wrote:
[color=blue]
> The editor (an old version of BBEdit) gives me two options for the
> document while I'm working on it: "Encode as Unicode" and, if that's
> enabled, the option to "Swap Bytes".[/color]

Feel free to play around with this stuff and see what happens. E.g
put some interesting characters into a file, save it with the various
options, open the file in a unicode-capable web browser and play with
its view> character encoding options (whatever it calls them) till the
result makes sense. Then you'll have a better idea of what you've
got. View the source to make sure you're getting coded characters
instead of &-notations.

My hunch is that your editor is talking about the forerunner of utf-16
which was called ucs-2, back when the Unicode range could all be
represented in two bytes. For this subset of characters, you may be
able to treat utf-16 and ucs-2 as effectively synonymous.

My reading of Alan Wood's pages on editors (please consult them) is
that current versions of BBEdit support utf-8:

Unicode and multilingual editors and word processors for Mac OS 9

http://www.alanwood.net/unicode/utilities_editors_mac.html#bbedit

Text editors, HTML editors and word processors with Unicode, UTF-8 or multilingual support that run under Mac OS 9. Part of Alan Wood's Unicode Resources.

[color=blue]
> (It also gives me a choice of Macintosh, Unix, or DOS line breaks,
> which I assume wouldn't affect the HTML display.)[/color]

Agreed
[color=blue]
> If I were working and saving in unicode, would that mean (for example)
> that I could type an emdash the way we Mac users do it
> (command-option-hyphen) and that would actually work in the HTML
> document on other platforms, without my using any character entity[/color]

Right
[color=blue]
> And would the emdash character then be more "compact" (smaller
> download) than the character reference (—) I've been using?[/color]

Yes
[color=blue]
> But...um... didn't I read somewhere that unicode
> documents are much larger than... the other kind...[/color]

utf-8 is a good compromise for western writing systems. We've
discussed some of the issues elsewhere on this thread.
[color=blue]
> And then, of course, there's the whole other issue that my FTP program
> automatically converts code to iso-8859-1 charset when you upload it,
> unless you tell it not to, and when BBEdit talks directly to the FTP
> server I don't know what it does.[/color]

This is a detail which you'd need to get a grasp on, right.

But play around a bit, and read around a bit, so that competences and
understanding stay reasonably in step. In the end, it's all much
simpler and straightforward that it might have seemed at the outset.
But if your software doen't properly support what you're trying to do,
then you're confronted with extra difficulties. So do take a look at
Alan Wood's overview as it relates to your particular platform(s) and
pick something that appeals to you, at least for the first steps.
Then you'd be able to assess whether the software that you're already
using is actually capable of what you need.

**Alan J. Flavell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Sat, 17 Jul 2004, Leif K-Brooks wrote:
[color=blue]
> Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,[/color]

Er, no. The characters in the upper half of iso-8859-1 need two bytes
per character in utf-8; only one in iso-8859-1.
[color=blue]
> My advice would be to replace your FTP client if it's that broken,[/color]

Cue A.Prilop and the anti-Pirard league (that's an in-joke, don't
worry about it). The FTP software is not "broken", it's got extra
functionality, for mapping between traditional MacRoman encoding and
iso-8859-1. That function needs to be off when the material isn't
encoded in MacRoman.

**Andy Dingley** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004 17:50:35 GMT, "C A Upsdell"
<cupsdell0311XX X@-@-@XXXrogers.com> wrote:
[color=blue]
>I routinely worked with
>various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
>codes).[/color]

Gray codes are a red-herring here. They've nothing to do with
character encodings.

**Brian** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

C A Upsdell wrote:[color=blue]
> "Alan J. Flavell" wrote...
>[color=green]
>> C A Upsdell wrote:
>>[color=darkred]
>>> Standards written later appear to have disassociated the term
>>> ASCII from the national variants[/color]
>>
>> Uh-uh, it's an international conspiracy to hide the origin of
>> these codes, is it? You don't seriously believe that the US
>> American national standards body would go making national
>> character codes for other countries, do you?[/color]
>
> a paragraph like this is unworthy of you. International
> conspiracy?[/color]

"Don't you know sarcasm when you hear it?!" [1]
[color=blue]
> ISO an American standards body?[/color]

Not *quite* what he was saying. ;-)
[color=blue]
> Standards being set by one national standards body without
> consulting with other nations? You speak as if the US were the
> only legitimate country in the world![/color]

Hard to imagine how you could have misread that post more than you did.

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Alan J. Flavell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Sat, 17 Jul 2004, Brian wrote [to C A Upsdell ]:
[color=blue]
> Hard to imagine how you could have misread that post more than you did.[/color]

It's comforting to know that someone could perceive the
discrepancy ;-)

I don't think it's worth my while to even start on responding to the
various non-sequiturs. Suffice it to say that I'm well near the front
in the crabby old b*gger stakes, I met my first computer in 1958 and
some of my early programs are for converting between different
character encodings. I've had an interest in character
representation, specifications, standards, usage and terminology in
this field ever since.

Oh, and ASCII is a 7-bit code.

all the best.

**Brian** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

Brian wrote:
[color=blue]
> "Don't you know sarcasm when you hear it?!" [1][/color]

That note marker was meant to be followed by a citation. Here it is: I
lifted that from one Charles Brown.

--
Brian (remove ".invalid" to email me)

Home - TS McHughs Irish Pub & Restaurant

http://www.tsmchughs.com/

**Leif K-Brooks** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

Alan J. Flavell wrote:[color=blue]
> On Sat, 17 Jul 2004, Leif K-Brooks wrote:
>
>[color=green]
>>Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,[/color]
>
>
> Er, no. The characters in the upper half of iso-8859-1 need two bytes
> per character in utf-8; only one in iso-8859-1.[/color]

Darn, you're right. Could've sworn I read that somewhere, even though it
doesn't make any sense; I guess this is why one shouldn't make Usenet
posts after midnight.

**Andy Dingley** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004 19:57:52 GMT, Jonas Smithson
<smithsonNOSPAM @REMOVETHISboar dermail.com> wrote:
[color=blue]
>But I got the core information I needed: there's no
>speed advantage of — over —.[/color]

...I've never understood encodings or entities either....

How portable is —, as a very general thing, relative to say,
  ?

I'd always assumed that both were effectively portable, but just this
week I've been having trouble with a system (Vodafone's PartnerML)
that can't handle apostrophes from M$oft Word, that appear as ’

**Jonas Smithson** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

Well, thanks again to all of you; you've given me a good starting point
for figuring out at least the basics of this stuff, and you've been
very patient with me (although not, I think, with each other!).

I realize now that some of what I've been reading in books has been
misleading or simply wrong; it's odd that a Usenet newsgroup could be
more reliable than some books from reputable publishers, but that seems
to be the case... which makes it hard to know how to "filter"
information as I go forward. In fact, much as I dislike the combative
or sneering tone that many Usenetters adopt (unnecessarily, I think), I
see that the contentiousness does serve one useful purpose -- when I'm
reading a book that contains misinformation, it would be useful if a
critic could be there to step in with a demurral!

Jonas

**Alan J. Flavell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Sat, 17 Jul 2004, Andy Dingley wrote:
[color=blue]
> How portable is —, as a very general thing,[/color]

Portable? Utterly: it's a string of seven ASCII characters, after
all; they're unlikely to come to any harm in transit. Compatible with
all browsers and client agents? No, but it's been clear enough since
RFC1866/HTML2.0 that this was where HTML would be heading; RFC2070
actually codified it, and HTML4.0 put it into a W3C version of HTML.
That's quite a little while back now, as you may recall.
[color=blue]
> relative to say,   ?[/color]

That notation is technically meaningless (in HTML) and AFAIK illegal
in XHTML. So by definition it's not compatible with anything. Sure,
it happens to pick out the displayable characters of the Windows-1252
code on a rather popular majority platform; and other browser makers
may have considered that they couldn't afford to not copy that
behaviour, no matter what the specifications said. So it gives the
visual result that the author intended; but to call that "working"
would be stretching things.
[color=blue]
> I'd always assumed that both were effectively portable,[/color]

But what do you really mean by "portable"? They are notations
constructed of strings of ASCII characters. They will certainly
-reach- every client agent in that form. If you really mean "will
client agents render them?" why not ask that question? Most will;
some won't. At least if you use ’ then by definition any client
agent which doesn't render them, doesn't support HTML4. If you use
numbers between 128 and 159 respectively, then you're not really
writing HTML, but some kind of quasi-MSHTML which even MS are weaning
themselves off now.
[color=blue]
> but just this week I've been having trouble with a system
> (Vodafone's PartnerML) that can't handle apostrophes from M$oft
> Word, that appear as ’[/color]

AFAIK, neither does WebTV. Works great in Lynx, of course.

If you would at least code them as 8-bit characters, instead of
&#number; notations, and send them as charset=windows-1252, then you
would at least be both (a) honest and (b) protocol-conforming. It's
not my top recommendation - far from it, but see the discussion:

404 Not Found

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s3a

hope that helps

**Alan J. Flavell** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

On Sat, 17 Jul 2004, Alan J. Flavell wrote:
[color=blue]
> On Sat, 17 Jul 2004, Andy Dingley wrote:
>[color=green]
> > How portable is —, as a very general thing,[/color][/color]
[...][color=blue][color=green]
> > relative to say,   ?[/color]
>
> That notation is technically meaningless (in HTML) and AFAIK illegal
> in XHTML.[/color]

Hah! You caught me out well and truly there!!

There's nothing wrong with 160, it's a no-break space.

The windows-1252 code for your em dash would be 151. And there I was,
posting on autopilot, assuming that's what you had typed. Well, hit
me down with a clue by four...

But the rest of what I wrote was, at least, what I intended. Sorry
about that.

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Named vs. numerical entities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment