Named vs. numerical entities

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
news:Pine.LNX.4 .53.04071617075 00.7333@ppepc56 .ph.gla.ac.uk.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Sorry, I missed it somehow. I intend to read it later,[/color]
>
> Call back here when you have done?
>[color=green]
> > 1. There's nothing any more nonsensical about the concept of a Unicode
> > encoding for the Unicode character set than there is about ASCII[/color][/color]
encoding[color=blue][color=green]
> > for the ASCII character set,[/color]
>
> Actually there are substantial differences. And you see this also
> with that MIME parameter which is (mis)named "charset" - but specifies
> what we now would call a "character encoding scheme".
>
> Back when 7 or 8 bits were sufficient to represent all of the
> characters of a repertoire, it was quasi-obvious that the "coded
> character set" was defined by assigning numbers (0-127 or 0-255 as the
> case may be) to the characters of the repertoire, and then to lay out
> the fonts according to that scheme, and to transmit the characters by
> means of bytes having that value.
>
> Consequently, back then it looked as if the things that we now call
> "coded character set", "character encoding" and "font arrangement"
> were just different names for the same thing. Of course, you needed a
> different font for each "charset" (i.e character encoding), which got
> to be a considerable drag.
>
> Nowadays these concepts have to be disambiguated. Unicode characters
> are designated by a code point which can, in principle, go up to 2**31
> (it hasn't got that far yet). Those numbers then have to be
> represented in a way which is convenient for transmission and/or
> storage (different design criteria apply for different purposes).
>[color=green]
> > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
>
> Actually not. But discussing that would be a pointless digression, so
> let's move on.
>[color=green]
> > So why are the UTF-* encoding, "encodings of the Unicode character[/color][/color]
set"?[color=blue]
>
> It's not practical, for various reasons, to transmit characters as
> 32-bit units. For one thing, it's very wasteful. For another,
> there's no unique byte-ordering, hence all this fuss about endian-ness
> when units of 16 or 32 bits are involved.
>
> There's also the question of representing unicode characters in a
> mail-safe context (hence utf-7). That will fade with time, but even
> 8-bit-safe mail formats ban null bytes, which means that utf-16 or
> utf-32/ucs-4 representations cannot be used without a further layer of
> encoding.
>[color=green]
> > Is it because they are closely related to the Unicode character set[/color]
>
> Is it because you won't read the tutorial before asking further
> questions?[/color]

No, it's because sometimes questions can be satisfied by relatively simple
answers without requiring one to read a whole tutorial (though sometimes
not). Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not).

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue]
> Not quite. You are thinking of US-ASCII. There are a variety of
> national ASCII character sets.[/color]

That's sloppy terminology. There's a variety of 7-bit national
character sets which are patterned on ASCII (US-ASCII is a more
accurate name, since - contrary to widespread belief amongst some
parties - America doesn't consist solely of the United States).

But those national character sets were mostly codified under ISO-646.

I give you my old page

404 Not Found

http://ppewww.ph.gla.ac.uk/~flavell/iso8859/digress.html#national

and particularly the links to
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html and

http://www.terena.nl/library/multiling/euroml/section04.html

This was a relevant topic in the early days of the WWW, since the code
positions which were set aside for national variations in iso-646 were
for example the basis for some of the "unsafe character" exclusions in
URLs.

Btw, I see there's a lovely comment in that Terena web page:

It will be clear that so-called "de facto standards" are related to
those discussed above as Monopoly banknotes to real money, valuable
as long as the game goes on.

**C A Upsdell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
news:Pine.GSO.4 .44.04071618331 40.11334-100000@s5b003.. .[color=blue]
> On Fri, 16 Jul 2004, C A Upsdell wrote:
>[color=green][color=darkred]
> >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
> >> and ISO 646.[/color]
> >
> > Not quite. You are thinking of US-ASCII.[/color]
>
> ASCII and US-ASCII are synonyms.
> <http://www.iana.org/assignments/character-sets>[/color]

NOT TRUE!!!! Read the IANA page: "These names are expressed in
ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
character set most commonly use in the Internet and used especially in
protocol standards is US-ASCII, this is strongly encouraged. The use of the
name US-ASCII is also encouraged." This says that US-ASCII is commonly
called ASCII. It does not say that US-ASCII is ASCII.

Also, when I started developing software in the early 1970's -- before the
Internet, before PCs, before microprocessors -- I routinely worked with
various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
codes). I find many Internet references denying the existence of 8-bit
ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit sets
were alive and well.

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"C A Upsdell" <cupsdell0311XX X@-@-@XXXrogers.com> wrote in message
news:LzUJc.1$Cm C1.0@news04.blo or.is.net.cable .rogers.com...[color=blue]
> "Andreas Prilop" <nhtcapri@rrz n-user.uni-hannover.de> wrote in message
> news:Pine.GSO.4 .44.04071618331 40.11334-100000@s5b003.. .[color=green]
> > On Fri, 16 Jul 2004, C A Upsdell wrote:
> >[color=darkred]
> > >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
> > >> and ISO 646.
> > >
> > > Not quite. You are thinking of US-ASCII.[/color]
> >
> > ASCII and US-ASCII are synonyms.
> > <http://www.iana.org/assignments/character-sets>[/color]
>
> NOT TRUE!!!! Read the IANA page: "These names are expressed in
> ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
> character set most commonly use in the Internet and used especially in
> protocol standards is US-ASCII, this is strongly encouraged. The use of[/color]
the[color=blue]
> name US-ASCII is also encouraged." This says that US-ASCII is commonly
> called ASCII. It does not say that US-ASCII is ASCII.[/color]

Uh, yeah, it does, unless the implication is along the lines of "... called
US-ASCII, or often simply ASCII, although this is technically incorrect
because ASCII properly refers to a different characters set". But that isn't
the implication and the statement is saying that US-ASCII, ASCII, and
ANSI_X3.4-1968 are all names for the same thing--which is the same as saying
that each of them is also each of the others.
[color=blue]
>
> Also, when I started developing software in the early 1970's -- before the
> Internet, before PCs, before microprocessors -- I routinely worked with
> various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
> codes). I find many Internet references denying the existence of 8-bit
> ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
sets[color=blue]
> were alive and well.[/color]

Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
They may have been ASCII extensions, but they were not ASCII.

**C A Upsdell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Harlan Messinger" <h.messinger@co mcast.net> wrote in message
news:2lqjqsFfb5 rcU1@uni-berlin.de...[color=blue][color=green]
> > Also, when I started developing software in the early 1970's -- before[/color][/color]
the[color=blue][color=green]
> > Internet, before PCs, before microprocessors -- I routinely worked with
> > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
Gray[color=blue][color=green]
> > codes). I find many Internet references denying the existence of 8-bit
> > ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
> sets[color=green]
> > were alive and well.[/color]
>
> Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
> They may have been ASCII extensions, but they were not ASCII.[/color]

Indeed they were ASCII. Standards written later appear to have
disassociated the term ASCII from the national variants and extended sets,
preferring to give them numbered ANSI designations, but in the early 1970s
they were ASCII. National variants which I personally worked with included
French, German, and Italian ASCII sets, and one of my co-workers worked with
the Portugese set . US-ASCII is the preferred term now to avoid confusion
with the other ASCII sets.

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Harlan Messinger wrote:

[after a bout of over-enthusiatic quoting]
[color=blue][color=green]
> > Is it because you won't read the tutorial before asking further
> > questions?[/color]
>
> No, it's because sometimes questions can be satisfied by relatively
> simple answers without requiring one to read a whole tutorial[/color]

And it's because often, the relatively simple answers don't make any
sense until you've done the groundwork first so that you can
understand the answers (or even better - ask the right questions).

Your attention was directed to the tutorial for a constructive reason:
someone who knew the subject believed that it would be of genuine
benefit to you, it would position you better for the subsequent
discussion. As it happens, that is also my own opinion.
[color=blue]
> Sometimes a tutorial or textbook will tell you the way things are
> without explaining why they aren't some other way (though sometimes not).[/color]

You'll be able to tell us how it was when you've tried it, OK? That
is, if I haven't lost patience by then and put you back into the
killfile...

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"C A Upsdell" <cupsdell0311XX X@-@-@XXXrogers.com> wrote in message
news:Z7VJc.1$Dg D1.0@news04.blo or.is.net.cable .rogers.com...[color=blue]
> "Harlan Messinger" <h.messinger@co mcast.net> wrote in message
> news:2lqjqsFfb5 rcU1@uni-berlin.de...[color=green][color=darkred]
> > > Also, when I started developing software in the early 1970's -- before[/color][/color]
> the[color=green][color=darkred]
> > > Internet, before PCs, before microprocessors -- I routinely worked[/color][/color][/color]
with[color=blue][color=green][color=darkred]
> > > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
> Gray[color=green][color=darkred]
> > > codes). I find many Internet references denying the existence of[/color][/color][/color]
8-bit[color=blue][color=green][color=darkred]
> > > ASCII, but I can attest that, in the early 1970s, multiple 7- and[/color][/color][/color]
8-bit[color=blue][color=green]
> > sets[color=darkred]
> > > were alive and well.[/color]
> >
> > Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
> > They may have been ASCII extensions, but they were not ASCII.[/color]
>
> Indeed they were ASCII. Standards written later appear to have
> disassociated the term ASCII from the national variants and extended sets,
> preferring to give them numbered ANSI designations, but in the early 1970s
> they were ASCII. National variants which I personally worked with[/color]
included[color=blue]
> French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
with[color=blue]
> the Portugese set .[/color]

You and they called them "ASCII" informally, or do you have a citation to
show that ASCII was officially regarded as the proper name for these sets?
[color=blue]
> US-ASCII is the preferred term now to avoid confusion
> with the other ASCII sets.[/color]

**Harlan Messinger** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message
news:Pine.LNX.4 .53.04071619333 30.7123@ppepc56 .ph.gla.ac.uk.. .[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>
> [after a bout of over-enthusiatic quoting]
>[color=green][color=darkred]
> > > Is it because you won't read the tutorial before asking further
> > > questions?[/color]
> >
> > No, it's because sometimes questions can be satisfied by relatively
> > simple answers without requiring one to read a whole tutorial[/color]
>
> And it's because often, the relatively simple answers don't make any
> sense until you've done the groundwork first so that you can
> understand the answers (or even better - ask the right questions).
>
> Your attention was directed to the tutorial for a constructive reason:
> someone who knew the subject believed that it would be of genuine
> benefit to you, it would position you better for the subsequent
> discussion. As it happens, that is also my own opinion.
>[color=green]
> > Sometimes a tutorial or textbook will tell you the way things are
> > without explaining why they aren't some other way (though sometimes[/color][/color]
not).[color=blue]
>
> You'll be able to tell us how it was when you've tried it, OK? That
> is, if I haven't lost patience by then and put you back into the
> killfile...[/color]

Oh, good grief, go ahead and get it over with. One would think I'd said
something simply terrible to you, instead of just asking questions and then
saying why I thought it was reasonable to do so.

**C A Upsdell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

> >[color=blue][color=green]
> > Indeed they were ASCII. Standards written later appear to have
> > disassociated the term ASCII from the national variants and extended[/color][/color]
sets,[color=blue][color=green]
> > preferring to give them numbered ANSI designations, but in the early[/color][/color]
1970s[color=blue][color=green]
> > they were ASCII. National variants which I personally worked with[/color]
> included[color=green]
> > French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
> with[color=green]
> > the Portugese set .[/color]
>
> You and they called them "ASCII" informally, or do you have a citation to
> show that ASCII was officially regarded as the proper name for these sets?[/color]

I do not have any of the manuals etc. that I used 30+ years ago. Otherwise
I could show you.

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue]
> Indeed they were ASCII.[/color]

No, they may have been *based* on ASCII, they may have been informally
referred to as "national ASCII", but they were not literally the
"American Standard Code for Information Interchange".
[color=blue]
> Standards written later appear to have disassociated the term ASCII
> from the national variants[/color]

Uh-uh, it's an international conspiracy to hide the origin of these
codes, is it? You don't seriously believe that the US American
national standards body would go making national character codes for
other countries, do you?
[color=blue]
> and extended sets,[/color]

At this point nobody's arguing about "extended sets". It's about
national variants based on the 7-bit code called ASCII.
[color=blue]
> preferring to give them numbered ANSI designations,[/color]

There you go again. ANSI (the later name of the US American national
standards body) had no jurisdiction over other national variants; only
over the (US-)American one. The British national variant based on
ASCII was a British Standard designation, BS4370; other national
variants would have had designations under their respective standards
bodies (DIN in Germany, and so on).

Later these 7-bit codes were codified into ISO-646 under the auspices
of the international standards body.
[color=blue]
> but in the early 1970s they were ASCII.[/color]

I've been interested in character coding issues since before then, and
I say you are mistaken, or confusing loose everyday terms and formal
specfications. Not that any of this is relevant to authoring HTML for
the WWW, so I shan't keep this sub-thread going.

**Jonas Smithson** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

My thanks to all the respondents. I've been sitting here reading this
thread with my jaw dropped open -- people not only discussing the
arcane nuances of encoding methods, but flaming each other over it!
This thread was so far over my head that (for my purposes) it might as
well have been written in ancient Greek (say, is that a possible
encoding method?). But I got the core information I needed: there's no
speed advantage of — over —. I wish I could remember where
I read that nonsense so that (if it was in a book, which I suspect it
was) I could warn people about the title.

I guess now my decision comes down to this: named entities are more
intuitive (I can remember them while I type without looking at a
chart), but Netscape 4 doesn't understand them, and makes the text look
like junk -- but it does understand numerical entities, which I can't
remember. So which do I care more about, my convenience in writing code
or the <0.5% of NS4 users? (That's a subjective question to myself, of
course; I don't expect an answer here.) Or maybe I'll type the named
entities and then do a bulk search & replace to numeric ones before
uploading the pages...

Alan Flavell wrote:[color=blue]
> It's best, of course, if your HTML authoring software takes
> care of the details for you, according to some options which
> you can set.[/color]

My "HTML authoring software" is a simple text editor; I don't care for
the so-called WYSIWYG editors so I have to make decisions like this for
myself.
[color=blue]
> utf-8 encoding is widely supported and a compact representation; its
> problem is more the possibility of mishandling in the hands of
> authors who are not yet familiar with it.[/color]

How would I, for example, type an emdash in utf-8 code? (I'm pretty
sure I just asked something totally clueless, like "which hand does a
cow use to play the accordian?" Oh, well... in for a dime, in for a
dollar, as they say...)

By the way, I occasionally see garbage characters even on the big news
sites -- where it looks like they meant to insert some kind of
punctuation mark but instead I see something that looks like a Chinese
character. I'm pretty sure they're not seeing that on their end, or
they would have fixed it; and I've searched through my preference
settings (in Windows Explorer 6) but couldn't find anything that seemed
relevant in terms of character encodings. Any guess as to why I'm
seeing scattered Chinese characters (it happens fairly rarely,
actually) and the site coders (presumably) aren't?

**Matt** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Jonas Smithson wrote:
[color=blue]
> My thanks to all the respondents. I've been sitting here reading this
> thread with my jaw dropped open -- people not only discussing the
> arcane nuances of encoding methods, but flaming each other over it![/color]

I prefer the term "heated discussion" :)
[color=blue]
> This thread was so far over my head that (for my purposes) it might as
> well have been written in ancient Greek (say, is that a possible
> encoding method?).[/color]

Use a greek encoding or unicode :). AIUI, and I never took Ancient Greek
very far at school, it uses letters all found in modern Greek.
[color=blue][color=green]
>> utf-8 encoding is widely supported and a compact representation; its
>> problem is more the possibility of mishandling in the hands of
>> authors who are not yet familiar with it.[/color]
>
> How would I, for example, type an emdash in utf-8 code? (I'm pretty
> sure I just asked something totally clueless, like "which hand does a
> cow use to play the accordian?" Oh, well... in for a dime, in for a
> dollar, as they say...)[/color]

Set your text editor to UTF-8 encoding, and input the character. You can
copy/paste it from anywhere (e.g. character map, a handy web page) or use
your keyboard -- I edited my keyboard layout to give me lots of useful
symbols. For instance, ndash â€“ and mdash â€” and AltGr + hypen and
Shift+AltGr+hyp hen now.[1]
[color=blue]
> By the way, I occasionally see garbage characters even on the big news
> sites -- where it looks like they meant to insert some kind of
> punctuation mark but instead I see something that looks like a Chinese
> character. I'm pretty sure they're not seeing that on their end, or
> they would have fixed it; and I've searched through my preference
> settings (in Windows Explorer 6) but couldn't find anything that seemed
> relevant in terms of character encodings. Any guess as to why I'm
> seeing scattered Chinese characters (it happens fairly rarely,
> actually) and the site coders (presumably) aren't?[/color]

Someone's character encoding is not set correctly. If you've set the
encoding selection in IE (View, Encoding) to auto-select, maybe theirs is
set wrongly.

[1] US layouts don't have AltGr, so I'd have to use Ctrl+Shift. You can
make a keyboard layout for Windows 2k,XP,2003 with this tool:
<http://www.microsoft.c om/downloads/details.aspx?Fa milyID=fb7b3dcd-d4c1-4943-9c74-d8df57ef19d7&di splaylang=en>
Much, much faster for typing things like â€˜Â â€™ â€œ â€ Â· Â¼ Â½ Â¾ Â©

--
Matt

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----

**Eric B. Bednarz** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

Jonas Smithson <smithsonNOSPAM @REMOVETHISboar dermail.com> writes:
[color=blue]
> [...] people not only discussing [...] but flaming each other over it![/color]
[color=blue]
> [...] numerical entities, [...][/color]

If you call *character references* 'numerical entities' one more time,
you ain't seen nothing yet! ;-)

Entity references are an entirely different syntactical construct.
You are excused because the WWW is cluttered with disinformation, but
before you go to sleep you really gotta write down 100 times:

'&#' is _*/NOT/*_ an ERO delimiter

Append exclamation marks in amounts you see fit.

--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aicha

**Alan J. Flavell** · Jul 20 '05, 07:21 PM

Re: Named vs. numerical entities

On Fri, 16 Jul 2004, Jonas Smithson wrote:
[color=blue]
> My thanks to all the respondents. I've been sitting here reading this
> thread with my jaw dropped open -- people not only discussing the
> arcane nuances of encoding methods, but flaming each other over it![/color]

Welcome to usenet. Gene Spafford had already said it in 1992
(google for usenet and "herd of performing elephants").
[color=blue]
> My "HTML authoring software" is a simple text editor;[/color]

But /how/ simple? Come back to that in a moment...
[color=blue]
> I don't care for the so-called WYSIWYG editors[/color]

I'm right with you there. But it isn't a binary choice between
type-every-character-by-hand or point-and-drool-and-never-see-any-HTML
[color=blue]
> How would I, for example, type an emdash in utf-8 code?[/color]

That's a non-sequitur: your keyboard doesn't generate "in" us-ascii or
iso-8859-1 or utf-8 code, it generates keyboard codes: it's the job of
input methods to turn keypresses into actual stored characters.

If your editor is sufficiently unicode-aware, then you can type-in an
emdash character (by some combination of keypressings), and when
you're done authoring, you can say save-As and tell the dialog to save
in utf-8 format. Or you can copy/paste characters from a menu, or use
a character picker utility or whatever. The key issue is that the
editor can store and work with these characters, and save them to file
in an encoding that you like (probably utf-8).

Recent versions of even such a "simple" editor as Notepad can do this
(in win2k, xp). Older ones can't, so you'd need to look for a
unicode-capable editor.

You could use the source-view mode of Mozilla Composer, for that
matter. A good choice, as it offers an immediate preview and various
other conveniences, such as translating &-notation to and from coded
characters.
[color=blue]
> (I'm pretty sure I just asked something totally clueless, like
> "which hand does a cow use to play the accordian?" Oh, well... in
> for a dime, in for a dollar, as they say...)[/color]

You recognise the problem, and that's well over half way to a
solution. Believe me, it's much harder to explain anything to people
who are convinced they already understand 90% of it (just that what
they think they understand is wrong!).

You could try Alan Wood's overview at

Unicode and multilingual editors and word processors for Windows

http://www.alanwood.net/unicode/utilities_editors.html

Text editors, HTML editors and word processors with Unicode, UTF-8 or multilingual support that run under Microsoft Windows. Part of Alan Wood’s Unicode Resources.

although it's a bit of a mix of text editors, word processors and
web-page extruders all in the same bucket, so be selective.

Or google for unicode editors (and related terms) and see if you care
for anything you get.
[color=blue]
> By the way, I occasionally see garbage characters even on the big news
> sites -- where it looks like they meant to insert some kind of
> punctuation mark but instead I see something that looks like a Chinese
> character. I'm pretty sure they're not seeing that on their end,[/color]

This can happen if they fail to specify a character encoding, and the
browser is set to auto-guess the encoding. Or various related errors.
I don't think there's a single right answer to your question. Given a
specific instance, it might be possible to deduce what had gone wrong.
Sometimes they got a news feed in one encoding, and accidentally
incorporated it into a page in a different encoding (news sites are
done from content management systems, the pages aren't produced
individually by hand).

hope this helps.

**Jonas Smithson** · Jul 20 '05, 07:22 PM

Re: Named vs. numerical entities

Alan J. Flavell wrote:
[color=blue]
> If your editor is sufficiently unicode-aware, then you can type-in an
> emdash character (by some combination of keypressings), and when
> you're done authoring, you can say save-As and tell the dialog to save
> in utf-8 format....[/color]

The editor (an old version of BBEdit) gives me two options for the
document while I'm working on it: "Encode as Unicode" and, if that's
enabled, the option to "Swap Bytes". Whether or not I chose those
options, when I go to save the document, I have the further options to
"Save as Unicode" and, if that's enabled, to "Swap Bytes". (It also
gives me a choice of Macintosh, Unix, or DOS line breaks, which I
assume wouldn't affect the HTML display.) The "unicode/swap bytes"
choices, of course, mean nothing to me, and I've always left them off
(the default). I can't find anything in the editor's preferences or
dialogs about "utf-8". When they say "Save as Unicode", is it likely
they mean the same thing you mean by "save in utf-8 format"?

If I were working and saving in unicode, would that mean (for example)
that I could type an emdash the way we Mac users do it
(command-option-hyphen) and that would actually work in the HTML
document on other platforms, without my using any character entity (or
character reference or whatever it's called)? (I have a PC too so I
guess I could test that.) And would the emdash character then be more
"compact" (smaller download) than the character reference (—)
I've been using? But...um... didn't I read somewhere that unicode
documents are much larger than... the other kind... (what's a
'non-Unicode' document called?) and so should only be used if you need
support for large character sets like Chinese etc...? Or maybe they
were referring to something else... wait, I think it was called
"double-byte encoding" or something. Excuse me, my brain is exploding.
:)

And then, of course, there's the whole other issue that my FTP program
automatically converts code to iso-8859-1 charset when you upload it,
unless you tell it not to, and when BBEdit talks directly to the FTP
server I don't know what it does.

And if I did save a text file as unicode, when I opened it later in a
text editor (perhaps even a different one), would I be able to tell
what it was saved as?

(That's a lot of questions, and I'm sure I phrased this all wrong, but
maybe you can guess what I mean or what the stuff I've been reading
meant?)

Named vs. numerical entities

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment