Simple high-ascii character encoding

**Jukka K. Korpela** · Aug 25 '05, 11:15 AM

Re: Simple high-ascii character encoding

chandy@totalise .co.uk wrote:
[color=blue]
> I have an Html document that declares that it uses the utf-8 character
> set.[/color]

Does it do that properly? Prove it, show us the URL! :-)
[color=blue]
> As this document is editable via a web interface I need to make
> sure than high-ascii characters that may be accidentally entered are
> properly represented when the document is served.[/color]

There are no high-ascii characters. Ascii stops at 127, has always
stopped, and will always stop.

If your document is adequately UTF-8 encoded, then form data sent via a
form on the page will appear as UTF-8 encoded, too, though naturally it
will _also_ be encoded as specified for form data encoding in general.
[color=blue]
> My programming
> language allows me to get the ascii value for any individual character
> so what I am doing when a change is saved is to look at each character
> in the content and if the ascii value for a character > 127 then I
> replace 'character' with '&#AsciiValue;' .[/color]

Why would you do that, given the fact that there are no Ascii values
greater than 127 and the fact that your form data handler gets the data
in UTF-8 encoding? What would be the point in replacing it by a
character reference, when the page itself is UTF-8 encoded?

**Alan J. Flavell** · Aug 25 '05, 11:35 AM

Re: Simple high-ascii character encoding

On Thu, 25 Aug 2005 chandy@totalise .co.uk wrote under the
heading:
[color=blue]
> Simple high-ascii character encoding[/color]

Hmmm. What's that supposed to mean in an HTML context?
[color=blue]
> I have an Html document that declares that it uses the utf-8 character
> set.[/color]

Terminology again! utf-8 is not a "character set", but a character
encoding scheme of unicode. I can't help it that, way back, MIME chose
the attribute name of "charset=" for this, which in current terminology
is very misleading, but utf-8 still isn't a "character set".
[color=blue]
> As this document is editable via a web interface I need to make
> sure than high-ascii characters that may be accidentally entered[/color]

I think you'd benefit from getting rid of this obsolete term
"high-ascii". ASCII is a 7-bit code, containing a mere 95 displayable
characters, whereas the document character set of HTML is Unicode,
containing vastly more characters than ASCII.

Modern OSes often define input methods for wide ranges of these
non-ASCII characters...
[color=blue]
> are properly represented when the document is served.[/color]

Details depend on your OS and editing application, but modern OSes don't
mind storing utf-8, and serving them out as such.
[color=blue]
> My programming language allows me to get the ascii value for any
> individual character[/color]

But most of the characters aren't in ASCII, so how could they have
an "ascii value"? Character representation in HTML isn't hard, but
you *do* have to use the terms with some care, if you want to make
sense.
[color=blue]
> so what I am doing when a change is saved is to look at each character
> in the content and if the ascii value for a character > 127[/color]

There ARE no ASCII characters with a value above 127 !
[color=blue]
> then I replace 'character' with '&#AsciiValue;' .[/color]

There *are* no ASCII values greater than 127.

Representing non-ASCII characters as &#number; , using their character
number in Unicode, is a feasible approach - but rather voluminous if you
have many of them.

I have a checklist that's been quite widely peer-reviewed: I'd
recommend that you work your way down the scenarios, and pick one that
seems to fit your needs.

404 Not Found

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

Hope this helps a bit.

**Chandy** · Aug 25 '05, 02:15 PM

Re: Simple high-ascii character encoding

Yep, clearly I make no sense to people who understand this better than
I do :) Okay, the langauge returns integer values for the standard as
well as 'extended' ascii characters (as detailed, for example, on
http://www.asciitable.com/). My document is not public but starts
with:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The system is publishing content in english to the web but is
poentially for world-wide consumption. Generally the extra characters
I have to represent will be items like ®, © and ™ and
some accented letters, but I was wanting to avoid having to have a
lookup of ascii value->Html Entity by just changing the character for
&#Value; when it seemed to have a value that put it outwith the
standard ascii range. I'll re-ask the question 'is this sensible'
while I read through the document you referred to.

Thanks!

Chandy

**Andreas Prilop** · Aug 25 '05, 02:45 PM

Re: Simple high-ascii character encoding

On 25 Aug 2005, Chandy wrote:
[color=blue]
> (as detailed, for example, on http://www.asciitable.com/).[/color]

| Not Found
| The requested URL /). was not found on this server.

Did you mean http://www.asciitable.com/ ? This is just bullshit!
Please refer to

ISO 646 (Good old ASCII)

http://czyborra.com/charsets/iso646.html

Codepage & Co.

http://czyborra.com/charsets/codepages.html

ISO 8859 Alphabet Soup

http://czyborra.com/charsets/iso8859.html

for reliable information.

**Harlan Messinger** · Aug 25 '05, 03:45 PM

Re: Simple high-ascii character encoding

Chandy wrote:[color=blue]
> Yep, clearly I make no sense to people who understand this better than
> I do :) Okay, the langauge returns integer values for the standard as
> well as 'extended' ascii characters (as detailed, for example, on
> http://www.asciitable.com/).[/color]

As that page itself says, "it took a while to get a single standard for
these extra characters and hence there are few varying 'extended' sets.
The most popular is presented below." This is all self-contradictory.
The point is there is no character set correctly called "extended
ASCII". Anyone using that term to refer to *a* mapping of a collection
of characters to codes 128-255 is using it because either:

(a) He thinks that "ASCII" itself refers to the numeric range 0-127, and
that "extended ASCII" therefore means, unambiguously, the range 128-255.
This is incorrect, because "ASCII" doesn't in the first place refer to
the range of numbers, it refers to a very specific set of characters (or
control codes) and its *assignment* to those numbers.

(b) He got the impression somewhere that the particular set of
characters he's seen assigned to the range 128-255 is *the* set of
characters so assigned, and that that particular set is known as
"extended ASCII". So what was introduced to MS-DOS users as "extended
ASCII" was the set of Microsoft line draw characters. Many other users
think it refers to the Western Europe (Latin-1) extension. And so forth.
The term is used by different people to refer to what they think it
means. In other words, it doesn't technically mean anything. It's a
misconception.

**Harlan Messinger** · Aug 25 '05, 03:55 PM

Re: Simple high-ascii character encoding

Harlan Messinger wrote:[color=blue]
> As that page itself says, "it took a while to get a single standard for
> these extra characters and hence there are few varying 'extended' sets.
> The most popular is presented below." This is all self-contradictory.
> The point is there is no character set correctly called "extended
> ASCII". Anyone using that term to refer to *a* mapping of a collection
> of characters to codes 128-255 is using it because either:[/color]

[snip]

To be fair, *any* of the character sets of which ASCII is a subset can
legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
extension, as is Unicode. But still, it makes no sense to speak of
"extended ASCII characters".

First, a given character may appear in one or more of these schemes and
*not* appear in one or more others. Would that character be an "extended
ASCII" character or not? The answer is that it's a character in some of
those character sets or that's represented in some of those encodings,
and not in others. The question of whether it's an "extended ASCII"
character is meaningless.

Second, a given character may appear in two different character sets but
mapped to different codes. What's the "extended ASCII code" for an em
dash? Well, under the standard Windows character set, an em-dash is
character 151; if you're using Unicode, it's character 8212; and if
you're using ISO-8859-1, it isn't anything at all because the em dash
isn't part of that character set. In other words, again, it's
meaningless to talk about a character's extended ASCII code.

**Guy Macon** · Aug 25 '05, 04:05 PM

Re: Simple high-ascii character encoding

Harlan Messinger wrote:
[color=blue]
>To be fair, *any* of the character sets of which ASCII is a subset can
>legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
>extension, as is Unicode. But still, it makes no sense to speak of
>"extended ASCII characters".[/color]

I was about to ask if anyone had bothered to list all the
different character sets that are identical to ASCII in the
first 127 characters, but perhaps it is easier to simply ask
if there are any character sets that are *not* identical to
ASCII in the first 127 characters...

**Andreas Prilop** · Aug 25 '05, 04:15 PM

Re: Simple high-ascii character encoding

On Thu, 25 Aug 2005, it was written:
[color=blue]
> but perhaps it is easier to simply ask
> if there are any character sets that are *not* identical to
> ASCII in the first 127 characters...[/color]
^
(Characters 0 to 127 are the first 128 characters.)

All of these

Index of /Public/MAPPINGS/VENDORS/MICSFT/EBCDIC

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/

ISO 646 (Good old ASCII)

http://czyborra.com/charsets/iso646.html#EBCDIC

**Alan J. Flavell** · Aug 25 '05, 04:25 PM

Re: Simple high-ascii character encoding

On Thu, 25 Aug 2005, Harlan Messinger wrote:
[color=blue]
> To be fair, *any* of the character sets of which ASCII is a subset
> can legimately be called *an* "extension of ASCII".[/color]

It could - but it's not a particularly informative statement, as I
hope you'd agree.
[color=blue]
> Latin-1 is an ASCII extension,[/color]

To be pedantic, "Latin-1" defines a repertoire of characters:
CP-1047 is the "EBCDIC Latin-1 character encoding". When you
said Latin-1, I suspect you really meant iso-8859-1, which indeed
has ASCII as its lower half.
[color=blue]
> as is Unicode.[/color]

Indeed.
[color=blue]
> But still, it makes no sense to speak of "extended ASCII
> characters".[/color]

Right!
[color=blue]
> Second, a given character may appear in two different character sets
> but mapped to different codes. What's the "extended ASCII code" for
> an em dash? Well, under the standard Windows character set, an
> em-dash is character 151; if you're using Unicode, it's character
> 8212; and if you're using ISO-8859-1, it isn't anything at all
> because the em dash isn't part of that character set. In other
> words, again, it's meaningless to talk about a character's extended
> ASCII code.[/color]

Right!!

And even in MS-DOS land, which is where this unfortunate phrase
*"extended ASCII" seems to have grown, there's a bushel of different
encodings: CP-437 for the USans, CP-850 for "multinatio nal" use (which
contains approximately an MS-DOS encoding of the Latin-1 repertoire,
but organised completely differently than iso-8859-1), plus loads of
national-specific code pages too. I've got an MS-DOS version 6 manual
somewhere which lists page after page of the wretched things.

Thank goodness we rarely have to go there these days (except where
some user has blundered and converted DOS to Windows where they ought
not, or failed to do so when they should've).

best

**Alan J. Flavell** · Aug 25 '05, 04:35 PM

Re: Simple high-ascii character encoding

On Thu, 25 Aug 2005, Chandy wrote:
[color=blue]
> http://www.asciitable.com/[/color]

Bleagh.

On cursory inspection, this appears to be the US-National MS-DOS code
page, CP-437. Utterly useless in the modern world: it's absolute
nonsense for them to claim that it's the "most popular", as indeed is
their claim that "it took a while to get a single standard", since
there never *has* been a "single" standard of the kind that they are
talking about. Possibly in the distant future, when this babel of
8-bit character codes has been forgotten, Unicode *will* be that
"single standard". Possibly.

Ho hum

**Harlan Messinger** · Aug 25 '05, 04:55 PM

Re: Simple high-ascii character encoding

Guy Macon wrote:[color=blue]
> Harlan Messinger wrote:
>
>[color=green]
>>To be fair, *any* of the character sets of which ASCII is a subset can
>>legimately be called *an* "extension of ASCII". Latin-1 is an ASCII
>>extension, as is Unicode. But still, it makes no sense to speak of
>>"extended ASCII characters".[/color]
>
>
> I was about to ask if anyone had bothered to list all the
> different character sets that are identical to ASCII in the
> first 127 characters, but perhaps it is easier to simply ask
> if there are any character sets that are *not* identical to
> ASCII in the first 127 characters...[/color]

EBCDIC, for starters.

Then there are all the non-standard arrangements that font designers
used in the past to map alphabets and symbol sets other than the basic
English one to the sub-128 positions so that foreign text and special
symbols could be rendered before more sophisticated means became
available. For example, the various Symbols and Wingdings fonts.

**Harlan Messinger** · Aug 25 '05, 05:05 PM

Re: Simple high-ascii character encoding

Alan J. Flavell wrote:[color=blue]
> On Thu, 25 Aug 2005, Harlan Messinger wrote:
>[color=green]
>>To be fair, *any* of the character sets of which ASCII is a subset
>>can legimately be called *an* "extension of ASCII".[/color]
>
> It could - but it's not a particularly informative statement, as I
> hope you'd agree.[/color]

Yes. Still, it's been convenient that for purposes of composing in
English most people (pre-Unicode) haven't had to worry about whether
their editor supported a particular encoding because it hasn't mattered
with respect to the common ASCII subset.
[color=blue][color=green]
>>Latin-1 is an ASCII extension,[/color]
>
> To be pedantic, "Latin-1" defines a repertoire of characters:
> CP-1047 is the "EBCDIC Latin-1 character encoding". When you
> said Latin-1, I suspect you really meant iso-8859-1, which indeed
> has ASCII as its lower half.[/color]

I did, and thanks for the adjustment. I'm trying really hard to stop
mixing up character sets and encodings. (By the way--is a "repertoire "
different from a "set"?)

**Alan J. Flavell** · Aug 25 '05, 06:05 PM

Re: Simple high-ascii character encoding

On Thu, 25 Aug 2005, Harlan Messinger wrote:
[color=blue][color=green]
> > CP-1047 is the "EBCDIC Latin-1 character encoding". When you said
> > Latin-1, I suspect you really meant iso-8859-1, which indeed has
> > ASCII as its lower half.[/color]
>
> I did, and thanks for the adjustment. I'm trying really hard to stop
> mixing up character sets and encodings.[/color]

"character sets" versus "encodings" is yet another layer! - although
that's hardly noticeable with the old 8-bit codings, it gets quite
critical with encodings of Unicode.
[color=blue]
> (By the way--is a "repertoire " different from a "set"?)[/color]

Well, the term "character set" is usually understood to define not
only a particular repertoire of characters, but also the assignment of
each character to a "small" integer number. This assignment is of
course different in EBCDIC-based codings from what it is in
ASCII-based codings, to take the obvious example.

As such, I'd tend to avoid the use of the term "set" to refer to a
character repertoire, if I'm trying to avoid implying a particular
ordering of the characters or their assignment to "small" integers.
The "repertoire " is the unordered selection of characters, without
reference to one or other "character sets" which might be defined
comprising that repertoire.

hope that helps.

Btw, recall that after a certain point, the Latin-x repertoire is
encoded by the iso-8859-y character code, where x is no longer equal
to y. This is because some of the intervening codes weren't for Latin
at all, but for Greek, Arabic, Cyrillic, Hebrew etc. So, for example,
iso-8859-15 is the ISO encoding for Latin-9.

**RobG** · Aug 25 '05, 10:55 PM

Re: Simple high-ascii character encoding

Harlan Messinger wrote:
[...]
[color=blue]
> Then there are all the non-standard arrangements that font designers
> used in the past to map alphabets and symbol sets other than the basic
> English one to the sub-128 positions so that foreign text[/color]

While we're being pedantic about words, should the phrase 'foreign text'
be 'non-English text'? Or in the context of ASCII, are the two terms
identical?

[...]

--
Rob

Simple high-ascii character encoding

Simple high-ascii character encoding

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment