Character encodings and invalid characters

**Alan J. Flavell** · Jul 20 '05, 07:04 PM

Re: Character encodings and invalid characters

On Mon, 14 Jun 2004, Safalra wrote:
[color=blue]
> Questions:
>
> 1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
> the byte order markers. How does it identify other encodings?[/color]

[I can't answer that, but the use of a BOM is permissible in utf-8
although it's not required. Actually, if I may be pedantic for a
moment, utf-16BE and utf-16LE don't use a BOM - the endianness is
specified by the name of the encoding; utf-16 uses a BOM and by
looking at the BOM you work out for yourself whether it's LE or BE.

Coming back to utf-8: unless it's entirely us-ascii in which case you
can't tell the difference, there are validity criteria, and the more
of it you get which meet the criteria, the more confident you can be
that it really is utf-8. Just one single violation of the criteria is
enough to rule that possibility out, and the Unicode rules *mandate*
refusing to process the document further, for security reasons.
[color=blue]
> Will it just assume the system default encoding until it finds bytes
> that imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
> ISO-8859-1 and US-ASCII, but others may occur.[/color]

Right, but define "others". Are you going to deal with any character
encodings which define characters that don't exist in Unicode - e.g
Klingon?

You certainly aren't going to be able to guess 8-bit character
encodings just by looking at them - you absolutely do, in general,
need some external source of wisdom on what character coding you are
dealing with. *Some* character encodings can be guessed, at least on
plausibility grounds.
[color=blue]
> 2) I'm slightly confused by the HTML specification - are the valid
> characters precisely those that are defined in Unicode?[/color]

With the greatest of respect, you seem to be putting the cart before
the horse. First you say you intend to remove invalid characters, and
then it becomes clear that you're not sure how to define what they
are. :-}

I'm assuming that there's some substantive issue behind your problem,
but I'm afraid you're not expressing it in terms that I can be
confident that I understand what you're trying to achieve. Recall
that there are in general three ways of representing characters in
HTML:

1. coded characters in the appropriate character encoding
2. numerical character references &#number; or &#xhexnum;
3. character entity references &name; for those characters which have
them.

Can you address what you propose to do with each of these when you
find them?
[color=blue]
> (I'm ignoring at this point characters that in HTML need escaping.)[/color]

Hmmm? Are you referring to the use of &-notations here, or something
else?
[color=blue]
> 3) If it fails on esoteric character encodings, how badly is it likely
> to fail? Will it totally trash the HTML?[/color]

Best answer I can give to that is that the HTML markup itself uses
nothing more than plain us-ascii repertoire. If you can't recognise
at least that repertoire in the original encoding, then you're going
to do worse than trash only the HTML, no?

good luck

**Roedy Green** · Jul 20 '05, 07:04 PM

Re: Character encodings and invalid characters

On 14 Jun 2004 09:48:55 -0700, usenet@safalra. com (Safalra) wrote or
quoted :
[color=blue]
>1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
>the byte order markers. How does it identify other encodings?[/color]

You have to ask the user. You can find out the default encoding on his
machine, but that's as good as it gets. People never thought to mark
documents with the encoding or record it in a resource fork.

You can take the same document and interpret it many ways. It would
require almost AI to figure out which was the most likely encoding.

You could do it my comparing letter frequencies to averages of
samples.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

**Safalra** · Jul 20 '05, 07:05 PM

Re: Character encodings and invalid characters

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message news:<Pine.LNX. 4.53.0406141757 110.8374@ppepc5 6.ph.gla.ac.uk> ...[color=blue]
> On Mon, 14 Jun 2004, Safalra wrote:[color=green]
> > 2) I'm slightly confused by the HTML specification - are the valid
> > characters precisely those that are defined in Unicode?[/color]
>
> With the greatest of respect, you seem to be putting the cart before
> the horse. First you say you intend to remove invalid characters, and
> then it becomes clear that you're not sure how to define what they
> are. :-}
>
> I'm assuming that there's some substantive issue behind your problem,
> but I'm afraid you're not expressing it in terms that I can be
> confident that I understand what you're trying to achieve.[/color]

Okay, I guess I should have given more detail:

I wrote my dissertation on the subject of automated neatening of HTML.
As part of this I wrote a Java program to demonstrate what could be
done. It removed or replaced invalid characters, attributes and
elements, turned presentation elements and attributes to CSS, and
replaced many tables used for layout purposes (and some framesets)
with divs and CSS. It worked suprisingly well, but I only had to test
it on ISO-8859-1 documents. I worked out the invalid characters just
by feeding them into the W3C Validator, and for the ones that were
invalid but rendered under Windows (like smartquotes) I replaced those
with valid equivalents.

Once I've worked the program into a more presentable state, I'd like
to release it (GPL'd, of course). The problem is, I've got no idea
what would happen if, say, a Japanese person runs it on some Japanese
HTML source on their harddisk - I've never used a foreign character
encoding, so I don't even know how their text editors figure out the
encoding. I was wondering if Java assumes it's the system default
(unless it encounters unicode), and hence the program would still
work. (I assume that people would usually use the same character
encoding for their system and their HTML?)
[color=blue]
> Recall
> that there are in general three ways of representing characters in
> HTML:
>
> 1. coded characters in the appropriate character encoding
> 2. numerical character references &#number; or &#xhexnum;
> 3. character entity references &name; for those characters which have
> them.
>
> Can you address what you propose to do with each of these when you
> find them?[/color]

1. That's the one I'm asking about. :)

Assuming I can get around character encoding problems.:

2. If I understand the specification correctly, these refer to UCS
code positions, so I just to to check whether the position is defined
in Unicode.
3. I just need to check whether these are defined in the
specification.

If occurances of (2) and (3) are valid, they'll just be outputted by
the program in the same form.
[color=blue][color=green]
> > (I'm ignoring at this point characters that in HTML need escaping.)[/color]
>
> Hmmm? Are you referring to the use of &-notations here,[/color]

Yes, but now we've discussed them above...

--
Safalra (Stephen Morley)

410 Gone

http://www.safalra.com/

**Michael Borgwardt** · Jul 20 '05, 07:05 PM

Re: Character encodings and invalid characters

Safalra wrote:[color=blue]
> to release it (GPL'd, of course). The problem is, I've got no idea
> what would happen if, say, a Japanese person runs it on some Japanese
> HTML source on their harddisk - I've never used a foreign character
> encoding, so I don't even know how their text editors figure out the
> encoding.[/color]

They assume it by convention, usually. This can (and does) go wrong.
[color=blue]
> I was wondering if Java assumes it's the system default
> (unless it encounters unicode)[/color]

Java *alway* assumes text is the system default encoding unless given an
explicit encoding. Unicode does not play into it.

Also, do remember that in theory, all HTML documents should declare
their encoding explicitly, or have it supplied by the server in
the header. In XHTML, the explicit declaration is in fact mandatory.

But overall, text encoding is a horribly complex, muddled mess of
legacy conventions, incompatibiliti es, hacks and workarounds. Most
of the time, it breaks down horribly as soon as you cross a language
barrier.

**Alan J. Flavell** · Jul 20 '05, 07:05 PM

Re: Character encodings and invalid characters

On Tue, 15 Jun 2004, Safalra wrote:
[color=blue]
> I wrote my dissertation on the subject of automated neatening of HTML.[/color]
[...][color=blue]
> with divs and CSS. It worked suprisingly well, but I only had to test
> it on ISO-8859-1 documents. I worked out the invalid characters just
> by feeding them into the W3C Validator,[/color]

I think I'm going to have to stand firm, and say that you really need
to make the effort and cross the threshold of understanding the HTML
character model in order to grasp what's behind this, otherwise you'd
risk blundering on in a heuristic fashion without a robust mental
picture of what's involved.

This note makes no attempt to be a full tutorial on that, but just
races through some key headings to see whether you can be persuaded to
read the background and get up to speed.

All of the characters from 0 to 31 decimal, and all of the characters
from 127(sic) to 159 decimal, in the Document Character Set, are
defined to be control characters, and almost all of them are excluded
from use in HTML. These are the characters which are declared to be
"invalid" by the specification (and by the validator).

What's the "Document Character Set"? Well, in HTML2 it was
iso-8859-1, and in HTML4 it was defined to be iso-10646 as amended.
Loosely, you can read "iso-10646 as amended" as being the character
model of Unicode. As far as the values from 0 to 255 are concerned,
iso-8859-1 and iso-10646 are identical.

How is this related to the external character encoding? Well, the
character model that was introduced in RFC2070 and embodied in HTML4
is based on the concept that the external encoding is converted into
iso-10646/unicode prior to any other processing being done. It
doesn't require implementations to work in that way internally, but it
_does_ mandate that they give that impression externally (black box
model).

So from HTML's point of view, if you have a document which is coded in
say Windows-1252, including those pretty quotes, then (as long as the
recipient consents - see the HTTP Accept-charset) it's perfectly
legal. All you need to do is apply the appropriate code mapping that
you find at the Unicode site, and get the resulting Unicode character.

Resources at http://www.unicode.org/Public/MAPPINGS/ , in this case

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

[color=blue]
> and for the ones that were invalid but rendered under Windows (like
> smartquotes) I replaced those with valid equivalents.[/color]

What you're talking about here is probably a document which in reality
is coded in Windows-1252 but erroneously claims to be - or is
mistakenly presumed to be - iso-8859-1 (or its equivalent in other
locales).

There's nothing inherently wrong with these particular octet values
(128-159 decimal) *in those codings which assign them to printable
characters* (that's not only all of the Windows-125x codings, but also
koi-8r and some other less-usual codings).

What's wrong is when those octet values occur in codings which define
them to be control characters which are not used in HTML.
[color=blue]
> Once I've worked the program into a more presentable state, I'd like
> to release it (GPL'd, of course). The problem is, I've got no idea
> what would happen if, say, a Japanese person runs it on some Japanese
> HTML source on their harddisk - I've never used a foreign character
> encoding, so I don't even know how their text editors figure out the
> encoding.[/color]

Sadly, quite a number of language locales simply *assume* that their
local coding applies. Try looking at such a file on a system that's
set for a different locale, and you'll get rubbish. Although it's
sometimes possible to guess (look at the automatic charset selection
in, say, Mozilla for examples of what can be done heuristically).

OK, I've done the HTML part of this. I'm not a regular Java user so
I'm leaving that to others.
[color=blue][color=green]
> > Recall
> > that there are in general three ways of representing characters in
> > HTML:
> >
> > 1. coded characters in the appropriate character encoding
> > 2. numerical character references &#number; or &#xhexnum;
> > 3. character entity references &name; for those characters which have
> > them.
> >
> > Can you address what you propose to do with each of these when you
> > find them?[/color]
>
> 1. That's the one I'm asking about. :)[/color]

Thanks - I did want to be sure about that first.

[Don't make the mistake of confusing an 8-bit character of value 151
decimal (in some specified 8-bit encoding), on the one hand, with the
undefined(HTML)/illegal(XML) notation — on the other hand.]
[color=blue]
> 2. If I understand the specification correctly, these refer to UCS
> code positions,[/color]

basically yes, modulo some possible nit picking about high/low
surrogates and stuff, that I don't want to go into here.
[color=blue]
> so I just to to check whether the position is defined
> in Unicode.[/color]

Er, not quite. Those control characters are certainly *defined*, but
they are excluded from use in HTML by the "SGML declaration for HTML",
and from XHTML by the rules of XML.

And on the other hand I don't think an as-yet-unassigned Unicode code
point is actually invalid for use in (X)HTML. Try it and see what the
validator says?

hope this helps a bit. The writeup of the HTML character model in the
relevant part of the HTML4 spec and/or RFC2070 is not bad, I'd suggest
giving it a try. There's also some material at
http://ppewww.ph.gla.ac.uk/~flavell/charset/ which some folks have
found helpful.

**Roedy Green** · Jul 20 '05, 07:05 PM

Re: Character encodings and invalid characters

On Mon, 14 Jun 2004 20:38:09 GMT, Roedy Green
<look-on@mindprod.com .invalid> wrote or quoted :
[color=blue]
>You have to ask the user. You can find out the default encoding on his
>machine, but that's as good as it gets. People never thought to mark
>documents with the encoding or record it in a resource fork.[/color]

for more info see

encoding : Java Glossary

http://mindprod.com/jgloss/encoding.html#IDENTIFICATION

I am working up a student project to solve this problem.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

**Roedy Green** · Jul 20 '05, 07:05 PM

Re: Character encodings and invalid characters

On Tue, 15 Jun 2004 21:59:54 GMT, Roedy Green
<look-on@mindprod.com .invalid> wrote or quoted :
[color=blue][color=green]
>>You have to ask the user. You can find out the default encoding on his
>>machine, but that's as good as it gets. People never thought to mark
>>documents with the encoding or record it in a resource fork.[/color]
>
>for more info see
>http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
>
>I am working up a student project to solve this problem.[/color]

see http://mindprod.com/projects/encodin...ification.html

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

**Safalra** · Jul 20 '05, 07:07 PM

Re: Character encodings and invalid characters

[newsgroups trimmed - this no longer relates to Java]

"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message news:<Pine.LNX. 4.53.0406151306 120.10311@ppepc 56.ph.gla.ac.uk >...[color=blue]
> And on the other hand I don't think an as-yet-unassigned Unicode code
> point is actually invalid for use in (X)HTML. Try it and see what the
> validator says?[/color]

They're valid.

Incidentally, I've found another strange 'feature' of Internet
Explorer (The Cambridge Linux service was down, so I was forced into
using Windows.) When IE uploaded the UTF-16 file to the validator, it
strangely sent it as application/octet-stream rather than text/html,
which it does for ISO-8859-1.

--
Safalra (Stephen Morley)

410 Gone

http://www.safalra.com/

**Jukka K. Korpela** · Jul 20 '05, 07:07 PM

Re: Character encodings and invalid characters

usenet@safalra. com (Safalra) wrote:
[color=blue]
> When IE uploaded the UTF-16 file to the validator, it
> strangely sent it as application/octet-stream rather than text/html,
> which it does for ISO-8859-1.[/color]

IE treats file upload weirdly.

Recently there was some discussion in the www-validator list about a
problem that seems to have resulted from IE's odd way of sending,
in file upload, an XHTML document as text/xml with no charset parameter
when the document lacks the <?xml ...> prologue. For details see

Re: character encoding from Jukka K. Korpela on 2004-06-14 (www-validator@w3.org from June 2004)

http://lists.w3.org/Archives/Public/www-validator/2004Jun/0130.html

In general, when you upload a file using IE, then you can expect the data
itself to be sent properly but should assume that everything else is
wrong until proven correct. In fact, maybe it's not _only_ IE's fault. A
browser is expected to include a Content-Type header (which in turn may
allow or even require a charset parameter, depending on the type).
How is it expected to perform this, for files in general? It's guesswork
at best _until_ someone creates a file system that contains media type
information (in MIME terms) in its control data. (Just dreaming aloud.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Character encodings and invalid characters

Character encodings and invalid characters

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment