Judge the encode systm used by the file.

**Richard Tobin** · Oct 30 '08, 03:55 PM

Re: Judge the encode systm used by the file.

In article <K9K4Kz.BBC@cwi .nl>, Dik T. Winter <Dik.Winter@cwi .nlwrote:

>5×½.
>Just to keep it for the future, this article is one such file ;-).

A good example. Though it turns out that the UTF-8 interpretation
does not correspond to an existing character, being U+05FD which
is an unused code point in the Hebrew range.

-- Richard
--
Please remember to mention me / in tapes you leave behind.

**George** · Oct 31 '08, 02:15 AM

Re: Judge the encode systm used by the file.

On Wed, 29 Oct 2008 09:52:51 GMT, James Kuyper wrote:

George wrote:

>On Wed, 29 Oct 2008 20:44:09 +1300, Ian Collins wrote:
>>

>>Hongyi Zhao wrote:

...

>>>I want to judge the file's encoding system correctly, i.e., belong to
>>>utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.

...

>What's the gb* stuff that he refers to?

>
<http://en.wikipedia.or g/wiki/GB18030>

I thought there was two things that did not look like the others, the
wonderful sesame street game. I've come to appreciate better children's
programming now that I'm basically the babysitter.

One was 'ansi'; the other was gb*. I thought it had something to do with
Great Britain.

Now that I realize that gb* would ordinarily be something I would guess a
Chinese professor would know better than I, that leaves 'ansi'.

As a mainland chinese addressing an American, professor Zhao might think
that I would know things about this encoding, but I don't.

Richard Heathfield's your man here. These encodings are usually
unter-syntactic. He posted a link recently which I bookmarked and left my
premises with an angry ex girlfriend.

How to call from C and differentiate one from the other is best done
pairwise. You have two encodings and call both to compare. You use a
well-known sentence, and encode it in two differing schemes:

"Now is the time for all good chinese to come to the aid of fair elections
in the US."

Call, compare results.
--
George

Freedom itself was attacked this morning by a faceless coward, and freedom
will be defended.
George W. Bush

**James Kuyper** · Oct 31 '08, 10:15 AM

Re: Judge the encode systm used by the file.

George wrote:

On Wed, 29 Oct 2008 09:52:51 GMT, James Kuyper wrote:
>

>George wrote:

>>On Wed, 29 Oct 2008 20:44:09 +1300, Ian Collins wrote:
>>>
>>>Hongyi Zhao wrote:

>...

>>>>I want to judge the file's encoding system correctly, i.e., belong to
>>>>utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.

....

I thought there was two things that did not look like the others, the
wonderful sesame street game. I've come to appreciate better children's
programming now that I'm basically the babysitter.
>
One was 'ansi'; the other was gb*. I thought it had something to do with
Great Britain.
>
Now that I realize that gb* would ordinarily be something I would guess a
Chinese professor would know better than I, that leaves 'ansi'.

ASCII was developed by the American Standards Association, which
eventually became the American National Standards Institute, or ANSI. I
can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
synonym for ASCII.

**Richard Bos** · Oct 31 '08, 11:05 AM

Re: Judge the encode systm used by the file.

richard@cogsci. ed.ac.uk (Richard Tobin) wrote:

Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:
>

Possibly, but are you willing to rely on this, given the thousands of
languages out there, most of them, _unlike_ English, written in a Latin
script which uses diacritics to a greater or smaller degree?

>
Yes. It's very unlikely that all the sequences of 8859 characters used
in such a document will be legal UTF-8.
>
The heuristic is: if the file contains bytes >= 128, and it would be
legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
I would be interested if you can come up with any real document for
which this heuristic fails.

*Shrug* You speak English, and you're willing to take that risk. I speak
a language which _does_ use diacritics, and I'm not.

Richard

**Richard Tobin** · Oct 31 '08, 11:35 AM

Re: Judge the encode systm used by the file.

In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
James Kuyper <jameskuyper@ve rizon.netwrote:

>ASCII was developed by the American Standards Association, which
>eventually became the American National Standards Institute, or ANSI. I
>can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
>synonym for ASCII.

Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
encoding "windows-1252", which is ISO-8859-1 with a completely random
bunch of characters replacing the C1 controls.

[I suspect the reason for a Microsoft encoding being called "ansi" is
similar to that for Edinburgh have streets called "London Rd" and
London having streets called "Edinburgh Rd". That is, if you start
from Microsoft it's in the direction of ANSI.]

-- Richard
--
Please remember to mention me / in tapes you leave behind.

**Richard Tobin** · Oct 31 '08, 11:45 AM

Re: Judge the encode systm used by the file.

In article <490ae4b8.60408 6120@news.xs4al l.nl>,
Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:

>The heuristic is: if the file contains bytes >= 128, and it would be
>legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
>I would be interested if you can come up with any real document for
>which this heuristic fails.

>*Shrug* You speak English, and you're willing to take that risk. I speak
>a language which _does_ use diacritics, and I'm not.

As Dik Winter's (constructed) example indicates, the chance of error
is probably higher for English documents than for ones with a lot
of diacritics. The more non-ASCII characters you have, the lower
the chance of them accidentally being legal UTF-8.

-- Richard
--
Please remember to mention me / in tapes you leave behind.

**James Kuyper** · Oct 31 '08, 01:05 PM

Re: Judge the encode systm used by the file.

Richard Tobin wrote:

In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
James Kuyper <jameskuyper@ve rizon.netwrote:
>

>ASCII was developed by the American Standards Association, which
>eventually became the American National Standards Institute, or ANSI. I
>can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
>synonym for ASCII.

>
Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
encoding "windows-1252", which is ISO-8859-1 with a completely random
bunch of characters replacing the C1 controls.

That's because those code pages were submitted to ANSI for
standardization . ANSI turned them down, but Microsoft continued to refer
them as "ANSI" pages.

**Ben Bacarisse** · Oct 31 '08, 03:45 PM

Re: Judge the encode systm used by the file.

richard@cogsci. ed.ac.uk (Richard Tobin) writes:

In article <490ae4b8.60408 6120@news.xs4al l.nl>,
Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:
>

>>The heuristic is: if the file contains bytes >= 128, and it would be
>>legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
>>I would be interested if you can come up with any real document for
>>which this heuristic fails.

>

>>*Shrug* You speak English, and you're willing to take that risk. I speak
>>a language which _does_ use diacritics, and I'm not.

>
As Dik Winter's (constructed) example indicates, the chance of error
is probably higher for English documents than for ones with a lot
of diacritics. The more non-ASCII characters you have, the lower
the chance of them accidentally being legal UTF-8.

It is not that hard to work out what is permitted and what is not.
For a file that uses an 8-bit single-byte encoding to look like valid
UTF-8 it must consist of sequences made up of the following patterns:

[01234567]x
[CD]x [89AB]x
Ex [89AB]x [89AB]x
F[01234567] [89AB]x [89AB]x [89AB]x

(this is a sort of made-up hex pattern notation).

For example, if any of the 8 characters F0 to F7 appears, it must be
followed by exactly three characters in the range 80 to BF. Any of
the 16 characters C0 to DF must be followed by exactly one such
character. These "follow-on" characters come to our aid, since half
of them are very rarely used control characters and the others are all
less than common (they are not letters for example).

Taking ISO-8859-1 as an example, the document can't include (anywhere)
thorn, small o with a slash, small y with either an acute or diaeresis
nor small y with any accent. In addition it can't have any accented
letter followed by either another one or by any "plain" character
whatsoever. Every small accented a, e or i (the Ex range) must be
followed by exactly two of the rather odd bunch like pilcrow, micro,
plus/minus etc. None of the "matching pairs" like Â« and Â», Â¿ and ?
can be appear in a normal position (preceded by a space, newline or
tab for example). The best real-world use case I can see is a word
that has one and only one final accented character followed by
something like the registered symbol, the copyright symbol or maybe a
superscript number.

Other single-byte encodings (like the Chinese ones) might well have
patterns of use that do fit the requirements of the UTF-8 scheme, but
it is not likely to be common for the 8859 family.

--
Ben.

**Dik T. Winter** · Oct 31 '08, 04:15 PM

Re: Judge the encode systm used by the file.

In article <87zlkkbwnu.fsf @bsb.me.ukBen Bacarisse <ben.usenet@bsb .me.ukwrites:
....

For example, if any of the 8 characters F0 to F7 appears, it must be
followed by exactly three characters in the range 80 to BF. Any of
the 16 characters C0 to DF must be followed by exactly one such
character. These "follow-on" characters come to our aid, since half
of them are very rarely used control characters and the others are all
less than common (they are not letters for example).

In other 8859's than 8859-1 there *are* letters in that range. For instance,
in 8859-2 of the 32 symbols in the range A0 to BF, 20 are letters, in 8859-3
there are 14 letters, in 8859-3 21 letters, and in 8859-4 all but two are
letters. Especially in 8859-4 the common letters are encoded in the range
B0-EF with letters used in specific languages in A0-AF and F0-FF. Do not
base your knowledge on 8859-1.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

**Stephen Sprunk** · Oct 31 '08, 06:05 PM

Re: Judge the encode systm used by the file.

Hongyi Zhao wrote:

I want to judge the file's encoding system correctly, i.e., belong to
utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.

It is trivial to detect a BOM at the beginning of a UTF-8, UTF-16LE, or
UTF-16BE file. If the file is in another encoding, or does not start
with a BOM, there is no reliable way to tell what encoding is used
because many files will be equally valid (to a computer, at least) using
several different encodings. You may be able to eliminate some of the
multi-byte encodings by looking for "invalid" sequences, but you can't
eliminate most of the single-byte ones.

Web browsers provide a perfect example of how difficult the problem is.
If a page is not explicitly marked with an encoding, they will either
use the user's default, which is often wrong, or use heuristics to
guess, which is also often wrong. I run into dozens of pages _per day_
that my browser can't correctly guess the encoding of.

(Browsers' heuristics often use character frequency, after markup is
removed, to determine the language and/or encoding in use. However,
short or unusual documents will often lead to an incorrect result.)

Who can give me some hints on the fortran implimentation of this
issue?

If you want help with Fortran, ask in a Fortran newsgroup; in
comp.lang.c, we discuss the C language.

S

**Ben Bacarisse** · Nov 1 '08, 12:05 AM

Re: Judge the encode systm used by the file.

"Dik T. Winter" <Dik.Winter@cwi .nlwrites:

In article <87zlkkbwnu.fsf @bsb.me.ukBen Bacarisse <ben.usenet@bsb .me.ukwrites:
...

For example, if any of the 8 characters F0 to F7 appears, it must be
followed by exactly three characters in the range 80 to BF. Any of
the 16 characters C0 to DF must be followed by exactly one such
character. These "follow-on" characters come to our aid, since half
of them are very rarely used control characters and the others are all
less than common (they are not letters for example).

>
In other 8859's than 8859-1 there *are* letters in that range. For instance,
in 8859-2 of the 32 symbols in the range A0 to BF, 20 are letters, in 8859-3
there are 14 letters, in 8859-3 21 letters, and in 8859-4 all but two are
letters. Especially in 8859-4 the common letters are encoded in the range
B0-EF with letters used in specific languages in A0-AF and F0-FF.

True. I don't know the languages covered by these sets well enough to
say if the resulting combinations are likely. The one most likely to
result in confusion seems to be 8859-5 since certain runs of two or
three capital letters would be valid UTF-8 sequences.

--
Ben.

**George** · Nov 2 '08, 01:25 AM

Re: Judge the encode systm used by the file.

On Fri, 31 Oct 2008 12:58:09 GMT, James Kuyper wrote:

Richard Tobin wrote:

>In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
>James Kuyper <jameskuyper@ve rizon.netwrote:
>>

>>ASCII was developed by the American Standards Association, which
>>eventually became the American National Standards Institute, or ANSI. I
>>can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
>>synonym for ASCII.

>>
>Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
>encoding "windows-1252", which is ISO-8859-1 with a completely random
>bunch of characters replacing the C1 controls.

>
That's because those code pages were submitted to ANSI for
standardization . ANSI turned them down, but Microsoft continued to refer
them as "ANSI" pages.

Interesting. I hope Dr. Zhao got what he needed.
--
George

This was not an act of terrorism, but it was an act of war.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

Judge the encode systm used by the file.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment