Judge the encode systm used by the file.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Richard Tobin

    #16
    Re: Judge the encode systm used by the file.

    In article <K9K4Kz.BBC@cwi .nl>, Dik T. Winter <Dik.Winter@cwi .nlwrote:
    >5×½.
    >Just to keep it for the future, this article is one such file ;-).
    A good example. Though it turns out that the UTF-8 interpretation
    does not correspond to an existing character, being U+05FD which
    is an unused code point in the Hebrew range.

    -- Richard
    --
    Please remember to mention me / in tapes you leave behind.

    Comment

    • George

      #17
      Re: Judge the encode systm used by the file.

      On Wed, 29 Oct 2008 09:52:51 GMT, James Kuyper wrote:
      George wrote:
      >On Wed, 29 Oct 2008 20:44:09 +1300, Ian Collins wrote:
      >>
      >>Hongyi Zhao wrote:
      ...
      >>>I want to judge the file's encoding system correctly, i.e., belong to
      >>>utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.
      ...
      >What's the gb* stuff that he refers to?
      >
      <http://en.wikipedia.or g/wiki/GB18030>
      I thought there was two things that did not look like the others, the
      wonderful sesame street game. I've come to appreciate better children's
      programming now that I'm basically the babysitter.

      One was 'ansi'; the other was gb*. I thought it had something to do with
      Great Britain.

      Now that I realize that gb* would ordinarily be something I would guess a
      Chinese professor would know better than I, that leaves 'ansi'.

      As a mainland chinese addressing an American, professor Zhao might think
      that I would know things about this encoding, but I don't.

      Richard Heathfield's your man here. These encodings are usually
      unter-syntactic. He posted a link recently which I bookmarked and left my
      premises with an angry ex girlfriend.

      How to call from C and differentiate one from the other is best done
      pairwise. You have two encodings and call both to compare. You use a
      well-known sentence, and encode it in two differing schemes:

      "Now is the time for all good chinese to come to the aid of fair elections
      in the US."

      Call, compare results.
      --
      George

      Freedom itself was attacked this morning by a faceless coward, and freedom
      will be defended.
      George W. Bush

      Comment

      • James Kuyper

        #18
        Re: Judge the encode systm used by the file.

        George wrote:
        On Wed, 29 Oct 2008 09:52:51 GMT, James Kuyper wrote:
        >
        >George wrote:
        >>On Wed, 29 Oct 2008 20:44:09 +1300, Ian Collins wrote:
        >>>
        >>>Hongyi Zhao wrote:
        >...
        >>>>I want to judge the file's encoding system correctly, i.e., belong to
        >>>>utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.
        ....
        I thought there was two things that did not look like the others, the
        wonderful sesame street game. I've come to appreciate better children's
        programming now that I'm basically the babysitter.
        >
        One was 'ansi'; the other was gb*. I thought it had something to do with
        Great Britain.
        >
        Now that I realize that gb* would ordinarily be something I would guess a
        Chinese professor would know better than I, that leaves 'ansi'.
        ASCII was developed by the American Standards Association, which
        eventually became the American National Standards Institute, or ANSI. I
        can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
        synonym for ASCII.

        Comment

        • Richard Bos

          #19
          Re: Judge the encode systm used by the file.

          richard@cogsci. ed.ac.uk (Richard Tobin) wrote:
          Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:
          >
          Possibly, but are you willing to rely on this, given the thousands of
          languages out there, most of them, _unlike_ English, written in a Latin
          script which uses diacritics to a greater or smaller degree?
          >
          Yes. It's very unlikely that all the sequences of 8859 characters used
          in such a document will be legal UTF-8.
          >
          The heuristic is: if the file contains bytes >= 128, and it would be
          legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
          I would be interested if you can come up with any real document for
          which this heuristic fails.
          *Shrug* You speak English, and you're willing to take that risk. I speak
          a language which _does_ use diacritics, and I'm not.

          Richard

          Comment

          • Richard Tobin

            #20
            Re: Judge the encode systm used by the file.

            In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
            James Kuyper <jameskuyper@ve rizon.netwrote:
            >ASCII was developed by the American Standards Association, which
            >eventually became the American National Standards Institute, or ANSI. I
            >can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
            >synonym for ASCII.
            Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
            encoding "windows-1252", which is ISO-8859-1 with a completely random
            bunch of characters replacing the C1 controls.

            [I suspect the reason for a Microsoft encoding being called "ansi" is
            similar to that for Edinburgh have streets called "London Rd" and
            London having streets called "Edinburgh Rd". That is, if you start
            from Microsoft it's in the direction of ANSI.]

            -- Richard
            --
            Please remember to mention me / in tapes you leave behind.

            Comment

            • Richard Tobin

              #21
              Re: Judge the encode systm used by the file.

              In article <490ae4b8.60408 6120@news.xs4al l.nl>,
              Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:
              >The heuristic is: if the file contains bytes >= 128, and it would be
              >legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
              >I would be interested if you can come up with any real document for
              >which this heuristic fails.
              >*Shrug* You speak English, and you're willing to take that risk. I speak
              >a language which _does_ use diacritics, and I'm not.
              As Dik Winter's (constructed) example indicates, the chance of error
              is probably higher for English documents than for ones with a lot
              of diacritics. The more non-ASCII characters you have, the lower
              the chance of them accidentally being legal UTF-8.

              -- Richard
              --
              Please remember to mention me / in tapes you leave behind.

              Comment

              • James Kuyper

                #22
                Re: Judge the encode systm used by the file.

                Richard Tobin wrote:
                In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
                James Kuyper <jameskuyper@ve rizon.netwrote:
                >
                >ASCII was developed by the American Standards Association, which
                >eventually became the American National Standards Institute, or ANSI. I
                >can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
                >synonym for ASCII.
                >
                Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
                encoding "windows-1252", which is ISO-8859-1 with a completely random
                bunch of characters replacing the C1 controls.
                That's because those code pages were submitted to ANSI for
                standardization . ANSI turned them down, but Microsoft continued to refer
                them as "ANSI" pages.

                Comment

                • Ben Bacarisse

                  #23
                  Re: Judge the encode systm used by the file.

                  richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                  In article <490ae4b8.60408 6120@news.xs4al l.nl>,
                  Richard Bos <rlb@hoekstra-uitgeverij.nlwr ote:
                  >
                  >>The heuristic is: if the file contains bytes >= 128, and it would be
                  >>legal UTF-8, then it's very likely that it *is* UTF-8. As I said,
                  >>I would be interested if you can come up with any real document for
                  >>which this heuristic fails.
                  >
                  >>*Shrug* You speak English, and you're willing to take that risk. I speak
                  >>a language which _does_ use diacritics, and I'm not.
                  >
                  As Dik Winter's (constructed) example indicates, the chance of error
                  is probably higher for English documents than for ones with a lot
                  of diacritics. The more non-ASCII characters you have, the lower
                  the chance of them accidentally being legal UTF-8.
                  It is not that hard to work out what is permitted and what is not.
                  For a file that uses an 8-bit single-byte encoding to look like valid
                  UTF-8 it must consist of sequences made up of the following patterns:

                  [01234567]x
                  [CD]x [89AB]x
                  Ex [89AB]x [89AB]x
                  F[01234567] [89AB]x [89AB]x [89AB]x

                  (this is a sort of made-up hex pattern notation).

                  For example, if any of the 8 characters F0 to F7 appears, it must be
                  followed by exactly three characters in the range 80 to BF. Any of
                  the 16 characters C0 to DF must be followed by exactly one such
                  character. These "follow-on" characters come to our aid, since half
                  of them are very rarely used control characters and the others are all
                  less than common (they are not letters for example).

                  Taking ISO-8859-1 as an example, the document can't include (anywhere)
                  thorn, small o with a slash, small y with either an acute or diaeresis
                  nor small y with any accent. In addition it can't have any accented
                  letter followed by either another one or by any "plain" character
                  whatsoever. Every small accented a, e or i (the Ex range) must be
                  followed by exactly two of the rather odd bunch like pilcrow, micro,
                  plus/minus etc. None of the "matching pairs" like « and », ¿ and ?
                  can be appear in a normal position (preceded by a space, newline or
                  tab for example). The best real-world use case I can see is a word
                  that has one and only one final accented character followed by
                  something like the registered symbol, the copyright symbol or maybe a
                  superscript number.

                  Other single-byte encodings (like the Chinese ones) might well have
                  patterns of use that do fit the requirements of the UTF-8 scheme, but
                  it is not likely to be common for the 8859 family.

                  --
                  Ben.

                  Comment

                  • Dik T. Winter

                    #24
                    Re: Judge the encode systm used by the file.

                    In article <87zlkkbwnu.fsf @bsb.me.ukBen Bacarisse <ben.usenet@bsb .me.ukwrites:
                    ....
                    For example, if any of the 8 characters F0 to F7 appears, it must be
                    followed by exactly three characters in the range 80 to BF. Any of
                    the 16 characters C0 to DF must be followed by exactly one such
                    character. These "follow-on" characters come to our aid, since half
                    of them are very rarely used control characters and the others are all
                    less than common (they are not letters for example).
                    In other 8859's than 8859-1 there *are* letters in that range. For instance,
                    in 8859-2 of the 32 symbols in the range A0 to BF, 20 are letters, in 8859-3
                    there are 14 letters, in 8859-3 21 letters, and in 8859-4 all but two are
                    letters. Especially in 8859-4 the common letters are encoded in the range
                    B0-EF with letters used in specific languages in A0-AF and F0-FF. Do not
                    base your knowledge on 8859-1.
                    --
                    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
                    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

                    Comment

                    • Stephen Sprunk

                      #25
                      Re: Judge the encode systm used by the file.

                      Hongyi Zhao wrote:
                      I want to judge the file's encoding system correctly, i.e., belong to
                      utf-8, ansi, gbk, gb2312, gb18030, or iso-8859-a, and so on.
                      It is trivial to detect a BOM at the beginning of a UTF-8, UTF-16LE, or
                      UTF-16BE file. If the file is in another encoding, or does not start
                      with a BOM, there is no reliable way to tell what encoding is used
                      because many files will be equally valid (to a computer, at least) using
                      several different encodings. You may be able to eliminate some of the
                      multi-byte encodings by looking for "invalid" sequences, but you can't
                      eliminate most of the single-byte ones.

                      Web browsers provide a perfect example of how difficult the problem is.
                      If a page is not explicitly marked with an encoding, they will either
                      use the user's default, which is often wrong, or use heuristics to
                      guess, which is also often wrong. I run into dozens of pages _per day_
                      that my browser can't correctly guess the encoding of.

                      (Browsers' heuristics often use character frequency, after markup is
                      removed, to determine the language and/or encoding in use. However,
                      short or unusual documents will often lead to an incorrect result.)
                      Who can give me some hints on the fortran implimentation of this
                      issue?
                      If you want help with Fortran, ask in a Fortran newsgroup; in
                      comp.lang.c, we discuss the C language.

                      S

                      Comment

                      • Ben Bacarisse

                        #26
                        Re: Judge the encode systm used by the file.

                        "Dik T. Winter" <Dik.Winter@cwi .nlwrites:
                        In article <87zlkkbwnu.fsf @bsb.me.ukBen Bacarisse <ben.usenet@bsb .me.ukwrites:
                        ...
                        For example, if any of the 8 characters F0 to F7 appears, it must be
                        followed by exactly three characters in the range 80 to BF. Any of
                        the 16 characters C0 to DF must be followed by exactly one such
                        character. These "follow-on" characters come to our aid, since half
                        of them are very rarely used control characters and the others are all
                        less than common (they are not letters for example).
                        >
                        In other 8859's than 8859-1 there *are* letters in that range. For instance,
                        in 8859-2 of the 32 symbols in the range A0 to BF, 20 are letters, in 8859-3
                        there are 14 letters, in 8859-3 21 letters, and in 8859-4 all but two are
                        letters. Especially in 8859-4 the common letters are encoded in the range
                        B0-EF with letters used in specific languages in A0-AF and F0-FF.
                        True. I don't know the languages covered by these sets well enough to
                        say if the resulting combinations are likely. The one most likely to
                        result in confusion seems to be 8859-5 since certain runs of two or
                        three capital letters would be valid UTF-8 sequences.

                        --
                        Ben.

                        Comment

                        • George

                          #27
                          Re: Judge the encode systm used by the file.

                          On Fri, 31 Oct 2008 12:58:09 GMT, James Kuyper wrote:
                          Richard Tobin wrote:
                          >In article <kPAOk.1702$225 .265@nwrddc02.g nilink.net>,
                          >James Kuyper <jameskuyper@ve rizon.netwrote:
                          >>
                          >>ASCII was developed by the American Standards Association, which
                          >>eventually became the American National Standards Institute, or ANSI. I
                          >>can't be sure, but I suspect that Hongyi Zhao is using "ansi" as a
                          >>synonym for ASCII.
                          >>
                          >Rather bizarrely, the term "ansi" is often used to refer to the Microsoft
                          >encoding "windows-1252", which is ISO-8859-1 with a completely random
                          >bunch of characters replacing the C1 controls.
                          >
                          That's because those code pages were submitted to ANSI for
                          standardization . ANSI turned them down, but Microsoft continued to refer
                          them as "ANSI" pages.
                          Interesting. I hope Dr. Zhao got what he needed.
                          --
                          George

                          This was not an act of terrorism, but it was an act of war.
                          George W. Bush

                          Picture of the Day http://apod.nasa.gov/apod/

                          Comment

                          Working...