Character encodings and invalid characters

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Safalra

    Character encodings and invalid characters

    [Crossposted as the questions to each group might sound a little
    strange without context; trim groups if necessary]

    The idea here is relatively simple: a java program (I'm using JDK1.4
    if that makes a difference) that loads an HTML file, removes invalid
    characters (or replaces them in the case of common ones like
    Microsoft's 'smartquotes'), and outputs the file.

    The problem is these files will be on disk, so the program won't have
    the character encoding information from the server.

    Questions:

    1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    the byte order markers. How does it identify other encodings? Will it
    just assume the system default encoding until it finds bytes that
    imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
    ISO-8859-1 and US-ASCII, but others may occur.

    2) I'm slightly confused by the HTML specification - are the valid
    characters precisely those that are defined in Unicode? (Java
    internally works with 16 but characters.) (I'm ignoring at this point
    characters that in HTML need escaping.)

    3) If it fails on esoteric character encodings, how badly is it likely
    to fail? Will it totally trash the HTML?

    --
    Safalra (Stephen Morley)

  • Alan J. Flavell

    #2
    Re: Character encodings and invalid characters

    On Mon, 14 Jun 2004, Safalra wrote:
    [color=blue]
    > Questions:
    >
    > 1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    > the byte order markers. How does it identify other encodings?[/color]

    [I can't answer that, but the use of a BOM is permissible in utf-8
    although it's not required. Actually, if I may be pedantic for a
    moment, utf-16BE and utf-16LE don't use a BOM - the endianness is
    specified by the name of the encoding; utf-16 uses a BOM and by
    looking at the BOM you work out for yourself whether it's LE or BE.

    Coming back to utf-8: unless it's entirely us-ascii in which case you
    can't tell the difference, there are validity criteria, and the more
    of it you get which meet the criteria, the more confident you can be
    that it really is utf-8. Just one single violation of the criteria is
    enough to rule that possibility out, and the Unicode rules *mandate*
    refusing to process the document further, for security reasons.
    [color=blue]
    > Will it just assume the system default encoding until it finds bytes
    > that imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
    > ISO-8859-1 and US-ASCII, but others may occur.[/color]

    Right, but define "others". Are you going to deal with any character
    encodings which define characters that don't exist in Unicode - e.g
    Klingon?

    You certainly aren't going to be able to guess 8-bit character
    encodings just by looking at them - you absolutely do, in general,
    need some external source of wisdom on what character coding you are
    dealing with. *Some* character encodings can be guessed, at least on
    plausibility grounds.
    [color=blue]
    > 2) I'm slightly confused by the HTML specification - are the valid
    > characters precisely those that are defined in Unicode?[/color]

    With the greatest of respect, you seem to be putting the cart before
    the horse. First you say you intend to remove invalid characters, and
    then it becomes clear that you're not sure how to define what they
    are. :-}

    I'm assuming that there's some substantive issue behind your problem,
    but I'm afraid you're not expressing it in terms that I can be
    confident that I understand what you're trying to achieve. Recall
    that there are in general three ways of representing characters in
    HTML:

    1. coded characters in the appropriate character encoding
    2. numerical character references &#number; or &#xhexnum;
    3. character entity references &name; for those characters which have
    them.

    Can you address what you propose to do with each of these when you
    find them?
    [color=blue]
    > (I'm ignoring at this point characters that in HTML need escaping.)[/color]

    Hmmm? Are you referring to the use of &-notations here, or something
    else?
    [color=blue]
    > 3) If it fails on esoteric character encodings, how badly is it likely
    > to fail? Will it totally trash the HTML?[/color]

    Best answer I can give to that is that the HTML markup itself uses
    nothing more than plain us-ascii repertoire. If you can't recognise
    at least that repertoire in the original encoding, then you're going
    to do worse than trash only the HTML, no?

    good luck

    Comment

    • Roedy Green

      #3
      Re: Character encodings and invalid characters

      On 14 Jun 2004 09:48:55 -0700, usenet@safalra. com (Safalra) wrote or
      quoted :
      [color=blue]
      >1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
      >the byte order markers. How does it identify other encodings?[/color]

      You have to ask the user. You can find out the default encoding on his
      machine, but that's as good as it gets. People never thought to mark
      documents with the encoding or record it in a resource fork.

      You can take the same document and interpret it many ways. It would
      require almost AI to figure out which was the most likely encoding.

      You could do it my comparing letter frequencies to averages of
      samples.


      --
      Canadian Mind Products, Roedy Green.
      Coaching, problem solving, economical contract programming.
      See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

      Comment

      • Safalra

        #4
        Re: Character encodings and invalid characters

        "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message news:<Pine.LNX. 4.53.0406141757 110.8374@ppepc5 6.ph.gla.ac.uk> ...[color=blue]
        > On Mon, 14 Jun 2004, Safalra wrote:[color=green]
        > > 2) I'm slightly confused by the HTML specification - are the valid
        > > characters precisely those that are defined in Unicode?[/color]
        >
        > With the greatest of respect, you seem to be putting the cart before
        > the horse. First you say you intend to remove invalid characters, and
        > then it becomes clear that you're not sure how to define what they
        > are. :-}
        >
        > I'm assuming that there's some substantive issue behind your problem,
        > but I'm afraid you're not expressing it in terms that I can be
        > confident that I understand what you're trying to achieve.[/color]

        Okay, I guess I should have given more detail:

        I wrote my dissertation on the subject of automated neatening of HTML.
        As part of this I wrote a Java program to demonstrate what could be
        done. It removed or replaced invalid characters, attributes and
        elements, turned presentation elements and attributes to CSS, and
        replaced many tables used for layout purposes (and some framesets)
        with divs and CSS. It worked suprisingly well, but I only had to test
        it on ISO-8859-1 documents. I worked out the invalid characters just
        by feeding them into the W3C Validator, and for the ones that were
        invalid but rendered under Windows (like smartquotes) I replaced those
        with valid equivalents.

        Once I've worked the program into a more presentable state, I'd like
        to release it (GPL'd, of course). The problem is, I've got no idea
        what would happen if, say, a Japanese person runs it on some Japanese
        HTML source on their harddisk - I've never used a foreign character
        encoding, so I don't even know how their text editors figure out the
        encoding. I was wondering if Java assumes it's the system default
        (unless it encounters unicode), and hence the program would still
        work. (I assume that people would usually use the same character
        encoding for their system and their HTML?)
        [color=blue]
        > Recall
        > that there are in general three ways of representing characters in
        > HTML:
        >
        > 1. coded characters in the appropriate character encoding
        > 2. numerical character references &#number; or &#xhexnum;
        > 3. character entity references &name; for those characters which have
        > them.
        >
        > Can you address what you propose to do with each of these when you
        > find them?[/color]

        1. That's the one I'm asking about. :)

        Assuming I can get around character encoding problems.:

        2. If I understand the specification correctly, these refer to UCS
        code positions, so I just to to check whether the position is defined
        in Unicode.
        3. I just need to check whether these are defined in the
        specification.

        If occurances of (2) and (3) are valid, they'll just be outputted by
        the program in the same form.
        [color=blue][color=green]
        > > (I'm ignoring at this point characters that in HTML need escaping.)[/color]
        >
        > Hmmm? Are you referring to the use of &-notations here,[/color]

        Yes, but now we've discussed them above...

        --
        Safalra (Stephen Morley)

        Comment

        • Michael Borgwardt

          #5
          Re: Character encodings and invalid characters

          Safalra wrote:[color=blue]
          > to release it (GPL'd, of course). The problem is, I've got no idea
          > what would happen if, say, a Japanese person runs it on some Japanese
          > HTML source on their harddisk - I've never used a foreign character
          > encoding, so I don't even know how their text editors figure out the
          > encoding.[/color]

          They assume it by convention, usually. This can (and does) go wrong.
          [color=blue]
          > I was wondering if Java assumes it's the system default
          > (unless it encounters unicode)[/color]

          Java *alway* assumes text is the system default encoding unless given an
          explicit encoding. Unicode does not play into it.

          Also, do remember that in theory, all HTML documents should declare
          their encoding explicitly, or have it supplied by the server in
          the header. In XHTML, the explicit declaration is in fact mandatory.

          But overall, text encoding is a horribly complex, muddled mess of
          legacy conventions, incompatibiliti es, hacks and workarounds. Most
          of the time, it breaks down horribly as soon as you cross a language
          barrier.

          Comment

          • Alan J. Flavell

            #6
            Re: Character encodings and invalid characters

            On Tue, 15 Jun 2004, Safalra wrote:
            [color=blue]
            > I wrote my dissertation on the subject of automated neatening of HTML.[/color]
            [...][color=blue]
            > with divs and CSS. It worked suprisingly well, but I only had to test
            > it on ISO-8859-1 documents. I worked out the invalid characters just
            > by feeding them into the W3C Validator,[/color]

            I think I'm going to have to stand firm, and say that you really need
            to make the effort and cross the threshold of understanding the HTML
            character model in order to grasp what's behind this, otherwise you'd
            risk blundering on in a heuristic fashion without a robust mental
            picture of what's involved.

            This note makes no attempt to be a full tutorial on that, but just
            races through some key headings to see whether you can be persuaded to
            read the background and get up to speed.

            All of the characters from 0 to 31 decimal, and all of the characters
            from 127(sic) to 159 decimal, in the Document Character Set, are
            defined to be control characters, and almost all of them are excluded
            from use in HTML. These are the characters which are declared to be
            "invalid" by the specification (and by the validator).

            What's the "Document Character Set"? Well, in HTML2 it was
            iso-8859-1, and in HTML4 it was defined to be iso-10646 as amended.
            Loosely, you can read "iso-10646 as amended" as being the character
            model of Unicode. As far as the values from 0 to 255 are concerned,
            iso-8859-1 and iso-10646 are identical.

            How is this related to the external character encoding? Well, the
            character model that was introduced in RFC2070 and embodied in HTML4
            is based on the concept that the external encoding is converted into
            iso-10646/unicode prior to any other processing being done. It
            doesn't require implementations to work in that way internally, but it
            _does_ mandate that they give that impression externally (black box
            model).

            So from HTML's point of view, if you have a document which is coded in
            say Windows-1252, including those pretty quotes, then (as long as the
            recipient consents - see the HTTP Accept-charset) it's perfectly
            legal. All you need to do is apply the appropriate code mapping that
            you find at the Unicode site, and get the resulting Unicode character.

            Resources at http://www.unicode.org/Public/MAPPINGS/ , in this case


            [color=blue]
            > and for the ones that were invalid but rendered under Windows (like
            > smartquotes) I replaced those with valid equivalents.[/color]

            What you're talking about here is probably a document which in reality
            is coded in Windows-1252 but erroneously claims to be - or is
            mistakenly presumed to be - iso-8859-1 (or its equivalent in other
            locales).

            There's nothing inherently wrong with these particular octet values
            (128-159 decimal) *in those codings which assign them to printable
            characters* (that's not only all of the Windows-125x codings, but also
            koi-8r and some other less-usual codings).

            What's wrong is when those octet values occur in codings which define
            them to be control characters which are not used in HTML.
            [color=blue]
            > Once I've worked the program into a more presentable state, I'd like
            > to release it (GPL'd, of course). The problem is, I've got no idea
            > what would happen if, say, a Japanese person runs it on some Japanese
            > HTML source on their harddisk - I've never used a foreign character
            > encoding, so I don't even know how their text editors figure out the
            > encoding.[/color]

            Sadly, quite a number of language locales simply *assume* that their
            local coding applies. Try looking at such a file on a system that's
            set for a different locale, and you'll get rubbish. Although it's
            sometimes possible to guess (look at the automatic charset selection
            in, say, Mozilla for examples of what can be done heuristically).

            OK, I've done the HTML part of this. I'm not a regular Java user so
            I'm leaving that to others.
            [color=blue][color=green]
            > > Recall
            > > that there are in general three ways of representing characters in
            > > HTML:
            > >
            > > 1. coded characters in the appropriate character encoding
            > > 2. numerical character references &#number; or &#xhexnum;
            > > 3. character entity references &name; for those characters which have
            > > them.
            > >
            > > Can you address what you propose to do with each of these when you
            > > find them?[/color]
            >
            > 1. That's the one I'm asking about. :)[/color]

            Thanks - I did want to be sure about that first.

            [Don't make the mistake of confusing an 8-bit character of value 151
            decimal (in some specified 8-bit encoding), on the one hand, with the
            undefined(HTML)/illegal(XML) notation — on the other hand.]
            [color=blue]
            > 2. If I understand the specification correctly, these refer to UCS
            > code positions,[/color]

            basically yes, modulo some possible nit picking about high/low
            surrogates and stuff, that I don't want to go into here.
            [color=blue]
            > so I just to to check whether the position is defined
            > in Unicode.[/color]

            Er, not quite. Those control characters are certainly *defined*, but
            they are excluded from use in HTML by the "SGML declaration for HTML",
            and from XHTML by the rules of XML.

            And on the other hand I don't think an as-yet-unassigned Unicode code
            point is actually invalid for use in (X)HTML. Try it and see what the
            validator says?

            hope this helps a bit. The writeup of the HTML character model in the
            relevant part of the HTML4 spec and/or RFC2070 is not bad, I'd suggest
            giving it a try. There's also some material at
            http://ppewww.ph.gla.ac.uk/~flavell/charset/ which some folks have
            found helpful.

            Comment

            • Roedy Green

              #7
              Re: Character encodings and invalid characters

              On Mon, 14 Jun 2004 20:38:09 GMT, Roedy Green
              <look-on@mindprod.com .invalid> wrote or quoted :
              [color=blue]
              >You have to ask the user. You can find out the default encoding on his
              >machine, but that's as good as it gets. People never thought to mark
              >documents with the encoding or record it in a resource fork.[/color]

              for more info see


              I am working up a student project to solve this problem.

              --
              Canadian Mind Products, Roedy Green.
              Coaching, problem solving, economical contract programming.
              See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

              Comment

              • Roedy Green

                #8
                Re: Character encodings and invalid characters

                On Tue, 15 Jun 2004 21:59:54 GMT, Roedy Green
                <look-on@mindprod.com .invalid> wrote or quoted :
                [color=blue][color=green]
                >>You have to ask the user. You can find out the default encoding on his
                >>machine, but that's as good as it gets. People never thought to mark
                >>documents with the encoding or record it in a resource fork.[/color]
                >
                >for more info see
                >http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
                >
                >I am working up a student project to solve this problem.[/color]

                see http://mindprod.com/projects/encodin...ification.html

                --
                Canadian Mind Products, Roedy Green.
                Coaching, problem solving, economical contract programming.
                See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

                Comment

                • Safalra

                  #9
                  Re: Character encodings and invalid characters

                  [newsgroups trimmed - this no longer relates to Java]

                  "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote in message news:<Pine.LNX. 4.53.0406151306 120.10311@ppepc 56.ph.gla.ac.uk >...[color=blue]
                  > And on the other hand I don't think an as-yet-unassigned Unicode code
                  > point is actually invalid for use in (X)HTML. Try it and see what the
                  > validator says?[/color]

                  They're valid.

                  Incidentally, I've found another strange 'feature' of Internet
                  Explorer (The Cambridge Linux service was down, so I was forced into
                  using Windows.) When IE uploaded the UTF-16 file to the validator, it
                  strangely sent it as application/octet-stream rather than text/html,
                  which it does for ISO-8859-1.

                  --
                  Safalra (Stephen Morley)

                  Comment

                  • Jukka K. Korpela

                    #10
                    Re: Character encodings and invalid characters

                    usenet@safalra. com (Safalra) wrote:
                    [color=blue]
                    > When IE uploaded the UTF-16 file to the validator, it
                    > strangely sent it as application/octet-stream rather than text/html,
                    > which it does for ISO-8859-1.[/color]

                    IE treats file upload weirdly.

                    Recently there was some discussion in the www-validator list about a
                    problem that seems to have resulted from IE's odd way of sending,
                    in file upload, an XHTML document as text/xml with no charset parameter
                    when the document lacks the <?xml ...> prologue. For details see


                    In general, when you upload a file using IE, then you can expect the data
                    itself to be sent properly but should assume that everything else is
                    wrong until proven correct. In fact, maybe it's not _only_ IE's fault. A
                    browser is expected to include a Content-Type header (which in turn may
                    allow or even require a charset parameter, depending on the type).
                    How is it expected to perform this, for files in general? It's guesswork
                    at best _until_ someone creates a file system that contains media type
                    information (in MIME terms) in its control data. (Just dreaming aloud.)

                    --
                    Yucca, http://www.cs.tut.fi/~jkorpela/
                    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

                    Comment

                    Working...