application/xhtml+xml in IE

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alan J. Flavell

    #16
    Re: application/xhtml+xml in IE

    On Fri, 2 Sep 2005, while I was getting myself up to speed with
    another followup, I now see that Gustaf wrote:
    [color=blue]
    > If the encoding declaration is omitted, XML only admits UTF-8 or UTF-16.
    >
    > http://www.w3.org/TR/REC-xml/#charencoding[/color]

    Thanks - I see that this actually confirms what I had just posted:

    ___
    /
    utf-16 with BOM (where the byte ordering is discerned by reading the
    BOM), utf-16LE and utf-16BE (where the byte ordering is laid down by
    the name of the encoding scheme). I rather suspect that only the
    first of those three schemes is legal XML without using the ?xml
    thingy to specify the encoding (scheme!).
    \___

    cheers

    Comment

    • Guy Macon

      #17
      Re: application/xhtml+xml in IE




      Gustaf wrote:[color=blue]
      >
      >Guy Macon wrote:
      >[color=green]
      >> "According to the rules of XML, skipping the XML declaration
      >> is okay only when using either UTF-8 or UTF-16 as character
      >> encoding in the document."[/color]
      >[color=green]
      >> I was under the impression that US-ASCII was OK as well.
      >> Does anyone have a reference for the above?[/color]
      >
      >If the encoding declaration is omitted, XML only admits UTF-8 or UTF-16.
      >
      > http://www.w3.org/TR/REC-xml/#charencoding
      >
      >But you can write pure ASCII documents just fine, since characters in
      >the ASCII range are encoded the same in UTF-8. That is, you don't need
      >an XML declaration for pure ASCII documents, since they will be treated
      >as UTF-8 documents.[/color]

      On my website (http://www.guymacon.com/) I set my .htaccess so that
      the server response includes:

      Content-Type: text/html; charset=us-ascii

      Then I wrote hy markup with no XML declaration (to avoid triggerin
      quirks mode in the brain-dead MS browser):

      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
      <head>
      <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" />
      ...
      (The meta http-equiv is useless, but I don't think it hurts anything.)

      So, unless I am mistaken, I am using US-ASCII character encoding,
      and I have skipped the XML declaration, and yet I am within the
      rules of XML. So would it be fair to say that...

      "According to the rules of XML, skipping the XML declaration
      is okay only when using either UTF-8 or UTF-16 as character
      encoding in the document."

      ....should be changed to...

      "According to the rules of XML, skipping the XML declaration
      is okay only when using US-ASCII, UTF-8 or UTF-16 as the
      character encoding in the document."

      ....?

      Or am I missing something? (It wouldn't be the first time...)


      Comment

      • Alan J. Flavell

        #18
        Re: application/xhtml+xml in IE

        On Fri, 2 Sep 2005, Guy Macon wrote:
        [color=blue][color=green]
        > > http://www.w3.org/TR/REC-xml/#charencoding[/color]
        >
        > On my website (http://www.guymacon.com/) I set my .htaccess so that
        > the server response includes:
        >
        > Content-Type: text/html; charset=us-ascii
        >
        > Then I wrote hy markup with no XML declaration[/color]

        Read the cited reference carefully again - it includes the following
        at 4.3.3, third paragraph:

        In the absence of external character encoding information (such as
        MIME headers), parsed entities which are stored in an encoding other
        than UTF-8 or UTF-16 MUST begin with a text declaration [...]
        containing an encoding declaration.

        But you *are* providing "external character encoding information", so
        you don't need the ?xml thingy (ahem, the xml "text declaration").

        So, what you are doing is fine, but your explanation of why you are
        doing it was adrift, and your proposed correction to the spec was
        off-beam. Even if the HTTP Content-type had advertised
        charset=iso-8859-2, or Big5 or whatever, you *still* would not have
        needed the ?xml thingy. (Of course, this only works for encoding
        schemes which are supported by the XML processor in question, but
        that applies whichever way you communicate the character encoding to
        them!)

        good luck

        Comment

        • Lachlan Hunt

          #19
          Re: application/xhtml+xml in IE

          Alan J. Flavell wrote:[color=blue]
          > [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]

          Only for text/* media types (including text/xml). application/*
          (including application/xml and application/xhtml+xml) do not have a
          default charset defined by HTTP rules.
          [color=blue]
          > So you can present us-ascii to XML, and allow it to assume that it's
          > utf-8, and at the same time present it to HTTP and allow -it- to
          > assume that it's iso-8859-1. and *both of them are correct*, in this
          > special case ;-)[/color]

          That's why RFC3023 enforces a US-ASCII default for text/xml, since it is
          a subset of both.

          --
          Lachlan Hunt

          http://GetFirefox.com/ Rediscover the Web
          http://GetThunderbird.com/ Reclaim your Inbox

          Comment

          • Henri Sivonen

            #20
            Re: application/xhtml+xml in IE

            In article <Pine.LNX.4.62. 0509021109370.1 7651@ppepc56.ph .gla.ac.uk>,
            "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
            [color=blue]
            > But the character encoding scheme which is advertised from an HTTP
            > server via the MIME "charset=" is authoritative, according to RFC2616,[/color]

            Yes.
            [color=blue]
            > and this attribute should not be omitted according to security alert
            > CA-2000-02,[/color]

            That alert did not convince me. Do you have concrete threat scenarios?
            Do they involve encodings that do not map Basic Latin to US-ASCII bytes?
            [color=blue]
            > so the <?xml thingy should really only be getting *used*
            > in non-HTTP contexts (e.g reading a local file):[/color]

            I strongly disagree. There is no security risk in using application/*
            types and internal character encoding information. Even the TAG
            disagrees with RFC 3023.

            The TAG has found "Thus there is no ambiguity when the charset is
            omitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to use
            the charset is misplaced for application/xml and for non-text "+xml"
            types." (http://www.w3.org/2001/tag/2004/0430...char-encoding).

            RFC 3023's insistence on declaring the encoding authoritatively outside
            the XML byte stream itself is, in my opinion, as silly as insisting on
            declaring the compression method of a zip archive authoritatively on the
            HTTP level instead of using the information stored in the file.
            [color=blue]
            > if the ?xml thingy
            > specified an encoding in an HTTP context, that evidently needs to be
            > consistent with what the server's HTTP Content-type header says.[2][/color]

            Yes, if the Content-Type says something about the encoding, it had
            better be consistent with the document.
            [color=blue]
            > [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]

            But that does not apply to application/* types. For text/xml there is
            RFC 3023, which says US-ASCII. For text/html, there is reality, which
            disagrees. For text/css, the CSS WG has more practical rules.

            --
            Henri Sivonen
            hsivonen@iki.fi

            Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

            Comment

            • Alan J. Flavell

              #21
              Re: application/xhtml+xml in IE

              On Sat, 3 Sep 2005, Lachlan Hunt wrote:
              [color=blue]
              > Alan J. Flavell wrote:[color=green]
              > > [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]
              >
              > Only for text/* media types (including text/xml). application/* (including
              > application/xml and application/xhtml+xml) do not have a default charset
              > defined by HTTP rules.[/color]

              Oh yes, you're right: thanks for the correction.
              [color=blue][color=green]
              > > So you can present us-ascii to XML, and allow it to assume that
              > > it's utf-8, and at the same time present it to HTTP and allow -it-
              > > to assume that it's iso-8859-1. and *both of them are correct*, in
              > > this special case ;-)[/color]
              >
              > That's why RFC3023 enforces a US-ASCII default for text/xml, since
              > it is a subset of both.[/color]

              Yes, I think you'll find that RFC2616 was anomalous in defining that
              default of iso-8859-1 for text/* media types: the normal MIME default
              for text/* was indeed us-ascii, as set out in RFC2046 (which is part
              of the MIME specifications) .

              Comment

              • Lachlan Hunt

                #22
                Re: application/xhtml+xml in IE

                Gustaf wrote:[color=blue]
                > For those interested, I wrote a bit on how to write conformant XHTML 1.1
                > documents (the URL is temporary). Enjoy. :-)
                >
                > http://gusgus.cn/www/xhtml/authoringxhtml11.html[/color]

                You have made a number of mistakes made in that document.

                1. Triggering standards mode
                Firstly, it's called a DOCTYPE declaration, not a "DOCTYPE tag".

                Secondly, while you are correct that standards/quirks mode will usually
                be triggered by the presence (or absense) of various DOCTYPEs, all
                browsers that support XHTML will use standards mode when the document is
                served as XML, regardless of the DOCTYPE used (even if it's omitted).


                2. The XML declaration

                It is correct that the XML declaration will trigger quirks mode in IE,
                however (as already pointed out in this thread) IE does not support
                application/xhtml+xml (although, it seems that it will sometimes use
                content sniffing if the file has a .html extension). Basically, if
                you're going to serve XHTML with the correct MIME type, IE bugs are
                irrelevant.


                3. Choice of character encoding

                | According to the rules of XML, skipping the XML declaration is okay
                | only when using either UTF-8 or UTF-16 as character encoding in the
                | document.

                That's only true when the encoding is not specified in a higher level
                protocol, such as the HTTP content-type header.


                4. Setting Content-Type in the HTTP headers

                You suggest that authors send this:

                Content-Type: application/xhtml+xml; charset=utf-8

                However, the W3C TAG WG disagree:

                # Good practice: XML and character encodings
                #
                # In general, a representation provider SHOULD NOT specify the character
                # encoding for XML data in protocol headers since the data is
                # self-describing.

                (Note: that only really applies to application/*+xml, not text/xml,
                which they recommend should be avoided)




                5. Setting Content-Type in the meta element

                <meta http-equiv="Content-Type" content="applic ation/xhtml+xml;
                charset=utf-8"/>

                That's completely useless for determining the MIME type, even if the
                file is read from the local file system. Typically, when read from a
                local file system, files with a .htm or .html file extension are
                processed as text/html and files with .xht or .xhtml are processed as
                application/xhtml+xml.

                When it's processed as text/html, UAs are lenient enough with their
                parsing to determine the character encoding, but will not begin
                processing the document as XML. When it's parsed as XML, that's
                completely useless and UAs will obey the XML declaration (if present) or
                default to UTF-8/UTF-16, as defined by the XML rec.


                6. Saving documents

                | it's best to avoid the BOM in UTF-8, because its presence is not
                | supported everywhere.
                | ...
                | 2. UTF-8 documents must not be saved with a BOM.

                That is not true for XML documents. XML processors are required to
                fully support UTF-8 and UTF-16 (including the BOM). That guideline is
                only relevant for serving UTF-8 encoded files as text/html to obsolete
                browsers.

                By the way, SuperEdi is another good editor that supports Unicode, and
                even includes an option to omit the BOM.

                --
                Lachlan Hunt

                http://GetFirefox.com/ Rediscover the Web
                http://GetThunderbird.com/ Reclaim your Inbox

                Comment

                • Alan J. Flavell

                  #23
                  Re: application/xhtml+xml in IE

                  On Sat, 3 Sep 2005, Henri Sivonen wrote:
                  [color=blue]
                  > "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                  >[color=green]
                  > > and this attribute should not be omitted according to security alert
                  > > CA-2000-02,[/color]
                  >
                  > That alert did not convince me. Do you have concrete threat scenarios?[/color]

                  Sorry, I can't offer anything specific, beyond what it says at


                  I've seen a case where dangerous markup was sneaked into a web page by
                  a technique of this kind, but I can no longer give you chapter and
                  verse, sorry. I don't think it's become a popular attack scenario.

                  I recognised that it's a complex issue, and that the Apache folks
                  responded by related changes in their default configurations, and
                  concluded it would be best to follow their advice as far as possible.
                  But this is for text/* MIME types.
                  [color=blue][color=green]
                  > > so the <?xml thingy should really only be getting *used*
                  > > in non-HTTP contexts (e.g reading a local file):[/color]
                  >
                  > I strongly disagree.[/color]

                  Well, I've been caught-out on the difference between text/* and
                  application/* data types, so I'm in no position to argue... But just
                  to clarify what I was trying to say there:

                  * if the ?xml encoding specifier is /present/, (which you presumably
                  favour), it will be overridden when the HTTP Content-type also
                  specifies the charset= attribute. Then, the ?xml encoding specifier
                  can do nothing better than to repeat what is already known and
                  authoritiative from HTTP: in that sense, it is not actually /used/,
                  even though it's present.
                  [color=blue]
                  > There is no security risk in using application/*
                  > types and internal character encoding information.[/color]

                  I can't see any reason to dispute that. The CERT CA alert relates
                  specifically to text/html, and at its widest to text/* MIME types.

                  Thanks for the corrections.

                  Comment

                  • Henri Sivonen

                    #24
                    Re: application/xhtml+xml in IE

                    In article <Pine.LNX.4.62. 0509031109120.9 224@ppepc56.ph. gla.ac.uk>,
                    "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
                    [color=blue]
                    > Sorry, I can't offer anything specific, beyond what it says at
                    > http://www.cert.org/tech_tips/malici...igation.html#3[/color]

                    Thanks.

                    It seems to me that server-side programs including tainted snippets of
                    text on the byte level major part of the problem and could be avoided if
                    the server-side programs operated on the character level internally (in
                    which case they'd have to perform a bytes to characters conversion of
                    the tainted text before doing anything with it).

                    --
                    Henri Sivonen
                    hsivonen@iki.fi

                    Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

                    Comment

                    Working...