application/xhtml+xml in IE

**Alan J. Flavell** · Sep 2 '05, 10:45 AM

Re: application/xhtml+xml in IE

On Fri, 2 Sep 2005, while I was getting myself up to speed with
another followup, I now see that Gustaf wrote:
[color=blue]
> If the encoding declaration is omitted, XML only admits UTF-8 or UTF-16.
>
> http://www.w3.org/TR/REC-xml/#charencoding[/color]

Thanks - I see that this actually confirms what I had just posted:

___
/
utf-16 with BOM (where the byte ordering is discerned by reading the
BOM), utf-16LE and utf-16BE (where the byte ordering is laid down by
the name of the encoding scheme). I rather suspect that only the
first of those three schemes is legal XML without using the ?xml
thingy to specify the encoding (scheme!).
\___

cheers

**Guy Macon** · Sep 2 '05, 11:05 AM

Re: application/xhtml+xml in IE

Gustaf wrote:[color=blue]
>
>Guy Macon wrote:
>[color=green]
>> "According to the rules of XML, skipping the XML declaration
>> is okay only when using either UTF-8 or UTF-16 as character
>> encoding in the document."[/color]
>[color=green]
>> I was under the impression that US-ASCII was OK as well.
>> Does anyone have a reference for the above?[/color]
>
>If the encoding declaration is omitted, XML only admits UTF-8 or UTF-16.
>
> http://www.w3.org/TR/REC-xml/#charencoding
>
>But you can write pure ASCII documents just fine, since characters in
>the ASCII range are encoded the same in UTF-8. That is, you don't need
>an XML declaration for pure ASCII documents, since they will be treated
>as UTF-8 documents.[/color]

On my website (http://www.guymacon.com/) I set my .htaccess so that
the server response includes:

Content-Type: text/html; charset=us-ascii

Then I wrote hy markup with no XML declaration (to avoid triggerin
quirks mode in the brain-dead MS browser):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII" />
...
(The meta http-equiv is useless, but I don't think it hurts anything.)

So, unless I am mistaken, I am using US-ASCII character encoding,
and I have skipped the XML declaration, and yet I am within the
rules of XML. So would it be fair to say that...

"According to the rules of XML, skipping the XML declaration
is okay only when using either UTF-8 or UTF-16 as character
encoding in the document."

....should be changed to...

"According to the rules of XML, skipping the XML declaration
is okay only when using US-ASCII, UTF-8 or UTF-16 as the
character encoding in the document."

....?

Or am I missing something? (It wouldn't be the first time...)

**Alan J. Flavell** · Sep 2 '05, 11:55 AM

Re: application/xhtml+xml in IE

On Fri, 2 Sep 2005, Guy Macon wrote:
[color=blue][color=green]
> > http://www.w3.org/TR/REC-xml/#charencoding[/color]
>
> On my website (http://www.guymacon.com/) I set my .htaccess so that
> the server response includes:
>
> Content-Type: text/html; charset=us-ascii
>
> Then I wrote hy markup with no XML declaration[/color]

Read the cited reference carefully again - it includes the following
at 4.3.3, third paragraph:

In the absence of external character encoding information (such as
MIME headers), parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 MUST begin with a text declaration [...]
containing an encoding declaration.

But you *are* providing "external character encoding information", so
you don't need the ?xml thingy (ahem, the xml "text declaration").

So, what you are doing is fine, but your explanation of why you are
doing it was adrift, and your proposed correction to the spec was
off-beam. Even if the HTTP Content-type had advertised
charset=iso-8859-2, or Big5 or whatever, you *still* would not have
needed the ?xml thingy. (Of course, this only works for encoding
schemes which are supported by the XML processor in question, but
that applies whichever way you communicate the character encoding to
them!)

good luck

**Lachlan Hunt** · Sep 3 '05, 06:25 AM

Re: application/xhtml+xml in IE

Alan J. Flavell wrote:[color=blue]
> [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]

Only for text/* media types (including text/xml). application/*
(including application/xml and application/xhtml+xml) do not have a
default charset defined by HTTP rules.
[color=blue]
> So you can present us-ascii to XML, and allow it to assume that it's
> utf-8, and at the same time present it to HTTP and allow -it- to
> assume that it's iso-8859-1. and *both of them are correct*, in this
> special case ;-)[/color]

That's why RFC3023 enforces a US-ASCII default for text/xml, since it is
a subset of both.

--
Lachlan Hunt

Lachlan Hunt: Web Development Guru

http://lachy.id.au/

http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

**Henri Sivonen** · Sep 3 '05, 09:15 AM

Re: application/xhtml+xml in IE

In article <Pine.LNX.4.62. 0509021109370.1 7651@ppepc56.ph .gla.ac.uk>,
"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
> But the character encoding scheme which is advertised from an HTTP
> server via the MIME "charset=" is authoritative, according to RFC2616,[/color]

Yes.
[color=blue]
> and this attribute should not be omitted according to security alert
> CA-2000-02,[/color]

That alert did not convince me. Do you have concrete threat scenarios?
Do they involve encodings that do not map Basic Latin to US-ASCII bytes?
[color=blue]
> so the <?xml thingy should really only be getting *used*
> in non-HTTP contexts (e.g reading a local file):[/color]

I strongly disagree. There is no security risk in using application/*
types and internal character encoding information. Even the TAG
disagrees with RFC 3023.

The TAG has found "Thus there is no ambiguity when the charset is
omitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to use
the charset is misplaced for application/xml and for non-text "+xml"
types." (http://www.w3.org/2001/tag/2004/0430...char-encoding).

RFC 3023's insistence on declaring the encoding authoritatively outside
the XML byte stream itself is, in my opinion, as silly as insisting on
declaring the compression method of a zip archive authoritatively on the
HTTP level instead of using the information stored in the file.
[color=blue]
> if the ?xml thingy
> specified an encoding in an HTTP context, that evidently needs to be
> consistent with what the server's HTTP Content-type header says.[2][/color]

Yes, if the Content-Type says something about the encoding, it had
better be consistent with the document.
[color=blue]
> [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]

But that does not apply to application/* types. For text/xml there is
RFC 3023, which says US-ASCII. For text/html, there is reality, which
disagrees. For text/css, the CSS WG has more practical rules.

--
Henri Sivonen
hsivonen@iki.fi

Henri Sivonen's pages

http://hsivonen.iki.fi/

Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

**Alan J. Flavell** · Sep 3 '05, 10:15 AM

Re: application/xhtml+xml in IE

On Sat, 3 Sep 2005, Lachlan Hunt wrote:
[color=blue]
> Alan J. Flavell wrote:[color=green]
> > [1] Amusingly, HTTP rules say that the default is iso-8859-1.[/color]
>
> Only for text/* media types (including text/xml). application/* (including
> application/xml and application/xhtml+xml) do not have a default charset
> defined by HTTP rules.[/color]

Oh yes, you're right: thanks for the correction.
[color=blue][color=green]
> > So you can present us-ascii to XML, and allow it to assume that
> > it's utf-8, and at the same time present it to HTTP and allow -it-
> > to assume that it's iso-8859-1. and *both of them are correct*, in
> > this special case ;-)[/color]
>
> That's why RFC3023 enforces a US-ASCII default for text/xml, since
> it is a subset of both.[/color]

Yes, I think you'll find that RFC2616 was anomalous in defining that
default of iso-8859-1 for text/* media types: the normal MIME default
for text/* was indeed us-ascii, as set out in RFC2046 (which is part
of the MIME specifications) .

**Lachlan Hunt** · Sep 3 '05, 10:45 AM

Re: application/xhtml+xml in IE

Gustaf wrote:[color=blue]
> For those interested, I wrote a bit on how to write conformant XHTML 1.1
> documents (the URL is temporary). Enjoy. :-)
>
> http://gusgus.cn/www/xhtml/authoringxhtml11.html[/color]

You have made a number of mistakes made in that document.

1. Triggering standards mode
Firstly, it's called a DOCTYPE declaration, not a "DOCTYPE tag".

Secondly, while you are correct that standards/quirks mode will usually
be triggered by the presence (or absense) of various DOCTYPEs, all
browsers that support XHTML will use standards mode when the document is
served as XML, regardless of the DOCTYPE used (even if it's omitted).

2. The XML declaration

It is correct that the XML declaration will trigger quirks mode in IE,
however (as already pointed out in this thread) IE does not support
application/xhtml+xml (although, it seems that it will sometimes use
content sniffing if the file has a .html extension). Basically, if
you're going to serve XHTML with the correct MIME type, IE bugs are
irrelevant.

3. Choice of character encoding

| According to the rules of XML, skipping the XML declaration is okay
| only when using either UTF-8 or UTF-16 as character encoding in the
| document.

That's only true when the encoding is not specified in a higher level
protocol, such as the HTTP content-type header.

4. Setting Content-Type in the HTTP headers

You suggest that authors send this:

Content-Type: application/xhtml+xml; charset=utf-8

However, the W3C TAG WG disagree:

# Good practice: XML and character encodings
#
# In general, a representation provider SHOULD NOT specify the character
# encoding for XML data in protocol headers since the data is
# self-describing.

(Note: that only really applies to application/*+xml, not text/xml,
which they recommend should be avoided)

Architecture of the World Wide Web, Volume One

http://www.w3.org/TR/2004/REC-webarch-20041215/#xml-media-types

5. Setting Content-Type in the meta element

<meta http-equiv="Content-Type" content="applic ation/xhtml+xml;
charset=utf-8"/>

That's completely useless for determining the MIME type, even if the
file is read from the local file system. Typically, when read from a
local file system, files with a .htm or .html file extension are
processed as text/html and files with .xht or .xhtml are processed as
application/xhtml+xml.

When it's processed as text/html, UAs are lenient enough with their
parsing to determine the character encoding, but will not begin
processing the document as XML. When it's parsed as XML, that's
completely useless and UAs will obey the XML declaration (if present) or
default to UTF-8/UTF-16, as defined by the XML rec.

6. Saving documents

| it's best to avoid the BOM in UTF-8, because its presence is not
| supported everywhere.
| ...
| 2. UTF-8 documents must not be saved with a BOM.

That is not true for XML documents. XML processors are required to
fully support UTF-8 and UTF-16 (including the BOM). That guideline is
only relevant for serving UTF-8 encoded files as text/html to obsolete
browsers.

By the way, SuperEdi is another good editor that supports Unicode, and
even includes an option to omit the BOM.

--
Lachlan Hunt

Lachlan Hunt: Web Development Guru

http://lachy.id.au/

http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

**Alan J. Flavell** · Sep 3 '05, 10:45 AM

Re: application/xhtml+xml in IE

On Sat, 3 Sep 2005, Henri Sivonen wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
>[color=green]
> > and this attribute should not be omitted according to security alert
> > CA-2000-02,[/color]
>
> That alert did not convince me. Do you have concrete threat scenarios?[/color]

Sorry, I can't offer anything specific, beyond what it says at

2000 Tech Tip: Understanding Malicious Content Mitigation for Web Developers

http://www.cert.org/tech_tips/malicious_code_mitigation.html#3

This 2000 tech tip contains discussion about malicious content mitigation.

I've seen a case where dangerous markup was sneaked into a web page by
a technique of this kind, but I can no longer give you chapter and
verse, sorry. I don't think it's become a popular attack scenario.

I recognised that it's a complex issue, and that the Apache folks
responded by related changes in their default configurations, and
concluded it would be best to follow their advice as far as possible.
But this is for text/* MIME types.
[color=blue][color=green]
> > so the <?xml thingy should really only be getting *used*
> > in non-HTTP contexts (e.g reading a local file):[/color]
>
> I strongly disagree.[/color]

Well, I've been caught-out on the difference between text/* and
application/* data types, so I'm in no position to argue... But just
to clarify what I was trying to say there:

* if the ?xml encoding specifier is /present/, (which you presumably
favour), it will be overridden when the HTTP Content-type also
specifies the charset= attribute. Then, the ?xml encoding specifier
can do nothing better than to repeat what is already known and
authoritiative from HTTP: in that sense, it is not actually /used/,
even though it's present.
[color=blue]
> There is no security risk in using application/*
> types and internal character encoding information.[/color]

I can't see any reason to dispute that. The CERT CA alert relates
specifically to text/html, and at its widest to text/* MIME types.

Thanks for the corrections.

**Henri Sivonen** · Sep 4 '05, 08:35 AM

Re: application/xhtml+xml in IE

In article <Pine.LNX.4.62. 0509031109120.9 224@ppepc56.ph. gla.ac.uk>,
"Alan J. Flavell" <flavell@ph.gla .ac.uk> wrote:
[color=blue]
> Sorry, I can't offer anything specific, beyond what it says at
> http://www.cert.org/tech_tips/malici...igation.html#3[/color]

Thanks.

It seems to me that server-side programs including tainted snippets of
text on the byte level major part of the problem and could be avoided if
the server-side programs operated on the character level internally (in
which case they'd have to perform a bytes to characters conversion of
the tainted text before doing anything with it).

--
Henri Sivonen
hsivonen@iki.fi

Henri Sivonen's pages

http://hsivonen.iki.fi/

Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

application/xhtml+xml in IE

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment