xml parsing escape characters

**Martin v. Löwis** · Jul 18 '05, 08:27 PM

Re: xml parsing escape characters

Luis P. Mendes wrote:[color=blue]
> I get the following result:
>
> <?xml version="1.0" encoding="utf-8"?>
> <string xmlns="http://www......">< DataSet>
> ~ <Order&gt ;[/color]

Most likely, this result is correct, and your document
really does contain

<Order&gt ;

[color=blue]
> I don't get any elements. But, if I access the same url via a browser,
> the result in the browser window is something like:
>
> <string xmlns="http://www......">
> ~ <DataSet>[/color]

Most likely, your browser is incorrect (or atleast confusing), and
renders < as "<", even though this is not markup.
[color=blue]
> I already browsed the web, I know it's about the escape characters, but
> I didn't find a simple solution for this.[/color]

Not sure what "this" is. AFAICT, everything works correctly.

Regards,
Martin

**Luis P. Mendes** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">< DataSet>
~ <Order&gt ;
~ <Customer&gt ;439</Customer>
(... others ...)
~ </Order>
</DataSet></string>

When I do:

print xmldoc.toxml()

it prints:
<?xml version="1.0" ?>
<string xmlns="http://www..."><Dat aSet>
~ <Order&gt ;
~ <Customer&gt ;439</Customer>

~ </Order>
</DataSet></string>

_______________ _______________ _______________ _____________
with: stringNode = xmldoc.childNod es[0]
print stringNode.toxm l()
I get:
<string xmlns="http://www.......">&lt ;DataSet>
~ <Order&gt ;
~ <Customer&gt ;439</Customer>

~ </Order>
</DataSet></string>
_______________ _______________ _______________ _______________ __________

with: DataSetNode = stringNode.chil dNodes[0]
print DataSetNode.tox ml()

I get:

<DataSet& gt;
~ <Order&gt ;
~ <Customer&gt ;439</Customer>

~ </Order>
</DataSet>
_______________ _______________ _______________ _______________ ___-

so far so good, but when I issue the command:

print DataSetNode.chi ldNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?
Why doesn't it return:
<Order&gt ;
<Customer&gt ;439</Customer>

</Order>
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4 UHCY8rB8RAvQsAK CFD/hps8ybQli8HAs3i SCvRjwqjACfS/12
5gctpB91S5cy299 e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----

**Kent Johnson** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

Luis P. Mendes wrote:[color=blue]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> this is the xml document:
>
> <?xml version="1.0" encoding="utf-8"?>
> <string xmlns="http://www......">< DataSet>
> ~ <Order&gt ;
> ~ <Customer&gt ;439</Customer>
> (... others ...)
> ~ </Order>
> </DataSet></string>[/color]

This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.

Kent

**Irmen de Jong** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

Kent Johnson wrote:
[...][color=blue]
> This is an XML document containing a single tag, <string>, whose content
> is text containing entity-escaped XML.
>
> This is *not* an XML document containing tags <DataSet>, <Order>,
> <Customer>, etc.
>
> All the behaviour you are seeing is a consequence of this. You need to
> unescape the contents of the <string> tag to be able to treat it as
> structured XML.[/color]

The unescaping is usually done for you by the xml parser that you use.

--Irmen

**Kent Johnson** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

Irmen de Jong wrote:[color=blue]
> Kent Johnson wrote:
> [...]
>[color=green]
>> This is an XML document containing a single tag, <string>, whose
>> content is text containing entity-escaped XML.
>>
>> This is *not* an XML document containing tags <DataSet>, <Order>,
>> <Customer>, etc.
>>
>> All the behaviour you are seeing is a consequence of this. You need to
>> unescape the contents of the <string> tag to be able to treat it as
>> structured XML.[/color]
>
>
> The unescaping is usually done for you by the xml parser that you use.[/color]

Yes, so if your XML contains for example
<stuff><no t a tag></stuff>

and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
"<not a tag>"

but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

This is exactly the confusion the OP has.
[color=blue]
>
> --Irmen[/color]

**Martin v. Löwis** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

Luis P. Mendes wrote:[color=blue]
> with: DataSetNode = stringNode.chil dNodes[0]
> print DataSetNode.tox ml()
>
> I get:
>
> <DataSet& gt;
> ~ <Order&gt ;
> ~ <Customer&gt ;439</Customer>
>
> ~ </Order>
> </DataSet>
> _______________ _______________ _______________ _______________ ___-
>
> so far so good, but when I issue the command:
>
> print DataSetNode.chi ldNodes[0]
>
> I get:
> IndexError: tuple index out of range
>
> Why the error, and why does it return a tuple?[/color]

The DataSetNode has no children, because it is not
an Element node, but a Text node. In XML, an element
is denoted by

<DataSet>...</DataSet>

and *not* by

<DataSet> ...</DataSet>

The latter is just a single string, represented
in XML as a Text node. It does not give you any
hierarchy whatsoever.

As a text node does not have any children, its
childNode members is a empty tuple; accessing
that tuple gives you an IndexError.

Regards,
Martin

**Martin v. Löwis** · Jul 18 '05, 08:28 PM

Re: xml parsing escape characters

Irmen de Jong wrote:[color=blue]
> The unescaping is usually done for you by the xml parser that you use.[/color]

Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
< and >. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.

Regards,
Martin

**Irmen de Jong** · Jul 18 '05, 08:29 PM

Re: xml parsing escape characters

Martin v. Löwis wrote:[color=blue]
> Irmen de Jong wrote:
>[color=green]
>> The unescaping is usually done for you by the xml parser that you use.[/color]
>
>
> Usually, but not in this case. If you have a text that looks like
> XML, and you want to put it into an XML element, the XML file uses
> < and >. The XML parser unescapes that as < and >. However, it
> does not then consider the < and > as markup, and it shouldn't.[/color]

That's also what I said?

The unescaping of the XML entities in the contents of the OP's
<string> element is done for you by the parser,
so you will get a text node with the <,>,&,whateve r in there.
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.

--Irmen

**Martin v. Löwis** · Jul 18 '05, 08:29 PM

Re: xml parsing escape characters

Irmen de Jong wrote:[color=blue][color=green]
>> Usually, but not in this case. If you have a text that looks like
>> XML, and you want to put it into an XML element, the XML file uses
>> < and >. The XML parser unescapes that as < and >. However, it
>> does not then consider the < and > as markup, and it shouldn't.[/color]
>
>
> That's also what I said?[/color]

You said it in response to
[color=blue][color=green][color=darkred]
>>> All the behaviour you are seeing is a consequence of this. You need
>>> to unescape the contents of the <string> tag to be able to treat it
>>> as structured XML.[/color][/color][/color]

In that context, I interpreted
[color=blue][color=green]
>> The unescaping is usually done for you by the xml parser that you
>> use.[/color][/color]

as "The parser should have done what you want; if the parser didn't,
that is is bug in the parser".
[color=blue]
> The OP probably wants to feed that to a new xml parser instance
> to process it as markup.
> Or perhaps the way the original XML document is constructed is
> flawed.[/color]

Either of these, indeed - probably the latter.

Regards,
Martin

**Luis P. Mendes** · Jul 18 '05, 08:29 PM

Re: xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I would like to thank everyone for your answers, but I'm not seeing the
light yet!

When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http.... ............">& lt;DataSet>
~ <Order&gt ;
~ <Customer&gt ;439</Customer>
~ </Order>
</DataSet></string>

should I take the contents of the string tag that is text and replace
all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
how to do it?

or should I use another parser that accomplishes the task with no need
to replace the escaped characters?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8AIQHn4 UHCY8rB8RAuw8AJ 9ZMQ8P3c7wXD1zV Ld2fe7MktMQwwCf XAND
EPpY1w2a3ix2s2v WRlzZ43U=
=bJQV
-----END PGP SIGNATURE-----

**Martin v. Löwis** · Jul 18 '05, 08:29 PM

Re: xml parsing escape characters

Luis P. Mendes wrote:[color=blue]
> When I access the url via the Firefox browser and look into the source
> code, I also get:
>
> <?xml version="1.0" encoding="utf-8"?>
> <string xmlns="http.... ............">& lt;DataSet>
> ~ <Order&gt ;
> ~ <Customer&gt ;439</Customer>
> ~ </Order>
> </DataSet></string>[/color]

Please do try to understand what you are seeing. This is crucial for
understanding what happens.

You may have the understanding that XML can be represented as a tree.
This would be good - if not, please read a book that explains why
XML can be considered as a tree.

In the tree, you have inner nodes, and leaf nodes. For example,
the document

<a>
<b>Hello</b>
<c>World</c>
</a>

has 5 nodes (ignoring whitespace content):

Element:a ---- Element:b ---- Text:"Hello"
|
\-- Element:c ---- Text:"World"

So the leaf nodes are typically Text nodes (unless you
have an empty element). Your document has this structure:

Element:string ---- Text:"""<DataSe t>
<Order>
<Customer>439 </Customer>
</Order>
</DataSet>"""

So the ***TEXT*** contains the letter "<", just like it contains
the letters "O" and "r". There IS no element Order in your document,
no matter how hard you look.

If you want a DataSet *element* in your document, it should
read

<string xmlns="...">
<DataSet>
<Order>
<Customer>439 </Customer>
</Order
</DataSet>
</string>

As this is the document you apparently want to process, complain
to whoever gave you that other document.
[color=blue]
> should I take the contents of the string tag that is text and replace
> all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?[/color]

No. We still don't know what you want to achieve, so it is difficult to
advise you what to do. My best advise is that whoever generates the XML
document should fix it.
[color=blue]
> or should I use another parser that accomplishes the task with no need
> to replace the escaped characters?[/color]

No. The parser is working correctly.

The document you got can also be interpreted as containing another
XML document as a text. This is evil, but apparently people are doing
it, anyway. If you really want that embedded document, you need
first to extract it.

To see what I mean, do

print DataSetNode.dat a

The .data attribute gives you the string contents of
a text node. You could use this as an XML document, and
parse it again to an XML parser. This would be ugly,
but might be your only choice if the producer of the
document is unwilling to adjust.

Regards,
Martin

**Jeremy Bowers** · Jul 18 '05, 08:29 PM

Re: xml parsing escape characters

On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. LÃ¶wis wrote:
[color=blue]
> Luis P. Mendes wrote:[color=green]
>> When I access the url via the Firefox browser and look into the source
>> code, I also get:
>>
>> <?xml version="1.0" encoding="utf-8"?> <string
>> xmlns="http.... ............">& lt;DataSet> ~ <Order&gt ;
>> ~ <Customer&gt ;439</Customer> ~ </Order>
>> </DataSet></string>[/color]
>
> Please do try to understand what you are seeing. This is crucial for
> understanding what happens.[/color]

From extremely painful and lengthy personal experience, Luis, I
***extremely*** strongly recommend taking the time to nail this down until
you really, really, really understand what is going on. Until you can
explain it to somebody else coherently, ideally.

Mixing escaping levels like this absolutely, positively *must* be done
correctly, or extremely-painful-to-debug problems will result.

(My painful experience was layering an RPC implementation in plain text on
top of IM messages, where I was dealing with everything from the socket
level up except the XML parser. Ultimately it turned out there was a
problem in the XML parser, it rendered "&amp;" as "&", which is wrong
wrong wrong. But that took a *long* time to find, especially as I had
other bugs in the way.)

Since you're layering XML in XML, test &amp; and &amp;amp ; to make
sure they work correctly; those usually show encoding errors. And, given
your current understanding of the issue, do not write your own decoding
function unless you absolutely can't avoid it.

**Luis P. Mendes** · Jul 18 '05, 08:31 PM

Re: xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.

Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
3IEMLLXwMZKvNoq A4tISVnI=
=jvOU
-----END PGP SIGNATURE-----

**Luis P. Mendes** · Jul 18 '05, 08:31 PM

Re: xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.

Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
3IEMLLXwMZKvNoq A4tISVnI=
=jvOU
-----END PGP SIGNATURE-----