xml parsing escape characters

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Luis P. Mendes

    xml parsing escape characters

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hi,

    I only know a little bit of xml and I'm trying to parse a xml document
    in order to save its elements in a file (dictionaries inside a list).

    When I access a url from python 2.3.3 running in Linux with the
    following lines:
    resposta = urllib.urlopen( url)
    xmldoc = minidom.parse(r esposta)
    resposta.close( )

    I get the following result:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt; DataSet&gt;
    ~ &lt;Order&gt ;
    ~ &lt;Customer&gt ;439&lt;/Customer&gt;
    (... others ...)
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    _______________ _______________ _______________ _______________ _

    In the lines below, I try to get all the child nodes from string, first
    by counting them, and then ignoring the /n ones:

    stringNode = xmldoc.childNod es[0]
    print stringNode.toxm l()
    dataSetNode = stringNode.chil dNodes[0]
    numNos = len(dataSetNode .childNodes)
    todosNos={}
    for no in range(numNos):
    todosNos[no] = dataSetNode.chi ldNodes[no].toxml()
    posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
    print posicaoXml

    (I'm almost sure there's a simpler way to do this...)
    _______________ _______________ _______________ _______________ _

    I don't get any elements. But, if I access the same url via a browser,
    the result in the browser window is something like:

    <string xmlns="http://www......">
    ~ <DataSet>
    ~ <Order>
    ~ <Customer>439 </Customer>
    (... others ...)
    ~ </Order>
    ~ </DataSet>
    </string>

    and the lines I posted work as intended.

    I already browsed the web, I know it's about the escape characters, but
    I didn't find a simple solution for this.

    I tried to use LL2XML.py and unescape function with a simple replace
    text = text.replace("& lt;", "<")
    but I had to convert the xml document to string and then I could not (or
    don't know) how to convert it back to xml object.

    How can I solve this? Please, explain it having in mind that I'm just
    beggining with Xml and I'm not very experienced in Python, too.


    Luis
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB7rzKHn4 UHCY8rB8RAhnlAK CYA6t0gd8rRDhIv Z5sdmNJlEPSeQCg teB3
    XUtZ0JoHeTavBOC Yi6YYnNo=
    =VORM
    -----END PGP SIGNATURE-----
  • Martin v. Löwis

    #2
    Re: xml parsing escape characters

    Luis P. Mendes wrote:[color=blue]
    > I get the following result:
    >
    > <?xml version="1.0" encoding="utf-8"?>
    > <string xmlns="http://www......">&lt; DataSet&gt;
    > ~ &lt;Order&gt ;[/color]

    Most likely, this result is correct, and your document
    really does contain

    &lt;Order&gt ;

    [color=blue]
    > I don't get any elements. But, if I access the same url via a browser,
    > the result in the browser window is something like:
    >
    > <string xmlns="http://www......">
    > ~ <DataSet>[/color]

    Most likely, your browser is incorrect (or atleast confusing), and
    renders &lt; as "<", even though this is not markup.
    [color=blue]
    > I already browsed the web, I know it's about the escape characters, but
    > I didn't find a simple solution for this.[/color]

    Not sure what "this" is. AFAICT, everything works correctly.

    Regards,
    Martin

    Comment

    • Luis P. Mendes

      #3
      Re: xml parsing escape characters

      -----BEGIN PGP SIGNED MESSAGE-----
      Hash: SHA1

      this is the xml document:

      <?xml version="1.0" encoding="utf-8"?>
      <string xmlns="http://www......">&lt; DataSet&gt;
      ~ &lt;Order&gt ;
      ~ &lt;Customer&gt ;439&lt;/Customer&gt;
      (... others ...)
      ~ &lt;/Order&gt;
      &lt;/DataSet&gt;</string>

      When I do:

      print xmldoc.toxml()

      it prints:
      <?xml version="1.0" ?>
      <string xmlns="http://www...">&lt;Dat aSet&gt;
      ~ &lt;Order&gt ;
      ~ &lt;Customer&gt ;439&lt;/Customer&gt;

      ~ &lt;/Order&gt;
      &lt;/DataSet&gt;</string>

      _______________ _______________ _______________ _____________
      with: stringNode = xmldoc.childNod es[0]
      print stringNode.toxm l()
      I get:
      <string xmlns="http://www.......">&lt ;DataSet&gt;
      ~ &lt;Order&gt ;
      ~ &lt;Customer&gt ;439&lt;/Customer&gt;

      ~ &lt;/Order&gt;
      &lt;/DataSet&gt;</string>
      _______________ _______________ _______________ _______________ __________

      with: DataSetNode = stringNode.chil dNodes[0]
      print DataSetNode.tox ml()

      I get:

      &lt;DataSet& gt;
      ~ &lt;Order&gt ;
      ~ &lt;Customer&gt ;439&lt;/Customer&gt;

      ~ &lt;/Order&gt;
      &lt;/DataSet&gt;
      _______________ _______________ _______________ _______________ ___-

      so far so good, but when I issue the command:

      print DataSetNode.chi ldNodes[0]

      I get:
      IndexError: tuple index out of range

      Why the error, and why does it return a tuple?
      Why doesn't it return:
      &lt;Order&gt ;
      &lt;Customer&gt ;439&lt;/Customer&gt;

      &lt;/Order&gt;
      ??
      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.2.4 (GNU/Linux)
      Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

      iD8DBQFB76y3Hn4 UHCY8rB8RAvQsAK CFD/hps8ybQli8HAs3i SCvRjwqjACfS/12
      5gctpB91S5cy299 e/TVLGQk=
      =XR2a
      -----END PGP SIGNATURE-----

      Comment

      • Kent Johnson

        #4
        Re: xml parsing escape characters

        Luis P. Mendes wrote:[color=blue]
        > -----BEGIN PGP SIGNED MESSAGE-----
        > Hash: SHA1
        >
        > this is the xml document:
        >
        > <?xml version="1.0" encoding="utf-8"?>
        > <string xmlns="http://www......">&lt; DataSet&gt;
        > ~ &lt;Order&gt ;
        > ~ &lt;Customer&gt ;439&lt;/Customer&gt;
        > (... others ...)
        > ~ &lt;/Order&gt;
        > &lt;/DataSet&gt;</string>[/color]

        This is an XML document containing a single tag, <string>, whose content is text containing
        entity-escaped XML.

        This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

        All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
        <string> tag to be able to treat it as structured XML.

        Kent

        Comment

        • Irmen de Jong

          #5
          Re: xml parsing escape characters

          Kent Johnson wrote:
          [...][color=blue]
          > This is an XML document containing a single tag, <string>, whose content
          > is text containing entity-escaped XML.
          >
          > This is *not* an XML document containing tags <DataSet>, <Order>,
          > <Customer>, etc.
          >
          > All the behaviour you are seeing is a consequence of this. You need to
          > unescape the contents of the <string> tag to be able to treat it as
          > structured XML.[/color]

          The unescaping is usually done for you by the xml parser that you use.

          --Irmen

          Comment

          • Kent Johnson

            #6
            Re: xml parsing escape characters

            Irmen de Jong wrote:[color=blue]
            > Kent Johnson wrote:
            > [...]
            >[color=green]
            >> This is an XML document containing a single tag, <string>, whose
            >> content is text containing entity-escaped XML.
            >>
            >> This is *not* an XML document containing tags <DataSet>, <Order>,
            >> <Customer>, etc.
            >>
            >> All the behaviour you are seeing is a consequence of this. You need to
            >> unescape the contents of the <string> tag to be able to treat it as
            >> structured XML.[/color]
            >
            >
            > The unescaping is usually done for you by the xml parser that you use.[/color]

            Yes, so if your XML contains for example
            <stuff>&lt;no t a tag&gt;</stuff>

            and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
            "<not a tag>"

            but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

            This is exactly the confusion the OP has.
            [color=blue]
            >
            > --Irmen[/color]

            Comment

            • Martin v. Löwis

              #7
              Re: xml parsing escape characters

              Luis P. Mendes wrote:[color=blue]
              > with: DataSetNode = stringNode.chil dNodes[0]
              > print DataSetNode.tox ml()
              >
              > I get:
              >
              > &lt;DataSet& gt;
              > ~ &lt;Order&gt ;
              > ~ &lt;Customer&gt ;439&lt;/Customer&gt;
              >
              > ~ &lt;/Order&gt;
              > &lt;/DataSet&gt;
              > _______________ _______________ _______________ _______________ ___-
              >
              > so far so good, but when I issue the command:
              >
              > print DataSetNode.chi ldNodes[0]
              >
              > I get:
              > IndexError: tuple index out of range
              >
              > Why the error, and why does it return a tuple?[/color]

              The DataSetNode has no children, because it is not
              an Element node, but a Text node. In XML, an element
              is denoted by

              <DataSet>...</DataSet>

              and *not* by

              &lt;DataSet&gt; ...&lt;/DataSet&gt;

              The latter is just a single string, represented
              in XML as a Text node. It does not give you any
              hierarchy whatsoever.

              As a text node does not have any children, its
              childNode members is a empty tuple; accessing
              that tuple gives you an IndexError.

              Regards,
              Martin

              Comment

              • Martin v. Löwis

                #8
                Re: xml parsing escape characters

                Irmen de Jong wrote:[color=blue]
                > The unescaping is usually done for you by the xml parser that you use.[/color]

                Usually, but not in this case. If you have a text that looks like
                XML, and you want to put it into an XML element, the XML file uses
                &lt; and &gt;. The XML parser unescapes that as < and >. However, it
                does not then consider the < and > as markup, and it shouldn't.

                Regards,
                Martin

                Comment

                • Irmen de Jong

                  #9
                  Re: xml parsing escape characters

                  Martin v. Löwis wrote:[color=blue]
                  > Irmen de Jong wrote:
                  >[color=green]
                  >> The unescaping is usually done for you by the xml parser that you use.[/color]
                  >
                  >
                  > Usually, but not in this case. If you have a text that looks like
                  > XML, and you want to put it into an XML element, the XML file uses
                  > &lt; and &gt;. The XML parser unescapes that as < and >. However, it
                  > does not then consider the < and > as markup, and it shouldn't.[/color]

                  That's also what I said?

                  The unescaping of the XML entities in the contents of the OP's
                  <string> element is done for you by the parser,
                  so you will get a text node with the <,>,&,whateve r in there.
                  The OP probably wants to feed that to a new xml parser instance
                  to process it as markup.
                  Or perhaps the way the original XML document is constructed is
                  flawed.

                  --Irmen

                  Comment

                  • Martin v. Löwis

                    #10
                    Re: xml parsing escape characters

                    Irmen de Jong wrote:[color=blue][color=green]
                    >> Usually, but not in this case. If you have a text that looks like
                    >> XML, and you want to put it into an XML element, the XML file uses
                    >> &lt; and &gt;. The XML parser unescapes that as < and >. However, it
                    >> does not then consider the < and > as markup, and it shouldn't.[/color]
                    >
                    >
                    > That's also what I said?[/color]

                    You said it in response to
                    [color=blue][color=green][color=darkred]
                    >>> All the behaviour you are seeing is a consequence of this. You need
                    >>> to unescape the contents of the <string> tag to be able to treat it
                    >>> as structured XML.[/color][/color][/color]

                    In that context, I interpreted
                    [color=blue][color=green]
                    >> The unescaping is usually done for you by the xml parser that you
                    >> use.[/color][/color]

                    as "The parser should have done what you want; if the parser didn't,
                    that is is bug in the parser".
                    [color=blue]
                    > The OP probably wants to feed that to a new xml parser instance
                    > to process it as markup.
                    > Or perhaps the way the original XML document is constructed is
                    > flawed.[/color]

                    Either of these, indeed - probably the latter.

                    Regards,
                    Martin

                    Comment

                    • Luis P. Mendes

                      #11
                      Re: xml parsing escape characters

                      -----BEGIN PGP SIGNED MESSAGE-----
                      Hash: SHA1

                      I would like to thank everyone for your answers, but I'm not seeing the
                      light yet!

                      When I access the url via the Firefox browser and look into the source
                      code, I also get:

                      <?xml version="1.0" encoding="utf-8"?>
                      <string xmlns="http.... ............">& lt;DataSet&gt;
                      ~ &lt;Order&gt ;
                      ~ &lt;Customer&gt ;439&lt;/Customer&gt;
                      ~ &lt;/Order&gt;
                      &lt;/DataSet&gt;</string>

                      should I take the contents of the string tag that is text and replace
                      all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
                      how to do it?

                      or should I use another parser that accomplishes the task with no need
                      to replace the escaped characters?
                      -----BEGIN PGP SIGNATURE-----
                      Version: GnuPG v1.2.4 (GNU/Linux)
                      Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

                      iD8DBQFB8AIQHn4 UHCY8rB8RAuw8AJ 9ZMQ8P3c7wXD1zV Ld2fe7MktMQwwCf XAND
                      EPpY1w2a3ix2s2v WRlzZ43U=
                      =bJQV
                      -----END PGP SIGNATURE-----

                      Comment

                      • Martin v. Löwis

                        #12
                        Re: xml parsing escape characters

                        Luis P. Mendes wrote:[color=blue]
                        > When I access the url via the Firefox browser and look into the source
                        > code, I also get:
                        >
                        > <?xml version="1.0" encoding="utf-8"?>
                        > <string xmlns="http.... ............">& lt;DataSet&gt;
                        > ~ &lt;Order&gt ;
                        > ~ &lt;Customer&gt ;439&lt;/Customer&gt;
                        > ~ &lt;/Order&gt;
                        > &lt;/DataSet&gt;</string>[/color]

                        Please do try to understand what you are seeing. This is crucial for
                        understanding what happens.

                        You may have the understanding that XML can be represented as a tree.
                        This would be good - if not, please read a book that explains why
                        XML can be considered as a tree.

                        In the tree, you have inner nodes, and leaf nodes. For example,
                        the document

                        <a>
                        <b>Hello</b>
                        <c>World</c>
                        </a>

                        has 5 nodes (ignoring whitespace content):

                        Element:a ---- Element:b ---- Text:"Hello"
                        |
                        \-- Element:c ---- Text:"World"

                        So the leaf nodes are typically Text nodes (unless you
                        have an empty element). Your document has this structure:

                        Element:string ---- Text:"""<DataSe t>
                        <Order>
                        <Customer>439 </Customer>
                        </Order>
                        </DataSet>"""

                        So the ***TEXT*** contains the letter "<", just like it contains
                        the letters "O" and "r". There IS no element Order in your document,
                        no matter how hard you look.

                        If you want a DataSet *element* in your document, it should
                        read

                        <string xmlns="...">
                        <DataSet>
                        <Order>
                        <Customer>439 </Customer>
                        </Order
                        </DataSet>
                        </string>

                        As this is the document you apparently want to process, complain
                        to whoever gave you that other document.
                        [color=blue]
                        > should I take the contents of the string tag that is text and replace
                        > all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?[/color]

                        No. We still don't know what you want to achieve, so it is difficult to
                        advise you what to do. My best advise is that whoever generates the XML
                        document should fix it.
                        [color=blue]
                        > or should I use another parser that accomplishes the task with no need
                        > to replace the escaped characters?[/color]

                        No. The parser is working correctly.

                        The document you got can also be interpreted as containing another
                        XML document as a text. This is evil, but apparently people are doing
                        it, anyway. If you really want that embedded document, you need
                        first to extract it.

                        To see what I mean, do

                        print DataSetNode.dat a

                        The .data attribute gives you the string contents of
                        a text node. You could use this as an XML document, and
                        parse it again to an XML parser. This would be ugly,
                        but might be your only choice if the producer of the
                        document is unwilling to adjust.

                        Regards,
                        Martin


                        Comment

                        • Jeremy Bowers

                          #13
                          Re: xml parsing escape characters

                          On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. Löwis wrote:
                          [color=blue]
                          > Luis P. Mendes wrote:[color=green]
                          >> When I access the url via the Firefox browser and look into the source
                          >> code, I also get:
                          >>
                          >> <?xml version="1.0" encoding="utf-8"?> <string
                          >> xmlns="http.... ............">& lt;DataSet&gt; ~ &lt;Order&gt ;
                          >> ~ &lt;Customer&gt ;439&lt;/Customer&gt; ~ &lt;/Order&gt;
                          >> &lt;/DataSet&gt;</string>[/color]
                          >
                          > Please do try to understand what you are seeing. This is crucial for
                          > understanding what happens.[/color]

                          From extremely painful and lengthy personal experience, Luis, I
                          ***extremely*** strongly recommend taking the time to nail this down until
                          you really, really, really understand what is going on. Until you can
                          explain it to somebody else coherently, ideally.

                          Mixing escaping levels like this absolutely, positively *must* be done
                          correctly, or extremely-painful-to-debug problems will result.

                          (My painful experience was layering an RPC implementation in plain text on
                          top of IM messages, where I was dealing with everything from the socket
                          level up except the XML parser. Ultimately it turned out there was a
                          problem in the XML parser, it rendered "&amp;amp;" as "&", which is wrong
                          wrong wrong. But that took a *long* time to find, especially as I had
                          other bugs in the way.)

                          Since you're layering XML in XML, test &amp;amp; and &amp;amp;amp ; to make
                          sure they work correctly; those usually show encoding errors. And, given
                          your current understanding of the issue, do not write your own decoding
                          function unless you absolutely can't avoid it.

                          Comment

                          • Luis P. Mendes

                            #14
                            Re: xml parsing escape characters

                            -----BEGIN PGP SIGNED MESSAGE-----
                            Hash: SHA1

                            ~From your experience, do you think that if this wrong XML code could be
                            meant to be read only by somekind of Microsoft parser, the error will
                            not occur?

                            I'll try to explain:

                            xml producer writes the code in Windows platform and 'thinks' that every
                            client will read/parse the code with a specific Windows parser. Could
                            that (wrong) XML code parse correctly in that kind of specific Windows
                            client?

                            Or in other words:

                            Do you know any windows parser that could turn that erroneous encoding
                            to a xml tree, with four or five inner levels of tags?

                            I'd like to thank everyone for taking the time to answer me.


                            Luis
                            -----BEGIN PGP SIGNATURE-----
                            Version: GnuPG v1.2.4 (GNU/Linux)
                            Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

                            iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
                            3IEMLLXwMZKvNoq A4tISVnI=
                            =jvOU
                            -----END PGP SIGNATURE-----

                            Comment

                            • Luis P. Mendes

                              #15
                              Re: xml parsing escape characters

                              -----BEGIN PGP SIGNED MESSAGE-----
                              Hash: SHA1

                              ~From your experience, do you think that if this wrong XML code could be
                              meant to be read only by somekind of Microsoft parser, the error will
                              not occur?

                              I'll try to explain:

                              xml producer writes the code in Windows platform and 'thinks' that every
                              client will read/parse the code with a specific Windows parser. Could
                              that (wrong) XML code parse correctly in that kind of specific Windows
                              client?

                              Or in other words:

                              Do you know any windows parser that could turn that erroneous encoding
                              to a xml tree, with four or five inner levels of tags?

                              I'd like to thank everyone for taking the time to answer me.


                              Luis
                              -----BEGIN PGP SIGNATURE-----
                              Version: GnuPG v1.2.4 (GNU/Linux)
                              Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

                              iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
                              3IEMLLXwMZKvNoq A4tISVnI=
                              =jvOU
                              -----END PGP SIGNATURE-----

                              Comment

                              Working...