Elementtree and CDATA handling

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • alainpoint@yahoo.fr

    Elementtree and CDATA handling

    I am experimenting with ElementTree and i came accross some
    (apparently) weird behaviour.
    I would expect a piece of XML to be read, parsed and written back
    without corruption (except for the comments and PI which have purposely
    been left out). It isn't however the case when it comes to CDATA
    handling.
    I have the following code:
    text="""<html>< head>
    <title>Document </title>
    </head>
    <body>
    <script type="text/javascript">
    //<![CDATA[
    function matchwo(a,b)
    {
    if (a < b && a > 0) then
    {
    return 1
    }
    }
    //]]>
    </script>
    </body>
    </html>
    """

    from elementtree import ElementTree
    tree = ElementTree.fro mstring(text)
    ElementTree.dum p(tree)

    Running the above piece of code yields the following:

    <html><head>
    <title>Document </title>
    </head>
    <body>
    <script type="text/javascript">
    //
    function matchwo(a,b)
    {
    if (a &lt; b &amp;&amp; a &gt; 0) then
    {
    return 1
    }
    }
    //
    </script>
    </body>
    </html>

    There are two problems: the //<![CDATA[ has disappeared and the <, >
    and && have been replaced by their equivalent entities (CDATA should
    have prevented that).
    I am no XML/HMTL expert, so i might be doing something wrong...
    Thank you for helping

    Alain

  • Fredrik Lundh

    #2
    Re: Elementtree and CDATA handling

    alainpoint@yaho o.fr wrote:
    [color=blue]
    > There are two problems: the //<![CDATA[ has disappeared and the <, >
    > and && have been replaced by their equivalent entities (CDATA should
    > have prevented that).
    > I am no XML/HMTL expert, so i might be doing something wrong...[/color]

    you're confusing the external representation of something with the internal
    data model.

    consider this:
    [color=blue][color=green][color=darkred]
    >>> "hello"
    >>> 'hello'
    >>> "hell\x6f"
    >>> "hell\157"
    >>> "hell" + "o"
    >>> 'h' 'e' 'l' 'l' 'o'[/color][/color][/color]

    the above are six ways to write the same string literal in Python. all these result
    in a five-character string containing the letters "h", "e", "l", "l", and "o". if you type
    the above at a python prompt, you'll find that Python echoes the strings back as
    'hello' in all six cases.

    in XML, entities, character references, and CDATA sections are three different
    way to represent reserved characters. once you've loaded the file, they all "dis-
    appear".

    </F>



    Comment

    • Terry Reedy

      #3
      Re: Elementtree and CDATA handling


      "Fredrik Lundh" <fredrik@python ware.com> wrote in message
      news:d7kdam$71c $1@sea.gmane.or g...[color=blue]
      > you're confusing the external representation of something with the
      > internal
      > data model.
      >
      > consider this:
      >[color=green][color=darkred]
      > >>> "hello"
      > >>> 'hello'
      > >>> "hell\x6f"
      > >>> "hell\157"
      > >>> "hell" + "o"
      > >>> 'h' 'e' 'l' 'l' 'o'[/color][/color]
      >
      > the above are six ways to write the same string literal in Python.[/color]

      Minor nit: I believe 'hell' + 'o' is two string literals and a runtime
      concatenation operation. Perhaps you meant 'hell' 'o', without the '+',
      which I believe is joined to one literal at parsing or compilation time.
      [color=blue]
      > all these result
      > in a five-character string containing the letters "h", "e", "l", "l", and
      > "o".
      > if you type the above at a python prompt,
      > you'll find that Python echoes the strings back as
      > 'hello' in all six cases.[/color]

      Nit aside, this is a valuable point that bears repeating. Another example
      of one internal versus multiple external that confuses many is the
      following:
      1.1 == 1.1000000000000 001 # True

      The mapping of external to internal is many-to-one for both strings and
      floats and therefore *cannot* be exactly inverted! (Or round-tripped.) So
      Python has to somehow choose one of the possible external forms that would
      generate the internal form.

      Terry J. Reedy




      Comment

      • and-google@doxdesk.com

        #4
        Re: Elementtree and CDATA handling

        Alain <alainpoint@yah oo.fr> wrote:
        [color=blue]
        > I would expect a piece of XML to be read, parsed and written back
        > without corruption [...]. It isn't however the case when it comes
        > to CDATA handling.[/color]

        This is not corruption, exactly. For most intents and purposes, CDATA
        sections should behave identically to normal character data. In a real
        XML-based browser (such as Mozilla in application/xhtml+xml mode), this
        line of script would actually work fine:
        [color=blue]
        > if (a &lt; b &amp;&amp; a &gt; 0) {[/color]

        The problem is you're (presumably) producing output that you want to be
        understood by things that are not XML parsers, namely legacy-HTML web
        browsers, which have special exceptions-to-the-rule like "<script>
        elements don't contain markup" that are not present in XML.

        ElementTree is a data binding that strives to simplify the XML
        processing experience, and as such it folds CDATA sections down to
        plain characters - this is usually easier for programmers to deal with.
        Such a feature is considered normal in XML processing, and is the
        default for, eg. DOM Level 3 implementations .

        If, instead, you want to keep track of where the CDATA sections are,
        and output them again without change, you'll need to use an
        XML-handling interface that supports this feature. Typically, DOM
        implementations do - the default Python minidom does, as does pxdom.
        DOM is a more comprehensive but less friendly/Python-like interface for
        XML processing.

        There are a few other obstacles you may meet if you are outputting XML
        for use by a non-XML parser (legacy browsers):

        - entity references - &eacute; etc. The HTML entities are not
        built into XML so to read them at all you'll need a parser that
        reads the external DTD subset (and a suitable !DOCTYPE). Even then
        they'll be converted to text, if that matters. (pxdom, optionally,
        can keep them as entity references regardless of whether their
        content is known);

        - empty elements - <img/> etc. An XML serialiser won't know how to
        output this is a browser-compatible way. (The next release of pxdom
        has an option to do so.)

        If you're generating output for legacy browsers, you might want to just
        use a 'real' HTML serialiser.

        --
        Andrew Clover
        mailto:and@doxd esk.com


        Comment

        • Fredrik Lundh

          #5
          Re: Elementtree and CDATA handling

          Terry Reedy wrote:
          [color=blue][color=green]
          >> the above are six ways to write the same string literal in Python.[/color]
          >
          > Minor nit: I believe 'hell' + 'o' is two string literals and a runtime concatenation operation.[/color]

          I guess I should have written "constant".

          on the other hand, while the difference might matter for current python
          interpreter implementations , it doesn't really matter for the user. after
          the operation, they end up with a string object that doesn't contain what
          they wrote.

          </F>



          Comment

          Working...