XML CDATA special characters

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John van Terheijden

    XML CDATA special characters

    Hi.

    I'm trying to develop a program that uses XML files store data. I'm using
    Windows XP, Apache 1.3.29 and PHP 4.3.4.

    Right now the XML file is read using the xml_parser_crea te(),
    xml_set_element _handler() etc. functions. I have difficulties with special
    characters in the data.

    I found information on "<![CDATA[ special chars here ]]>", UTF-8, XML DOM,
    htmlentities(), and more, but I'm confused with all these terms and their
    meaning.

    I think I should use CDATA sections anyhow, right? Or is this UTF-8 a way to
    use special characters without bothering the XML parser?

    Long ago I used a DOM in Perl and liked it, is it hard to use the PHP XML
    DOM and is it (part of a) solution to my problem?

    Right now (with the xml_parser_ functions) my program outputs something like
    <img alt="Data from XML file, sometimes with "quotes".">
    to the browser, which isn't right because of the early end-quote. Where and
    how should I avoid this? This is where htmlentities fits in, right? And I
    once read something about PHP settings dealing with HTML characters.

    It's not that I'm lazy, but there's a lot of information on a lot of
    interrelated subjects. Who can help me out here please?

    Regards,

    - John van Terheijden, the Netherlands


  • Terence

    #2
    Re: XML CDATA special characters

    for a start, if you are "creating" XML content, then you need to use the
    DOM API and not the SAX API. As far as I am aware, the SAX API is just
    for "reading" XML data and not writing to it. Someone please correct me
    if I am wrong.

    The DOM API will conveniently do all special character escaping for you
    so you dont have to worry about using functions *like* htmlentities().
    On that point, basic XML only has 5 pre-defined default entities. And
    off the top of my head, I think they are:
    [color=blue]
    > -- &gt;[/color]
    < -- &lt;
    " -- &quot;
    & -- &amp;
    [insert fifth one here]

    The other one escapes me (no pun intended). If you try and use HTML
    entities, then you will likely create invalid XML documents because HTML
    has entities that are "undefined" in the default XML set.

    When you use an XML parser (be it SAX, or DOM) to get the data back from
    the XML storage files, everything (including entities) will be converted
    back (un-escaped). So you really do not need to use CDATA sections.
    CDATA sections do have their usages but their absolute neccecity is
    limited to a very few cases.

    SPECIAL NOTE ON XSL STYLESHEETS:
    If you are using XSL templates to extract HTML markup contained
    (escaped) in the XML storage files, use the disable-output-escaping
    attribute of the value-of directive to disable output escaping. This is
    useful if you have done something like this...
    $element->set_content($h tmlSource);
    and you wish the output tree to contain unescaped HTML.

    As for character encoding (UTF8 etc), it depends on what sort of data
    you are putting in there. Odds are you needn't concern yourself with
    this unless you know that your source data is UTF-16 or something. Just
    try using the DOM XML functions and see how you go.

    Comment

    • John Dunlop

      #3
      Re: XML CDATA special characters

      Terence wrote:
      [color=blue]
      > On that point, basic XML only has 5 pre-defined default entities. And
      > off the top of my head, I think they are:
      >[color=green]
      > > -- &gt;[/color]
      > < -- &lt;
      > " -- &quot;
      > & -- &amp;
      > [insert fifth one here]
      >
      > The other one escapes me (no pun intended).[/color]

      The other one was introduced by XML1.0, and doesn't exist in any HTML
      version. It's U+0027 APOSTROPHE ("'"), with an entity reference of
      &apos;, a decimal character reference of &#39;, and a hexadecimal
      character reference of &#x27; (XML1.0 sec. 4.6).
      [color=blue]
      > If you try and use HTML entities, then you will likely create invalid
      > XML documents because HTML has entities that are "undefined" in the
      > default XML set.[/color]

      OK.

      On the other hand, htmlspecialchar s converts, at most, just those
      five characters to their respective entity references (or decimal
      character reference in the case of the ASCII apostrophe, since there
      is no entity reference defined for it in any HTML version). The
      ENT_QUOTES mode converts both single- and double-quotes; the default
      mode, ENT_COMPAT, only converts double-quotes.



      --
      Jock

      Comment

      • John van Terheijden

        #4
        Re: XML CDATA special characters

        I didn't mention SAX, is that the standard PHP parser I'm using now? I
        thought it was Expat. Thanks for making this even more confusing ;)

        Ok, I'll just dive into DOM now and see where this will all end up. I'll
        probably come across all the terms again, in time. B.t.w. I don't understand
        much of your XSL note, probably because I know very little about XSL. I'm
        using XML to store data while avoiding databases.

        Thanks!

        "Terence" <tk.lists@fastm ail.fm> schreef in bericht
        news:3fb9969e$1 @herald...[color=blue]
        > for a start, if you are "creating" XML content, then you need to use the
        > DOM API and not the SAX API. As far as I am aware, the SAX API is just
        > for "reading" XML data and not writing to it. Someone please correct me
        > if I am wrong.
        >
        > The DOM API will conveniently do all special character escaping for you
        > so you dont have to worry about using functions *like* htmlentities().
        > On that point, basic XML only has 5 pre-defined default entities. And
        > off the top of my head, I think they are:
        >[color=green]
        > > -- &gt;[/color]
        > < -- &lt;
        > " -- &quot;
        > & -- &amp;
        > [insert fifth one here]
        >
        > The other one escapes me (no pun intended). If you try and use HTML
        > entities, then you will likely create invalid XML documents because HTML
        > has entities that are "undefined" in the default XML set.
        >
        > When you use an XML parser (be it SAX, or DOM) to get the data back from
        > the XML storage files, everything (including entities) will be converted
        > back (un-escaped). So you really do not need to use CDATA sections.
        > CDATA sections do have their usages but their absolute neccecity is
        > limited to a very few cases.
        >
        > SPECIAL NOTE ON XSL STYLESHEETS:
        > If you are using XSL templates to extract HTML markup contained
        > (escaped) in the XML storage files, use the disable-output-escaping
        > attribute of the value-of directive to disable output escaping. This is
        > useful if you have done something like this...
        > $element->set_content($h tmlSource);
        > and you wish the output tree to contain unescaped HTML.
        >
        > As for character encoding (UTF8 etc), it depends on what sort of data
        > you are putting in there. Odds are you needn't concern yourself with
        > this unless you know that your source data is UTF-16 or something. Just
        > try using the DOM XML functions and see how you go.
        >[/color]


        Comment

        • Terence

          #5
          Re: XML CDATA special characters

          John van Terheijden wrote:
          [color=blue]
          > I didn't mention SAX, is that the standard PHP parser I'm using now? I
          > thought it was Expat. Thanks for making this even more confusing ;)
          >[/color]

          Yeah, it's a bit like that. I didn't want to include too much
          explanations else I'd be in danger of writing a huge article. Trust me,
          restraint is a good thing for me. When you're on the newbie end of a
          technology, then it's best just to pretend you never read/heard the
          stuff that confused you (initially of course).

          Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
          "XML extension". And yes, it is based on the Expat (product name)
          implementation of SAX. SAX is a standard, Expat is a product that
          implements that standard.

          DOM is a standard, PHP uses the libxml product which implements that
          standard. PHP5 is slated to use libxml2 which is very exciting indeed :)

          If you don't know anything about XSLT, then ignore the tip I gave to
          XSLT users who might take my advice on the [no need to use] CDATA issue.
          XSLT is a whole new kettle of fish, don't go there until you have a firm
          grasp on XML.

          I recomend familiarising yourself with the XML "infoset". You will find
          the "infoset" standard on the w3c website. Do not panic, it is a
          relatively short document that can be skimmed quite readily. Don't get
          depressed if it all doesn't stick the first time. At least *familiarise*
          yourself with the *concept* of the infoset. There should be an
          introduction/primer type article there.

          [color=blue]
          > Ok, I'll just dive into DOM now and see where this will all end up. I'll
          > probably come across all the terms again, in time. B.t.w. I don't understand
          > much of your XSL note, probably because I know very little about XSL. I'm
          > using XML to store data while avoiding databases.
          >
          > Thanks!
          >
          > "Terence" <tk.lists@fastm ail.fm> schreef in bericht
          > news:3fb9969e$1 @herald...
          >[color=green]
          >>for a start, if you are "creating" XML content, then you need to use the
          >>DOM API and not the SAX API. As far as I am aware, the SAX API is just
          >>for "reading" XML data and not writing to it. Someone please correct me
          >>if I am wrong.
          >>
          >>The DOM API will conveniently do all special character escaping for you
          >>so you dont have to worry about using functions *like* htmlentities().
          >>On that point, basic XML only has 5 pre-defined default entities. And
          >>off the top of my head, I think they are:
          >>[color=darkred]
          >> > -- &gt;[/color]
          >>< -- &lt;
          >>" -- &quot;
          >>& -- &amp;
          >>[insert fifth one here]
          >>
          >>The other one escapes me (no pun intended). If you try and use HTML
          >>entities, then you will likely create invalid XML documents because HTML
          >>has entities that are "undefined" in the default XML set.
          >>
          >>When you use an XML parser (be it SAX, or DOM) to get the data back from
          >>the XML storage files, everything (including entities) will be converted
          >>back (un-escaped). So you really do not need to use CDATA sections.
          >>CDATA sections do have their usages but their absolute neccecity is
          >>limited to a very few cases.
          >>
          >>SPECIAL NOTE ON XSL STYLESHEETS:
          >>If you are using XSL templates to extract HTML markup contained
          >>(escaped) in the XML storage files, use the disable-output-escaping
          >>attribute of the value-of directive to disable output escaping. This is
          >>useful if you have done something like this...
          >>$element->set_content($h tmlSource);
          >>and you wish the output tree to contain unescaped HTML.
          >>
          >>As for character encoding (UTF8 etc), it depends on what sort of data
          >>you are putting in there. Odds are you needn't concern yourself with
          >>this unless you know that your source data is UTF-16 or something. Just
          >>try using the DOM XML functions and see how you go.
          >>[/color]
          >
          >
          >[/color]

          Comment

          • John van Terheijden

            #6
            Re: XML CDATA special characters

            Thanks for the reply.

            "Terence" <tk.lists@fastm ail.fm> schreef in bericht
            news:3fbaada8$1 @herald...[color=blue]
            > John van Terheijden wrote:
            >[color=green]
            > > I didn't mention SAX, is that the standard PHP parser I'm using now? I
            > > thought it was Expat. Thanks for making this even more confusing ;)
            > >[/color]
            >
            > Yeah, it's a bit like that. I didn't want to include too much
            > explanations else I'd be in danger of writing a huge article. Trust me,
            > restraint is a good thing for me. When you're on the newbie end of a
            > technology, then it's best just to pretend you never read/heard the
            > stuff that confused you (initially of course).[/color]

            I agree. It's always hard to choose between learning by reading or by
            practice. Most of the times, "the other one" would have been faster.
            [color=blue]
            > Simple Api for Xml (SAX) is indeed what PHP's inadequately named the
            > "XML extension". And yes, it is based on the Expat (product name)
            > implementation of SAX. SAX is a standard, Expat is a product that
            > implements that standard.
            >
            > DOM is a standard, PHP uses the libxml product which implements that
            > standard. PHP5 is slated to use libxml2 which is very exciting indeed :)[/color]

            Thanks for clearing that up! Btw, I think I like how DOM works better than
            how SAX works. However, I believe that's very much depending on the type of
            XML data involved.
            [color=blue]
            > If you don't know anything about XSLT, then ignore the tip I gave to
            > XSLT users who might take my advice on the [no need to use] CDATA issue.
            > XSLT is a whole new kettle of fish, don't go there until you have a firm
            > grasp on XML.[/color]

            ok :)
            [color=blue]
            > I recomend familiarising yourself with the XML "infoset". You will find
            > the "infoset" standard on the w3c website. Do not panic, it is a
            > relatively short document that can be skimmed quite readily. Don't get
            > depressed if it all doesn't stick the first time. At least *familiarise*
            > yourself with the *concept* of the infoset. There should be an
            > introduction/primer type article there.[/color]

            I had a quick look and will read it.

            Thanks!


            Comment

            • R. Rajesh Jeba Anbiah

              #7
              Re: XML CDATA special characters

              "John van Terheijden" <john-foobar-nl> wrote in message news:<3fb955fb$ 0$31392$edd6591 c@news.versatel .net>...[color=blue]
              > Hi.
              >
              > I'm trying to develop a program that uses XML files store data. I'm using
              > Windows XP, Apache 1.3.29 and PHP 4.3.4.[/color]

              I couldn't understand why people are messing with XML when PHP with a
              simple database (like MySQL, Postgre SQL or SQLite) can do the job
              better.

              XML can be effectively used to share the data between two domains.
              But, there are people who plow XML in their own domains; also seen
              number of people who dump their data into XML from the DB and messing
              with XML.

              There are also some people who still fight against PHP's cool
              short-tag on behalf of messy XML.

              ---
              "One who mix sports and patriotism is a barbarian"
              Email: rrjanbiah-at-Y!com

              Comment

              Working...