problem parsing XML files with < and > in cdata-section

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • wenke

    problem parsing XML files with < and > in cdata-section

    Hi,

    I am using the following code (see below) from php.net
    (http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML
    file (encoded in UTF-8). I changed the code slightly so that the cdata
    sections will be echoed an not the element names as in the original
    example.

    In the cdata sections of my XML file I have terms like this:

    Cap<Finanzin strument>

    The parser echoes them as following (echo $data . "<br>";):

    Cap
    <
    Finanzinstrumen t[color=blue]
    >[/color]

    Can anyone explain this to me? Why does the parser split the
    cdata-section with &lt; and &gt, in it? Is there any way to avoid
    this?

    Thanks very much in advance,

    greetings, wenke

    --------------------------------------------

    <?php
    $file = "ck_bsp.xml ";
    $depth = array();

    function startElement($p arser, $name, $attrs)
    {
    global $depth;
    for ($i = 0; $i < $depth[$parser]; $i++) {
    echo " ";
    }
    //echo "$name\n";
    $depth[$parser]++;
    }

    function endElement($par ser, $name)
    {
    global $depth;
    $depth[$parser]--;
    }

    function characterData($ parser, $data)
    {
    echo $data . "<br>";
    }

    $xml_parser = xml_parser_crea te();
    xml_set_element _handler($xml_p arser, "startEleme nt", "endElement ");
    xml_set_charact er_data_handler ($xml_parser, "characterData" );
    if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
    }

    while ($data = fread($fp, 4096)) {
    if (!xml_parse($xm l_parser, $data, feof($fp))) {
    die(sprintf("XM L error: %s at line %d",
    xml_error_strin g(xml_get_error _code($xml_pars er)),
    xml_get_current _line_number($x ml_parser)));
    }
    }
    xml_parser_free ($xml_parser);
    ?>
    --------------------------------------------
  • Eric Bohlman

    #2
    Re: problem parsing XML files with &amp;lt; and &amp;gt; in cdata-section

    wenkeroeper@gmx .de (wenke) wrote in
    news:a642a16e.0 403030637.284b9 54e@posting.goo gle.com:
    [color=blue]
    > In the cdata sections of my XML file I have terms like this:
    >
    > Cap&lt;Finanzin strument&gt;
    >
    > The parser echoes them as following (echo $data . "<br>";):
    >
    > Cap
    > <
    > Finanzinstrumen t[color=green]
    >>[/color]
    >
    > Can anyone explain this to me? Why does the parser split the
    > cdata-section with &lt; and &gt, in it? Is there any way to avoid
    > this?[/color]

    "Stream-oriented" XML parsers (like expat, which is what PHP uses) are
    almost never guaranteed to return maximum-length pieces of character data,
    because doing so requires some rather complicated internal buffering that
    slows them down. In particular, they usually stop a chunk at the
    beginning of an entity reference. You simply have to be prepared for
    consecutive calls to your character data handler.

    Comment

    • wenke

      #3
      Re: problem parsing XML files with &amp;lt; and &amp;gt; in cdata-section

      Eric Bohlman <ebohlman@earth link.net> wrote in message news:<Xns94A1CB 4338120ebohlman omsdevcom@130.1 33.1.4>...[color=blue]
      > wenkeroeper@gmx .de (wenke) wrote in
      > news:a642a16e.0 403030637.284b9 54e@posting.goo gle.com:
      >[color=green]
      > > In the cdata sections of my XML file I have terms like this:
      > >
      > > Cap&lt;Finanzin strument&gt;
      > >
      > > The parser echoes them as following (echo $data . "<br>";):
      > >
      > > Cap
      > > <
      > > Finanzinstrumen t[color=darkred]
      > >>[/color]
      > >
      > > Can anyone explain this to me? Why does the parser split the
      > > cdata-section with &lt; and &gt, in it? Is there any way to avoid
      > > this?[/color]
      >
      > "Stream-oriented" XML parsers (like expat, which is what PHP uses) are
      > almost never guaranteed to return maximum-length pieces of character data,
      > because doing so requires some rather complicated internal buffering that
      > slows them down. In particular, they usually stop a chunk at the
      > beginning of an entity reference. You simply have to be prepared for
      > consecutive calls to your character data handler.[/color]

      Could you please render this more precisely? How do I know if the
      output the parser is giving me still belongs to the prior or a new
      cdata section (especially if the structure of the data might vary) ??
      Thanks!

      Comment

      • Eric Bohlman

        #4
        Re: problem parsing XML files with &amp;lt; and &amp;gt; in cdata-section

        wenkeroeper@gmx .de (wenke) wrote in
        news:a642a16e.0 403080153.6e023 138@posting.goo gle.com:
        [color=blue]
        > Eric Bohlman <ebohlman@earth link.net> wrote in message
        > news:<Xns94A1CB 4338120ebohlman omsdevcom@130.1 33.1.4>...[color=green]
        >> "Stream-oriented" XML parsers (like expat, which is what PHP uses)
        >> are almost never guaranteed to return maximum-length pieces of
        >> character data, because doing so requires some rather complicated
        >> internal buffering that slows them down. In particular, they usually
        >> stop a chunk at the beginning of an entity reference. You simply
        >> have to be prepared for consecutive calls to your character data
        >> handler.[/color]
        >
        > Could you please render this more precisely? How do I know if the
        > output the parser is giving me still belongs to the prior or a new
        > cdata section (especially if the structure of the data might vary) ??[/color]

        If there were no intervening start-element or end-element events, then two
        character-data events are referring to consecutive parts of the same text.
        The usual trick is to clear out a text buffer at the end of the code for
        each start-element or end-element event (the code would have made use of
        anything that was previously in the buffer), and simply append the text to
        it in character-data events.

        Comment

        Working...