Weird loadHTML behaviour

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • monochromec@gmail.com

    Weird loadHTML behaviour

    Hi all,

    I'm in the process of setting up a PHP script that reads a HTML file,
    does a character conversion and then displays the contents of a single
    HTML tag as follows:


    $str = mb_convert_enco ding (file_get_conte nts ('aktuel.htm'),
    'HTML-ENTITIES', 'ISO-8859-1');

    file_put_conten ts ('dmp.htm', $str);

    $dom = DOMDocument::lo adHTML ($str);
    $elem = $dom->getElementsByT agName ('h5');
    if ($elem->length) {
    $n = $elem->item (0)->nodeValue;
    var_dump (bin2hex ($n));

    What's interesting is that the source HTML file is properly ISO-8859-1
    encoded (which the contents of "dmp.htm" verifies). The trouble starts
    when I retrieve the contents of the first <h5tag that has an umlaut
    in it. In this case, the umlaut is screwed up - what used to be a
    "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü" (0xc3 0x9c
    as the var_dump confirms). What surprises me are two things: that
    somehow the character changes and that the umlaut is not HTML-encoded
    as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
    box.

    Any thoughts?

    Cheers, Christoph

  • monochromec@gmail.com

    #2
    Re: Weird loadHTML behaviour

    On May 9, 12:43 am, monochro...@gma il.com wrote:
    Hi all,
    >
    I'm in the process of setting up a PHP script that reads a HTML file,
    does a character conversion and then displays the contents of a single
    HTML tag as follows:
    >
    $str = mb_convert_enco ding (file_get_conte nts ('aktuel.htm'),
    'HTML-ENTITIES', 'ISO-8859-1');
    >
    file_put_conten ts ('dmp.htm', $str);
    >
    $dom = DOMDocument::lo adHTML ($str);
    $elem = $dom->getElementsByT agName ('h5');
    if ($elem->length) {
    $n = $elem->item (0)->nodeValue;
    var_dump (bin2hex ($n));
    >
    What's interesting is that the source HTML file is properly ISO-8859-1
    encoded (which the contents of "dmp.htm" verifies). The trouble starts
    when I retrieve the contents of the first <h5tag that has an umlaut
    in it. In this case, the umlaut is screwed up - what used to be a
    "Ü" (capital U umlaut, ISO-88591 0xdc) has now become "Ü"(0xc3 0x9c
    as the var_dump confirms). What surprises me are two things: that
    somehow the character changes and that the umlaut is not HTML-encoded
    as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
    box.
    >
    Any thoughts?
    >
    Cheers, Christoph

    After some :-) research, it turns out that the encoding of the
    contents of the first <h5tag
    has acutally changed to UTF-8 - hence the strange byte sequence. This
    begs the question
    if the default encoding for parsed HTML strings in the DOM package is
    UTF-8 (if we are looking
    at HTML-ENTITIES-conformant encoding initially). Is this a bug of
    DOMDocument or a feature?

    Cheers, Christoph

    Comment

    Working...