Problem loading html containing scripts using Dom LoadHTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • loretta

    Problem loading html containing scripts using Dom LoadHTML

    This code is just reading html and printing , eventually I want to
    modify the html. However, the original html contains javascript and
    the output html contains tags not in the original.

    $url = "http://www.something.c om";
    $doc = new DOMDocument();
    $doc->loadHTMLFile($ url);
    print $doc->saveHTML();

    Original html snippet:
    function exampleFunction () {
    var doc = '<html><head>' ;
    doc += '<title>Title</title>';
    doc += '</head>';
    doc += '<body onload="self.fo cus();">';
    doc += '</body></html>';
    }

    Html after saveHTML:
    function exampleFunction () {
    ('about:blank', 'imagemanagerpo pup',settings);
    var doc = '<html><head>' ;
    doc += '<title>Title</title>';
    doc += '</script>
    </head>
    <body>
    <p>';
    doc += '</body>
    </html><html><bod y>
    <p>';
    }

    Extra tags to end the script, head and begin a new body are being
    added before the </bodytag and after the <body onload=self.foc us()>
    tag in the js variable. Is there a way for the Dom to leave the
    javascript as is without trying to 'fix' the html ? The changes being
    made are causing a javascript error.
    Thanks

  • shimmyshack

    #2
    Re: Problem loading html containing scripts using Dom LoadHTML

    On May 14, 6:08 pm, loretta <lorb...@optonl ine.netwrote:
    This code is just reading html and printing , eventually I want to
    modify the html. However, the original html contains javascript and
    the output html contains tags not in the original.
    >
    $url = "http://www.something.c om";
    $doc = new DOMDocument();
    $doc->loadHTMLFile($ url);
    print $doc->saveHTML();
    >
    Original html snippet:
    function exampleFunction () {
    var doc = '<html><head>' ;
    doc += '<title>Title</title>';
    doc += '</head>';
    doc += '<body onload="self.fo cus();">';
    doc += '</body></html>';
    }
    >
    Html after saveHTML:
    function exampleFunction () {
    ('about:blank', 'imagemanagerpo pup',settings);
    var doc = '<html><head>' ;
    doc += '<title>Title</title>';
    doc += '</script>
    </head>
    <body>
    <p>';
    doc += '</body>
    </html><html><bod y>
    <p>';
    >
    }
    >
    Extra tags to end the script, head and begin a new body are being
    added before the </bodytag and after the <body onload=self.foc us()>
    tag in the js variable. Is there a way for the Dom to leave the
    javascript as is without trying to 'fix' the html ? The changes being
    made are causing a javascript error.
    Thanks
    start off with xHTML, so it can be loaded with no errors, see google
    on how to add javascript in a way that is compliant with xml standards

    Comment

    • loretta

      #3
      Re: Problem loading html containing scripts using Dom LoadHTML

      On May 14, 2:16 pm, shimmyshack <matt.fa...@gma il.comwrote:
      On May 14, 6:08 pm, loretta <lorb...@optonl ine.netwrote:
      >
      >
      >
      >
      >
      This code is just reading html and printing , eventually I want to
      modify the html. However, the original html contains javascript and
      the output html contains tags not in the original.
      >
      $url = "http://www.something.c om";
      $doc = new DOMDocument();
      $doc->loadHTMLFile($ url);
      print $doc->saveHTML();
      >
      Original html snippet:
      function exampleFunction () {
      var doc = '<html><head>' ;
      doc += '<title>Title</title>';
      doc += '</head>';
      doc += '<body onload="self.fo cus();">';
      doc += '</body></html>';
      }
      >
      Html after saveHTML:
      function exampleFunction () {
      ('about:blank', 'imagemanagerpo pup',settings);
      var doc = '<html><head>' ;
      doc += '<title>Title</title>';
      doc += '</script>
      </head>
      <body>
      <p>';
      doc += '</body>
      </html><html><bod y>
      <p>';
      >
      }
      >
      Extra tags to end the script, head and begin a new body are being
      added before the </bodytag and after the <body onload=self.foc us()>
      tag in the js variable. Is there a way for the Dom to leave the
      javascript as is without trying to 'fix' the html ? The changes being
      made are causing a javascript error.
      Thanks
      >
      start off with xHTML, so it can be loaded with no errors, see google
      on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
      >
      - Show quoted text -
      The html I am retrieving has a xhtml doctype. I also have no control
      over the original webpage. The original webpage loads with no errors
      in both IE and FF.

      Comment

      • Jerry Stuckle

        #4
        Re: Problem loading html containing scripts using Dom LoadHTML

        loretta wrote:
        On May 14, 2:16 pm, shimmyshack <matt.fa...@gma il.comwrote:
        >On May 14, 6:08 pm, loretta <lorb...@optonl ine.netwrote:
        >>
        >>
        >>
        >>
        >>
        >>This code is just reading html and printing , eventually I want to
        >>modify the html. However, the original html contains javascript and
        >>the output html contains tags not in the original.
        >> $url = "http://www.something.c om";
        >> $doc = new DOMDocument();
        >> $doc->loadHTMLFile($ url);
        >> print $doc->saveHTML();
        >>Original html snippet:
        >> function exampleFunction () {
        >> var doc = '<html><head>' ;
        >> doc += '<title>Title</title>';
        >> doc += '</head>';
        >> doc += '<body onload="self.fo cus();">';
        >> doc += '</body></html>';
        >> }
        >>Html after saveHTML:
        >>function exampleFunction () {
        >>('about:blank ','imagemanager popup',settings );
        >> var doc = '<html><head>' ;
        >> doc += '<title>Title</title>';
        >> doc += '</script>
        >></head>
        >><body>
        >><p>';
        >> doc += '</body>
        >></html><html><bod y>
        >><p>';
        >>}
        >>Extra tags to end the script, head and begin a new body are being
        >>added before the </bodytag and after the <body onload=self.foc us()>
        >>tag in the js variable. Is there a way for the Dom to leave the
        >>javascript as is without trying to 'fix' the html ? The changes being
        >>made are causing a javascript error.
        >>Thanks
        >start off with xHTML, so it can be loaded with no errors, see google
        >on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
        >>
        >- Show quoted text -
        >
        The html I am retrieving has a xhtml doctype. I also have no control
        over the original webpage. The original webpage loads with no errors
        in both IE and FF.
        >
        But does it validate (http://validator.w3.org)? Pages can load in
        browsers without error and still not validate. The browsers are very
        forgiving, and make a "best guess" as to what the page creator wanted.

        --
        =============== ===
        Remove the "x" from my email address
        Jerry Stuckle
        JDS Computer Training Corp.
        jstucklex@attgl obal.net
        =============== ===

        Comment

        • shimmyshack

          #5
          Re: Problem loading html containing scripts using Dom LoadHTML

          On May 14, 7:47 pm, loretta <lorb...@optonl ine.netwrote:
          On May 14, 2:16 pm, shimmyshack <matt.fa...@gma il.comwrote:
          >
          >
          >
          On May 14, 6:08 pm, loretta <lorb...@optonl ine.netwrote:
          >
          This code is just reading html and printing , eventually I want to
          modify the html. However, the original html contains javascript and
          the output html contains tags not in the original.
          >
          $url = "http://www.something.c om";
          $doc = new DOMDocument();
          $doc->loadHTMLFile($ url);
          print $doc->saveHTML();
          >
          Original html snippet:
          function exampleFunction () {
          var doc = '<html><head>' ;
          doc += '<title>Title</title>';
          doc += '</head>';
          doc += '<body onload="self.fo cus();">';
          doc += '</body></html>';
          }
          >
          Html after saveHTML:
          function exampleFunction () {
          ('about:blank', 'imagemanagerpo pup',settings);
          var doc = '<html><head>' ;
          doc += '<title>Title</title>';
          doc += '</script>
          </head>
          <body>
          <p>';
          doc += '</body>
          </html><html><bod y>
          <p>';
          >
          }
          >
          Extra tags to end the script, head and begin a new body are being
          added before the </bodytag and after the <body onload=self.foc us()>
          tag in the js variable. Is there a way for the Dom to leave the
          javascript as is without trying to 'fix' the html ? The changes being
          made are causing a javascript error.
          Thanks
          >
          start off with xHTML, so it can be loaded with no errors, see google
          on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
          >
          - Show quoted text -
          >
          The html I am retrieving has a xhtml doctype. I also have no control
          over the original webpage. The original webpage loads with no errors
          in both IE and FF.
          this is what i find on google.
          The MDN Web Docs site provides information about Open Web technologies including HTML, CSS, and APIs for both Web sites and progressive web apps.

          use <!CDATA or the "xhtml" document is no such thing, btw it should
          not just claim to be xhtml but should be properly validated as such,
          including the content-type text/xml+xhtml (served with as .xhtml)
          once you have obtained the webpage, and parsed it adding the right
          instructions for the xml parser, all should work, if indeed the rest
          of the doc is valid xml.

          Comment

          • shimmyshack

            #6
            Re: Problem loading html containing scripts using Dom LoadHTML

            On May 14, 9:58 pm, shimmyshack <matt.fa...@gma il.comwrote:
            On May 14, 7:47 pm, loretta <lorb...@optonl ine.netwrote:
            >
            >
            >
            On May 14, 2:16 pm, shimmyshack <matt.fa...@gma il.comwrote:
            >
            On May 14, 6:08 pm, loretta <lorb...@optonl ine.netwrote:
            >
            This code is just reading html and printing , eventually I want to
            modify the html. However, the original html contains javascript and
            the output html contains tags not in the original.
            >
            $url = "http://www.something.c om";
            $doc = new DOMDocument();
            $doc->loadHTMLFile($ url);
            print $doc->saveHTML();
            >
            Original html snippet:
            function exampleFunction () {
            var doc = '<html><head>' ;
            doc += '<title>Title</title>';
            doc += '</head>';
            doc += '<body onload="self.fo cus();">';
            doc += '</body></html>';
            }
            >
            Html after saveHTML:
            function exampleFunction () {
            ('about:blank', 'imagemanagerpo pup',settings);
            var doc = '<html><head>' ;
            doc += '<title>Title</title>';
            doc += '</script>
            </head>
            <body>
            <p>';
            doc += '</body>
            </html><html><bod y>
            <p>';
            >
            }
            >
            Extra tags to end the script, head and begin a new body are being
            added before the </bodytag and after the <body onload=self.foc us()>
            tag in the js variable. Is there a way for the Dom to leave the
            javascript as is without trying to 'fix' the html ? The changes being
            made are causing a javascript error.
            Thanks
            >
            start off with xHTML, so it can be loaded with no errors, see google
            on how to add javascript in a way that is compliant with xml standards- Hide quoted text -
            >
            - Show quoted text -
            >
            The html I am retrieving has a xhtml doctype. I also have no control
            over the original webpage. The original webpage loads with no errors
            in both IE and FF.
            >
            this is what i find on google.http://developer.mozilla.org/en/docs..._and_JavaScrip...
            use <!CDATA or the "xhtml" document is no such thing, btw it should
            not just claim to be xhtml but should be properly validated as such,
            including the content-type text/xml+xhtml (served with as .xhtml)
            once you have obtained the webpage, and parsed it adding the right
            instructions for the xml parser, all should work, if indeed the rest
            of the doc is valid xml.
            oops, application/xml+xhtml of course

            Comment

            • Toby A Inkster

              #7
              Re: Problem loading html containing scripts using Dom LoadHTML

              Jerry Stuckle wrote:
              But does it validate (http://validator.w3.org)? Pages can load in
              browsers without error and still not validate. The browsers are very
              forgiving, and make a "best guess" as to what the page creator wanted.
              From the excerpts posted, no. Javascript blocks in XHTML must be entity
              encoded -- that is:

              '&' ='&amp;'
              '<' ='&lt;'

              at a minimum. If not, then the document is not valid.

              If a document is not valid, then DOMDocument might not be able to load it
              correctly. Or rather, "correctly" is not defined, so DOMDocument is free
              to interpret it however it likes!

              --
              Toby A Inkster BSc (Hons) ARCS
              Fast withdrawal casino UK 2025 – Play now & cash out instantly! Discover the top sites for rapid, secure payouts with no delays.

              Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux

              Comment

              • shimmyshack

                #8
                Re: Problem loading html containing scripts using Dom LoadHTML

                On May 15, 9:50 am, Toby A Inkster <usenet200...@t obyinkster.co.u k>
                wrote:
                Jerry Stuckle wrote:
                But does it validate (http://validator.w3.org)? Pages can load in
                browsers without error and still not validate. The browsers are very
                forgiving, and make a "best guess" as to what the page creator wanted.
                >
                From the excerpts posted, no. Javascript blocks in XHTML must be entity
                encoded -- that is:
                >
                '&' ='&amp;'
                '<' ='&lt;'
                >
                at a minimum. If not, then the document is not valid.
                >
                If a document is not valid, then DOMDocument might not be able to load it
                correctly. Or rather, "correctly" is not defined, so DOMDocument is free
                to interpret it however it likes!
                >
                --
                Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co. uk/
                Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
                uising a CDATA block means that the parse wont be tripped up by < and
                so forth.

                Comment

                • loretta

                  #9
                  Re: Problem loading html containing scripts using Dom LoadHTML

                  On May 15, 7:32 am, shimmyshack <matt.fa...@gma il.comwrote:
                  On May 15, 9:50 am, Toby A Inkster <usenet200...@t obyinkster.co.u k>
                  wrote:
                  >
                  >
                  >
                  >
                  >
                  Jerry Stuckle wrote:
                  But does it validate (http://validator.w3.org)?Pages can load in
                  browsers without error and still not validate. The browsers are very
                  forgiving, and make a "best guess" as to what the page creator wanted.
                  >
                  From the excerpts posted, no. Javascript blocks in XHTML must be entity
                  encoded -- that is:
                  >
                  '&' ='&amp;'
                  '<' ='&lt;'
                  >
                  at a minimum. If not, then the document is not valid.
                  >
                  If a document is not valid, then DOMDocument might not be able to load it
                  correctly. Or rather, "correctly" is not defined, so DOMDocument is free
                  to interpret it however it likes!
                  >
                  --
                  Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co. uk/
                  Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
                  >
                  uising a CDATA block means that the parse wont be tripped up by < and
                  so forth.- Hide quoted text -
                  >
                  - Show quoted text -
                  The webpage does not validate, however the errors are nowhere near the
                  extra tags in the javascirpt being inserted at the head tag, i.e.
                  there is an unordered list somewhere in the html that is closed twice
                  and an incorrect checkbox attribute. The page validates in tidy, with
                  warnings only. There is this CDATA block around all the javascript
                  functions, in a comment:
                  //<![CDATA[
                  //]]>


                  It seems to me that the parser is seeing the '</head>' tag in the
                  javascrpt variable and putting in the end script tag and body tags

                  Comment

                  • Jerry Stuckle

                    #10
                    Re: Problem loading html containing scripts using Dom LoadHTML

                    loretta wrote:
                    On May 15, 7:32 am, shimmyshack <matt.fa...@gma il.comwrote:
                    >On May 15, 9:50 am, Toby A Inkster <usenet200...@t obyinkster.co.u k>
                    >wrote:
                    >>
                    >>
                    >>
                    >>
                    >>
                    >>Jerry Stuckle wrote:
                    >>>But does it validate (http://validator.w3.org)?Pages can load in
                    >>>browsers without error and still not validate. The browsers are very
                    >>>forgiving, and make a "best guess" as to what the page creator wanted.
                    >>From the excerpts posted, no. Javascript blocks in XHTML must be entity
                    >>encoded -- that is:
                    >> '&' ='&amp;'
                    >> '<' ='&lt;'
                    >>at a minimum. If not, then the document is not valid.
                    >>If a document is not valid, then DOMDocument might not be able to load it
                    >>correctly. Or rather, "correctly" is not defined, so DOMDocument is free
                    >>to interpret it however it likes!
                    >>--
                    >>Toby A Inkster BSc (Hons) ARCShttp://tobyinkster.co. uk/
                    >>Geek of ~ HTML/SQL/Perl/PHP/Python/Apache/Linux
                    >uising a CDATA block means that the parse wont be tripped up by < and
                    >so forth.- Hide quoted text -
                    >>
                    >- Show quoted text -
                    >
                    The webpage does not validate, however the errors are nowhere near the
                    extra tags in the javascirpt being inserted at the head tag, i.e.
                    there is an unordered list somewhere in the html that is closed twice
                    and an incorrect checkbox attribute. The page validates in tidy, with
                    warnings only. There is this CDATA block around all the javascript
                    functions, in a comment:
                    //<![CDATA[
                    //]]>
                    >
                    >
                    It seems to me that the parser is seeing the '</head>' tag in the
                    javascrpt variable and putting in the end script tag and body tags
                    >
                    Since you haven't told us the page you're trying to load, we can't see
                    what the problem is.

                    And BTW - instead of using "something.com" , which is a valid domain, you
                    should use "example.co m" - which is reserved just for such use.

                    --
                    =============== ===
                    Remove the "x" from my email address
                    Jerry Stuckle
                    JDS Computer Training Corp.
                    jstucklex@attgl obal.net
                    =============== ===

                    Comment

                    Working...