Get all document contents

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Christopher Benson-Manica

    Get all document contents

    Is there a way to get the entire contents of the current document as a
    string? I want to send the document contents to a markup validation
    service.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cybers pace.org | don't, I need to know. Flames welcome.
  • Martin Honnen

    #2
    Re: Get all document contents



    Christopher Benson-Manica wrote:
    [color=blue]
    > Is there a way to get the entire contents of the current document as a
    > string?[/color]

    Some browsers like IE or Opera allow you to serialize an element so you
    can use
    document.docume ntElement.outer HTML
    to get the serialized markup of the <HTML> element.
    [color=blue]
    > I want to send the document contents to a markup validation
    > service.[/color]

    Send them the URL then, that way they can fetch the contents. outerHTML
    will hardly do for validation as browsers apply their own serialization
    and that way while your source might be XHTML with lower case tag names
    the outerHTML might contain tags in upper case letters.

    Of course there are also browser dependant methods to get the source of
    the page, see

    but XMLHttpRequest' s responseText is known for instance to not handle
    ISO-8859-x encodings properly.


    Martin Honnen

    Comment

    • Christopher Benson-Manica

      #3
      Re: Get all document contents

      Martin Honnen <mahotrash@yaho o.de> spoke thus:
      [color=blue]
      > Send them the URL then, that way they can fetch the contents.[/color]

      Obviously that would be the easy solution, but the pages I'd like to
      do this with aren't accessible to the validator (users must be logged
      in to view these pages).
      [color=blue]
      > outerHTML
      > will hardly do for validation as browsers apply their own serialization
      > and that way while your source might be XHTML with lower case tag names
      > the outerHTML might contain tags in upper case letters.[/color]

      Hm, I see the problem. For the purposes of validation, though, it
      should be possible to clean up the string without too much trouble
      (convert all characters to lowercase to take care of the tags)
      although it seems that attributes lose their enclosing double quotes
      as well, which is unfortunate.
      [color=blue]
      > Of course there are also browser dependant methods to get the source of
      > the page, see
      > http://jibbering.com/faq/#FAQ4_38
      > but XMLHttpRequest' s responseText is known for instance to not handle
      > ISO-8859-x encodings properly.[/color]

      In what way does it fail to handle such encodings? I'll look into
      something like this and see if I can make it work. Thanks.

      --
      Christopher Benson-Manica | I *should* know what I'm talking about - if I
      ataru(at)cybers pace.org | don't, I need to know. Flames welcome.

      Comment

      • Matt Kruse

        #4
        Re: Get all document contents

        Christopher Benson-Manica wrote:[color=blue]
        > Hm, I see the problem. For the purposes of validation, though, it
        > should be possible to clean up the string without too much trouble
        > (convert all characters to lowercase to take care of the tags)
        > although it seems that attributes lose their enclosing double quotes
        > as well, which is unfortunate.[/color]

        In addition to that, browsers will add tags and content where there is none
        in the source.
        For example, adding <tbody> tags to tables, even if it's not in your source.

        Examining the browser's internal representation of your source is inadequate
        for validation.

        --
        Matt Kruse



        Comment

        • RobG

          #5
          Re: Get all document contents

          Christopher Benson-Manica wrote:[color=blue]
          > Martin Honnen <mahotrash@yaho o.de> spoke thus:
          >
          >[color=green]
          >>Send them the URL then, that way they can fetch the contents.[/color]
          >
          >
          > Obviously that would be the easy solution, but the pages I'd like to
          > do this with aren't accessible to the validator (users must be logged
          > in to view these pages).[/color]

          You can install the W3C validator locally.

          Allowing a browser to parse the HTML first and then send it to the
          validator will effectively invalidate your validation. AFAIK (but I
          may well be wrong), you can't get the doctype declaration which is
          fundamental to validating the page.

          --
          Rob

          Comment

          • Richard Cornford

            #6
            Re: Get all document contents

            RobG wrote:
            <snip>[color=blue]
            > ... . AFAIK (but I may well be wrong),
            > you can't get the doctype declaration which is
            > fundamental to validating the page.[/color]

            On Mozilla and Opera (recent versions):-

            document.doctyp e (object)
            document.doctyp e.publicId (string)
            document.doctyp e.systemId (string)

            - could be used to re-produce it.

            The other issues raised about the likely validity of a serialised DOM
            makes doing so pointless in this context, but where a serialised DOM has
            other uses it can be employed to make the results more complete (along
            with maybe iterating the attributes collection of the documentElement in
            order to supplement innerHTML with accurate HTML tags).

            Richard.


            Comment

            Working...