Saving a web page

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • alexey_r@mail.ru

    Saving a web page

    Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
    clear enough.

    But unless I am missing something, this will only give me the html
    source of the webpage requsted, and not all the images, stylesheets and
    so on. Is there a simple way to get the entire webpage?

    The alternatives I see now:
    Get a WebBrowser in background to do it, but this seems very nasty.
    There _has_ to be a better way. Besides, how can I select the correct
    file type and enter the name in the backgound?
    Interop with mshtml.dll. See above.
    After getting the html file, I could iterate through the images, etc.
    to request all of them separately.

    Thank you in advance!

  • Andy

    #2
    Re: Saving a web page

    You'll have to get the img tags and download them manually; basically,
    write some code which normally a browser would do.

    So, parse the <imgtags (and <atags, if you like), then use
    HttpRequest to get the images.

    HTH
    Andy


    alexey_r@mail.r u wrote:
    Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
    clear enough.
    >
    But unless I am missing something, this will only give me the html
    source of the webpage requsted, and not all the images, stylesheets and
    so on. Is there a simple way to get the entire webpage?
    >
    The alternatives I see now:
    Get a WebBrowser in background to do it, but this seems very nasty.
    There _has_ to be a better way. Besides, how can I select the correct
    file type and enter the name in the backgound?
    Interop with mshtml.dll. See above.
    After getting the html file, I could iterate through the images, etc.
    to request all of them separately.
    >
    Thank you in advance!

    Comment

    • Tom Spink

      #3
      Re: Saving a web page

      alexey_r@mail.r u wrote:
      Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
      clear enough.
      >
      But unless I am missing something, this will only give me the html
      source of the webpage requsted, and not all the images, stylesheets and
      so on. Is there a simple way to get the entire webpage?
      >
      The alternatives I see now:
      Get a WebBrowser in background to do it, but this seems very nasty.
      There _has_ to be a better way. Besides, how can I select the correct
      file type and enter the name in the backgound?
      Interop with mshtml.dll. See above.
      After getting the html file, I could iterate through the images, etc.
      to request all of them separately.
      >
      Thank you in advance!
      Hi,

      Unfortunately, there isn't a simple way. The way web-browsers (usually)
      work is that they start rendering the page, and download the
      images/stylesheets/whatnot as they need them. They're parsing the HTML,
      finding an <imgtag, or a <linktag and deciding to download the file
      that the tag is referencing.

      You'll need to do this; i.e. analyse the HTML you've received, and decide
      what needs to be downloaded by looking at the tags.

      --
      Hope this helps,
      Tom Spink

      Comment

      • Michael Nemtsev

        #4
        Re: Saving a web page

        Hello alexey_r@mail.r u,

        I'd save page into MHT (web archive) and then parse it to get images
        BTW images are encoded in the MHT

        PS: This lib could be used for parsing http://www.codeproject.com/csharp/mime_project.asp
        Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
        clear enough.
        >
        But unless I am missing something, this will only give me the html
        source of the webpage requsted, and not all the images, stylesheets
        and so on. Is there a simple way to get the entire webpage?
        >
        The alternatives I see now:
        Get a WebBrowser in background to do it, but this seems very nasty.
        There _has_ to be a better way. Besides, how can I select the correct
        file type and enter the name in the backgound?
        Interop with mshtml.dll. See above.
        After getting the html file, I could iterate through the images, etc.
        to request all of them separately.
        Thank you in advance!
        >
        ---
        WBR,
        Michael Nemtsev :: blog: http://spaces.msn.com/laflour

        "At times one remains faithful to a cause only because its opponents do not
        cease to be insipid." (c) Friedrich Nietzsche


        Comment

        • alexey_r@mail.ru

          #5
          Re: Saving a web page


          Michael Nemtsev wrote:
          Hello alexey_r@mail.r u,
          >
          I'd save page into MHT (web archive) and then parse it to get images
          BTW images are encoded in the MHT
          Ah, thank you. But how do I save it as MHT?

          Comment

          • alexey_r@mail.ru

            #6
            Re: Saving a web page


            Tom Spink wrote:
            alexey_r@mail.r u wrote:
            >
            Using HttpWebRequest and HttpWebResponse to retrieve a webpage seems
            clear enough.

            But unless I am missing something, this will only give me the html
            source of the webpage requsted, and not all the images, stylesheets and
            so on. Is there a simple way to get the entire webpage?

            The alternatives I see now:
            Get a WebBrowser in background to do it, but this seems very nasty.
            There _has_ to be a better way. Besides, how can I select the correct
            file type and enter the name in the backgound?
            Interop with mshtml.dll. See above.
            After getting the html file, I could iterate through the images, etc.
            to request all of them separately.

            Thank you in advance!
            >
            Hi,
            >
            Unfortunately, there isn't a simple way. The way web-browsers (usually)
            work is that they start rendering the page, and download the
            images/stylesheets/whatnot as they need them. They're parsing the HTML,
            finding an <imgtag, or a <linktag and deciding to download the file
            that the tag is referencing.
            >
            You'll need to do this; i.e. analyse the HTML you've received, and decide
            what needs to be downloaded by looking at the tags.
            Thank you.

            Comment

            • Michael Nemtsev

              #7
              Re: Saving a web page

              Hello alexey_r@mail.r u,


              Michael Nemtsev wrote:
              >I'd save page into MHT (web archive) and then parse it to get images
              >BTW images are encoded in the MHT
              >>
              Ah, thank you. But how do I save it as MHT?
              >
              ---
              WBR,
              Michael Nemtsev :: blog: http://spaces.msn.com/laflour

              "At times one remains faithful to a cause only because its opponents do not
              cease to be insipid." (c) Friedrich Nietzsche


              Comment

              • alexey_r@mail.ru

                #8
                Re: Saving a web page


                Michael Nemtsev wrote:Thank you again! Looks like it won't work for websites protected by
                password, so I am back to plan A.
                Michael Nemtsev wrote:
                I'd save page into MHT (web archive) and then parse it to get images
                BTW images are encoded in the MHT
                >
                Ah, thank you. But how do I save it as MHT?

                Comment

                • Michael Nemtsev

                  #9
                  Re: Saving a web page

                  Hello alexey_r@mail.r u,

                  What does "websites protected by password"?
                  Any example?
                  Have you tried to save that sites to MHT via IE?
                  Michael Nemtsev wrote:
                  >Thank you again! Looks like it won't work for websites protected by
                  password, so I am back to plan A.
                  >
                  >>Michael Nemtsev wrote:
                  >>>
                  >>>I'd save page into MHT (web archive) and then parse it to get
                  >>>images BTW images are encoded in the MHT
                  >>>>
                  >>Ah, thank you. But how do I save it as MHT?
                  >>>
                  ---
                  WBR,
                  Michael Nemtsev :: blog: http://spaces.msn.com/laflour

                  "At times one remains faithful to a cause only because its opponents do not
                  cease to be insipid." (c) Friedrich Nietzsche


                  Comment

                  Working...