HELP: strange php behavior downloading html

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chuck Renner

    HELP: strange php behavior downloading html

    Please help!

    This MIGHT even be a bug in PHP!

    I'll provide version numbers and site specific information (browser, OS,
    and kernel versions) if others cannot reproduce this problem.

    I'm running into some PHP behavior that I do not understand in PHP 5.1.2.

    I need to parse the HTML from the following carefully constructed URI:
    Photo galleries for Chuck Renner, Kim Renner, Behr Renner, Nicole Renner, and Andrew Renner.


    The problem is that when PHP downloads the HTML using file_get_conten ts,
    or any other method of opening a remote file in PHP that I have tried,
    it gives me the wrong page!

    This URI is supposed to yield the HTML from the page at
    http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
    version of the page, selectable from the dropdown box at the top of the
    page.

    The correct page is downloaded in IE, SeaMonkey, and in wget!

    But when downloading in PHP, I get the HTML from the page at
    http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
    small" version of the page, selectable from the dropdown box at the top
    of the page.

    Please note that the templatechange. mg page is merely a server-side
    script that takes the arguments passed to it (TemplateID and origin),
    and redirects the browser to the correct version of the page at
    "origin", based on the "TemplateID ".

    Here is how to reproduce the problem:
    * Download the page with wget so that you have a copy of the correct
    results:

    --commandline start here--
    wget
    "http://crenner.smugmug .com/homepage/templatechange. mg?TemplateID=7 &origin=http ://crenner.smugmug .com/gallery/1960121"
    -O correct.html
    --commandline end here--

    * Download the same page with php 5.1.2:

    --file incorrect.php start here--
    <?php
    print(file_get_ contents("http://crenner.smugmug .com/homepage/templatechange. mg?TemplateID=7 &origin=http ://crenner.smugmug .com/gallery/1960121"));
    ?>
    --file incorrect.php end here--

    --commandline start here--
    php incorrect.php incorrect.html
    --commandline end here--

    * You should now have two very different HTML files (correct.html and
    incorrect.html) , even though both were downloaded using the same URI!

    * Open correct.html in a web browser. You will see a thumbnails
    ("allthumbs" ) only version of a smugmug.com picture gallery.

    * Open incorrect.html in a web browser. You will see a paginated
    version of the same smugmug.com picture gallery ("smugmug small"), with
    a larger image on the right.

    I know that I could make a workaround by having my PHP scripts call wget
    instead of using intrinsic functions to download the HTML. This is not
    practical for me for a number of reasons, including code portability and
    streamlining.

    Can anyone help me with this? I know that the templatechange. mg uses a
    302 to redirect the browser, based on the output I get from wget. I
    also know that the redirect is happening in PHP (even if it is happening
    incorrectly), because I'm not getting the contents of the
    templatechange. mg file, but a different version of the gallery itself.

    This is driving me crazy. I can find no logical reason why PHP would
    yield different results for the same URI than I get in 3 other browsers
    (SeaMonkey, IE, and wget).

    I have also attached the results pages and the php script (correct.html,
    incorrect.html, and incorrect.php) in php_download_st rangeness.tar.b z2
    (a bzip2 compressed tar archive)

    - Chuck Renner



  • Rik

    #2
    Re: strange php behavior downloading html

    Chuck Renner wrote:
    <snip file_get_conten ts() behaves unexpected for the OP>
    First of all: no binaries with your post please.

    Second:
    HTTP/1.1 302 Found
    Date: Fri, 27 Oct 2006 10:17:31 GMT
    Server: Apache
    X-Powered-By: smugmug/1.2.0
    Set-Cookie: SMSESS=879c1d8a 0378b8304671bec df6ff28c8; path=/;
    domain=.smugmug .com
    Cache-Control: private, max-age=1, must-revalidate
    Pragma:
    Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
    domain=.smugmug .com

    file_get_conten ts() will NOT honour these Set-Cookie's whatsoever. It isn't
    meant to do that.

    If you want to do this, use cURL.
    This is not a bug, this is a documented limitation.
    --
    Rik Wasmus


    Comment

    • Chuck Renner

      #3
      Re: strange php behavior downloading html

      Rik wrote:
      First of all: no binaries with your post please.
      sorry...
      Second:
      HTTP/1.1 302 Found
      Date: Fri, 27 Oct 2006 10:17:31 GMT
      Server: Apache
      X-Powered-By: smugmug/1.2.0
      Set-Cookie: SMSESS=879c1d8a 0378b8304671bec df6ff28c8; path=/;
      domain=.smugmug .com
      Cache-Control: private, max-age=1, must-revalidate
      Pragma:
      Set-Cookie: Template=7; expires=Sun, 26-Nov-2006 10:17:31 GMT; path=/;
      domain=.smugmug .com
      >
      file_get_conten ts() will NOT honour these Set-Cookie's whatsoever. It isn't
      meant to do that.
      >
      If you want to do this, use cURL.
      This is not a bug, this is a documented limitation.
      Thanks. I had already spent hours in google and php documentation
      before posting, and had not found that. I did not find any php
      documentation on file_get_conten ts limitations and Set-Cookie. I'll
      start looking for cURL documentation now.

      Thanks again for pointing me in the right direction.

      - Chuck Renner

      Comment

      • Chuck Renner

        #4
        Re: HELP: strange php behavior downloading html

        Thanks Rik for pointing out that the HTTP headers on that redirected
        page were setting and using cookies and for pointing me in the right
        direction with cURL.

        I was able to yield a correctly working result for my HTML downloading
        problem in less than an hour, using cURL with PHP.

        With the function I have below, I just call tempnam() to give me a
        temporary filename, call my function with the uri and the results from
        tempnam(), and then read the file with file_get_conten ts(). I then can
        delete the file with unlink().

        Here is the function I wrote to download a uri into a file (following
        all redirects, ignoring old cookies, and passing set cookies to redirects):
        <?php
        function uri_download($u ri, $fileName) {
        // use cURL to download uri
        // make a curl resource, setting the uri as it's target to open
        $curl = curl_init($uri) ;
        // make a file resource and create/empty the file for writing
        $hFile = fopen($fileName , "w+");
        // set curl options
        // set the file resource that curl will write to
        curl_setopt($cu rl, CURLOPT_FILE, $hFile);
        // do not let curl output the HTTP headers
        curl_setopt($cu rl, CURLOPT_HEADER, false);
        // let curl follow redirects
        curl_setopt($cu rl, CURLOPT_FOLLOWL OCATION, true);
        // set a location for curl to handle cookies
        curl_setopt($cu rl, CURLOPT_COOKIEJ AR, "/tmp");
        // tell curl to mark this as a new cookie session
        curl_setopt($cu rl, CURLOPT_COOKIES ESSION, true);
        // execute curl (download the uri to the temp file)
        curl_exec($curl );
        // close the curl resource
        curl_close($cur l);
        // unset the curl resource
        unset($curl);
        // close the temp file and file resource
        fclose($hFile);
        // unset the file resource
        unset($hFile);
        }
        ?>

        Chuck Renner wrote:
        Please help!
        >
        This MIGHT even be a bug in PHP!
        >
        I'll provide version numbers and site specific information (browser, OS,
        and kernel versions) if others cannot reproduce this problem.
        >
        I'm running into some PHP behavior that I do not understand in PHP 5.1.2.
        >
        I need to parse the HTML from the following carefully constructed URI:
        Photo galleries for Chuck Renner, Kim Renner, Behr Renner, Nicole Renner, and Andrew Renner.

        >
        The problem is that when PHP downloads the HTML using file_get_conten ts,
        or any other method of opening a remote file in PHP that I have tried,
        it gives me the wrong page!
        >
        This URI is supposed to yield the HTML from the page at
        http://crenner.smugmug.com/gallery/1960121 , but with the "allthumbs"
        version of the page, selectable from the dropdown box at the top of the
        page.
        >
        The correct page is downloaded in IE, SeaMonkey, and in wget!
        >
        But when downloading in PHP, I get the HTML from the page at
        http://crenner.smugmug.com/gallery/1960121 , but with the "smugmug
        small" version of the page, selectable from the dropdown box at the top
        of the page.
        >
        Please note that the templatechange. mg page is merely a server-side
        script that takes the arguments passed to it (TemplateID and origin),
        and redirects the browser to the correct version of the page at
        "origin", based on the "TemplateID ".
        >
        Here is how to reproduce the problem:
        * Download the page with wget so that you have a copy of the correct
        results:
        >
        --commandline start here--
        wget
        "http://crenner.smugmug .com/homepage/templatechange. mg?TemplateID=7 &origin=http ://crenner.smugmug .com/gallery/1960121"
        -O correct.html
        --commandline end here--
        >
        * Download the same page with php 5.1.2:
        >
        --file incorrect.php start here--
        <?php
        print(file_get_ contents("http://crenner.smugmug .com/homepage/templatechange. mg?TemplateID=7 &origin=http ://crenner.smugmug .com/gallery/1960121"));
        ?>
        --file incorrect.php end here--
        >
        --commandline start here--
        php incorrect.php incorrect.html
        --commandline end here--
        >
        * You should now have two very different HTML files (correct.html and
        incorrect.html) , even though both were downloaded using the same URI!
        >
        * Open correct.html in a web browser. You will see a thumbnails
        ("allthumbs" ) only version of a smugmug.com picture gallery.
        >
        * Open incorrect.html in a web browser. You will see a paginated
        version of the same smugmug.com picture gallery ("smugmug small"), with
        a larger image on the right.
        >
        I know that I could make a workaround by having my PHP scripts call wget
        instead of using intrinsic functions to download the HTML. This is not
        practical for me for a number of reasons, including code portability and
        streamlining.
        >
        Can anyone help me with this? I know that the templatechange. mg uses a
        302 to redirect the browser, based on the output I get from wget. I
        also know that the redirect is happening in PHP (even if it is happening
        incorrectly), because I'm not getting the contents of the
        templatechange. mg file, but a different version of the gallery itself.
        >
        This is driving me crazy. I can find no logical reason why PHP would
        yield different results for the same URI than I get in 3 other browsers
        (SeaMonkey, IE, and wget).
        >
        I have also attached the results pages and the php script (correct.html,
        incorrect.html, and incorrect.php) in php_download_st rangeness.tar.b z2
        (a bzip2 compressed tar archive)
        >
        - Chuck Renner
        >
        >

        Comment

        Working...