urllib.urlretireve problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ritesh Raj Sarraf

    urllib.urlretireve problem

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hello Everybody,

    I've got a small problem with urlretrieve.
    Even passing a bad url to urlretrieve doesn't raise an exception. Or does
    it?

    If Yes, What exception is it ? And how do I use it in my program ? I've
    searched a lot but haven't found anything helping.

    Example:
    try:

    urllib.urlretri eve("http://security.debian .org/pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb")
    except IOError, X:
    DoSomething(X)
    except OSError, X:
    DoSomething(X)

    urllib.urlretri eve doesn't raise an exception even though there is no
    package named libparl5.6

    Please Help!

    rrs
    - --
    Ritesh Raj Sarraf
    RESEARCHUT -- http://www.researchut.com
    Gnupg Key ID: 04F130BC
    "Stealing logic from one person is plagiarism, stealing from many is
    research".
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.5 (GNU/Linux)

    iD8DBQFCRcCk4Rh i6gTxMLwRAlb2AJ 0fB3V5ZpwdAiCxf l/rGBWU92YBEACdFY IJ
    8bGZMJ5nuKAqvjO 0KEAylUg=
    =eaHC
    -----END PGP SIGNATURE-----

  • Larry Bates

    #2
    Re: urllib.urlretir eve problem

    I noticed you hadn't gotten a reply. When I execute this it put's the following
    in the retrieved file:

    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <HTML><HEAD>
    <TITLE>404 Not Found</TITLE>
    </HEAD><BODY>
    <H1>Not Found</H1>
    The requested URL /pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb was no
    t found on this server.<P>
    </BODY></HTML>

    You will probably need to use something else to first determine if the URL
    actually exists.

    Larry Bates


    Ritesh Raj Sarraf wrote:[color=blue]
    > Hello Everybody,
    >
    > I've got a small problem with urlretrieve.
    > Even passing a bad url to urlretrieve doesn't raise an exception. Or does
    > it?
    >
    > If Yes, What exception is it ? And how do I use it in my program ? I've
    > searched a lot but haven't found anything helping.
    >
    > Example:
    > try:
    >
    > urllib.urlretri eve("http://security.debian .org/pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb")
    > except IOError, X:
    > DoSomething(X)
    > except OSError, X:
    > DoSomething(X)
    >
    > urllib.urlretri eve doesn't raise an exception even though there is no
    > package named libparl5.6
    >
    > Please Help!
    >
    > rrs[/color]

    Comment

    • gene.tani@gmail.com

      #3
      Re: urllib.urlretir eve problem

      Mertz' "Text Processing in Python" book had a good discussion about
      trapping 403 and 404's.



      Larry Bates wrote:[color=blue]
      > I noticed you hadn't gotten a reply. When I execute this it put's[/color]
      the following[color=blue]
      > in the retrieved file:
      >
      > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
      > <HTML><HEAD>
      > <TITLE>404 Not Found</TITLE>
      > </HEAD><BODY>
      > <H1>Not Found</H1>
      > The requested URL[/color]
      /pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb was no[color=blue]
      > t found on this server.<P>
      > </BODY></HTML>
      >
      > You will probably need to use something else to first determine if[/color]
      the URL[color=blue]
      > actually exists.
      >
      > Larry Bates
      >
      >
      > Ritesh Raj Sarraf wrote:[color=green]
      > > Hello Everybody,
      > >
      > > I've got a small problem with urlretrieve.
      > > Even passing a bad url to urlretrieve doesn't raise an exception.[/color][/color]
      Or does[color=blue][color=green]
      > > it?
      > >
      > > If Yes, What exception is it ? And how do I use it in my program ?[/color][/color]
      I've[color=blue][color=green]
      > > searched a lot but haven't found anything helping.
      > >
      > > Example:
      > > try:
      > >
      > >[/color][/color]
      urllib.urlretri eve("http://security.debian .org/pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb")[color=blue][color=green]
      > > except IOError, X:
      > > DoSomething(X)
      > > except OSError, X:
      > > DoSomething(X)
      > >
      > > urllib.urlretri eve doesn't raise an exception even though there is[/color][/color]
      no[color=blue][color=green]
      > > package named libparl5.6
      > >
      > > Please Help!
      > >
      > > rrs[/color][/color]

      Comment

      • Ritesh Raj Sarraf

        #4
        Re: urllib.urlretir eve problem

        -----BEGIN PGP SIGNED MESSAGE-----
        Hash: SHA1

        Larry Bates wrote:
        [color=blue]
        > I noticed you hadn't gotten a reply.  When I execute this it put's the
        > following in the retrieved file:
        >
        > <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
        > <HTML><HEAD>
        > <TITLE>404 Not Found</TITLE>
        > </HEAD><BODY>
        > <H1>Not Found</H1>
        > The requested URL /pool/updates/main/p/perl/libparl5.6_5.6. 1-8.9_i386.deb
        > was no t found on this server.<P>
        > </BODY></HTML>
        >
        > You will probably need to use something else to first determine if the URL
        > actually exists.[/color]

        I'm happy that at least someone responded as this was my first post to the
        python mailing list.

        I'm coding a program for offline package management.
        The link that I provided could be obsolete by newer packages. That is where
        my problem is. I wanted to know how to raise an exception here so that
        depending on the type of exception I could make my program function.

        For example, for Temporary Name Resolution Failure, python raises an
        exception which I've handled well. The problem lies with obsolete urls
        where no exception is raised and I end up having a 404 error page as my
        data.

        Can we have an exception for that ? Or can we have the exit status of
        urllib.urlretri eve to know if it downloaded the desired file.
        I think my problem is fixable in urllib.urlopen, I just find
        urllib.urlretri eve more convenient and want to know if it can be done with
        it.

        Thanks for responding.

        rrs
        - --
        Ritesh Raj Sarraf
        RESEARCHUT -- http://www.researchut.com
        Gnupg Key ID: 04F130BC
        "Stealing logic from one person is plagiarism, stealing from many is
        research".
        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.2.5 (GNU/Linux)

        iD8DBQFCSuYS4Rh i6gTxMLwRAu0FAJ 9R0s4TyB7zHcvDF TflOp2joVkErQCf U4vG
        8U0Ah5WTdTQHKRk mPsZsHdE=
        =OMub
        -----END PGP SIGNATURE-----

        Comment

        • Diez B. Roggisch

          #5
          Re: urllib.urlretir eve problem

          > I'm coding a program for offline package management.[color=blue]
          > The link that I provided could be obsolete by newer packages. That is
          > where my problem is. I wanted to know how to raise an exception here so
          > that depending on the type of exception I could make my program function.
          >
          > For example, for Temporary Name Resolution Failure, python raises an
          > exception which I've handled well. The problem lies with obsolete urls
          > where no exception is raised and I end up having a 404 error page as my
          > data.
          >
          > Can we have an exception for that ? Or can we have the exit status of
          > urllib.urlretri eve to know if it downloaded the desired file.
          > I think my problem is fixable in urllib.urlopen, I just find
          > urllib.urlretri eve more convenient and want to know if it can be done with
          > it.[/color]

          It makes no sense having urllib generating exceptions for such a case. From
          its point of view, things work pefectly - it got a result. No network error
          or whatsoever.

          Its your application that is not happy with the result - but it has to
          figure that out by itself.

          You could for instance try and see what kind of result you got using the
          unix file command - it will tell you that you received a html file, not a
          deb.

          Or check the mimetype returned - its text/html in the error case of yours,
          and most probably something like application/octet-stream otherwise.

          Regards,

          Diez

          Comment

          • Skip Montanaro

            #6
            Re: urllib.urlretir eve problem

            [color=blue][color=green]
            >> For example, for Temporary Name Resolution Failure, python raises an
            >> exception which I've handled well. The problem lies with obsolete
            >> urls where no exception is raised and I end up having a 404 error
            >> page as my data.[/color][/color]

            Diez> It makes no sense having urllib generating exceptions for such a
            Diez> case. From its point of view, things work pefectly - it got a
            Diez> result. No network error or whatsoever.

            You can subclass FancyURLOpener and define a method to handle 404s, 403s,
            401s, etc. There should be no need to resort to grubbing around with file
            extensions and such.

            Skip

            Comment

            • gene.tani@gmail.com

              #7
              Re: urllib.urlretir eve problem

              ..from urllib2 import urlopen
              .. try:
              .. urlopen(someURL )
              .. except IOError, errobj:
              .. if hasattr(errobj, 'reason'): print 'server doesnt exist, is
              down, DNS prob, or we don't have internet connect'
              .. if hasattr(errobj, 'code'): print errobj.code

              Comment

              • Wade

                #8
                Re: urllib.urlretir eve problem


                Diez B. Roggisch wrote:[color=blue]
                > It makes no sense having urllib generating exceptions for such a[/color]
                case. From[color=blue]
                > its point of view, things work pefectly - it got a result. No network[/color]
                error[color=blue]
                > or whatsoever.
                >
                > Its your application that is not happy with the result - but it has[/color]
                to[color=blue]
                > figure that out by itself.
                >
                > You could for instance try and see what kind of result you got using[/color]
                the[color=blue]
                > unix file command - it will tell you that you received a html file,[/color]
                not a[color=blue]
                > deb.
                >
                > Or check the mimetype returned - its text/html in the error case of[/color]
                yours,[color=blue]
                > and most probably something like application/octet-stream otherwise.
                >
                > Regards,
                >
                > Diez[/color]

                Also be aware that many webservers (especially IIS ones) are configured
                to return some kind of custom page instead of a stock 404, and you
                might be getting a 200 status code even though the page you requested
                is not there. So depending on what site you are scraping, you might
                have to read the page you got back to figure out if it's what you
                wanted.

                -- Wade Leftwich
                Ithaca, NY

                Comment

                • Ritesh Raj Sarraf

                  #9
                  Re: urllib.urlretir eve problem

                  -----BEGIN PGP SIGNED MESSAGE-----
                  Hash: SHA1

                  Diez B. Roggisch wrote:
                  [color=blue]
                  > You could for instance try and see what kind of result you got using the
                  > unix file command - it will tell you that you received a html file, not a
                  > deb.
                  >
                  > Or check the mimetype returned - its text/html in the error case of yours,
                  > and most probably something like application/octet-stream otherwise.
                  >[/color]

                  Using the unix file command is not possible at all. The whole goal of the
                  program is to help people get their packages downloaded from some other
                  (high speed) machine which could be running Windows/Mac OSX/Linux et
                  cetera. That is why I'm sticking strictly to python libraries.

                  The second suggestion sounds good. I'll look into that.

                  Thanks,

                  rrs
                  - --
                  Ritesh Raj Sarraf
                  RESEARCHUT -- http://www.researchut.com
                  Gnupg Key ID: 04F130BC
                  "Stealing logic from one person is plagiarism, stealing from many is
                  research".
                  -----BEGIN PGP SIGNATURE-----
                  Version: GnuPG v1.2.5 (GNU/Linux)

                  iD8DBQFCTDhV4Rh i6gTxMLwRAi2BAJ 4zp7IsQNMZ1zqpF/hGUAjUyYwKigCeK aqO
                  FbGuuFOIHawZ8y/ICf87wOI=
                  =btA5
                  -----END PGP SIGNATURE-----

                  Comment

                  Working...