urllib2 - iteration over non-sequence

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • rplobue@gmail.com

    urllib2 - iteration over non-sequence

    im trying to get urllib2 to work on my server which runs python
    2.2.1. When i run the following code:


    import urllib2
    for line in urllib2.urlopen ('www.google.co m'):
    print line


    i will always get the error:
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    TypeError: iteration over non-sequence


    Anyone have any answers?

  • Larry Bates

    #2
    Re: urllib2 - iteration over non-sequence

    rplobue@gmail.c om wrote:
    im trying to get urllib2 to work on my server which runs python
    2.2.1. When i run the following code:
    >
    >
    import urllib2
    for line in urllib2.urlopen ('www.google.co m'):
    print line
    >
    >
    i will always get the error:
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    TypeError: iteration over non-sequence
    >
    >
    Anyone have any answers?
    >
    I ran your code:
    >>import urllib2
    >>urllib2.urlop en('www.google. com')
    Traceback (most recent call last):
    File "<interacti ve input>", line 1, in <module>
    File "C:\Python25\li b\urllib2.py", line 121, in urlopen
    return _opener.open(ur l, data)
    File "C:\Python25\li b\urllib2.py", line 366, in open
    protocol = req.get_type()
    File "C:\Python25\li b\urllib2.py", line 241, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
    ValueError: unknown url type: www.google.com

    Note the traceback.

    you need to call it with type in front of the url:
    >>import urllib2
    >>urllib2.urlop en('http://www.google.com' )
    <addinfourl at 27659320 whose fp = <socket._fileob ject object at 0x01A51F48>>

    Python's interactive mode is very useful for tracking down this type
    of problem.

    -Larry

    Comment

    • rplobue@gmail.com

      #3
      Re: urllib2 - iteration over non-sequence

      Thanks for the reply Larry but I am still having trouble. If i
      understand you correctly, your are just suggesting that i add an http://
      in front of the address? However when i run this:
      >>import urllib2
      >>site = urllib2.urlopen ('http://www.google.com' )
      >>for line in site:
      >> print line
      I am still getting the message:

      TypeError: iteration over non-sequence
      File "<stdin>", line 1
      TypeError: iteration over non-sequence

      Comment

      • Gary Herron

        #4
        Re: urllib2 - iteration over non-sequence

        rplobue@gmail.c om wrote:
        Thanks for the reply Larry but I am still having trouble. If i
        understand you correctly, your are just suggesting that i add an http://
        in front of the address? However when i run this:
        >
        >
        >>>import urllib2
        >>>site = urllib2.urlopen ('http://www.google.com' )
        >>>for line in site:
        >>> print line
        >>>>
        >
        I am still getting the message:
        >
        TypeError: iteration over non-sequence
        File "<stdin>", line 1
        TypeError: iteration over non-sequence
        >
        Newer version of Python are willing to implement an iterator that
        *reads* the contents of a file object and supplies the lines to you
        one-by-one in a loop. However, you explicitly said the version of
        Python you are using, and that predates generators/iterators.

        So... You must explicitly read the contents of the file-like object
        yourself, and loop through the lines you self. However, fear not --
        it's easy. The socket._fileobj ect object provides a method "readlines"
        that reads the *entire* contents of the object, and returns a list of
        lines. And you can iterate through that list of lines. Like this:

        import urllib2
        url = urllib2.urlopen ('http://www.google.com' )
        for line in url.readlines() :
        print line
        url.close()


        Gary Herron






        Comment

        • Erik Max Francis

          #5
          Re: urllib2 - iteration over non-sequence

          Gary Herron wrote:
          So... You must explicitly read the contents of the file-like object
          yourself, and loop through the lines you self. However, fear not --
          it's easy. The socket._fileobj ect object provides a method "readlines"
          that reads the *entire* contents of the object, and returns a list of
          lines. And you can iterate through that list of lines. Like this:
          >
          import urllib2
          url = urllib2.urlopen ('http://www.google.com' )
          for line in url.readlines() :
          print line
          url.close()
          This is really wasteful, as there's no point in reading in the whole
          file before iterating over it. To get the same effect as file iteration
          in later versions, use the .xreadlines method::

          for line in aFile.xreadline s():
          ...

          --
          Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
          San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
          If you flee from terror, then terror continues to chase you.
          -- Benjamin Netanyahu

          Comment

          • Paul Rubin

            #6
            Re: urllib2 - iteration over non-sequence

            Erik Max Francis <max@alcyone.co mwrites:
            This is really wasteful, as there's no point in reading in the whole
            file before iterating over it. To get the same effect as file
            iteration in later versions, use the .xreadlines method::
            >
            for line in aFile.xreadline s():
            ...
            Ehhh, a heck of a lot of web pages don't have any newlines, so you end
            up getting the whole file anyway, with that method. Something like

            for line in iter(lambda: aFile.read(4096 ), ''): ...

            may be best.

            Comment

            • Gary Herron

              #7
              Re: urllib2 - iteration over non-sequence

              Paul Rubin wrote:
              Erik Max Francis <max@alcyone.co mwrites:
              >
              >This is really wasteful, as there's no point in reading in the whole
              >file before iterating over it. To get the same effect as file
              >iteration in later versions, use the .xreadlines method::
              >>
              > for line in aFile.xreadline s():
              > ...
              >>
              >
              Ehhh, a heck of a lot of web pages don't have any newlines, so you end
              up getting the whole file anyway, with that method. Something like
              >
              for line in iter(lambda: aFile.read(4096 ), ''): ...
              >
              may be best.
              >
              Certainly there's are cases where xreadlines or read(bytecount) are
              reasonable, but only if the total pages size is *very* large. But for
              most web pages, you guys are just nit-picking (or showing off) to
              suggest that the full read implemented by readlines is wasteful.
              Moreover, the original problem was with sockets -- which don't have
              xreadlines. That seems to be a method on regular file objects.

              For simplicity, I'd still suggest my original use of readlines. If
              and when you find you are downloading web pages with sizes that are
              putting a serious strain on your memory footprint, then one of the other
              suggestions might be indicated.

              Gary Herron




              Comment

              • Paul Rubin

                #8
                Re: urllib2 - iteration over non-sequence

                Gary Herron <gherron@island training.comwri tes:
                For simplicity, I'd still suggest my original use of readlines. If
                and when you find you are downloading web pages with sizes that are
                putting a serious strain on your memory footprint, then one of the other
                suggestions might be indicated.
                If you know in advance that the page you're retrieving will be
                reasonable in size, then using readlines is fine. If you don't know
                in advance what you're retrieving (e.g. you're working on a crawler)
                you have to assume that you'll hit some very large pages with
                difficult construction.


                Comment

                • Erik Max Francis

                  #9
                  Re: urllib2 - iteration over non-sequence

                  Gary Herron wrote:
                  Certainly there's are cases where xreadlines or read(bytecount) are
                  reasonable, but only if the total pages size is *very* large. But for
                  most web pages, you guys are just nit-picking (or showing off) to
                  suggest that the full read implemented by readlines is wasteful.
                  Moreover, the original problem was with sockets -- which don't have
                  xreadlines. That seems to be a method on regular file objects.
                  >
                  For simplicity, I'd still suggest my original use of readlines. If
                  and when you find you are downloading web pages with sizes that are
                  putting a serious strain on your memory footprint, then one of the other
                  suggestions might be indicated.
                  It isn't nitpicking to point out that you're making something that will
                  consume vastly more amounts of memory than it could possibly need. And
                  insisting that pages aren't _always_ huge is just a silly cop-out; of
                  course pages get very large.

                  There is absolutely no reason to read the entire file into memory (which
                  is what you're doing) before processing it. This is a good example of
                  the principle of there is one obvious right way to do it -- and it isn't
                  to read the whole thing in first for no reason whatsoever other than to
                  avoid an `x`.

                  --
                  Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
                  San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
                  The more violent the love, the more violent the anger.
                  -- _Burmese Proverbs_ (tr. Hla Pe)

                  Comment

                  • Erik Max Francis

                    #10
                    Re: urllib2 - iteration over non-sequence

                    Paul Rubin wrote:
                    If you know in advance that the page you're retrieving will be
                    reasonable in size, then using readlines is fine. If you don't know
                    in advance what you're retrieving (e.g. you're working on a crawler)
                    you have to assume that you'll hit some very large pages with
                    difficult construction.
                    And that's before you even mention the point that, depending on the
                    application, it could easily open yourself up to a DOS attack.

                    There's premature optimization, and then there's premature completely
                    obvious and pointless waste. This falls in the latter category.

                    Besides, someone was asking for/needing an older equivalent to iterating
                    over a file. That's obviously .xreadlines, not .readlines.

                    --
                    Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
                    San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
                    The more violent the love, the more violent the anger.
                    -- _Burmese Proverbs_ (tr. Hla Pe)

                    Comment

                    • Gabriel Genellina

                      #11
                      Re: urllib2 - iteration over non-sequence

                      En Sun, 10 Jun 2007 02:54:47 -0300, Erik Max Francis <max@alcyone.co m>
                      escribió:
                      Gary Herron wrote:
                      >
                      >Certainly there's are cases where xreadlines or read(bytecount) are
                      >reasonable, but only if the total pages size is *very* large. But for
                      >most web pages, you guys are just nit-picking (or showing off) to
                      >suggest that the full read implemented by readlines is wasteful.
                      >Moreover, the original problem was with sockets -- which don't have
                      >xreadlines. That seems to be a method on regular file objects.
                      >>
                      There is absolutely no reason to read the entire file into memory (which
                      is what you're doing) before processing it. This is a good example of
                      the principle of there is one obvious right way to do it -- and it isn't
                      to read the whole thing in first for no reason whatsoever other than to
                      avoid an `x`.
                      The problem is -and you appear not to have noticed that- that the object
                      returned by urlopen does NOT have a xreadlines() method; and even if it
                      had, a lot of pages don't contain any '\n' so using xreadlines would read
                      the whole page in memory anyway.

                      Python 2.2 (the version that the OP is using) did include a xreadlines
                      module (now defunct) but on this case it is painfully slooooooooooooo w -
                      perhaps it tries to read the source one character at a time.

                      So the best way would be to use (as Paul Rubin already said):

                      for line in iter(lambda: f.read(4096), ''): print line

                      --
                      Gabriel Genellina

                      Comment

                      Working...