read all available pages on a Website

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Brad Tilley

    read all available pages on a Website

    Is there a way to make urllib or urllib2 read all of the pages on a Web
    site? For example, say I wanted to read each page of www.python.org into
    separate strings (a string for each page). The problem is that I don't
    know how many pages are at www.python.org. How can I handle this?

    Thanks,

    Brad
  • Tim Roberts

    #2
    Re: read all available pages on a Website

    Brad Tilley <bradtilley@usa .net> wrote:
    [color=blue]
    >Is there a way to make urllib or urllib2 read all of the pages on a Web
    >site? For example, say I wanted to read each page of www.python.org into
    >separate strings (a string for each page). The problem is that I don't
    >know how many pages are at www.python.org. How can I handle this?[/color]

    You have to parse the HTML to pull out all the links and images and fetch
    them, one by one. sgmllib can help with the parsing. You can multithread
    this, if performance in an issue.

    By the way, there are many web sites for which this sort of behavior is not
    welcome.
    --
    - Tim Roberts, timr@probo.com
    Providenza & Boekelheide, Inc.

    Comment

    • Leif K-Brooks

      #3
      Re: read all available pages on a Website

      Tim Roberts wrote:[color=blue]
      > Brad Tilley <bradtilley@usa .net> wrote:
      >[color=green]
      >>Is there a way to make urllib or urllib2 read all of the pages on a Web
      >>site?[/color]
      > By the way, there are many web sites for which this sort of behavior is not
      > welcome.[/color]

      Any site that didn't want to be crawled would most likely use a
      robots.txt file, so you could check that before doing the crawl.

      Comment

      • Alex Martelli

        #4
        Re: read all available pages on a Website

        Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
        [color=blue]
        > Tim Roberts wrote:[color=green]
        > > Brad Tilley <bradtilley@usa .net> wrote:
        > >[color=darkred]
        > >>Is there a way to make urllib or urllib2 read all of the pages on a Web
        > >>site?[/color]
        > > By the way, there are many web sites for which this sort of behavior is not
        > > welcome.[/color]
        >
        > Any site that didn't want to be crawled would most likely use a
        > robots.txt file, so you could check that before doing the crawl.[/color]

        Python's Tools/webchecker/ directory has just the code you need for all
        of this. The directory is part of the Python source distribution, but
        it's all pure Python code, so, if your distribution is binary and omits
        that directory, just download the Python source distribution, unpack it,
        and there you are.


        Alex

        Comment

        • Carlos Ribeiro

          #5
          Re: read all available pages on a Website

          Brad,

          Just to clarify something other posters have said. Automatic crawling
          of websites is not welcome primarily because of performance concerns.
          It also may be regarded by some webmasters a a kind of abuse, because
          the crawler is doing 'hits' and copying material for unknown reasons,
          but is not seeing any ad or generating revenue. Some sites even go to
          the extent of blocking access from your IP, or even for your entire IP
          range, when they detect this type of behavior. Because of this, there
          is a very simple procol involving a file called "robots.txt ". Whenever
          your robot first enter into a site, it must check this file and follow
          the instructions there. It will tell you what you can do in that
          website.

          There are also other few catches that you need to be aware of. First,
          some sites don't have links pointing to all their pages, so it's never
          possible to be completely sure about having read *all* pages. Also,
          some sites have link embedded into scripts. It's not a recommended
          practice, but it's common at some sites, and it may cause you
          problems. And finally, there are situations where your robot may be
          stuck into an "infinite site"; that's because some sites generate
          pages dinamically, and your robot may end up fetching page after page
          and never get out of the site. So, if you want a generic solution to
          crawl any site you desire, you have to check out these issues.


          Best regards,


          --
          Carlos Ribeiro
          Consultoria em Projetos
          blog: http://rascunhosrotos.blogspot.com
          blog: http://pythonnotes.blogspot.com
          mail: carribeiro@gmai l.com
          mail: carribeiro@yaho o.com

          Comment

          • Michael Foord

            #6
            Re: read all available pages on a Website

            Brad Tilley <bradtilley@usa .net> wrote in message news:<ci2qnl$2j q$1@solaris.cc. vt.edu>...[color=blue]
            > Is there a way to make urllib or urllib2 read all of the pages on a Web
            > site? For example, say I wanted to read each page of www.python.org into
            > separate strings (a string for each page). The problem is that I don't
            > know how many pages are at www.python.org. How can I handle this?
            >
            > Thanks,
            >
            > Brad[/color]

            I can highly reccommend the BeautifulSoup parser for helping you to
            extract all the links - should make it a doddle. (you want to check
            that you only follwo links that are in www.python.org of course - the
            standard library urlparse should help with that).

            Regards,


            Fuzzy


            Comment

            • Brad Tilley

              #7
              Re: read all available pages on a Website

              Alex Martelli wrote:[color=blue]
              > Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
              >
              >[color=green]
              >>Tim Roberts wrote:
              >>[color=darkred]
              >>>Brad Tilley <bradtilley@usa .net> wrote:
              >>>
              >>>
              >>>>Is there a way to make urllib or urllib2 read all of the pages on a Web
              >>>>site?
              >>>
              >>>By the way, there are many web sites for which this sort of behavior is not
              >>>welcome.[/color]
              >>
              >>Any site that didn't want to be crawled would most likely use a
              >>robots.txt file, so you could check that before doing the crawl.[/color]
              >
              >
              > Python's Tools/webchecker/ directory has just the code you need for all
              > of this. The directory is part of the Python source distribution, but
              > it's all pure Python code, so, if your distribution is binary and omits
              > that directory, just download the Python source distribution, unpack it,
              > and there you are.
              >
              >
              > Alex[/color]

              Thank you, this is ideal.

              Comment

              Working...