read all available pages on a Website

**Tim Roberts** · Jul 18 '05, 03:19 PM

Re: read all available pages on a Website

Brad Tilley <bradtilley@usa .net> wrote:
[color=blue]
>Is there a way to make urllib or urllib2 read all of the pages on a Web
>site? For example, say I wanted to read each page of www.python.org into
>separate strings (a string for each page). The problem is that I don't
>know how many pages are at www.python.org. How can I handle this?[/color]

You have to parse the HTML to pull out all the links and images and fetch
them, one by one. sgmllib can help with the parsing. You can multithread
this, if performance in an issue.

By the way, there are many web sites for which this sort of behavior is not
welcome.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

**Leif K-Brooks** · Jul 18 '05, 03:19 PM

Re: read all available pages on a Website

Tim Roberts wrote:[color=blue]
> Brad Tilley <bradtilley@usa .net> wrote:
>[color=green]
>>Is there a way to make urllib or urllib2 read all of the pages on a Web
>>site?[/color]
> By the way, there are many web sites for which this sort of behavior is not
> welcome.[/color]

Any site that didn't want to be crawled would most likely use a
robots.txt file, so you could check that before doing the crawl.

**Alex Martelli** · Jul 18 '05, 03:20 PM

Re: read all available pages on a Website

Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
[color=blue]
> Tim Roberts wrote:[color=green]
> > Brad Tilley <bradtilley@usa .net> wrote:
> >[color=darkred]
> >>Is there a way to make urllib or urllib2 read all of the pages on a Web
> >>site?[/color]
> > By the way, there are many web sites for which this sort of behavior is not
> > welcome.[/color]
>
> Any site that didn't want to be crawled would most likely use a
> robots.txt file, so you could check that before doing the crawl.[/color]

Python's Tools/webchecker/ directory has just the code you need for all
of this. The directory is part of the Python source distribution, but
it's all pure Python code, so, if your distribution is binary and omits
that directory, just download the Python source distribution, unpack it,
and there you are.

Alex

**Carlos Ribeiro** · Jul 18 '05, 03:20 PM

Re: read all available pages on a Website

Brad,

Just to clarify something other posters have said. Automatic crawling
of websites is not welcome primarily because of performance concerns.
It also may be regarded by some webmasters a a kind of abuse, because
the crawler is doing 'hits' and copying material for unknown reasons,
but is not seeing any ad or generating revenue. Some sites even go to
the extent of blocking access from your IP, or even for your entire IP
range, when they detect this type of behavior. Because of this, there
is a very simple procol involving a file called "robots.txt ". Whenever
your robot first enter into a site, it must check this file and follow
the instructions there. It will tell you what you can do in that
website.

There are also other few catches that you need to be aware of. First,
some sites don't have links pointing to all their pages, so it's never
possible to be completely sure about having read *all* pages. Also,
some sites have link embedded into scripts. It's not a recommended
practice, but it's common at some sites, and it may cause you
problems. And finally, there are situations where your robot may be
stuck into an "infinite site"; that's because some sites generate
pages dinamically, and your robot may end up fetching page after page
and never get out of the site. So, if you want a generic solution to
crawl any site you desire, you have to check out these issues.

Best regards,

--
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: carribeiro@gmai l.com
mail: carribeiro@yaho o.com

**Michael Foord** · Jul 18 '05, 03:20 PM

Re: read all available pages on a Website

Brad Tilley <bradtilley@usa .net> wrote in message news:<ci2qnl$2j q$1@solaris.cc. vt.edu>...[color=blue]
> Is there a way to make urllib or urllib2 read all of the pages on a Web
> site? For example, say I wanted to read each page of www.python.org into
> separate strings (a string for each page). The problem is that I don't
> know how many pages are at www.python.org. How can I handle this?
>
> Thanks,
>
> Brad[/color]

I can highly reccommend the BeautifulSoup parser for helping you to
extract all the links - should make it a doddle. (you want to check
that you only follwo links that are in www.python.org of course - the
standard library urlparse should help with that).

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html

**Brad Tilley** · Jul 18 '05, 03:20 PM

Re: read all available pages on a Website

Alex Martelli wrote:[color=blue]
> Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
>
>[color=green]
>>Tim Roberts wrote:
>>[color=darkred]
>>>Brad Tilley <bradtilley@usa .net> wrote:
>>>
>>>
>>>>Is there a way to make urllib or urllib2 read all of the pages on a Web
>>>>site?
>>>
>>>By the way, there are many web sites for which this sort of behavior is not
>>>welcome.[/color]
>>
>>Any site that didn't want to be crawled would most likely use a
>>robots.txt file, so you could check that before doing the crawl.[/color]
>
>
> Python's Tools/webchecker/ directory has just the code you need for all
> of this. The directory is part of the Python source distribution, but
> it's all pure Python code, so, if your distribution is binary and omits
> that directory, just download the Python source distribution, unpack it,
> and there you are.
>
>
> Alex[/color]

Thank you, this is ideal.

read all available pages on a Website

read all available pages on a Website

Comment

Comment

Comment

Comment

Comment

Comment