Re: which datastructure for fast sorted insert?

**Gabriel Genellina** · Jun 27 '08, 04:26 PM

Re: which datastructure for fast sorted insert?

En Sun, 25 May 2008 22:42:06 -0300, <notnorwegian@y ahoo.seescribió :

def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

Use the | operator, or |=

Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("ht tp://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration

You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.

def scrapeSites(sta rtAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site )
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

Try this (untested):

def scrapeSites(sta rtAddress):
allsites = set() # all links found so far
pending = set([startAddress]) # pending sites to examine
while pending:
newsites = set() # new links
for site in pending:
newsites |= scrapeSite(site )
pending = newsites - allsites
allsites |= newsites
return allsites

wtf? im not multithreading or anything so how can the size change here?

You modified the set you were iterating over. Another example of the same problem:

d = {'a': 1, 'b': 2, 'c':3}
for key in d:
d[key+key]=0

--
Gabriel Genellina

**I V** · Jun 27 '08, 04:26 PM

Re: which datastructure for fast sorted insert?

On Sun, 25 May 2008 18:42:06 -0700, notnorwegian wrote:

def scrapeSites(sta rtAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site )
joinSets(sites, newsites)

You change the size of "sites" here, which means you can no longer use
the iterator.

wtf? im not multithreading or anything so how can the size change here?

How about:

def scrape_sites(st art_address):
sites = set([start_address])
sites_scraped = set()
# This will run until it doesn't find any new sites, which may be
# a long time
while len(sites) 0:
new_sites = set()
for site in sites:
new_sites.updat e(scrape_site(s ite))
sites_scraped.u pdate(sites)
sites = new_sites.diffe rence(sites_scr aped)
return sites

Re: which datastructure for fast sorted insert?

Re: which datastructure for fast sorted insert?

Comment

Comment