Re: which datastructure for fast sorted insert?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • notnorwegian@yahoo.se

    Re: which datastructure for fast sorted insert?

    Traceback (most recent call last):
    File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
    x = scrapeSites("ht tp://www.yahoo.com")
    File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
    scrapeSites
    site = iterator.next()
    RuntimeError: Set changed size during iteration


    def joinSets(set1, set2):
    for i in set2:
    set1.add(i)
    return set1

    def scrapeSites(sta rtAddress):
    site = startAddress
    sites = set()
    iterator = iter(sites)
    pos = 0
    while pos < 10:#len(sites):
    newsites = scrapeSite(site )
    joinSets(sites, newsites)
    pos += 1
    site = iterator.next()
    return sites

    def scrapeSite(addr ess):
    toSet = set()
    site = urllib.urlopen( address)
    for row in site:
    obj = url.search(row)
    if obj != None:
    toSet.add(obj.g roup())
    return toSet


    wtf? im not multithreading or anything so how can the size change here?
  • Gabriel Genellina

    #2
    Re: which datastructure for fast sorted insert?

    En Sun, 25 May 2008 22:42:06 -0300, <notnorwegian@y ahoo.seescribió :
    def joinSets(set1, set2):
    for i in set2:
    set1.add(i)
    return set1
    Use the | operator, or |=
    Traceback (most recent call last):
    File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
    x = scrapeSites("ht tp://www.yahoo.com")
    File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
    scrapeSites
    site = iterator.next()
    RuntimeError: Set changed size during iteration
    You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.
    def scrapeSites(sta rtAddress):
    site = startAddress
    sites = set()
    iterator = iter(sites)
    pos = 0
    while pos < 10:#len(sites):
    newsites = scrapeSite(site )
    joinSets(sites, newsites)
    pos += 1
    site = iterator.next()
    return sites
    Try this (untested):

    def scrapeSites(sta rtAddress):
    allsites = set() # all links found so far
    pending = set([startAddress]) # pending sites to examine
    while pending:
    newsites = set() # new links
    for site in pending:
    newsites |= scrapeSite(site )
    pending = newsites - allsites
    allsites |= newsites
    return allsites
    wtf? im not multithreading or anything so how can the size change here?
    You modified the set you were iterating over. Another example of the same problem:

    d = {'a': 1, 'b': 2, 'c':3}
    for key in d:
    d[key+key]=0

    --
    Gabriel Genellina

    Comment

    • I V

      #3
      Re: which datastructure for fast sorted insert?

      On Sun, 25 May 2008 18:42:06 -0700, notnorwegian wrote:
      def scrapeSites(sta rtAddress):
      site = startAddress
      sites = set()
      iterator = iter(sites)
      pos = 0
      while pos < 10:#len(sites):
      newsites = scrapeSite(site )
      joinSets(sites, newsites)
      You change the size of "sites" here, which means you can no longer use
      the iterator.
      wtf? im not multithreading or anything so how can the size change here?
      How about:

      def scrape_sites(st art_address):
      sites = set([start_address])
      sites_scraped = set()
      # This will run until it doesn't find any new sites, which may be
      # a long time
      while len(sites) 0:
      new_sites = set()
      for site in sites:
      new_sites.updat e(scrape_site(s ite))
      sites_scraped.u pdate(sites)
      sites = new_sites.diffe rence(sites_scr aped)
      return sites

      Comment

      Working...