wgetting the crawled links only

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ymic8
    New Member
    • Jul 2007
    • 8

    wgetting the crawled links only

    Hi everyone,
    this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success.
    With wget, if you are more familiar with it than me, how can get it to output the crawled links (links not the actual HTML content) to a file?

    currently i have something like:
    wget -q -E -O outfile --proxy-user=username -E --proxy-password=mypass word -r --recursive http://www.museum.vic. gov.au

    but it outputs the actual crawled HTML content to 'outfile', but I only want the crawled links.

    Thank you in advance
  • bartonc
    Recognized Expert Expert
    • Sep 2006
    • 6478

    #2
    Originally posted by ymic8
    Hi everyone,
    this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success.
    With wget, if you are more familiar with it than me, how can get it to output the crawled links (links not the actual HTML content) to a file?

    currently i have something like:
    wget -q -E -O outfile --proxy-user=username -E --proxy-password=mypass word -r --recursive http://www.museum.vic. gov.au

    but it outputs the actual crawled HTML content to 'outfile', but I only want the crawled links.

    Thank you in advance
    The closest thing that my searches turned up was this thread, but it didn't get very far along.

    If kudos were around, he'd know what to tell you.
    Sorry to not be of more help than that.

    Welcome to the Python Forum on TheScripts.

    Comment

    • Motoma
      Recognized Expert Specialist
      • Jan 2007
      • 3236

      #3
      You can use the urllib module to read the raw HTML from a page. You could then run a regex or a find to snag all of the href tags from pages and append them to your list of URLs to search.

      I am interested in what you are doing. Would you mind posting your code, or giving an explanation of what you are working on?

      Comment

      • ymic8
        New Member
        • Jul 2007
        • 8

        #4
        Originally posted by Motoma
        You can use the urllib module to read the raw HTML from a page. You could then run a regex or a find to snag all of the href tags from pages and append them to your list of URLs to search.

        I am interested in what you are doing. Would you mind posting your code, or giving an explanation of what you are working on?
        Hi there my friends, thank you all for the replies.
        My current code actually uses the urllib and regex such as re.findall(r'<p .*p>', page) (which grabs all the paragraphed-words in a greedy mode, but somehow doesn't work for some HTMLs, wonder why...)
        I am working on an honour's yearlong project in partnership with the Melbourne Museum, Australia. The project develops a software prototype that automatically and appropriately recommends exhibits to a user to be visited next if and only if the recommended exhibit is judged to be of the user's personal needs and interests. In comparison to the research proposal, the goals of this software have been refined as:
        1. To predict when and where should a recommendation be made. This is the automatic component of the software;
        2. To choose an exhibit that best represents or satisfies the user's information need at a given time, based on a dynamically determined profile for this particular user. This is the user-personalisation component of the software.
        ...
        i just copied and pasted some stuff from my progress report. But for the web crawling part, i just need to crawl the museum webspace and get all the museum links (maybe not all, i just want ~5000 of them) so i can extract the semantic content of the webpage using these URLs, and process these content in some way.

        wget wont give me the links, it only gives me the content, which I don't actually want at this stage. I only want the links - so i can first of all filter out those .mp3s, .exes, .phps ... etc.
        w3mir doesn't get me much further too, because every time i try to get the links, it gives me 'connection-timeout'
        ...
        that's why I want to use my own 'personalised' version of crawler, which works fine, but that's not the centre of my project, but it's still interesting tho...
        thank you for your attention to a newbie like me :-)

        Comment

        • ymic8
          New Member
          • Jul 2007
          • 8

          #5
          my crawler is BFS by the way, my initial version of the crawler was DFS, which formed a spider trap out of the web.
          there is still some minor modifications, after that, if anyone wants, i will post it here.

          Comment

          • bartonc
            Recognized Expert Expert
            • Sep 2006
            • 6478

            #6
            Originally posted by ymic8
            my crawler is BFS by the way, my initial version of the crawler was DFS, which formed a spider trap out of the web.
            there is still some minor modifications, after that, if anyone wants, i will post it here.
            Yes, please do. Many good things may come from sharing ideas and experience in this way.

            Thank you for the wonderful details of your project, too.

            Comment

            • ymic8
              New Member
              • Jul 2007
              • 8

              #7
              my amateur code + one more regex question

              no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

              actually before the code, can i just ask something else?:
              I want to grab the textual content of http://www.museum.vic.gov.au/dinosau...immensity.html
              now, i tried
              [CODE]
              page =urlopen('http://www.museum.vic. gov.au/dinosaurs/time-immensity.html' ).read()
              print re.findall(r'<p >.*p>', page)

              so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

              This is the museum_crawler. py : please comment on my amateur code thx
              Code:
              import re, string, os, sys
              from nltk_lite.corpora import brown, extract, stopwords
              from nltk_lite.wordnet import *
              from nltk_lite.probability import FreqDist
              from urllib import urlopen
              
              docs = {}
              pagerank = {}
              MAX_DOCS = 3000 
              MIN_CHILDLINK_LEN = 8
              irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
              mvic = 'http://www.museum.vic.gov.au'
              mmel = 'http://melbourne.museum.vic.gov.au'
              
              """
              only concern the contentual links related to exhibits
              """
              def purify_irr(link):
                  for term in irresites:
                      if term in link:
                          return False
                  return True
              
              """
              checks if the word has a punctuation char in it
              """
              def haspunc(word):
                  offset = 0
                  for char in word:
                      if char in string.punctuation:
                          return offset
                      offset += 1
                  return None
              
              """
              returns the position of the char that is not punctuation
              """
              def nextnonpunc(word): 
                  for i in range(len(word)):
                      if word[i] not in string.punctuation:
                          return i
                  return 100
              
              """
              don't want URLs that have 'irrelevant' meanings and extensions (types)
              """
              def valid_page(url):
                  link = url.lower()
                  if purify_irr(link):
                      for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
                          if ext in link:
                              return False
                      return True
                  return False
              
              """
              don't want URLs that refer to the central-information sites
              """
              def informative_page(link, page):
                  if 'page not found' not in page.lower() and valid_page(link):
                      if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
                         not (link in [mmel, mmel+'/', mmel+'/index.asp']):
                          
                          return True
                  return False 
              
              """
              count the number of occurences of 'char' in 'word'
              """
              def count(word, char):
                  count = 0
                  for ch in word:
                      if ch == char:
                          count += 1
                  return count
              
              """
              force an URL to be truncated as 'directory-URL', i.e. ends with '/'
              """
              def rootify(link):
                  root = link
                  if link[-1] != '/':
                      if count(link, '/') < 3 and 'http' not in link:
                          return
                      for i in range(len(link)):
                          if count(link, '/') < 3:
                              return link+'/'
                          if link[-i-1] == '/':
                              root = link[:-i]
                              break;
                  else:
                      root = link
                  return root
              
              """
              gets the end-string of a URL after the rightmost '/'
              tail('aaa/bbb/ccc') => 'ccc'
              """
              def tail(st):
                  if st[-1] == '/':
                      st = st[:-1]
                  for i in range(len(st)):
                      if st[-i] == '/':
                          break;
                  return st[-i+1:]
              
              """
              get the content of the page and check for its type
              """
              def getContent(link):
                  try:
                      if not valid_page(link):
                          return None
                      page = urlopen(link).read()
                      return page
                  except IOError:
                      return None
              
              """
              returns the outlinks (if relative links, then the full URLs of these links) of
              the 'link'.
              """
              def outlinks(link):
                  print '=>', link
                  givup = False
                  temp = []
                  r = link
                  link = link[link.index('http'):]
                  if link[-1] == '"':
                      link = link[:-1]
              	
                  root = rootify(link)
                  page = getContent(link)
                  if (page == None):
                      return None
                  if page not in docs.values() and informative_page(link, page):
                      docs[link] = page
                      temp.append(link)
                  outlinks = re.findall(r'href=.+?"', page)
                  outlinks = [link for link in outlinks if valid_page(link)]
                  
                  for link in outlinks:
                      com = None
                      if 'http' in link:
                          link = link[link.index('http'):]
                          if link[-1] == '"':
                              link = link[:-1]
                          if 'museum' in link.lower():
                              page = getContent(link)
                              if (page == None):
                                  return None
                              if page not in docs.values():
                                  if informative_page(link, page):
                                      temp.append(link)
                                      docs[link] = page
              
                      elif len(link) < MIN_CHILDLINK_LEN:
                          continue
                      else:
                          if link[6:-1][0] == '/':
                          	rest = link[7:-1]
                          else:
                              rest = link[6:-1]
                          com = rest
                          start = nextnonpunc(rest)
                          if start == 100:
                              continue;
                          link_option = ''
                          rest = rest[start:]
                          if '/' in rest and '/' in root:
                              child_first_comp = rest[:rest.index('/')]
                              parent_last_comp = tail(root)
                              if child_first_comp.lower() == parent_last_comp.lower():
                              # if the relative link has an overlapping component with root
                              # e.g. /CSIRAC/history... and /history should be the same 
                              # relative links, but they result in diff URLs, thus need to
                              # have a 'backup' link to be urlopened in some cases
                                  link_option = root+rest[rest.index('/')+1:]    
                          link = root+rest
                          if not givup and 'museum' in link.lower():
                              page = getContent(link)
                              if (page != None) and page not in docs.values():
                                  if informative_page(link, page):
                                          
                                          temp.append(link)
                                          docs[link] = page
                              else:
                                  if link_option != '':
                                      page = getContent(link_option)
                                      if (page != None) and page not in docs.values():
                                          if informative_page(link, page):
                                              temp.append(link_option)
                                              docs[link] = page
                  pagerank[root] = temp
                  for link in temp:
                      print "  --  ", link
                  return temp
              
                  
              def crawler(link):
                  link = link[link.index('http'):]
                  if link[-1] == '"':
                      link = link[:-1]
                  root = link
                  if root[-1] != '/':
                      if count(root, '/') < 3:
                          return
                      for i in range(len(root)):
                          if root[-i-1] == '/':
                              root = root[:-i]
                              break;
              
                  page = getContent(link)
                  if (page == None):
                      return
                  if page not in docs.values():
                      docs[link] = page
                  to_scan = [root]
                  while len(to_scan) > 0 and len(docs) < MAX_DOCS:
                      childlinks = []
                      for parent in to_scan:
                          out = outlinks(parent)
                          if out:			
                              # just in case if some links are dead or invalid
                              childlinks.extend(out)
                      
                      to_scan = list(set(childlinks))
              
              if __name__ == '__main__':
                  crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
                  fp = open('docs.py', 'a')
                  for key in docs:
                      fp.write(key+'\n')
                  fp.close()
                  fp = open('pagerank.py', 'a')
                  for key in pagerank:
                      fp.write(key+'\n')
                      for outlink in pagerank[key]:
              	    fp.write('    '+outlink+'\n')
                  fp.close

              Comment

              • ymic8
                New Member
                • Jul 2007
                • 8

                #8
                Originally posted by ymic8
                no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

                actually before the code, can i just ask something else?:
                I want to grab the textual content of http://www.museum.vic.gov.au/dinosau...immensity.html
                now, i tried
                [CODE]
                page =urlopen('http://www.museum.vic. gov.au/dinosaurs/time-immensity.html' ).read()
                print re.findall(r'<p >.*p>', page)

                so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

                This is the museum_crawler. py : please comment on my amateur code thx
                Code:
                import re, string, os, sys
                from nltk_lite.corpora import brown, extract, stopwords
                from nltk_lite.wordnet import *
                from nltk_lite.probability import FreqDist
                from urllib import urlopen
                
                docs = {}
                pagerank = {}
                MAX_DOCS = 3000 
                MIN_CHILDLINK_LEN = 8
                irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
                mvic = 'http://www.museum.vic.gov.au'
                mmel = 'http://melbourne.museum.vic.gov.au'
                
                """
                only concern the contentual links related to exhibits
                """
                def purify_irr(link):
                    for term in irresites:
                        if term in link:
                            return False
                    return True
                
                """
                checks if the word has a punctuation char in it
                """
                def haspunc(word):
                    offset = 0
                    for char in word:
                        if char in string.punctuation:
                            return offset
                        offset += 1
                    return None
                
                """
                returns the position of the char that is not punctuation
                """
                def nextnonpunc(word): 
                    for i in range(len(word)):
                        if word[i] not in string.punctuation:
                            return i
                    return 100
                
                """
                don't want URLs that have 'irrelevant' meanings and extensions (types)
                """
                def valid_page(url):
                    link = url.lower()
                    if purify_irr(link):
                        for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
                            if ext in link:
                                return False
                        return True
                    return False
                
                """
                don't want URLs that refer to the central-information sites
                """
                def informative_page(link, page):
                    if 'page not found' not in page.lower() and valid_page(link):
                        if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
                           not (link in [mmel, mmel+'/', mmel+'/index.asp']):
                            
                            return True
                    return False 
                
                """
                count the number of occurences of 'char' in 'word'
                """
                def count(word, char):
                    count = 0
                    for ch in word:
                        if ch == char:
                            count += 1
                    return count
                
                """
                force an URL to be truncated as 'directory-URL', i.e. ends with '/'
                """
                def rootify(link):
                    root = link
                    if link[-1] != '/':
                        if count(link, '/') < 3 and 'http' not in link:
                            return
                        for i in range(len(link)):
                            if count(link, '/') < 3:
                                return link+'/'
                            if link[-i-1] == '/':
                                root = link[:-i]
                                break;
                    else:
                        root = link
                    return root
                
                """
                gets the end-string of a URL after the rightmost '/'
                tail('aaa/bbb/ccc') => 'ccc'
                """
                def tail(st):
                    if st[-1] == '/':
                        st = st[:-1]
                    for i in range(len(st)):
                        if st[-i] == '/':
                            break;
                    return st[-i+1:]
                
                """
                get the content of the page and check for its type
                """
                def getContent(link):
                    try:
                        if not valid_page(link):
                            return None
                        page = urlopen(link).read()
                        return page
                    except IOError:
                        return None
                
                """
                returns the outlinks (if relative links, then the full URLs of these links) of
                the 'link'.
                """
                def outlinks(link):
                    print '=>', link
                    givup = False
                    temp = []
                    r = link
                    link = link[link.index('http'):]
                    if link[-1] == '"':
                        link = link[:-1]
                	
                    root = rootify(link)
                    page = getContent(link)
                    if (page == None):
                        return None
                    if page not in docs.values() and informative_page(link, page):
                        docs[link] = page
                        temp.append(link)
                    outlinks = re.findall(r'href=.+?"', page)
                    outlinks = [link for link in outlinks if valid_page(link)]
                    
                    for link in outlinks:
                        com = None
                        if 'http' in link:
                            link = link[link.index('http'):]
                            if link[-1] == '"':
                                link = link[:-1]
                            if 'museum' in link.lower():
                                page = getContent(link)
                                if (page == None):
                                    return None
                                if page not in docs.values():
                                    if informative_page(link, page):
                                        temp.append(link)
                                        docs[link] = page
                
                        elif len(link) < MIN_CHILDLINK_LEN:
                            continue
                        else:
                            if link[6:-1][0] == '/':
                            	rest = link[7:-1]
                            else:
                                rest = link[6:-1]
                            com = rest
                            start = nextnonpunc(rest)
                            if start == 100:
                                continue;
                            link_option = ''
                            rest = rest[start:]
                            if '/' in rest and '/' in root:
                                child_first_comp = rest[:rest.index('/')]
                                parent_last_comp = tail(root)
                                if child_first_comp.lower() == parent_last_comp.lower():
                                # if the relative link has an overlapping component with root
                                # e.g. /CSIRAC/history... and /history should be the same 
                                # relative links, but they result in diff URLs, thus need to
                                # have a 'backup' link to be urlopened in some cases
                                    link_option = root+rest[rest.index('/')+1:]    
                            link = root+rest
                            if not givup and 'museum' in link.lower():
                                page = getContent(link)
                                if (page != None) and page not in docs.values():
                                    if informative_page(link, page):
                                            
                                            temp.append(link)
                                            docs[link] = page
                                else:
                                    if link_option != '':
                                        page = getContent(link_option)
                                        if (page != None) and page not in docs.values():
                                            if informative_page(link, page):
                                                temp.append(link_option)
                                                docs[link] = page
                    pagerank[root] = temp
                    for link in temp:
                        print "  --  ", link
                    return temp
                
                    
                def crawler(link):
                    link = link[link.index('http'):]
                    if link[-1] == '"':
                        link = link[:-1]
                    root = link
                    if root[-1] != '/':
                        if count(root, '/') < 3:
                            return
                        for i in range(len(root)):
                            if root[-i-1] == '/':
                                root = root[:-i]
                                break;
                
                    page = getContent(link)
                    if (page == None):
                        return
                    if page not in docs.values():
                        docs[link] = page
                    to_scan = [root]
                    while len(to_scan) > 0 and len(docs) < MAX_DOCS:
                        childlinks = []
                        for parent in to_scan:
                            out = outlinks(parent)
                            if out:			
                                # just in case if some links are dead or invalid
                                childlinks.extend(out)
                        
                        to_scan = list(set(childlinks))
                
                if __name__ == '__main__':
                    crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
                    fp = open('docs.py', 'a')
                    for key in docs:
                        fp.write(key+'\n')
                    fp.close()
                    fp = open('pagerank.py', 'a')
                    for key in pagerank:
                        fp.write(key+'\n')
                        for outlink in pagerank[key]:
                	    fp.write('    '+outlink+'\n')
                    fp.close
                how come it doesn't show up? maybe my last thread is too long?

                Comment

                • ymic8
                  New Member
                  • Jul 2007
                  • 8

                  #9
                  Originally posted by bartonc
                  Yes, please do. Many good things may come from sharing ideas and experience in this way.

                  Thank you for the wonderful details of your project, too.
                  why doesn't my thread show up?

                  Comment

                  • ymic8
                    New Member
                    • Jul 2007
                    • 8

                    #10
                    Originally posted by bartonc
                    Yes, please do. Many good things may come from sharing ideas and experience in this way.

                    Thank you for the wonderful details of your project, too.
                    (ha, so i missed a [/CODE] and it refuses to post my thread, lol)
                    no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

                    actually before the code, can i just ask something else?:
                    I want to grab the textual content of http://www.museum.vic. gov.au/dinosaurs/time-immensity.html
                    now, i tried
                    Code:
                    page =urlopen('http://www.museum.vic.gov.au/dinosaurs/time-immensity.html').read()
                    print re.findall(r'<p>.*p>', page)
                    so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

                    This is the museum_crawler. py : please comment on my amateur code thx
                    Code:
                    import re, string, os, sys
                    from nltk_lite.corpora import brown, extract, stopwords
                    from nltk_lite.wordnet import *
                    from nltk_lite.probability import FreqDist
                    from urllib import urlopen
                    
                    docs = {}
                    pagerank = {}
                    MAX_DOCS = 3000 
                    MIN_CHILDLINK_LEN = 8
                    irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
                    mvic = 'http://www.museum.vic.gov.au'
                    mmel = 'http://melbourne.museum.vic.gov.au'
                    
                    """
                    only concern the contentual links related to exhibits
                    """
                    def purify_irr(link):
                        for term in irresites:
                            if term in link:
                                return False
                        return True
                    
                    """
                    checks if the word has a punctuation char in it
                    """
                    def haspunc(word):
                        offset = 0
                        for char in word:
                            if char in string.punctuation:
                                return offset
                            offset += 1
                        return None
                    
                    """
                    returns the position of the char that is not punctuation
                    """
                    def nextnonpunc(word): 
                        for i in range(len(word)):
                            if word[i] not in string.punctuation:
                                return i
                        return 100
                    
                    """
                    don't want URLs that have 'irrelevant' meanings and extensions (types)
                    """
                    def valid_page(url):
                        link = url.lower()
                        if purify_irr(link):
                            for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
                                if ext in link:
                                    return False
                            return True
                        return False
                    
                    """
                    don't want URLs that refer to the central-information sites
                    """
                    def informative_page(link, page):
                        if 'page not found' not in page.lower() and valid_page(link):
                            if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
                               not (link in [mmel, mmel+'/', mmel+'/index.asp']):
                                
                                return True
                        return False 
                    
                    """
                    count the number of occurences of 'char' in 'word'
                    """
                    def count(word, char):
                        count = 0
                        for ch in word:
                            if ch == char:
                                count += 1
                        return count
                    
                    """
                    force an URL to be truncated as 'directory-URL', i.e. ends with '/'
                    """
                    def rootify(link):
                        root = link
                        if link[-1] != '/':
                            if count(link, '/') < 3 and 'http' not in link:
                                return
                            for i in range(len(link)):
                                if count(link, '/') < 3:
                                    return link+'/'
                                if link[-i-1] == '/':
                                    root = link[:-i]
                                    break;
                        else:
                            root = link
                        return root
                    
                    """
                    gets the end-string of a URL after the rightmost '/'
                    tail('aaa/bbb/ccc') => 'ccc'
                    """
                    def tail(st):
                        if st[-1] == '/':
                            st = st[:-1]
                        for i in range(len(st)):
                            if st[-i] == '/':
                                break;
                        return st[-i+1:]
                    
                    """
                    get the content of the page and check for its type
                    """
                    def getContent(link):
                        try:
                            if not valid_page(link):
                                return None
                            page = urlopen(link).read()
                            return page
                        except IOError:
                            return None
                    
                    """
                    returns the outlinks (if relative links, then the full URLs of these links) of
                    the 'link'.
                    """
                    def outlinks(link):
                        print '=>', link
                        givup = False
                        temp = []
                        r = link
                        link = link[link.index('http'):]
                        if link[-1] == '"':
                            link = link[:-1]
                    	
                        root = rootify(link)
                        page = getContent(link)
                        if (page == None):
                            return None
                        if page not in docs.values() and informative_page(link, page):
                            docs[link] = page
                            temp.append(link)
                        outlinks = re.findall(r'href=.+?"', page)
                        outlinks = [link for link in outlinks if valid_page(link)]
                        
                        for link in outlinks:
                            com = None
                            if 'http' in link:
                                link = link[link.index('http'):]
                                if link[-1] == '"':
                                    link = link[:-1]
                                if 'museum' in link.lower():
                                    page = getContent(link)
                                    if (page == None):
                                        return None
                                    if page not in docs.values():
                                        if informative_page(link, page):
                                            temp.append(link)
                                            docs[link] = page
                    
                            elif len(link) < MIN_CHILDLINK_LEN:
                                continue
                            else:
                                if link[6:-1][0] == '/':
                                	rest = link[7:-1]
                                else:
                                    rest = link[6:-1]
                                com = rest
                                start = nextnonpunc(rest)
                                if start == 100:
                                    continue;
                                link_option = ''
                                rest = rest[start:]
                                if '/' in rest and '/' in root:
                                    child_first_comp = rest[:rest.index('/')]
                                    parent_last_comp = tail(root)
                                    if child_first_comp.lower() == parent_last_comp.lower():
                                    # if the relative link has an overlapping component with root
                                    # e.g. /CSIRAC/history... and /history should be the same 
                                    # relative links, but they result in diff URLs, thus need to
                                    # have a 'backup' link to be urlopened in some cases
                                        link_option = root+rest[rest.index('/')+1:]    
                                link = root+rest
                                if not givup and 'museum' in link.lower():
                                    page = getContent(link)
                                    if (page != None) and page not in docs.values():
                                        if informative_page(link, page):
                                                
                                                temp.append(link)
                                                docs[link] = page
                                    else:
                                        if link_option != '':
                                            page = getContent(link_option)
                                            if (page != None) and page not in docs.values():
                                                if informative_page(link, page):
                                                    temp.append(link_option)
                                                    docs[link] = page
                        pagerank[root] = temp
                        for link in temp:
                            print "  --  ", link
                        return temp
                    
                        
                    def crawler(link):
                        link = link[link.index('http'):]
                        if link[-1] == '"':
                            link = link[:-1]
                        root = link
                        if root[-1] != '/':
                            if count(root, '/') < 3:
                                return
                            for i in range(len(root)):
                                if root[-i-1] == '/':
                                    root = root[:-i]
                                    break;
                    
                        page = getContent(link)
                        if (page == None):
                            return
                        if page not in docs.values():
                            docs[link] = page
                        to_scan = [root]
                        while len(to_scan) > 0 and len(docs) < MAX_DOCS:
                            childlinks = []
                            for parent in to_scan:
                                out = outlinks(parent)
                                if out:			
                                    # just in case if some links are dead or invalid
                                    childlinks.extend(out)
                            
                            to_scan = list(set(childlinks))
                    
                    if __name__ == '__main__':
                        crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
                        fp = open('docs.py', 'a')
                        for key in docs:
                            fp.write(key+'\n')
                        fp.close()
                        fp = open('pagerank.py', 'a')
                        for key in pagerank:
                            fp.write(key+'\n')
                            for outlink in pagerank[key]:
                    	    fp.write('    '+outlink+'\n')
                        fp.close

                    Comment

                    • ymic8
                      New Member
                      • Jul 2007
                      • 8

                      #11
                      oh I see, the reason the re.findall business didn't work is because in the page source there are \r, \n, and \t chars, that's why the regex ignored them.
                      now i am try to look for a succint regex to rid all the \x's.

                      by the way, my crawler code is not really commented and documented well, so if you are having trouble understanding them, i will soon document them and post it up again, but for now i need to get other components working.

                      thank you all

                      Comment

                      • bartonc
                        Recognized Expert Expert
                        • Sep 2006
                        • 6478

                        #12
                        Originally posted by ymic8
                        why doesn't my thread show up?
                        Sorry. There's a bug in the way this site displays long code blocks. Putting the code block inside quotes seems to make it worse. Your most recent post seems to have worked out well. Thanks for being persistent with this sometimes finicky site.

                        Comment

                        Working...