need to write a simple web crawler

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Pradeep Vasudevan
    New Member
    • Sep 2006
    • 1

    need to write a simple web crawler

    hai

    i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

    i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

    help in any knind would be appreciated..

    thank u
  • kudos
    Recognized Expert New Member
    • Jul 2006
    • 127

    #2
    Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

    Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

    hash = {}
    stack = []
    stack.push("www .thescripts.com ")

    while(len(stack ) > 0):
    currpage = stack.pop()
    hash[currpage] = 1 # sets it to visited
    links = findlinks(currp age) # this method finds all the links of the page
    # here you can do what you would do, like finding some text, downloading
    # some image etc etc
    # push all the links on the stack
    Code:
     for l in links:
      if(hash[l] != 1):
       stack.push(l)

    This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

    -kudos



    Originally posted by Pradeep Vasudevan
    hai

    i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...

    i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...

    help in any knind would be appreciated..

    thank u

    Comment

    • squzer
      New Member
      • Jun 2007
      • 3

      #3
      Hi friend.. me too involving develpin a crawler.. share the deas you got please........

      Comment

      • kudos
        Recognized Expert New Member
        • Jul 2006
        • 127

        #4
        Originally posted by squzer
        Hi friend.. me too involving develpin a crawler.. share the deas you got please........
        Hi, what do you want to get from your crawl?

        -kudos

        Comment

        • mike171562
          New Member
          • Aug 2007
          • 1

          #5
          I am looking for one that will read from a list of urls and crawl them for certain text words and then list the results.

          Comment

          • technoashis
            New Member
            • Jul 2007
            • 2

            #6
            I am also trying for that but my crawler takes a hell a lot of time to crwal i have done it in python. Can you folks give me some clue

            Comment

            • dazzler
              New Member
              • Nov 2007
              • 75

              #7
              I have done crawler also which parses URLs from html. I think that python's html parser modules only work with clean & valid html code... and net is full of dirty html! so get ready to write your own html parser =)

              Comment

              • heiro
                New Member
                • Jul 2007
                • 56

                #8
                Originally posted by kudos
                Its quite easy actually, you need one thing, one way to parse a html page (which is found in the python lib), and as you pointed out in your post, Breath first search (BFS) and depth first search (DFS). You also need some kind of structure to determine if you visited a certain page before (maybe a hash list?)

                Lets assume that we use BFS, and use pythons list method, and that you start on a certain page (www.thescripts.com ?:)

                hash = {}
                stack = []
                stack.push("www .thescripts.com ")

                while(len(stack ) > 0):
                currpage = stack.pop()
                hash[currpage] = 1 # sets it to visited
                links = findlinks(currp age) # this method finds all the links of the page
                # here you can do what you would do, like finding some text, downloading
                # some image etc etc
                # push all the links on the stack
                Code:
                 for l in links:
                  if(hash[l] != 1):
                   stack.push(l)

                This was strictly psuedo code, since I haven't got a python interpreter here. If you still need it, I could write you a simple crawler.

                -kudos

                I'm very interested how web crawler works..Would you mind if I ask for a sample code so that i could study and later make my own?

                Comment

                • helena pap
                  New Member
                  • Mar 2008
                  • 1

                  #9
                  hi, i am trying to make a crawler and have the most frequency keywords of the pages of one site ... any idea??

                  Comment

                  • urgent
                    New Member
                    • Apr 2008
                    • 1

                    #10
                    Hi, I need to write a simple crawler too. it must have the ability to capture webpages from a certain site for example ww.CNN.com

                    and also it must parse those HTML webpages. I need any sample code please..urgentl y in order to help me with my project.

                    Comment

                    • chaosAD
                      New Member
                      • Feb 2008
                      • 9

                      #11
                      a simple html parser, looks for thumbnail tags and prints the thumbnail information

                      Code:
                      import urllib2, sgmllib
                      
                      
                      class ImageScraper(sgmllib.SGMLParser):
                      
                          def __init__(self):
                      
                              sgmllib.SGMLParser.__init__(self)
                              
                              self.href = ''
                      
                          def start_a(self, attrs):
                              for tag, value in attrs:
                                  if tag == 'href':
                                      self.href = value
                      
                          def end_a(self):
                              self.href = ''
                      
                          def start_img(self, attrs):
                              if self.href:
                                  print "#####################################"
                                  print "IMAGE URL: " + self.href
                                  for tag, value in attrs:
                                      if tag == 'src':
                                          print "THUMBNAIL SRC: " + value
                                      elif tag == "width":
                                          print "THUMBNAIL WIDTH: " + value
                                      elif tag == "height":
                                          print "THUMBNAIL HEIGHT: " + value
                                      elif tag == "alt":
                                          print "THUMBNAIL NAME: " + value
                                      elif tag == "border":
                                          print "THUMBNAIL BORDER: " + value
                                      else:
                                          None
                                  print "#####################################\n"
                      
                      
                      url = "http://bytes.com/"
                      
                      sock = urllib2.urlopen(url)
                      
                      page = sock.read()
                      
                      sock.close()
                      
                      parser = ThumbnailScraper()
                      
                      parser.feed(page)
                      
                      parser.close()

                      Comment

                      • varun1985
                        New Member
                        • Jul 2008
                        • 1

                        #12
                        Originally posted by kudos
                        Hi, what do you want to get from your crawl?

                        -kudos
                        hi kudos,

                        I want to write a crawler which will fetch the data like company name,turnover,p roduct for which they are working for..and store into my database.

                        actually i have to submit a project,i have made simple html tags based crawler but want to make a dynamic simple web crawler.

                        your help is required!!!

                        Thanks in advance!!!

                        Varun

                        Comment

                        • kudos
                          Recognized Expert New Member
                          • Jul 2006
                          • 127

                          #13
                          ok, webcrawlers, there is usually alot of 'ifs', but have a sketched out a very simple webcrawler that illustrates the idea (with comments!)

                          Code:
                          #webcrawler
                          #this is basically a shell, illustrating use of the "breath-first" type of webcrawler
                          # you have to add things for extracting the actual info from the webpage yourself
                          # all it currently do is to print the url of the pages, and the number of candidates to visit
                          
                          import urllib
                          page = "http://bytes.com" # startpage
                          stack = []
                          stack.append(page)
                          visit = {} # keeps track of pages that we visited, to avoid loops
                          stopvar = 5 # I have added a variable that will allow you to exit after visiting x number of page, obviously we do not want to visit all page of the internet :)
                          
                          while(stopvar >= 0):
                           stopvar-=1
                           cpage = stack.pop()
                           f = urllib.urlopen(cpage)
                           html=f.read()
                           sp = "a href=\""
                           
                           # you want extract things from the html code (such as images, text etc, etc around here)
                           # the rest of the thing is to extract hyperlinks, and put them into a stack, so we can
                           # continue to visit pages
                           
                           for i in range(len(html)):
                            if(sp == html[i:i+len(sp)]):
                             url = ""
                             i+=len(sp)
                             while(html[i] != "\""):
                              url+=html[i]
                              i+=1
                             # is our link a local link, or a global link? i leave local links as an exercise :)
                             if(url[0:4] == "http"):
                              if(visit.has_key(url) == False):
                               stack.append(url)
                               visit[url] = 1
                           print str(len(stack)) + " " + cpage
                          -kudos

                          Comment

                          • alidia45
                            New Member
                            • Nov 2009
                            • 1

                            #14
                            Try Scrapy, a very powerful (and well documented) framework for writing web crawlers (and screen scrapers) in Python.

                            Comment

                            Working...