Web Crawler - Python or Perl?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • disappearedng@gmail.com

    Web Crawler - Python or Perl?

    Hi all,
    I am currently planning to write my own web crawler. I know Python but
    not Perl, and I am interested in knowing which of these two are a
    better choice given the following scenario:

    1) I/O issues: my biggest constraint in terms of resource will be
    bandwidth throttle neck.
    2) Efficiency issues: The crawlers have to be fast, robust and as
    "memory efficient" as possible. I am running all of my crawlers on
    cheap pcs with about 500 mb RAM and P3 to P4 processors
    3) Compatibility issues: Most of these crawlers will run on Unix
    (FreeBSD), so there should exist a pretty good compiler that can
    optimize my code these under the environments.

    What are your opinions?
  • subeen

    #2
    Re: Web Crawler - Python or Perl?

    On Jun 9, 11:48 pm, disappeare...@g mail.com wrote:
    Hi all,
    I am currently planning to write my own web crawler. I know Python but
    not Perl, and I am interested in knowing which of these two are a
    better choice given the following scenario:
    >
    1) I/O issues: my biggest constraint in terms of resource will be
    bandwidth throttle neck.
    2) Efficiency issues: The crawlers have to be fast, robust and as
    "memory efficient" as possible. I am running all of my crawlers on
    cheap pcs with about 500 mb RAM and P3 to P4 processors
    3) Compatibility issues: Most of these crawlers will run on Unix
    (FreeBSD), so there should exist a pretty good compiler that can
    optimize my code these under the environments.
    >
    What are your opinions?
    It really doesn't matter whether you use Perl or Python for writing
    web crawlers. I have used both for writing crawlers. The scenarios you
    mentioned (I/O issues, Efficiency, Compatibility) don't differ two
    much for these two languages. Both the languages have fast I/O. You
    can use urllib2 module and/or beautiful soup for developing crawler in
    Python. For Perl you can use Mechanize or LWP modules. Both languages
    have good support for regular expressions. Perl is slightly faster I
    have heard, though I don't find the difference myself. Both are
    compatible with *nix. For writing a good crawler, language is not
    important, it's the technology which is important.

    regards,
    Subeen.

    Comment

    • Stefan Behnel

      #3
      Re: Web Crawler - Python or Perl?

      disappearedng@g mail.com wrote:
      1) I/O issues: my biggest constraint in terms of resource will be
      bandwidth throttle neck.
      2) Efficiency issues: The crawlers have to be fast, robust and as
      "memory efficient" as possible. I am running all of my crawlers on
      cheap pcs with about 500 mb RAM and P3 to P4 processors
      3) Compatibility issues: Most of these crawlers will run on Unix
      (FreeBSD), so there should exist a pretty good compiler that can
      optimize my code these under the environments.
      You should rethink your requirements. You expect to be I/O bound, so why do
      you require a good "compiler"? Especially when asking about two interpreted
      languages...

      Consider using lxml (with Python), it has pretty much everything you need for
      a web crawler, supports threaded parsing directly from HTTP URLs, and it's
      plenty fast and pretty memory efficient.



      Stefan

      Comment

      • Stefan Behnel

        #4
        Re: Web Crawler - Python or Perl?

        subeen wrote:
        can use urllib2 module and/or beautiful soup for developing crawler
        Not if you care about a) speed and/or b) memory efficiency.



        Stefan

        Comment

        • subeen

          #5
          Re: Web Crawler - Python or Perl?

          On Jun 10, 12:15 am, Stefan Behnel <stefan...@behn el.dewrote:
          subeen wrote:
          can use urllib2 module and/or beautiful soup for developing crawler
          >
          Not if you care about a) speed and/or b) memory efficiency.
          >

          >
          Stefan
          ya, beautiful soup is slower. so it's better to use urllib2 for
          fetching data and regular expressions for parsing data.


          regards,
          Subeen.

          Comment

          • Ray Cote

            #6
            Re: Web Crawler - Python or Perl?

            At 11:21 AM -0700 6/9/08, subeen wrote:
            >On Jun 10, 12:15 am, Stefan Behnel <stefan...@behn el.dewrote:
            > subeen wrote:
            > can use urllib2 module and/or beautiful soup for developing crawler
            >>
            > Not if you care about a) speed and/or b) memory efficiency.
            >>

            >>
            > Stefan
            >
            >ya, beautiful soup is slower. so it's better to use urllib2 for
            >fetching data and regular expressions for parsing data.
            >
            >
            >regards,
            >Subeen.
            >http://love-python.blogspot.com/
            >--
            >http://mail.python.org/mailman/listinfo/python-list
            Beautiful Soup is a bit slower, but it will actually parse some of
            the bizarre HTML you'll download off the web. We've written a couple
            of crawlers to run over specific clients sites (I note, we did _not_
            create the content on these sites).

            Expect to find html code that looks like this:

            <ul>
            <li>
            <form>
            </li>
            </form>
            </ul>
            [from a real example, and yes, it did indeed render in IE.]

            I don't know if some of the quicker parsers discussed require
            well-formed HTML since I've not used them. You may want to consider
            using one of the quicker HTML parsers and, when they throw a fit on
            the downloaded HTML, drop back to Beautiful Soup -- which usually
            gets _something_ useful off the page.

            --Ray

            --

            Raymond Cote
            Appropriate Solutions, Inc.
            PO Box 458 ~ Peterborough, NH 03458-0458
            Phone: 603.924.6079 ~ Fax: 603.924.8668
            rgacote(at)Appr opriateSolution s.com

            Comment

            • Sebastian \lunar\ Wiesner

              #7
              Re: Web Crawler - Python or Perl?

              subeen <tamim.shahriar @gmail.comat Montag 09 Juni 2008 20:21:
              On Jun 10, 12:15 am, Stefan Behnel <stefan...@behn el.dewrote:
              >subeen wrote:
              can use urllib2 module and/or beautiful soup for developing crawler
              >>
              >Not if you care about a) speed and/or b) memory efficiency.
              >>
              >http://blog.ianbicking.org/2008/03/3...r-performance/
              >>
              >Stefan
              >
              ya, beautiful soup is slower. so it's better to use urllib2 for
              fetching data and regular expressions for parsing data.
              BeautifulSoup is implemented on regular expressions. I doubt, that you can
              achieve a great performance gain by using plain regular expressions, and
              even if, this gain is certainly not worth the effort. Parsing markup with
              regular expressions is hard, and the result will most likely not be as fast
              and as memory-efficient as lxml.html.

              I personally am absolutely happy with lxml.html. It's fast, memory
              efficient, yet powerful and easy to use.

              --
              Freedom is always the freedom of dissenters.
              (Rosa Luxemburg)

              Comment

              • Nick Craig-Wood

                #8
                Re: Web Crawler - Python or Perl?

                disappearedng@g mail.com <disappearedng@ gmail.comwrote:
                I am currently planning to write my own web crawler. I know Python but
                not Perl, and I am interested in knowing which of these two are a
                better choice given the following scenario:
                >
                1) I/O issues: my biggest constraint in terms of resource will be
                bandwidth throttle neck.
                2) Efficiency issues: The crawlers have to be fast, robust and as
                "memory efficient" as possible. I am running all of my crawlers on
                cheap pcs with about 500 mb RAM and P3 to P4 processors
                3) Compatibility issues: Most of these crawlers will run on Unix
                (FreeBSD), so there should exist a pretty good compiler that can
                optimize my code these under the environments.
                >
                What are your opinions?
                Use python with twisted.

                With a friend I wrote a crawler. Our first attempt was standard
                python. Our second attempt was with twisted. Twisted absolutely blew
                the socks off our first attempt - mainly because you can fetch 100s or
                1000s of pages simultaneously, without threads.

                Python with twisted will satisfy 1-3. You'll have to get your head
                around its asynchronous nature, but once you do you'll be writing a
                killer crawler ;-)

                As for Perl - once upon a time I would have done this with perl, but I
                wouldn't go back now!

                --
                Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

                Comment

                • Stefan Behnel

                  #9
                  Re: Web Crawler - Python or Perl?

                  Ray Cote wrote:
                  Beautiful Soup is a bit slower, but it will actually parse some of the
                  bizarre HTML you'll download off the web.
                  [...]
                  I don't know if some of the quicker parsers discussed require
                  well-formed HTML since I've not used them. You may want to consider
                  using one of the quicker HTML parsers and, when they throw a fit on the
                  downloaded HTML, drop back to Beautiful Soup -- which usually gets
                  _something_ useful off the page.
                  So does lxml.html. And if you still feel like needing BS once in a while,
                  there's lxml.html.soupp arser.



                  Stefan

                  Comment

                  • disappearedng@gmail.com

                    #10
                    Re: Web Crawler - Python or Perl?

                    As to why as opposed to what, I am attempting to build a search engine
                    right now that plans to crawl not just html but other things too.

                    I am open to learning, and I don't want to learn anything that doesn't
                    really contribute to building my search engine for the moment. Hence I
                    want to see whether learning PERL will be helpful to the later parts
                    of my search engine.

                    Victor

                    Comment

                    • Stefan Behnel

                      #11
                      Re: Web Crawler - Python or Perl?

                      disappearedng@g mail.com wrote:
                      As to why as opposed to what, I am attempting to build a search engine
                      right now that plans to crawl not just html but other things too.
                      >
                      I am open to learning, and I don't want to learn anything that doesn't
                      really contribute to building my search engine for the moment. Hence I
                      want to see whether learning PERL will be helpful to the later parts
                      of my search engine.
                      I honestly don't think there's anything useful in Perl that you can't do in
                      Python. There's tons of ugly ways to write unreadable code, though, so if you
                      prefer that, that's something that's harder to do in Python.

                      Stefan

                      Comment

                      • Chuck Rhode

                        #12
                        Re: Web Crawler - Python or Perl?

                        On Mon, 09 Jun 2008 10:48:03 -0700, disappearedng wrote:
                        I know Python but not Perl, and I am interested in knowing which of
                        these two are a better choice.
                        I'm partial to *Python*, but, the last time I looked, *urllib2* didn't
                        provide a time-out mechanism that worked under all circumstances. My
                        client-side scripts would usually hang when the server quit
                        responding, which happened a lot.

                        You can get around this by starting an *html* retrieval in its own
                        thread, giving it a deadline, and killing it if it doesn't finish
                        gracefully.

                        A quicker and considerably grittier solution is to supply timeout
                        parms to the *curl* command through the shell. Execute the command
                        and retrieve its output through the *subprocess* module.

                        --
                        ... Chuck Rhode, Sheboygan, WI, USA
                        ... 1979 Honda Goldwing GL1000 (Geraldine)
                        ... Weather: http://LacusVeris.com/WX
                        ... 64° — Wind SE 5 mph — Sky partly cloudy.

                        Comment

                        • subeen

                          #13
                          Re: Web Crawler - Python or Perl?

                          On Jun 13, 1:26 am, Chuck Rhode <CRh...@LacusVe ris.comwrote:
                          On Mon, 09 Jun 2008 10:48:03 -0700, disappearedng wrote:
                          I knowPythonbut notPerl, and I am interested in knowing which of
                          these two are a better choice.
                          >
                          I'm partial to *Python*, but, the last time I looked, *urllib2* didn't
                          provide a time-out mechanism that worked under all circumstances. My
                          client-side scripts would usually hang when the server quit
                          responding, which happened a lot.
                          >
                          You can avoid the problem using the following code:
                          import socket

                          timeout = 300 # seconds
                          socket.setdefau lttimeout(timeo ut)

                          regards,
                          Subeen.

                          Comment

                          Working...