Async Client with 1K connections?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • William Chang

    Async Client with 1K connections?

    Before I take the plunge, I'd appreciate any advice on the feasibility
    and degree of difficulty of the following...

    I need extremely efficient and robust _client_ software for some
    common protocols like HTTP and POP3, supporting 1,000 simultaneous
    independent connections and commensurate network throughput. The data
    get written to files or sockets, so no GUI needed.

    I am not a Python programmer :-( but I am a "fan" :-) and I have been
    reading about asyncore/Medusa/Twisted -- which would be my best bet?

    Any advantage to using a particular unix/version -- Linux 32/64bit?
    FreeBSD 4/5? Solaris Sun/Intel?

    If anyone who is expert in this area may be available, please contact
    me at "w c h a n g at a f f i n i dot com". (I'm in the SF Bay Area.)

    My background is C -- I was the principal author of Infoseek (RIP),
    including the Python modularization that was the core of Ultraseek aka
    Inktomi Enterprise Search aka Verity. (For those of you old enough to
    remember!) Unfortunately, I moved upstairs and never did much Python.

    Thanks in advance, --William
  • Paul Rubin

    #2
    Re: Async Client with 1K connections?

    williamichang@h otmail.com (William Chang) writes:[color=blue]
    > I need extremely efficient and robust _client_ software for some
    > common protocols like HTTP and POP3, supporting 1,000 simultaneous
    > independent connections and commensurate network throughput. The data
    > get written to files or sockets, so no GUI needed.[/color]

    You're writing a monstrous web spider in Python?
    [color=blue]
    > I am not a Python programmer :-( but I am a "fan" :-) and I have been
    > reading about asyncore/Medusa/Twisted -- which would be my best bet?[/color]

    With enough hardware, you can do practically anything. Some Apache
    servers fork off that many processes.

    Comment

    • Paul Rubin

      #3
      Re: Async Client with 1K connections?

      williamichang@h otmail.com (William Chang) writes:[color=blue]
      > I need extremely efficient and robust _client_ software for some
      > common protocols like HTTP and POP3, supporting 1,000 simultaneous
      > independent connections and commensurate network throughput. The data
      > get written to files or sockets, so no GUI needed.
      >
      > I am not a Python programmer :-( but I am a "fan" :-) and I have been
      > reading about asyncore/Medusa/Twisted -- which would be my best bet?[/color]

      Seriously, I'd probably use asyncore since it's the simplest. Twisted
      is more flexible but maybe you don't need that.

      Why do you want to write this client in Python? What is it doing?

      Rather than going crazy tuning the software, you can parallelize it
      and run it on multiple boxes. Does that work for you?
      [color=blue]
      > Any advantage to using a particular unix/version -- Linux 32/64bit?
      > FreeBSD 4/5? Solaris Sun/Intel?[/color]

      Google has something like 8000 servers in its farm, running 32 bit
      Linux, so they're probably onto something. Solaris is a lot slower.
      64 bit Linux is maybe too new to deploy in some big production system.

      Comment

      • Bill Scherer

        #4
        Re: Async Client with 1K connections?

        [P&M]

        William Chang wrote:
        [color=blue]
        >Before I take the plunge, I'd appreciate any advice on the feasibility
        >and degree of difficulty of the following...
        >
        >I need extremely efficient and robust _client_ software for some
        >common protocols like HTTP and POP3, supporting 1,000 simultaneous
        >independent connections
        >[/color]
        I've got an httpd stress tool that uses asyncore. I can run up 1020
        independent simulated clients on my RH9 box(1x3Ghz cpu, 1GB ram),
        driving at over 600 requests per second against a modest (2x1Ghz)
        webserver, just pulling a static page.
        [color=blue]
        >and commensurate network throughput.
        >[/color]
        That could vary a lot, couldn't it?
        [color=blue]
        >The data get written to files or sockets, so no GUI needed.
        >[/color]
        Writing to files could slow you down a lot, depending on how much needs
        to be written, how fast your disks are, how you go about getting the
        data from the async client to the file, etc.. Much of the same goes for
        sockets, too.
        [color=blue]
        >I am not a Python programmer :-( but I am a "fan" :-) and I have been
        >reading about asyncore/Medusa/Twisted -- which would be my best bet?
        >[/color]
        I should think all can do the job for you, depending on the details
        which you haven't told us.
        [color=blue]
        >Any advantage to using a particular unix/version -- Linux 32/64bit?
        >FreeBSD 4/5? Solaris Sun/Intel?
        >
        >If anyone who is expert in this area may be available, please contact
        >me at "w c h a n g at a f f i n i dot com". (I'm in the SF Bay Area.)
        >
        >My background is C -- I was the principal author of Infoseek (RIP),
        >including the Python modularization that was the core of Ultraseek aka
        >Inktomi Enterprise Search aka Verity. (For those of you old enough to
        >remember!) Unfortunately, I moved upstairs and never did much Python.
        >
        >Thanks in advance, --William
        >
        >[/color]


        Comment

        • Michel Claveau/Hamster

          #5
          Re: Async Client with 1K connections?

          Hi !

          See Erlang : the web-server-sample can serve more than 50000 connexions on
          one standard cpu.




          Comment

          • Dave Brueck

            #6
            Re: Async Client with 1K connections?

            > Before I take the plunge, I'd appreciate any advice on the feasibility[color=blue]
            > and degree of difficulty of the following...
            >
            > I need extremely efficient and robust _client_ software for some
            > common protocols like HTTP and POP3, supporting 1,000 simultaneous
            > independent connections and commensurate network throughput. The data
            > get written to files or sockets, so no GUI needed.[/color]

            1000+ connections is not a problem, although (on Linux at least, and probably
            others) you'll probably need to make sure your process is allowed to have open
            more file descriptors, especially if you're turning around and writing data to
            disk (since that uses file descriptors too). This is OS-specific and has
            nothing to do with Python, but IIRC you can do something like
            os.sysconf(os.s ysconf_names['SC_OPEN_MAX']) to see how many fd's your process
            can have open.
            [color=blue]
            > I am not a Python programmer :-( but I am a "fan" :-) and I have been
            > reading about asyncore/Medusa/Twisted -- which would be my best bet?[/color]

            You're probably going to be ok either way, but what are your throughput
            requirements exactly? Are these connections pulling down HTML pages and small
            images or are they big, multi-megabyte downloads? How big is your connection?
            For 99% of uses asyncore or Twisted will be fine - but if you need very high
            numbers of new connections per second (hundreds) or throughput (hundreds of
            Mbps) then you might need to modify the framework or build your own - still in
            Python but more tailored to your specific needs - in order to get those levels
            of performance.

            -Dave


            Comment

            • Peter Hansen

              #7
              Re: Async Client with 1K connections?

              Paul Rubin wrote:[color=blue]
              >
              > williamichang@h otmail.com (William Chang) writes:[color=green]
              > > I need extremely efficient and robust _client_ software for some
              > > common protocols like HTTP and POP3, supporting 1,000 simultaneous
              > > independent connections and commensurate network throughput. The data
              > > get written to files or sockets, so no GUI needed.
              > >
              > > I am not a Python programmer :-( but I am a "fan" :-) and I have been
              > > reading about asyncore/Medusa/Twisted -- which would be my best bet?[/color]
              >
              > Seriously, I'd probably use asyncore since it's the simplest. Twisted
              > is more flexible but maybe you don't need that.[/color]

              I agree Twisted is more flexible, but having tried both I'd argue that
              it is also simpler. I was able to get farther, faster, just by following
              the simple examples (e.g. http://www.twistedmatrix.com/documents/howto/clients)
              on the web site than I was with asyncore. I also found the source
              _much_ cleaner and more readable when it came time to look there as well.

              -Peter

              Comment

              • Paul Rubin

                #8
                Re: Async Client with 1K connections?

                Bill Scherer <Bill.Scherer@v erizonwireless. com> writes:[color=blue][color=green]
                > >The data get written to files or sockets, so no GUI needed.
                > >[/color]
                > Writing to files could slow you down a lot, depending on how much
                > needs to be written, how fast your disks are, how you go about
                > getting the data from the async client to the file, etc.. Much of the
                > same goes for sockets, too.[/color]

                That's a good point, you should put everything into one file serially,
                then sort it afterwards to separate out data from individual
                connections.

                Comment

                • William Chang

                  #9
                  Re: Async Client with 1K connections?

                  Thank you all for the discussion! Some additional information:

                  One of the intended uses is indeed a next-gen web spider. I did the
                  math, and yes I will need about 10 cutting-edge PCs to spider like
                  you-know-who. But I shouldn't need 100 -- and would rather not
                  spend money unnecessarily.. . Throughput per PC would be on
                  the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
                  simultaneous connections. (That's 17M pages per day per PC.)
                  My search & content engine can index and store at such a rate,
                  but can the spider initiate (at least) 200 new requests per second,
                  assuming each request lasts 5-10 seconds?

                  Of course, that assumes the spider algorithm/coordinator is pretty
                  intelligent and well-engineered. And the hardware stay up, etc.
                  Managing storage is certainly nontrivial; at such a scale nothing is
                  to be taken for granted!

                  Nevertheless, it shouldn't cost millions. Maybe $100K :-)

                  Time for a sanity check? --William




                  Comment

                  • Paul Rubin

                    #10
                    Re: Async Client with 1K connections?

                    "William Chang" <williamichang@ hotmail.com> writes:[color=blue]
                    > Thank you all for the discussion! Some additional information:
                    >
                    > One of the intended uses is indeed a next-gen web spider. I did the
                    > math, and yes I will need about 10 cutting-edge PCs to spider like
                    > you-know-who. But I shouldn't need 100 -- and would rather not
                    > spend money unnecessarily.. . Throughput per PC would be on
                    > the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
                    > simultaneous connections. (That's 17M pages per day per PC.)[/color]

                    That's orders of magnitude less than you-know-who. Also, don't forget
                    how many queries you have to take from users, and the amount of disk seeks
                    needed for each one.
                    [color=blue]
                    > Nevertheless, it shouldn't cost millions. Maybe $100K :-)[/color]

                    10 MB of internet connectivity is at least a few K$/month all by itself.

                    Comment

                    • Aahz

                      #11
                      Re: Async Client with 1K connections?

                      In article <zeWdnXYoFaHsSb Td4p2dnA@comcas t.com>,
                      William Chang <williamichang@ hotmail.com> wrote:[color=blue]
                      >
                      >One of the intended uses is indeed a next-gen web spider. I did the
                      >math, and yes I will need about 10 cutting-edge PCs to spider like
                      >you-know-who.[/color]

                      Note that while you-know-who makes extensive use of Python, I don't
                      think they're using it for spidering/searching. I do have some
                      background writing a spider in Python, using Verity's engine for
                      indexing/retrieval, but we were using threading rather than
                      asyncore-style operations.
                      --
                      Aahz (aahz@pythoncra ft.com) <*> http://www.pythoncraft.com/

                      "Argue for your limitations, and sure enough they're yours." --Richard Bach

                      Comment

                      • William Chang

                        #12
                        Re: Async Client with 1K connections?

                        aahz@pythoncraf t.com (Aahz) wrote:[color=blue]
                        > Note that while you-know-who makes extensive use of Python, I don't
                        > think they're using it for spidering/searching. I do have some
                        > background writing a spider in Python, using Verity's engine for
                        > indexing/retrieval, but we were using threading rather than
                        > asyncore-style operations.[/color]

                        Interesting, did you try maxing out the number of threads/connections?
                        On an UltraSparc with hardware thread/lwp support, a thousand threads
                        can co-exist reliably, at least for computations and disk I/O. Linux
                        is another matter entirely.

                        --William

                        Comment

                        • William Chang

                          #13
                          Re: Async Client with 1K connections?

                          Paul Rubin <http://phr.cx@NOSPAM.i nvalid> wrote:[color=blue]
                          > "William Chang" <williamichang@ hotmail.com> writes:[color=green]
                          > > ... Throughput per PC would be on
                          > > the order of 1MB/s assuming 200x5KB downloads/sec using 1-2000
                          > > simultaneous connections. (That's 17M pages per day per PC.)[/color]
                          >
                          > That's orders of magnitude less than you-know-who.[/color]

                          Do you know how frequently you-know-who refreshes its entire index? A year
                          ago things were pretty dire, easily over 10% dead links, if I recall correctly.
                          10 PCs at 17M/day each will refresh 3B pages in 18 days, easily world-class.
                          [color=blue]
                          > ... Also, don't forget
                          > how many queries you have to take from users, and the amount of disk seeks
                          > needed for each one.[/color]

                          Sure, that's what I do. However, spidering and querying are independent tasks,
                          generally speaking.
                          [color=blue]
                          > 10 MB of internet connectivity is at least a few K$/month all by itself.[/color]

                          Yes, $2500 to be specific.

                          There's no reason to be intimidated (if I may use that word) by you-know-who's
                          marketing message (80,000 machines). Back in '96 Infoseek could handle 10M
                          queries per day on a single Sun E4000 with 8CPU (<200Mhz), 4GB, 20x4GB RAID.
                          Sure the WWW is much bigger now, but so are the disk drives!

                          -- William

                          Comment

                          Working...