sk

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • smartestdesign@gmail.com

    sk

    I am developing a program to crawl a site( looks like craigslist ).
    Since they have more than 20,000 entries I have to go to each
    categories
    site
    , parse with regular expression and extract data to database. This data
    will
    be updated every two days.

    The program i am analyzing now is that I have a number of clients site
    running
    on the same machine and if my program occupies the cpu usages( more
    than
    80% )
    web server might hang and won't accept any connection from outside
    until I
    reboot
    my server.

    I came up with some idea to reduce process overhead.
    1. go to the site and download all sites without parsing.
    2. once all sites have been downloaded to local starts parsing.
    3. save all data in a database.

    if any has a better idea let me know.

    SK

  • Jonathan

    #2
    Re: sk

    smartestdesign@ gmail.com wrote:
    I am developing a program to crawl a site( looks like craigslist ).
    Since they have more than 20,000 entries I have to go to each
    categories
    site
    , parse with regular expression and extract data to database. This data
    will
    be updated every two days.
    >
    The program i am analyzing now is that I have a number of clients site
    running
    on the same machine and if my program occupies the cpu usages( more
    than
    80% )
    web server might hang and won't accept any connection from outside
    until I
    reboot
    my server.
    >
    I came up with some idea to reduce process overhead.
    1. go to the site and download all sites without parsing.
    2. once all sites have been downloaded to local starts parsing.
    3. save all data in a database.
    >
    if any has a better idea let me know.
    >
    SK
    If you can get access to the database you are better of replicating this
    database... but I guess that is not an option...

    Jonathan

    Comment

    • smartestdesign@gmail.com

      #3
      Re: sk

      unfortunately not

      Comment

      • Richard Levasseur

        #4
        Re: sk


        smartestdesign@ gmail.com wrote:
        I am developing a program to crawl a site( looks like craigslist ).
        Since they have more than 20,000 entries I have to go to each
        categories
        site
        , parse with regular expression and extract data to database. This data
        will
        be updated every two days.
        >
        The program i am analyzing now is that I have a number of clients site
        running
        on the same machine and if my program occupies the cpu usages( more
        than
        80% )
        web server might hang and won't accept any connection from outside
        until I
        reboot
        my server.
        >
        I came up with some idea to reduce process overhead.
        1. go to the site and download all sites without parsing.
        2. once all sites have been downloaded to local starts parsing.
        3. save all data in a database.
        >
        if any has a better idea let me know.
        >
        SK
        Try using the rss feeds.

        Comment

        • smartestdesign@gmail.com

          #5
          Re: sk


          Richard,

          Thank you for sharing your idea.
          >Try using the rss feeds.
          I don't know much about the rss feeds, but it should be set up
          in the source site right? Say if i want to get some data from

          mysite.com has to provide the rss xml file right?

          What i want to do is the crawling the external sites and extracting
          data.

          SK

          Comment

          • mootmail-googlegroups@yahoo.com

            #6
            Re: sk


            smartestdesign@ gmail.com wrote:
            The program i am analyzing now is that I have a number of clients site running
            on the same machine and if my program occupies the cpu usages( more than 80% )
            web server might hang and won't accept any connection from outside until I reboot my server.
            >

            Hard to tell from your post...is this script running through the web
            server as a page?

            If so, you might try converting it to a command line script and run it
            via cron. In that situation, ideally, the cpu should use appropriate
            process threading to prevent the web server from locking up.

            Comment

            • Carl Vondrick

              #7
              Re: sk

              smartestdesign@ gmail.com wrote:
              >Try using the rss feeds.
              I don't know much about the rss feeds, but it should be set up
              in the source site right? Say if i want to get some data from

              mysite.com has to provide the rss xml file right?
              Yes, the feed must come from the source.

              Carl

              Comment

              Working...