Code For Five Threads To Process Multiple Files?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • tdahsu@gmail.com

    Code For Five Threads To Process Multiple Files?

    All,

    I'd appreciate any help. I've got a list of files in a directory, and
    I'd like to iterate through that list and process each one. Rather
    than do that serially, I was thinking I should start five threads and
    process five files at a time.

    Is this a good idea? I picked the number five at random... I was
    thinking that I might check the number of processors and start a
    multiple of that, but then I remembered KISS and it seemed that that
    was too complicated.

    If it's not a horrible idea, would anyone be able to provide some
    quick code as to how to do that? Any and all help would be greatly
    appreciated!

    Thanks in advance!
  • A.T.Hofkamp

    #2
    Re: Code For Five Threads To Process Multiple Files?

    On 2008-05-21, tdahsu@gmail.co m <tdahsu@gmail.c omwrote:
    I'd appreciate any help. I've got a list of files in a directory, and
    I'd like to iterate through that list and process each one. Rather
    than do that serially, I was thinking I should start five threads and
    process five files at a time.
    >
    Is this a good idea? I picked the number five at random... I was
    Depends what you are doing.
    If you are mainly reading/writing files, there is not much to gain, since 1
    process will already push the disk IO system to its limit. If you do a lot of
    processing, then more threads than the number of processors is not much use. If
    you have more 'burtsy' behavior (first do lot of reading, then lot of
    processing, then again reading, etc), then the system may be able to do some
    scheduling and keep both the processors and the file system busy.

    I cannot really give you advice on threading, I have never done that. You may
    want to consider an alternative, namely multi-tasking at OS level. If you can
    easily split the files over a number of OS processes (written in Python), you
    can make the Python program really simple, and let the OS handle the
    task-switching between the programs.

    Sincerely,
    Albert

    Comment

    • tdahsu@gmail.com

      #3
      Re: Code For Five Threads To Process Multiple Files?

      On May 21, 11:13 am, "A.T.Hofkam p" <h...@se-162.se.wtb.tue. nlwrote:
      On 2008-05-21, tda...@gmail.co m <tda...@gmail.c omwrote:
      >
      I'd appreciate any help.  I've got a list of files in a directory, and
      I'd like to iterate through that list and process each one.  Rather
      than do that serially, I was thinking I should start five threads and
      process five files at a time.
      >
      Is this a good idea?  I picked the number five at random... I was
      >
      Depends what you are doing.
      If you are mainly reading/writing files, there is not much to gain, since 1
      process will already push the disk IO system to its limit. If you do a lotof
      processing, then more threads than the number of processors is not much use. If
      you have more 'burtsy' behavior (first do lot of reading, then lot of
      processing, then again reading, etc), then the system may be able to do some
      scheduling and keep both the processors and the file system busy.
      >
      I cannot really give you advice on threading, I have never done that. You may
      want to consider an alternative, namely multi-tasking at OS level. If you can
      easily split the files over a number of OS processes (written in Python), you
      can make the Python program really simple, and let the OS handle the
      task-switching between the programs.
      >
      Sincerely,
      Albert
      Albert,

      Thanks for your response - I appreciate your time!

      I am mainly reading and writing files, so it seems like it might not
      be a good idea. What if I read the whole file into memory first, and
      operate on it there? They are not large files...

      Either way, I'd hope that someone might respond with an example, as
      then I could test and see which is faster!

      Thanks again.

      Comment

      • tdahsu@gmail.com

        #4
        Re: Code For Five Threads To Process Multiple Files?

        On May 21, 11:41 am, tda...@gmail.co m wrote:
        On May 21, 11:13 am, "A.T.Hofkam p" <h...@se-162.se.wtb.tue. nlwrote:
        >
        >
        >
        On 2008-05-21, tda...@gmail.co m <tda...@gmail.c omwrote:
        >
        I'd appreciate any help.  I've got a list of files in a directory, and
        I'd like to iterate through that list and process each one.  Rather
        than do that serially, I was thinking I should start five threads and
        process five files at a time.
        >
        Is this a good idea?  I picked the number five at random... I was
        >
        Depends what you are doing.
        If you are mainly reading/writing files, there is not much to gain, since 1
        process will already push the disk IO system to its limit. If you do a lot of
        processing, then more threads than the number of processors is not much use. If
        you have more 'burtsy' behavior (first do lot of reading, then lot of
        processing, then again reading, etc), then the system may be able to do some
        scheduling and keep both the processors and the file system busy.
        >
        I cannot really give you advice on threading, I have never done that. You may
        want to consider an alternative, namely multi-tasking at OS level. If you can
        easily split the files over a number of OS processes (written in Python), you
        can make the Python program really simple, and let the OS handle the
        task-switching between the programs.
        >
        Sincerely,
        Albert
        >
        Albert,
        >
        Thanks for your response - I appreciate your time!
        >
        I am mainly reading and writing files, so it seems like it might not
        be a good idea.  What if I read the whole file into memory first, and
        operate on it there?  They are not large files...
        >
        Either way, I'd hope that someone might respond with an example, as
        then I could test and see which is faster!
        >
        Thanks again.
        Ah, well, I didn't get any other responses, but here's what I've done:

        loopCount = 0
        for l in range(len(self. filesToProcess) ):
        threads = []
        try:

        threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
        +l])))

        threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
        +2])))

        threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
        +3])))

        threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
        +4])))

        threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
        +5])))
        msg = "Processing file...\n"
        for thread in threads:
        wx.CallAfter(se lf.textctrl03.w rite(msg),
        thread.start())
        for thread in threads:
        thread.join()
        loopCount += 5
        except IndexError:
        pass

        It works, and it works well. It starts five threads, and processes
        five files at a time. (In the "self.processFi les" I read the whole
        file into memory using readlines(), which works well.)

        Of course, now the wx.CallAfter function doesn't work... I get
        "TypeError: 'NoneType' object is not callable" for every time it is
        run...

        Comment

        • tdahsu@gmail.com

          #5
          Re: Code For Five Threads To Process Multiple Files?

          On May 23, 12:20 am, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:
          On Thu, 22 May 2008 11:03:48 -0700 (PDT), tda...@gmail.co m declaimed the
          following in comp.lang.pytho n:
          >
          Ah, well, I didn't get any other responses, but here's what I've done:
          >
                  Apparently the direct email from my work address did not get through
          (I don't have group posting ability from work).
          >
          loopCount = 0
                          for l in range(len(self. filesToProcess) ):
                              threads = []
                              try:
          >
          threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
          +l])))
          >
                  Python lists index from 0... So this will be 0+0, first entry in the
          file list
          >
          >
          >
          threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
          +2])))
          >
                  This is 0+2, THIRD entry in the file list -- you've just skipped
          over the second entry...
          >
          threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
          +3])))
          >
          threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
          +4])))
          >
          threads.append( threading.Threa d(target=self.p rocessFiles(sel f.filesToProces s[loopCount
          +5])))
          >
                  Very ugly... Also going to fail for other reasons... Consider:
          >
          filestoprocess = [ 'file1', 'file2', 'file3' ]
          for jnk in range(len(files toprocess)):  #this will loop three times!
                                                                                          #jnk = 0, 1, 2
          >
                  You proceed to create FIVE threads (or try to) when there are only
          THREE files... It will fail as soon as it tries loopCount+3 (fourth
          entry in a three element list)
          >
                                  msg = "Processing file....\n"
                                  for thread in threads:
                                      wx.CallAfter(se lf.textctrl03.w rite(msg),
          thread.start())
          >
                  Is this running as the main controller of some GUI? if so....
          >
                                  for thread in threads:
                                      thread.join()
          >
                  Your GUI will essentially freeze since it can't process events
          (including screen updates) until the entire function you are in returns
          to the event handler... But .join() blocks until the specified thread
          really finishes...
          >
                                  loopCount += 5
                              except IndexError:
                                  pass
          >
                  BAD style -- if you are going to trap an exception, you should do
          something with it... But then, the only reason you would GET this
          exception is because the preceding code is looping too many times
          relative to the number of files...
          >
                  As shown, with three files, you will create the first thread (0) for
          first file, skip the second file creating the second thread (1) for the
          third file, and raise an exception on trying to create the third thread
          (2) when you try to access a fourth file in the list.  The exception
          will be raised -- SKIPPING over the thread.start() calls, and skipping
          the thread.join() calls. You then ignore the error, and go back to the
          start of the loop where the index is now "1"... AND reset the thread
          list, so threads 0&1 are forgotten, never started, never joined, garbage
          collected...
          >
                  Again, you now create a thread (0) giving it the second file (since
          loopCount was never incremented, and the first thread is using loopCount
          + <loopindex>), create thread (1) giving it the third file, raise the
          exception... repeat
          >
          >
          >
          It works, and it works well.  It starts five threads, and processes
          five files at a time.  (In the "self.processFi les" I read the whole
          file into memory using readlines(), which works well.)
          >
                  It only works as long as loopCount+5 is less than the number of
          files in the list... AND at that, it skips one file and double processes
          another...
          >
          Of course, now the wx.CallAfter function doesn't work... I get
          "TypeError: 'NoneType' object is not callable" for every time it is
          run...
          >
                  Probably because it wants you to supply it with one or two
          /callable/ functions... but you are actually calling the functions and
          passing it the results of the called functions (and they aren't
          returning anything -- None).
          >
                  Ignoring GUI stuff... here is a simple one-job threadpool algorithm
          -- you have to plug in the file list and the actual processing work. It
          creates n-threads; and those threads pull the work off of a common
          queue; the main program only has to fill the queue with the work to be
          done, and stuff a sentinal value onto the queue when it wants the
          threads to die -- which would be before shutdown of the program (create
          the pool at start-up, leave the threads blocked on the .get() until you
          need one to process...
          >
          -=-=-=-=-=-=-=-
          #
          #       Example code for a pooled thread file processor
          #       NOT EXECUTABLE as is -- there is no code to obtain
          #       the list of files to be processed; and the processor
          #       just sleeps...
          >
          import threading
          import Queue
          import time         #just for demo sleep
          >
          NUMTHREADS = 5
          SENTINAL = object()
          >
          workQueue = Queue.Queue()
          >
          def fileProc():         #function that handles processing of the files
              while True:
                  fname = workQueue.get()
                  if fname is SENTINAL:
                      workQueue.put(S ENTINAL)    #recycle sentinal for next
                      break
                  print "Processing %s" % fname
                  time.sleep(3)   #replace with real file processing
          >
          threadList = []
          for ti in range(NUMTHREAD S):    #create worker threads
              t = threading.Threa d(target=filePr oc)
              t.start()
              threadList.appe nd(t)
          >
          for fn in listOfFiles:  #queue up the file names to be worked
              workQueue.put(f n)   #need to expand to include how names are
                                  #obtained
          >
          workQueue.put(S ENTINAL) #signal that no more files are to be worked
          >
          for t in threadList:
              t.join()            #wait for each thread to exit (ensures main
                                  #doesn't exit before all threads finish
          processing
          >
          --
                  Wulfraed        Dennis Lee Bieber               KD6MOG
                  wlfr...@ix.netc om.com              wulfr...@bestia ria.com
                          HTTP://wlfraed.home.netcom.com/
                  (Bestiaria Support Staff:               web-a...@bestiaria. com)
                          HTTP://www.bestiaria.com/
          Thanks for the information! I can definitely see what you're talking
          about, and the Exception is only "pass" right now while I am working
          on the code.

          However, it does process every file (it doesn't skip the second one),
          and I'm guessing that this is because it loops so many times? I guess
          that means I am successful in spite of myself! ;-) (This wouldn't be
          the first time... ;-) )

          I REALLY appreciate your insights!!

          Comment

          Working...