File processor

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Morris

    File processor

    Hi all

    This is a bit vague I suppose :-) Tomorrow I need to write a service which
    monitors two folders for new files and performs tasks appropriately. Some
    of these tasks are not too intensive and some are. Here's a scenario

    Event: \Incoming\SomeF ile.txt
    Action: Copy to a backup folder. Move it elsewhere

    Event: \Incoming\SomeF ile.zip
    Action: Copy to a backup folder. Unzip a file within it elsewhere

    Event: \Outgoing\SomeF ile.txt
    Action: Copy to a backup folder. Move it elsewhere

    Event: \Outgoing\SomeF ile.xml
    Action: Parse the XML, generate a binary file, zip the binary file, backup
    the zip file, copy the zip elsewhere.


    In most of these cases the task is quick, in the final case the task could
    take up to a couple of minutes. I really need to look into this in great
    detail in the morning, but I am hoping to get a bit of a head-start :-)


    01: Is there a class for monitoring new files in a folder and triggering an
    event or something with the name of the new file?

    02: I expect that once the event triggers I will stuff the filename into a
    thread-safe queue. If I have a thread pool for the quick tasks and queue
    tasks to perform I presume the thread automatically sleeps again once the
    task is complete, is that right?


    Thanks

    Pete

  • Peter Duniho

    #2
    Re: File processor

    On Wed, 24 Sep 2008 15:22:11 -0700, Peter Morris
    <mrpmorrisNO@sp amgmail.comwrot e:
    [...]
    01: Is there a class for monitoring new files in a folder and triggering
    an event or something with the name of the new file?
    FileSystemWatch er
    02: I expect that once the event triggers I will stuff the filename into
    a thread-safe queue. If I have a thread pool for the quick tasks and
    queue tasks to perform I presume the thread automatically sleeps again
    once the task is complete, is that right?
    Which thread? The thread pool thread? Yes, if there are no more thread
    pool tasks queued, a thread pool thread will simply enter a wait state
    until a new task is queued.

    Note that if your tasks are not i/o bound, and you expect there to be a
    large number of them queued in a short time, using the built-in thread
    pool is probably not a great idea, as your tasks will all wind up fighting
    each other for the CPU, wasting lots of time in the process.

    Pete

    Comment

    • Peter Morris

      #3
      Re: File processor

      FileSystemWatch er

      Shame there wasn't a way of receiving notifications after the file is
      created and the file handle closed. I had to write something to handle this
      situation.

      Which thread? The thread pool thread? Yes, if there are no more thread
      pool tasks queued, a thread pool thread will simply enter a wait state
      until a new task is queued.
      I decided against the pool thread. I have tasks which are immediate, short,
      long in duration. I didn't want them all in the same thread pool because a
      few long tasks would hog it. I'm going to have 3 threads, each with their
      own queue, and give them jobs to do. Adding to the queue will resume a
      thread, running out of jobs will suspend it.


      Thanks for the info!


      Pete

      Comment

      • Peter Duniho

        #4
        Re: File processor

        On Thu, 25 Sep 2008 11:52:36 -0700, Peter Morris
        <mrpmorrisNO@sp amgmail.comwrot e:
        >FileSystemWatc her
        >
        Shame there wasn't a way of receiving notifications after the file is
        created and the file handle closed. I had to write something to handle
        this situation.
        Yes, FileSystemWatch er is not a panacea. That said, it does provide you
        with enough basic information that the manual labor is reduced (presumably
        that's what you had to do in your own situation).
        >Which thread? The thread pool thread? Yes, if there are no more
        >thread pool tasks queued, a thread pool thread will simply enter a wait
        >state until a new task is queued.
        >
        I decided against the pool thread. I have tasks which are immediate,
        short, long in duration. I didn't want them all in the same thread pool
        because a few long tasks would hog it.
        The built-in thread pool allows thousands of threads. Not necessarily the
        best approach for performance, but if the tasks are not CPU-bound, it
        would probably be fine. Even if the tasks are CPU-bound, as long as you
        never have a huge number of them running simultaneously, it would probably
        be fine. A few long-running tasks would not block other tasks.
        I'm going to have 3 threads, each with their own queue, and give them
        jobs to do. Adding to the queue will resume a thread, running out of
        jobs will suspend it.
        Note that the number of threads should probably also relate to the number
        of CPUs on the system, not just the types of jobs you have. And of
        course, it depends on whether the tasks are CPU-bound versus i/o-bound.

        You haven't posted those details, so it's hard to provide specifics. But
        so far, I haven't seen anything that would suggest that using the built-in
        thread pool wouldn't be appropriate here.

        Pete

        Comment

        • Brian Rasmussen [C# MVP]

          #5
          Re: File processor

          I'm sorry to bother you, but I'm a little confused about this statement and
          I was hoping you could clarify it for me.

          "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
          news:op.uh0fq3t n8jd0ej@petes-computer.local. ..
          Note that if your tasks are not i/o bound, and you expect there to be a
          large number of them queued in a short time, using the built-in thread
          pool is probably not a great idea, as your tasks will all wind up fighting
          each other for the CPU, wasting lots of time in the process.
          Are you saying, that queuing a lot of CPU bound tasks on the thread pool is
          a bad idea?

          That's not generally my understanding. Unless the tasks are long running,
          the thread pool is well suited for this kind of task, and it is designed to
          provide good performance based on the number of available CPUs. Fact of the
          matter is, that if you have many CPU bound tasks, they will compete for CPU
          time no matter what kind of threading strategy you use.

          --
          Regards,
          Brian Rasmussen [C# MVP]




          "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
          news:op.uh0fq3t n8jd0ej@petes-computer.local. ..
          On Wed, 24 Sep 2008 15:22:11 -0700, Peter Morris
          <mrpmorrisNO@sp amgmail.comwrot e:
          >
          >[...]
          >01: Is there a class for monitoring new files in a folder and triggering
          >an event or something with the name of the new file?
          >
          FileSystemWatch er
          >
          >02: I expect that once the event triggers I will stuff the filename into
          >a thread-safe queue. If I have a thread pool for the quick tasks and
          >queue tasks to perform I presume the thread automatically sleeps again
          >once the task is complete, is that right?
          >
          Which thread? The thread pool thread? Yes, if there are no more thread
          pool tasks queued, a thread pool thread will simply enter a wait state
          until a new task is queued.
          >
          Note that if your tasks are not i/o bound, and you expect there to be a
          large number of them queued in a short time, using the built-in thread
          pool is probably not a great idea, as your tasks will all wind up fighting
          each other for the CPU, wasting lots of time in the process.
          >
          Pete

          Comment

          • Peter Duniho

            #6
            Re: File processor

            On Thu, 25 Sep 2008 22:05:18 -0700, Brian Rasmussen [C# MVP]
            <brian@kodehove d.dkwrote:
            I'm sorry to bother you, but I'm a little confused about this statement
            and I was hoping you could clarify it for me.
            >
            "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
            news:op.uh0fq3t n8jd0ej@petes-computer.local. ..
            >Note that if your tasks are not i/o bound, and you expect there to be a
            >large number of them queued in a short time, using the built-in thread
            >pool is probably not a great idea, as your tasks will all wind up
            >fighting each other for the CPU, wasting lots of time in the process.
            >
            Are you saying, that queuing a lot of CPU bound tasks on the thread pool
            is a bad idea?
            It certainly can be.
            That's not generally my understanding.
            Your understanding may be wrong, or simply incomplete. I'm not sure which.
            Unless the tasks are long running, the thread pool is well suited for
            this kind of task, and it is designed to provide good performance based
            on the number of available CPUs.
            No, not really. The thread pool doesn't do anything in particular to
            match active threads with the CPU count. If the tasks are so short-lived,
            and queued so infrequently that one just naturally has relatively few
            threads competing with each other for the CPU, then that's fine. There's
            probably no need to go to the extra effort to limit the number of active
            threads at once.

            But unless you can guarantee that the tasks are all short-lived _and_ that
            there are only a few running at any given time, it pays to be more careful.
            Fact of the matter is, that if you have many CPU bound tasks, they will
            compete for CPU time no matter what kind of threading strategy you use.
            Define "compete". The fact is, there's a good way to compete and a bad
            way.

            First, let's ignore the system threads and assume that you have _only_
            your interesting CPU-bound threads. Now, as long as you only have at most
            one of these active for each CPU, then _none_ of the threads need ever
            yield the CPU. But as soon as you have more of these threads than you
            have CPUs, at least some will have to be round-robin-ed by the thread
            scheduler (and in practice, all probably will be).

            Interrupting a thread is very costly. Not only is there the immediate
            cost of the context switch, in which all of the state for one thread is
            saved and all of the state for another thread is restored, there is a
            serious risk of completely blowing the CPU caches (pipelines, L1 and L2
            cache, jump prediction, etc.).

            If you limit the number of active threads to the number of CPUs you have,
            then you maximize the probability that any given thread can run for an
            extended period of time without interruption, which in turn significantly
            improves the overall throughput of that thread. Conversely, when you
            create a situation in which it's assured that you have more active threads
            than CPUs, you ensure that you inject non-productive CPU cycles and
            disrupt the caching mechanisms in the CPU, all of which can significantly
            hurt performance.

            Even when you have exactly as many threads as CPUs, there are of course
            other threads in the system that may need to run from time to time. It's
            not a perfect system. But, those other threads are generally not going to
            be CPU-bound, and thus aren't going to present the same kind of constant
            competition for the CPU that your own CPU-bound worker threads would.

            The issue of being CPU-bound is important. An i/o-bound thread will in
            fact spend a lot of time in a non-runnable state and thus won't compete
            for the CPU (as much). You can have an awful lot of i/o-bound threads
            sitting idle without any significant cost, and in fact having many
            multiple i/o operations all pending is a way to take advantage of some
            inherent parallelism that exists elsewhere in the hardware (this could go
            either way though...in some cases, having multple i/o operations pending
            allows a particular device to retrieve data most efficiently, as in the
            case of a hard disk, and in other cases having multiple i/o operations
            pending just causes contention for the i/o device, which would be as
            counter-productive as over-competing for the CPU).

            There's not a single one right way to do threading. It does depend on
            your specific task. But for CPU-bound tasks, it is _definitely_
            counter-productive to simply queue a large number of tasks and let the
            thread pool sort it out. You can get much more efficient throughput by
            making sure you never have more runnable threads than you have CPUs.

            Pete

            Comment

            • Registered User

              #7
              Re: File processor

              On Thu, 25 Sep 2008 19:52:36 +0100, "Peter Morris"
              <mrpmorrisNO@SP AMgmail.comwrot e:
              >FileSystemWatc her
              >
              >Shame there wasn't a way of receiving notifications after the file is
              >created and the file handle closed. I had to write something to handle this
              >situation.
              >
              >
              >Which thread? The thread pool thread? Yes, if there are no more thread
              >pool tasks queued, a thread pool thread will simply enter a wait state
              >until a new task is queued.
              >
              >I decided against the pool thread. I have tasks which are immediate, short,
              >long in duration. I didn't want them all in the same thread pool because a
              >few long tasks would hog it. I'm going to have 3 threads, each with their
              >own queue, and give them jobs to do. Adding to the queue will resume a
              >thread, running out of jobs will suspend it.
              >
              You might consider giving each task its own thread pool at the
              outgoing end of each task queue. I did something kinda similar a few
              years back where a worker thread would read its request queue, perform
              the task, and finally write to a parallel response queue.

              regards
              A.G.

              Comment

              • Brian Rasmussen [C# MVP]

                #8
                Re: File processor

                Thanks for the reply - please see my comments below
                >Unless the tasks are long running, the thread pool is well suited for
                >this kind of task, and it is designed to provide good performance based
                >on the number of available CPUs.
                >
                No, not really. The thread pool doesn't do anything in particular to
                match active threads with the CPU count. If the tasks are so short-lived,
                and queued so infrequently that one just naturally has relatively few
                threads competing with each other for the CPU, then that's fine. There's
                probably no need to go to the extra effort to limit the number of active
                threads at once.
                According to the documentation
                (http://msdn.microsoft.com/en-us/libr...hreadpool.aspx)
                that's not entirely correct. The documentation says, "The thread pool
                maintains a minimum number of idle threads. For worker threads, the default
                value of this minimum is the number of processors." In other words: The
                thread pool tries to avoid creating redundant threads based on the number of
                CPUs. As you point out having more threads than CPUs is wasteful.
                >Fact of the matter is, that if you have many CPU bound tasks, they will
                >compete for CPU time no matter what kind of threading strategy you use.
                >
                Define "compete". The fact is, there's a good way to compete and a bad
                way.
                By competing I mean, that the scheduler will switch between all runnable
                threads with the highest priority. As the switching is expensive it should
                be minimized.

                Anyway, I'm aware of all the stuff you go through about CPU threads vs. I/O
                threads and as far as I can tell, we have the same understanding of those
                issues. Given that, I'm confused that you end your post with the following:
                There's not a single one right way to do threading. It does depend on
                your specific task. But for CPU-bound tasks, it is _definitely_
                counter-productive to simply queue a large number of tasks and let the
                thread pool sort it out. You can get much more efficient throughput by
                making sure you never have more runnable threads than you have CPUs.
                I agree that threading is hard and I certainly won't claim to be a master in
                the field. However, I cannot see why you would gain an advantage by doing
                what you describe here.

                Assume we have 10 CPU bound tasks (non-blocking and short running) and 2
                available CPUs. In this case the thread pool will schedule the tasks to run
                on 2 CPUs and thus not create additional threads thereby reducing the cost
                of switching between threads. On the other hand if you create 10 threads and
                let each of them run one of the tasks each, you not only pay the price of
                creating additional threads, you will also end up with a lot of context
                switches which is pure overhead (assuming of course that the tasks cannot be
                completed within a single time slice).

                If the goal is to complete all tasks as fast as possible, it seems to me
                that the thread pool offers a pretty good deal.

                --
                Regards,
                Brian Rasmussen [C# MVP]




                "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
                news:op.uh2u1mo p8jd0ej@petes-computer.local. ..
                On Thu, 25 Sep 2008 22:05:18 -0700, Brian Rasmussen [C# MVP]
                <brian@kodehove d.dkwrote:
                >
                >I'm sorry to bother you, but I'm a little confused about this statement
                >and I was hoping you could clarify it for me.
                >>
                >"Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
                >news:op.uh0fq3 tn8jd0ej@petes-computer.local. ..
                >>Note that if your tasks are not i/o bound, and you expect there to be a
                >>large number of them queued in a short time, using the built-in thread
                >>pool is probably not a great idea, as your tasks will all wind up
                >>fighting each other for the CPU, wasting lots of time in the process.
                >>
                >Are you saying, that queuing a lot of CPU bound tasks on the thread pool
                >is a bad idea?
                >
                It certainly can be.
                >
                >That's not generally my understanding.
                >
                Your understanding may be wrong, or simply incomplete. I'm not sure
                which.
                >
                >Unless the tasks are long running, the thread pool is well suited for
                >this kind of task, and it is designed to provide good performance based
                >on the number of available CPUs.
                >
                No, not really. The thread pool doesn't do anything in particular to
                match active threads with the CPU count. If the tasks are so short-lived,
                and queued so infrequently that one just naturally has relatively few
                threads competing with each other for the CPU, then that's fine. There's
                probably no need to go to the extra effort to limit the number of active
                threads at once.
                >
                But unless you can guarantee that the tasks are all short-lived _and_ that
                there are only a few running at any given time, it pays to be more
                careful.
                >
                >Fact of the matter is, that if you have many CPU bound tasks, they will
                >compete for CPU time no matter what kind of threading strategy you use.
                >
                Define "compete". The fact is, there's a good way to compete and a bad
                way.
                >
                First, let's ignore the system threads and assume that you have _only_
                your interesting CPU-bound threads. Now, as long as you only have at most
                one of these active for each CPU, then _none_ of the threads need ever
                yield the CPU. But as soon as you have more of these threads than you
                have CPUs, at least some will have to be round-robin-ed by the thread
                scheduler (and in practice, all probably will be).
                >
                Interrupting a thread is very costly. Not only is there the immediate
                cost of the context switch, in which all of the state for one thread is
                saved and all of the state for another thread is restored, there is a
                serious risk of completely blowing the CPU caches (pipelines, L1 and L2
                cache, jump prediction, etc.).
                >
                If you limit the number of active threads to the number of CPUs you have,
                then you maximize the probability that any given thread can run for an
                extended period of time without interruption, which in turn significantly
                improves the overall throughput of that thread. Conversely, when you
                create a situation in which it's assured that you have more active threads
                than CPUs, you ensure that you inject non-productive CPU cycles and
                disrupt the caching mechanisms in the CPU, all of which can significantly
                hurt performance.
                >
                Even when you have exactly as many threads as CPUs, there are of course
                other threads in the system that may need to run from time to time. It's
                not a perfect system. But, those other threads are generally not going to
                be CPU-bound, and thus aren't going to present the same kind of constant
                competition for the CPU that your own CPU-bound worker threads would.
                >
                The issue of being CPU-bound is important. An i/o-bound thread will in
                fact spend a lot of time in a non-runnable state and thus won't compete
                for the CPU (as much). You can have an awful lot of i/o-bound threads
                sitting idle without any significant cost, and in fact having many
                multiple i/o operations all pending is a way to take advantage of some
                inherent parallelism that exists elsewhere in the hardware (this could go
                either way though...in some cases, having multple i/o operations pending
                allows a particular device to retrieve data most efficiently, as in the
                case of a hard disk, and in other cases having multiple i/o operations
                pending just causes contention for the i/o device, which would be as
                counter-productive as over-competing for the CPU).
                >
                There's not a single one right way to do threading. It does depend on
                your specific task. But for CPU-bound tasks, it is _definitely_
                counter-productive to simply queue a large number of tasks and let the
                thread pool sort it out. You can get much more efficient throughput by
                making sure you never have more runnable threads than you have CPUs.
                >
                Pete
                >

                Comment

                Working...