Parallelising code

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • psaffrey@googlemail.com

    Parallelising code

    I have some file processing code that has to deal with quite a lot of
    data. I have a quad core machine, so I wondered whether I could take
    advantage of some parallelism.

    Essentially, I have a number of CSV files, let's say 100, each
    containing about 8000 data points. For each point, I need to look up
    some other data structures (generated in advance) and append the point
    to a relevant list. I wondered whether I could get each core to handle
    a few files each. I have a few questions:

    - Am I actually going to get any speed up from parallelism, or is it
    likely that most of my processing time is spent reading files? I guess
    I can profile for this?

    - Is list.append() thread safe? (not sure if this is the right term)
    what I mean is, can two separate processors file a point in the same
    list at the same time without anything horrible happening? Do I need
    to do anything special (mutex or whatever) to make this happen, or
    will it happen automatically?

    Thanks in advance for any guidance,

    Peter
  • Dan Upton

    #2
    Re: Parallelising code

    Essentially, I have a number of CSV files, let's say 100, each
    containing about 8000 data points. For each point, I need to look up
    some other data structures (generated in advance) and append the point
    to a relevant list. I wondered whether I could get each core to handle
    a few files each. I have a few questions:
    >
    - Am I actually going to get any speed up from parallelism, or is it
    likely that most of my processing time is spent reading files? I guess
    I can profile for this?
    That probably depends on both how much data is involved in a "data
    point" (ie, is it just one value, or are you parsing several fields
    from the CSV per record), and how much processing each point involves.
    Profiling should enlighten you, yes. You may also have issues with
    I/O contention if you have lots of threads trying to read from disk at
    once, although I'm not sure how much of an impact that will have.
    >
    - Is list.append() thread safe? (not sure if this is the right term)
    what I mean is, can two separate processors file a point in the same
    list at the same time without anything horrible happening? Do I need
    to do anything special (mutex or whatever) to make this happen, or
    will it happen automatically?



    Comment

    • Larry Bates

      #3
      Re: Parallelising code

      psaffrey@google mail.com wrote:
      I have some file processing code that has to deal with quite a lot of
      data. I have a quad core machine, so I wondered whether I could take
      advantage of some parallelism.
      >
      Essentially, I have a number of CSV files, let's say 100, each
      containing about 8000 data points. For each point, I need to look up
      some other data structures (generated in advance) and append the point
      to a relevant list. I wondered whether I could get each core to handle
      a few files each. I have a few questions:
      >
      - Am I actually going to get any speed up from parallelism, or is it
      likely that most of my processing time is spent reading files? I guess
      I can profile for this?
      >
      - Is list.append() thread safe? (not sure if this is the right term)
      what I mean is, can two separate processors file a point in the same
      list at the same time without anything horrible happening? Do I need
      to do anything special (mutex or whatever) to make this happen, or
      will it happen automatically?
      >
      Thanks in advance for any guidance,
      >
      Peter
      Put the data into a database first to see if it is actually too slow.
      If it is take a look at an in-memory database or perhaps something as simple as
      memcached could help.

      -Larry

      Comment

      • Mathieu Prevot

        #4
        Re: Parallelising code

        2008/9/15 psaffrey@google mail.com <psaffrey@googl email.com>:
        I have some file processing code that has to deal with quite a lot of
        data. I have a quad core machine, so I wondered whether I could take
        advantage of some parallelism.
        >
        Essentially, I have a number of CSV files, let's say 100, each
        containing about 8000 data points. For each point, I need to look up
        some other data structures (generated in advance) and append the point
        to a relevant list. I wondered whether I could get each core to handle
        a few files each. I have a few questions:
        >
        - Am I actually going to get any speed up from parallelism, or is it
        likely that most of my processing time is spent reading files? I guess
        I can profile for this?
        >
        - Is list.append() thread safe? (not sure if this is the right term)
        what I mean is, can two separate processors file a point in the same
        list at the same time without anything horrible happening? Do I need
        to do anything special (mutex or whatever) to make this happen, or
        will it happen automatically?
        You won't take advantage of your cores with a pure and single python
        script. Python threads are useful for UI, files operations, all but
        concurrent processing. The simpler way to do concurrent processing is
        to use Popen from subrocess, that'll create new processes.

        Notice that you can call python scripts from another one eg a manager
        and as many workers as you want. IMO it's the simpler design and less
        work for making concurrent processes.
        Ideally make your workers not need to feedback with variable, or
        anything more complex than a return value. Also, make them not write
        the same file. They can read the same file without problem.

        Remark that you can manage lock etc from the manager script.

        I'm not sure python semaphore allow interprocess communication like c
        semaphores [1] ; check this. A workaround is to send to stderr tuples.

        HTH,
        Mathieu

        [1] Programming with POSIX Threads, David R. Butenhof, http://tinyurl.com/6hpkol

        Comment

        • Michael Palmer

          #5
          Re: Parallelising code

          On Sep 15, 12:46 pm, "psaff...@googl email.com"
          <psaff...@googl email.comwrote:
          I have some file processing code that has to deal with quite a lot of
          data. I have a quad core machine, so I wondered whether I could take
          advantage of some parallelism.
          >
          Essentially, I have a number of CSV files, let's say 100, each
          containing about 8000 data points. For each point, I need to look up
          some other data structures (generated in advance) and append the point
          to a relevant list. I wondered whether I could get each core to handle
          a few files each. I have a few questions:
          >
          - Am I actually going to get any speed up from parallelism, or is it
          likely that most of my processing time is spent reading files? I guess
          I can profile for this?
          >
          - Is list.append() thread safe? (not sure if this is the right term)
          what I mean is, can two separate processors file a point in the same
          list at the same time without anything horrible happening? Do I need
          to do anything special (mutex or whatever) to make this happen, or
          will it happen automatically?
          >
          Thanks in advance for any guidance,
          >
          Peter
          Look at http://pypi.python.org/pypi/processing

          Comment

          • Paul Boddie

            #6
            Re: Parallelising code

            On 15 Sep, 18:46, "psaff...@googl email.com" <psaff...@googl email.com>
            wrote:
            I have some file processing code that has to deal with quite a lot of
            data. I have a quad core machine, so I wondered whether I could take
            advantage of some parallelism.
            Take a look at this page for some solutions:



            In addition, Jython and IronPython provide the ability to use threads
            more effectively.
            Essentially, I have a number of CSV files, let's say 100, each
            containing about 8000 data points. For each point, I need to look up
            some other data structures (generated in advance) and append the point
            to a relevant list. I wondered whether I could get each core to handle
            a few files each. I have a few questions:
            >
            - Am I actually going to get any speed up from parallelism, or is it
            likely that most of my processing time is spent reading files? I guess
            I can profile for this?
            There are a few things to consider, and it is useful to see where most
            of the time is being spent. One interesting exercise called "Wide
            Finder 2", run by Tim Bray (see [1] for more details), investigated
            the benefits of log file processing using many concurrent processes,
            but it was often argued that the greatest speed-up over a naive serial
            implementation could be achieved by optimising the input and output
            and by choosing the right parsing strategy.

            Paul

            [1] http://www.tbray.org/ongoing/When/20.../Wide-Finder-2

            Comment

            • psaffrey@googlemail.com

              #7
              Re: Parallelising code

              Many very helpful replies, which I will now mull over.

              Thanks,

              Peter

              Comment

              Working...