Parallelising code

**Dan Upton** · Sep 15 '08, 05:15 PM

Re: Parallelising code

Essentially, I have a number of CSV files, let's say 100, each

containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:
>
- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

That probably depends on both how much data is involved in a "data
point" (ie, is it just one value, or are you parsing several fields
from the CSV per record), and how much processing each point involves.
Profiling should enlighten you, yes. You may also have issues with
I/O contention if you have lots of threads trying to read from disk at
once, although I'm not sure how much of an impact that will have.

>
- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?

404 Not Found

http://mail.python.org/pipermail/python-list/2007-March/430017.html

http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm

**Larry Bates** · Sep 15 '08, 06:15 PM

Re: Parallelising code

psaffrey@google mail.com wrote:

I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.
>
Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:
>
- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?
>
- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?
>
Thanks in advance for any guidance,
>
Peter

Put the data into a database first to see if it is actually too slow.
If it is take a look at an in-memory database or perhaps something as simple as
memcached could help.

-Larry

**Mathieu Prevot** · Sep 15 '08, 08:05 PM

Re: Parallelising code

2008/9/15 psaffrey@google mail.com <psaffrey@googl email.com>:

I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.
>
Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:
>
- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?
>
- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?

You won't take advantage of your cores with a pure and single python
script. Python threads are useful for UI, files operations, all but
concurrent processing. The simpler way to do concurrent processing is
to use Popen from subrocess, that'll create new processes.

Notice that you can call python scripts from another one eg a manager
and as many workers as you want. IMO it's the simpler design and less
work for making concurrent processes.
Ideally make your workers not need to feedback with variable, or
anything more complex than a return value. Also, make them not write
the same file. They can read the same file without problem.

Remark that you can manage lock etc from the manager script.

I'm not sure python semaphore allow interprocess communication like c
semaphores [1] ; check this. A workaround is to send to stderr tuples.

HTH,
Mathieu

[1] Programming with POSIX Threads, David R. Butenhof, http://tinyurl.com/6hpkol

**Michael Palmer** · Sep 15 '08, 11:45 PM

Re: Parallelising code

On Sep 15, 12:46 pm, "psaff...@googl email.com"
<psaff...@googl email.comwrote:

I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.
>
Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:
>
- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?
>
- Is list.append() thread safe? (not sure if this is the right term)
what I mean is, can two separate processors file a point in the same
list at the same time without anything horrible happening? Do I need
to do anything special (mutex or whatever) to make this happen, or
will it happen automatically?
>
Thanks in advance for any guidance,
>
Peter

Look at http://pypi.python.org/pypi/processing

**Paul Boddie** · Sep 16 '08, 09:25 AM

Re: Parallelising code

On 15 Sep, 18:46, "psaff...@googl email.com" <psaff...@googl email.com>
wrote:

I have some file processing code that has to deal with quite a lot of
data. I have a quad core machine, so I wondered whether I could take
advantage of some parallelism.

Take a look at this page for some solutions:

ParallelProcessing

http://wiki.python.org/moin/ParallelProcessing

In addition, Jython and IronPython provide the ability to use threads
more effectively.

Essentially, I have a number of CSV files, let's say 100, each
containing about 8000 data points. For each point, I need to look up
some other data structures (generated in advance) and append the point
to a relevant list. I wondered whether I could get each core to handle
a few files each. I have a few questions:
>
- Am I actually going to get any speed up from parallelism, or is it
likely that most of my processing time is spent reading files? I guess
I can profile for this?

There are a few things to consider, and it is useful to see where most
of the time is being spent. One interesting exercise called "Wide
Finder 2", run by Tim Bray (see [1] for more details), investigated
the benefits of log file processing using many concurrent processes,
but it was often argued that the greatest speed-up over a naive serial
implementation could be achieved by optimising the input and output
and by choosing the right parsing strategy.

Paul

[1] http://www.tbray.org/ongoing/When/20.../Wide-Finder-2

**psaffrey@googlemail.com** · Sep 16 '08, 11:25 AM

Re: Parallelising code

Many very helpful replies, which I will now mull over.

Thanks,

Peter

Parallelising code

Parallelising code

Comment

Comment

Comment

Comment

Comment

Comment