parallel csv-file processing

**Paul Rubin** · Nov 9 '07, 11:05 AM

Re: parallel csv-file processing

Michel Albert <exhuma@gmail.c omwrites:

buffer = []
for line in reader:
buffer.append(l ine)
if len(buffer) == 1000:
f = job_server.subm it(calc_scores, buffer)
buffer = []
>
f = job_server.subm it(calc_scores, buffer)
buffer = []
>
but would this not kill my memory if I start loading bigger slices
into the "buffer" variable?

Why not pass the disk offsets to the job server (untested):

n = 1000
for i,_ in enumerate(reade r):
if i % n == 0:
job_server.subm it(calc_scores, reader.tell(), n)

the remote process seeks to the appropriate place and processes n lines
starting from there.

**Marc 'BlackJack' Rintsch** · Nov 9 '07, 12:15 PM

Re: parallel csv-file processing

On Fri, 09 Nov 2007 02:51:10 -0800, Michel Albert wrote:

Obviously this won't work as you cannot access a slice of a csv-file.
Would it be possible to subclass the csv.reader class in a way that
you can somewhat efficiently access a slice?

An arbitrary slice? I guess not as all records before must have been read
because the lines are not equally long.

The obvious way is to do the following:
>
buffer = []
for line in reader:
buffer.append(l ine)
if len(buffer) == 1000:
f = job_server.subm it(calc_scores, buffer)
buffer = []

With `itertools.isli ce()` this can be written as:

while True:
buffer = list(itertools. islice(reader, 1000))
if not buffer:
break
f = job_server.subm it(calc_scores, buffer)

**Paul Boddie** · Nov 9 '07, 12:55 PM

Re: parallel csv-file processing

On 9 Nov, 12:02, Paul Rubin <http://phr...@NOSPAM.i nvalidwrote:

>
Why not pass the disk offsets to the job server (untested):
>
n = 1000
for i,_ in enumerate(reade r):
if i % n == 0:
job_server.subm it(calc_scores, reader.tell(), n)
>
the remote process seeks to the appropriate place and processes n lines
starting from there.

This is similar to a lot of the smarter solutions for Tim Bray's "Wide
Finder" - a problem apparently in the same domain. See here for more
details:

http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

Lots of discussion about more than just parallel processing/
programming, too.

Paul

parallel csv-file processing

parallel csv-file processing

Comment

Comment

Comment