How to remove subset from a file efficiently?

**Raymond Hettinger** · Jan 12 '06, 05:45 PM

Re: How to remove subset from a file efficiently?

[fynali][color=blue]
> I have two files:
>
> - PSP0000320.dat (quite a large list of mobile numbers),
> - CBR0000319.dat (a subset of the above, a list of barred bumbers)[/color]

# print all non-barred mobile phone numbers
barred = set(open('CBR00 00319.dat'))
for num in open('PSP000032 0.dat'):
if num not in barred:
print num,

Raymond

**AJL** · Jan 12 '06, 08:25 PM

Re: How to remove subset from a file efficiently?

On 12 Jan 2006 09:04:21 -0800
"fynali" <iladijas@gmail .com> wrote:
[color=blue]
> Hi all,
>
> I have two files:
>
> - PSP0000320.dat (quite a large list of mobile numbers),
> - CBR0000319.dat (a subset of the above, a list of barred bumbers)
>[/color]
....
[color=blue]
> Objective: to remove the numbers present in barred-list from the
> PSPfile.[/color]

How fast does this run?

a = set(file('PSP00 00320.dat'))
b = set(file('CBR00 00319.dat'))
file('PSP-CBR.dat', 'w').writelines (a.difference(b ))

AJL

**Mike Meyer** · Jan 12 '06, 11:05 PM

Re: How to remove subset from a file efficiently?

"fynali" <iladijas@gmail .com> writes:
[color=blue]
> Hi all,
>
> I have two files:[/color]

Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):
[color=blue]
> Objective: to remove the numbers present in barred-list from the
> PSPfile.
>
> $ ls -lh PSP0000320.dat CBR0000319.dat
> ... 56M Dec 28 19:41 PSP0000320.dat
> ... 8.6M Dec 28 19:40 CBR0000319.dat
>
> $ wc -l PSP0000320.dat CBR0000319.dat
> 4,462,603 PSP0000320.dat
> 693,585 CBR0000319.dat
>
> I wrote the following in bash to do the same:
>
> #!/bin/bash
>
> ARGS=2
>
> if [ $# -ne $ARGS ] # takes two arguments
> then
> echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
> echo; echo " eg.: `basename $0` PSP0000320.dat
> CBR0000319.dat" ; echo;
> echo "NOTE: first argument: PSP file, second: CBR file";
> echo " this script _does_ no_ input validation!"
> exit 1
> fi;
>
> # fix prefix; cost: 12.587 secs
> cat $1 | sed -e 's/^0*/966/' > $1.good
> cat $2 | sed -e 's/^0*/966/' > $2.good
>
> # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
> sort $1.good > $1.sorted
> sort $2.good > $2.sorted
>
> # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
> diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
>
> # remove trailing junk [spaces & <]; cost: 1 min 3 secs
> cat $1.filtered | sed -e 's/$[0-9]*$ *</\1/' > $1.cleaned
>
> # remove intermediate files, good, sorted, filtered
> rm -f *.good *.sorted *.filtered
> #:~
>
> ...but strangely though, there's a discrepancy, the reason for which I
> can't figure out![/color]

The above script can be shortened quite a bit:

#!/bin/sh

comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)

Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.

The Python version is probably faster, since it doesn't sort the
data.

<mike
--
Mike Meyer <mwm@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

**Christopher Weimann** · Jan 13 '06, 01:35 AM

Re: How to remove subset from a file efficiently?

On 01/12/2006-09:04AM, fynali wrote:[color=blue]
>
> - PSP0000320.dat (quite a large list of mobile numbers),
> - CBR0000319.dat (a subset of the above, a list of barred bumbers)
>[/color]

fgrep -x -v -f CBR0000319.dat PSP0000320.dat > PSP-CBR.dat

**Raymond Hettinger** · Jan 13 '06, 06:45 AM

Re: How to remove subset from a file efficiently?

AJL wrote:[color=blue]
> How fast does this run?
>
> a = set(file('PSP00 00320.dat'))
> b = set(file('CBR00 00319.dat'))
> file('PSP-CBR.dat', 'w').writelines (a.difference(b ))[/color]

Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).

To jam the action into a couple lines, try this:

b = set(file('CBR00 00319.dat'))
file('PSP-CBR.dat','w').w ritelines(itert ools.ifilterfal se(b.__contains __,file('PSP000 0320.dat')))

Raymond

**AJL** · Jan 13 '06, 04:15 PM

Re: How to remove subset from a file efficiently?

On 12 Jan 2006 22:29:22 -0800
"Raymond Hettinger" <python@rcn.com > wrote:
[color=blue]
> AJL wrote:[color=green]
> > How fast does this run?
> >
> > a = set(file('PSP00 00320.dat'))
> > b = set(file('CBR00 00319.dat'))
> > file('PSP-CBR.dat', 'w').writelines (a.difference(b ))[/color]
>
> Turning PSP into a set takes extra time, consumes unnecessary memory,
> eliminates duplicates (possibly a bad thing), and loses the original
> input ordering (probably a bad thing).
>
> To jam the action into a couple lines, try this:
>
> b = set(file('CBR00 00319.dat'))
> file('PSP-CBR.dat','w').w ritelines(itert ools.ifilterfal se(b.__contains __,file('PSP000 0320.dat')))
>
> Raymond
>[/color]

The OP said "assume machine has plenty memory". ;)

I saw some solutions that used sets and was wondering why they stopped
at using a set for the first file and not the second when the problem is
really a set problem but I can see the reasoning behind it now.

AJL

**fynali** · Jan 14 '06, 06:25 AM

Re: How to remove subset from a file efficiently?

$ time fgrep -x -v -f CBR0000333 PSP0000333 > PSP-CBR.dat.fgrep

real 0m31.551s
user 0m16.841s
sys 0m0.912s

--
$ time ./cleanup.py

real 0m6.080s
user 0m4.836s
sys 0m0.408s

--
$ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python
3872421 PSP-CBR.dat.fgrep
3872421 PSP-CBR.dat.python

Fantastic, at any rate the time is down from my initial ~4 min.!

Thank you Chris. The fgrep approach is clean and to the point; and one
more reason to love the *nix approach to handling everyday problems.

Fredrik's set|dict approach in Python above gives me one more reason to
love Python. And it is indeed fast, 5x!

Thank you all for all your help.

--
fynali

**fynali** · Jan 14 '06, 07:15 AM

Re: How to remove subset from a file efficiently?

$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w ').writelines(i tertools.ifilte rfalse(b.__cont ains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.

**bonono@gmail.com** · Jan 14 '06, 07:25 AM

Re: How to remove subset from a file efficiently?

fynali wrote:[color=blue]
> $ cat cleanup_ray.py
> #!/usr/bin/python
> import itertools
>
> b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
>
> file('PSP-CBR.dat,ray','w ').writelines(i tertools.ifilte rfalse(b.__cont ains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
>
> --
> $ time ./cleanup_ray.py
>
> real 0m5.451s
> user 0m4.496s
> sys 0m0.428s
>
> (-: Damn! That saves a bit more time! Bravo!
>
> Thanks to you Raymond.[/color]
Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(give n what you already
have) so may not be important but could be fun.

**fynali** · Jan 14 '06, 07:35 AM

Re: How to remove subset from a file efficiently?

--
$ ./cleanup.py
Traceback (most recent call last):
File "./cleanup.py", line 3, in ?
import itertools
ImportError: No module named itertools

--
$ time ./cleanup.py
File "./cleanup.py", line 8
outfile.writeli nes(number for number in postpaid_file if number
not in barred)
^
SyntaxError: invalid syntax

The earlier results I posted were run on my workstation which has
Python 2.4.1,

$ uname -a && python -V
Linux sajid 2.6.13-15.7-smp #1 SMP
Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux
Python 2.4.1

but the server on which the actual processing will be done has an older
version )-:

$ uname -a && python -V
Linux cactus 2.4.21-20.ELsmp #1 SMP
Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux
Python 2.2.3

Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?

--
fynali

**fynali** · Jan 14 '06, 07:45 AM

Re: How to remove subset from a file efficiently?

[bonono][color=blue]
> Have you tried the explicit loop variant with psyco ?[/color]

Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?

[fynali][color=blue]
> Needless to say, I'm utterly new to python and my programming
> skills & know-how are rudimentary.[/color]

(-:

--
fynali

**Fredrik Lundh** · Jan 14 '06, 08:05 AM

Re: How to remove subset from a file efficiently?

"fynali" wrote:
[color=blue]
> Is a rewrite possible of Raymond's or Fredrik's suggestions above which
> will still give me the time saving made?[/color]

Python 2.2 don't have a readymade set type (new in 2.3), and it doesn't
support generator expressions (the thing that caused the syntax error).

however, using a dictionary instead of the set

barred = {}
for number in open(open('/home/sjd/python/wip/CBR0000319.dat' )):
barred[number] = None # just add it as a key

and a list comprehension instead of the generator expression

outfile.writeli nes([number for number in infile if number not in barred])

(note the extra brackets)

should give you decent performance under 2.2.

</F>

**bonono@gmail.com** · Jan 14 '06, 08:05 AM

Re: How to remove subset from a file efficiently?

fynali wrote:[color=blue]
> [bonono][color=green]
> > Have you tried the explicit loop variant with psyco ?[/color]
>
> Sure I wouldn't mind trying; can you suggest some code snippets along
> the lines of which I should try...?
>
> [fynali][color=green]
> > Needless to say, I'm utterly new to python and my programming
> > skills & know-how are rudimentary.[/color]
>
> (-:
>
> --
> fynali[/color]

just add :

import psyco
psyco.full()

to the beginning of the code from Fredrik for 2.2(the one using dict
and list comprehension). You system may not have the psyco module
though.

**fynali** · Jan 14 '06, 08:55 AM

Re: How to remove subset from a file efficiently?

$ cat cleanup.py
#!/usr/bin/python

postpaid_file = open('/home/oracle/stc/test/PSP0000333')
outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w')

barred = {}

for number in open('/home/oracle/stc/test/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writeli nes([number for number in postpaid_file if number
not in barred])

postpaid_file.c lose(); outfile.close()

--
$ time ./cleanup.py

real 0m31.007s
user 0m24.660s
sys 0m3.550s

Can we say that using generators & newer Python _is_ faster?

--
fynali

How to remove subset from a file efficiently?

How to remove subset from a file efficiently?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment