How to remove subset from a file efficiently?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • fynali

    How to remove subset from a file efficiently?

    Hi all,

    I have two files:

    - PSP0000320.dat (quite a large list of mobile numbers),
    - CBR0000319.dat (a subset of the above, a list of barred bumbers)

    # head PSP0000320.dat CBR0000319.dat
    ==> PSP0000320.dat <==
    96653696338
    96653766996
    96654609431
    96654722608
    96654738074
    96655697044
    96655824738
    96656190117
    96656256762
    96656263751

    ==> CBR0000319.dat <==
    96651131135
    96651131135
    96651420412
    96651730095
    96652399117
    96652399142
    96652399142
    96652399142
    96652399160
    96652399271

    Objective: to remove the numbers present in barred-list from the
    PSPfile.

    $ ls -lh PSP0000320.dat CBR0000319.dat
    ... 56M Dec 28 19:41 PSP0000320.dat
    ... 8.6M Dec 28 19:40 CBR0000319.dat

    $ wc -l PSP0000320.dat CBR0000319.dat
    4,462,603 PSP0000320.dat
    693,585 CBR0000319.dat

    I wrote the following in python to do it:

    #: c01:rmcommon.py
    barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat' , 'r')
    postlist = open(r'/home/sjd/python/wip/PSP0000320.dat' , 'r')
    outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')

    # reading it all in one go, so as to avoid frequent disk accesses
    (assume machine has plenty memory)
    barredlist.read ()
    postlist.read()

    #
    for number in postlist:
    if number in barrlist:
    pass
    else:
    outfile.write(n umber)

    barredlist.clos e(); postlist.close( ); outfile.close()
    #:~

    The above code simply takes too long to complete. If I were to do a
    diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
    sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
    complete.

    I wrote the following in bash to do the same:

    #!/bin/bash

    ARGS=2

    if [ $# -ne $ARGS ] # takes two arguments
    then
    echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
    echo; echo " eg.: `basename $0` PSP0000320.dat
    CBR0000319.dat" ; echo;
    echo "NOTE: first argument: PSP file, second: CBR file";
    echo " this script _does_ no_ input validation!"
    exit 1
    fi;

    # fix prefix; cost: 12.587 secs
    cat $1 | sed -e 's/^0*/966/' > $1.good
    cat $2 | sed -e 's/^0*/966/' > $2.good

    # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
    sort $1.good > $1.sorted
    sort $2.good > $2.sorted

    # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
    diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

    # remove trailing junk [spaces & <]; cost: 1 min 3 secs
    cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

    # remove intermediate files, good, sorted, filtered
    rm -f *.good *.sorted *.filtered
    #:~

    ....but strangely though, there's a discrepancy, the reason for which I
    can't figure out!

    Needless to say, I'm utterly new to python and my programming skills &
    know-how are rudimentary.

    Any help will be genuinely appreciated.

    --
    fynali

  • Raymond Hettinger

    #2
    Re: How to remove subset from a file efficiently?

    [fynali][color=blue]
    > I have two files:
    >
    > - PSP0000320.dat (quite a large list of mobile numbers),
    > - CBR0000319.dat (a subset of the above, a list of barred bumbers)[/color]

    # print all non-barred mobile phone numbers
    barred = set(open('CBR00 00319.dat'))
    for num in open('PSP000032 0.dat'):
    if num not in barred:
    print num,

    Raymond

    Comment

    • AJL

      #3
      Re: How to remove subset from a file efficiently?

      On 12 Jan 2006 09:04:21 -0800
      "fynali" <iladijas@gmail .com> wrote:
      [color=blue]
      > Hi all,
      >
      > I have two files:
      >
      > - PSP0000320.dat (quite a large list of mobile numbers),
      > - CBR0000319.dat (a subset of the above, a list of barred bumbers)
      >[/color]
      ....
      [color=blue]
      > Objective: to remove the numbers present in barred-list from the
      > PSPfile.[/color]

      How fast does this run?

      a = set(file('PSP00 00320.dat'))
      b = set(file('CBR00 00319.dat'))
      file('PSP-CBR.dat', 'w').writelines (a.difference(b ))

      AJL

      Comment

      • Mike Meyer

        #4
        Re: How to remove subset from a file efficiently?

        "fynali" <iladijas@gmail .com> writes:
        [color=blue]
        > Hi all,
        >
        > I have two files:[/color]

        Others have pointed out the Python solution - use a set instead of a
        list for membership testing. I want to point out a better Unix
        solution ('cause I probably wouldn't have written a Python program to
        do this):
        [color=blue]
        > Objective: to remove the numbers present in barred-list from the
        > PSPfile.
        >
        > $ ls -lh PSP0000320.dat CBR0000319.dat
        > ... 56M Dec 28 19:41 PSP0000320.dat
        > ... 8.6M Dec 28 19:40 CBR0000319.dat
        >
        > $ wc -l PSP0000320.dat CBR0000319.dat
        > 4,462,603 PSP0000320.dat
        > 693,585 CBR0000319.dat
        >
        > I wrote the following in bash to do the same:
        >
        > #!/bin/bash
        >
        > ARGS=2
        >
        > if [ $# -ne $ARGS ] # takes two arguments
        > then
        > echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
        > echo; echo " eg.: `basename $0` PSP0000320.dat
        > CBR0000319.dat" ; echo;
        > echo "NOTE: first argument: PSP file, second: CBR file";
        > echo " this script _does_ no_ input validation!"
        > exit 1
        > fi;
        >
        > # fix prefix; cost: 12.587 secs
        > cat $1 | sed -e 's/^0*/966/' > $1.good
        > cat $2 | sed -e 's/^0*/966/' > $2.good
        >
        > # sort/save files; for the 4,462,603 lines, cost: 36.589 secs
        > sort $1.good > $1.sorted
        > sort $2.good > $2.sorted
        >
        > # diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
        > diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
        >
        > # remove trailing junk [spaces & <]; cost: 1 min 3 secs
        > cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
        >
        > # remove intermediate files, good, sorted, filtered
        > rm -f *.good *.sorted *.filtered
        > #:~
        >
        > ...but strangely though, there's a discrepancy, the reason for which I
        > can't figure out![/color]

        The above script can be shortened quite a bit:

        #!/bin/sh

        comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)

        Will output only lines that occur in $1. It also runs the seds and
        sorts in parallel, which can make a significant difference in the
        clock time it takes to get the job done.

        The Python version is probably faster, since it doesn't sort the
        data.

        <mike
        --
        Mike Meyer <mwm@mired.or g> http://www.mired.org/home/mwm/
        Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

        Comment

        • Christopher Weimann

          #5
          Re: How to remove subset from a file efficiently?

          On 01/12/2006-09:04AM, fynali wrote:[color=blue]
          >
          > - PSP0000320.dat (quite a large list of mobile numbers),
          > - CBR0000319.dat (a subset of the above, a list of barred bumbers)
          >[/color]

          fgrep -x -v -f CBR0000319.dat PSP0000320.dat > PSP-CBR.dat

          Comment

          • Raymond Hettinger

            #6
            Re: How to remove subset from a file efficiently?

            AJL wrote:[color=blue]
            > How fast does this run?
            >
            > a = set(file('PSP00 00320.dat'))
            > b = set(file('CBR00 00319.dat'))
            > file('PSP-CBR.dat', 'w').writelines (a.difference(b ))[/color]

            Turning PSP into a set takes extra time, consumes unnecessary memory,
            eliminates duplicates (possibly a bad thing), and loses the original
            input ordering (probably a bad thing).

            To jam the action into a couple lines, try this:

            b = set(file('CBR00 00319.dat'))
            file('PSP-CBR.dat','w').w ritelines(itert ools.ifilterfal se(b.__contains __,file('PSP000 0320.dat')))

            Raymond

            Comment

            • AJL

              #7
              Re: How to remove subset from a file efficiently?

              On 12 Jan 2006 22:29:22 -0800
              "Raymond Hettinger" <python@rcn.com > wrote:
              [color=blue]
              > AJL wrote:[color=green]
              > > How fast does this run?
              > >
              > > a = set(file('PSP00 00320.dat'))
              > > b = set(file('CBR00 00319.dat'))
              > > file('PSP-CBR.dat', 'w').writelines (a.difference(b ))[/color]
              >
              > Turning PSP into a set takes extra time, consumes unnecessary memory,
              > eliminates duplicates (possibly a bad thing), and loses the original
              > input ordering (probably a bad thing).
              >
              > To jam the action into a couple lines, try this:
              >
              > b = set(file('CBR00 00319.dat'))
              > file('PSP-CBR.dat','w').w ritelines(itert ools.ifilterfal se(b.__contains __,file('PSP000 0320.dat')))
              >
              > Raymond
              >[/color]

              The OP said "assume machine has plenty memory". ;)

              I saw some solutions that used sets and was wondering why they stopped
              at using a set for the first file and not the second when the problem is
              really a set problem but I can see the reasoning behind it now.

              AJL

              Comment

              • fynali

                #8
                Re: How to remove subset from a file efficiently?

                $ time fgrep -x -v -f CBR0000333 PSP0000333 > PSP-CBR.dat.fgrep

                real 0m31.551s
                user 0m16.841s
                sys 0m0.912s

                --
                $ time ./cleanup.py

                real 0m6.080s
                user 0m4.836s
                sys 0m0.408s

                --
                $ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python
                3872421 PSP-CBR.dat.fgrep
                3872421 PSP-CBR.dat.python

                Fantastic, at any rate the time is down from my initial ~4 min.!

                Thank you Chris. The fgrep approach is clean and to the point; and one
                more reason to love the *nix approach to handling everyday problems.

                Fredrik's set|dict approach in Python above gives me one more reason to
                love Python. And it is indeed fast, 5x!

                Thank you all for all your help.

                --
                fynali

                Comment

                • fynali

                  #9
                  Re: How to remove subset from a file efficiently?

                  $ cat cleanup_ray.py
                  #!/usr/bin/python
                  import itertools

                  b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

                  file('PSP-CBR.dat,ray','w ').writelines(i tertools.ifilte rfalse(b.__cont ains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

                  --
                  $ time ./cleanup_ray.py

                  real 0m5.451s
                  user 0m4.496s
                  sys 0m0.428s

                  (-: Damn! That saves a bit more time! Bravo!

                  Thanks to you Raymond.

                  Comment

                  • bonono@gmail.com

                    #10
                    Re: How to remove subset from a file efficiently?


                    fynali wrote:[color=blue]
                    > $ cat cleanup_ray.py
                    > #!/usr/bin/python
                    > import itertools
                    >
                    > b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
                    >
                    > file('PSP-CBR.dat,ray','w ').writelines(i tertools.ifilte rfalse(b.__cont ains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
                    >
                    > --
                    > $ time ./cleanup_ray.py
                    >
                    > real 0m5.451s
                    > user 0m4.496s
                    > sys 0m0.428s
                    >
                    > (-: Damn! That saves a bit more time! Bravo!
                    >
                    > Thanks to you Raymond.[/color]
                    Have you tried the explicit loop variant with psyco ? My experience is
                    that psyco is pretty good at optimizing for loop which usually results
                    in faster code than even built-in map/filter variant.

                    Though it would just be 1 or 2 sec difference(give n what you already
                    have) so may not be important but could be fun.

                    Comment

                    • fynali

                      #11
                      Re: How to remove subset from a file efficiently?

                      --
                      $ ./cleanup.py
                      Traceback (most recent call last):
                      File "./cleanup.py", line 3, in ?
                      import itertools
                      ImportError: No module named itertools

                      --
                      $ time ./cleanup.py
                      File "./cleanup.py", line 8
                      outfile.writeli nes(number for number in postpaid_file if number
                      not in barred)
                      ^
                      SyntaxError: invalid syntax

                      The earlier results I posted were run on my workstation which has
                      Python 2.4.1,

                      $ uname -a && python -V
                      Linux sajid 2.6.13-15.7-smp #1 SMP
                      Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux
                      Python 2.4.1

                      but the server on which the actual processing will be done has an older
                      version )-:

                      $ uname -a && python -V
                      Linux cactus 2.4.21-20.ELsmp #1 SMP
                      Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux
                      Python 2.2.3

                      Is a rewrite possible of Raymond's or Fredrik's suggestions above which
                      will still give me the time saving made?

                      --
                      fynali

                      Comment

                      • fynali

                        #12
                        Re: How to remove subset from a file efficiently?

                        [bonono][color=blue]
                        > Have you tried the explicit loop variant with psyco ?[/color]

                        Sure I wouldn't mind trying; can you suggest some code snippets along
                        the lines of which I should try...?

                        [fynali][color=blue]
                        > Needless to say, I'm utterly new to python and my programming
                        > skills & know-how are rudimentary.[/color]

                        (-:

                        --
                        fynali

                        Comment

                        • Fredrik Lundh

                          #13
                          Re: How to remove subset from a file efficiently?

                          "fynali" wrote:
                          [color=blue]
                          > Is a rewrite possible of Raymond's or Fredrik's suggestions above which
                          > will still give me the time saving made?[/color]

                          Python 2.2 don't have a readymade set type (new in 2.3), and it doesn't
                          support generator expressions (the thing that caused the syntax error).

                          however, using a dictionary instead of the set

                          barred = {}
                          for number in open(open('/home/sjd/python/wip/CBR0000319.dat' )):
                          barred[number] = None # just add it as a key

                          and a list comprehension instead of the generator expression

                          outfile.writeli nes([number for number in infile if number not in barred])

                          (note the extra brackets)

                          should give you decent performance under 2.2.

                          </F>



                          Comment

                          • bonono@gmail.com

                            #14
                            Re: How to remove subset from a file efficiently?


                            fynali wrote:[color=blue]
                            > [bonono][color=green]
                            > > Have you tried the explicit loop variant with psyco ?[/color]
                            >
                            > Sure I wouldn't mind trying; can you suggest some code snippets along
                            > the lines of which I should try...?
                            >
                            > [fynali][color=green]
                            > > Needless to say, I'm utterly new to python and my programming
                            > > skills & know-how are rudimentary.[/color]
                            >
                            > (-:
                            >
                            > --
                            > fynali[/color]

                            just add :

                            import psyco
                            psyco.full()

                            to the beginning of the code from Fredrik for 2.2(the one using dict
                            and list comprehension). You system may not have the psyco module
                            though.

                            Comment

                            • fynali

                              #15
                              Re: How to remove subset from a file efficiently?

                              $ cat cleanup.py
                              #!/usr/bin/python

                              postpaid_file = open('/home/oracle/stc/test/PSP0000333')
                              outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w')

                              barred = {}

                              for number in open('/home/oracle/stc/test/CBR0000333'):
                              barred[number] = None # just add it as a key

                              outfile.writeli nes([number for number in postpaid_file if number
                              not in barred])

                              postpaid_file.c lose(); outfile.close()

                              --
                              $ time ./cleanup.py

                              real 0m31.007s
                              user 0m24.660s
                              sys 0m3.550s

                              Can we say that using generators & newer Python _is_ faster?

                              --
                              fynali

                              Comment

                              Working...