Orders of magnitude

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Robert Brewer

    Orders of magnitude

    x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]

    # list version: c = []
    for i in x:
    if i not in c:
    c.append(i)[color=blue][color=green][color=darkred]
    >>> t1.timeit(1)[/color][/color][/color]
    2.0331145438875 637

    # dict version: c = {}
    for i in x:
    if i not in c:
    c[i] = None[color=blue][color=green][color=darkred]
    >>> t2.timeit(1)[/color][/color][/color]
    0.0067952770534 134288

    # bsddb version: c = bsddb.btopen(No ne)
    for i in x:
    if i not in c:
    c[i] = None[color=blue][color=green][color=darkred]
    >>> t3.timeit(1)[/color][/color][/color]
    0.1843075027692 2936

    Wow. Dicts are *fast*.

    I'm dedup'ing a 10-million-record dataset, trying different approaches
    for building indexes. The in-memory dicts are clearly faster, but I get
    Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
    other ways to build a large index without slowing down by a factor of
    25?


    Robert Brewer
    MIS
    Amor Ministries
    fumanchu@amor.o rg

  • Paul Rubin

    #2
    Re: Orders of magnitude

    "Robert Brewer" <fumanchu@amor. org> writes:[color=blue]
    > I'm dedup'ing a 10-million-record dataset, trying different approaches
    > for building indexes. The in-memory dicts are clearly faster, but I get
    > Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
    > other ways to build a large index without slowing down by a factor of
    > 25?[/color]

    Sort, then remove dups.

    Comment

    • Josiah Carlson

      #3
      Re: Orders of magnitude

      [color=blue]
      > Sort, then remove dups.[/color]

      A list 10 million integers suck up ~160 megs of memory with Python. I
      doubt the strings would fit even then.

      I would suggest a multi-phase method. Break the sequence up into 100k
      element blocks each and remove duplicates using dictionaries. Sort each
      block individually and write each to a file.

      Merge the 100 files by using a heapq.
      #heap is structured like...
      [("next string in file a", file_handle_a),
      ("next string in file b", file_handle_b), ...]

      If you keep using heappop and heappush from heapq, along with a single
      string of memory, you can remove duplicates quite easily, and certainly
      won't run out of memory.

      - Josiah

      Comment

      • PF

        #4
        Re: Orders of magnitude



        It all boils down to how much space your keys take.
        When you look for dupes, you must hold only the keys in memory, not the
        data (it'll be a lot faster this way).

        I'd say create a bsddb with btree sort to hold all your keys. Should take
        about 20 minutues to fill it. Then scan it in sorted key order, and
        duplciates will appear next to each other.

        Comment

        • Paul Rubin

          #5
          Re: Orders of magnitude

          Josiah Carlson <jcarlson@nospa m.uci.edu> writes:[color=blue][color=green]
          > > Sort, then remove dups.[/color]
          >
          > A list 10 million integers suck up ~160 megs of memory with Python. I
          > doubt the strings would fit even then.[/color]

          I meant sort externally. I thought these records were in a disk file
          or something.

          Comment

          • Peter Otten

            #6
            Re: Orders of magnitude

            Robert Brewer wrote:
            [color=blue]
            > # bsddb version: c = bsddb.btopen(No ne)
            > for i in x:
            > if i not in c:
            > c[i] = None[/color]

            You might try a minor change:

            for i in x:
            c[i] = None

            The fewer the redundancies the larger the speed gain (I suppose; that really
            needs testing to be sure).

            Peter

            Comment

            • Michel Claveau/Hamster

              #7
              Re: Orders of magnitude

              Bonjour !

              Les dictionnaires sont plus rapides lors de la recherche ("... in ..."),
              car il y a utilisation de la clef.
              Avec une liste, la recherche est séquentielle.


              in english (try) :

              For search (... in ...) :
              - dictionnary use key search
              - list use sequentiel search




              @-salutations
              --
              Michel Claveau


              Comment

              • Buck Nuggets

                #8
                Re: Orders of magnitude

                "Robert Brewer" <fumanchu@amor. org> wrote in message news:<mailman.3 8.1080542935.20 120.python-list@python.org >...[color=blue]
                > I'm dedup'ing a 10-million-record dataset, trying different approaches
                > for building indexes. The in-memory dicts are clearly faster, but I get
                > Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
                > other ways to build a large index without slowing down by a factor of
                > 25?[/color]

                In case you are interested in alternatives approaches...he re's how I
                typically do this:

                step 1: sort the file using a separate sort utility (unix sort, cygwin
                sort, etc)

                step 2: have a python program read in rows,
                compare each row to the prior,
                write out only one row for each set

                ks

                Comment

                • Christian Tismer

                  #9
                  Re: Orders of magnitude

                  Buck Nuggets wrote:
                  [color=blue]
                  > "Robert Brewer" <fumanchu@amor. org> wrote in message news:<mailman.3 8.1080542935.20 120.python-list@python.org >...
                  >[color=green]
                  >>I'm dedup'ing a 10-million-record dataset, trying different approaches
                  >>for building indexes. The in-memory dicts are clearly faster, but I get
                  >>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
                  >>other ways to build a large index without slowing down by a factor of
                  >>25?[/color]
                  >
                  >
                  > In case you are interested in alternatives approaches...he re's how I
                  > typically do this:
                  >
                  > step 1: sort the file using a separate sort utility (unix sort, cygwin
                  > sort, etc)
                  >
                  > step 2: have a python program read in rows,
                  > compare each row to the prior,
                  > write out only one row for each set[/color]

                  Good solution, but wayyyy too much effort.
                  You probably know it:
                  If you are seeking for duplicates, and doing it by
                  complete ordering, then you are thwowing lots of information
                  away, since you are not seeking for neighborship, right?
                  That clearly means: it must be inefficient.

                  No offense, just trying to get you on the right track!

                  ciao - chris
                  --
                  Christian Tismer :^) <mailto:tismer@ stackless.com>
                  Mission Impossible 5oftware : Have a break! Take a ride on Python's
                  Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
                  14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
                  work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
                  PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
                  whom do you want to sponsor today? http://www.stackless.com/


                  Comment

                  • Stefan Axelsson

                    #10
                    Re: Orders of magnitude

                    Buck Nuggets wrote:
                    [color=blue]
                    > step 1: sort the file using a separate sort utility (unix sort, cygwin
                    > sort, etc)
                    >
                    > step 2: have a python program read in rows,
                    > compare each row to the prior,
                    > write out only one row for each set[/color]

                    It might be worth mentioning that if your idea of 'sameness' is
                    identical textual representation, the '-u' flag of 'sort' does step 2
                    for you. Won't work in the more general case, but then 'sort' might not
                    work either. YMMV.

                    Stefan,
                    --
                    Stefan Axelsson (email at http://www.cs.chalmers.se/~sax)

                    Comment

                    • Buck Nuggets

                      #11
                      Re: Orders of magnitude

                      Christian Tismer <tismer@stackle ss.com> wrote in message news:<mailman.8 6.1080611520.20 120.python-list@python.org >...[color=blue]
                      > Buck Nuggets wrote:
                      >[color=green]
                      > > "Robert Brewer" <fumanchu@amor. org> wrote in message news:<mailman.3 8.1080542935.20 120.python-list@python.org >...
                      > >
                      > > In case you are interested in alternatives approaches...he re's how I
                      > > typically do this:
                      > >
                      > > step 1: sort the file using a separate sort utility (unix sort, cygwin
                      > > sort, etc)
                      > >
                      > > step 2: have a python program read in rows,
                      > > compare each row to the prior,
                      > > write out only one row for each set[/color]
                      >
                      > Good solution, but wayyyy too much effort.
                      > You probably know it:
                      > If you are seeking for duplicates, and doing it by
                      > complete ordering, then you are thwowing lots of information
                      > away, since you are not seeking for neighborship, right?
                      > That clearly means: it must be inefficient.
                      >
                      > No offense, just trying to get you on the right track![/color]

                      Ha, that's ok. I've been doing exactly this kind of thing for over
                      twenty years (crusty old database developer). I think that you will
                      find that it is more efficient in both development and run time. And
                      it's simple enough that once you start down this path you won't need
                      to brainstorm on how to get it to work.

                      Rather than taking 2-18 hours with the previously mentioned solutions
                      (which require index-building and 10 million index lookups), you'll
                      probably do the entire thing in about 10 minutes (9 minutes to sort
                      file + 1 minute to check dups).

                      Comment

                      • Dang Griffith

                        #12
                        Re: Orders of magnitude

                        On 30 Mar 2004 06:57:16 -0800, bucknuggets@yah oo.com (Buck Nuggets)
                        wrote:
                        [color=blue]
                        >Christian Tismer <tismer@stackle ss.com> wrote in message news:<mailman.8 6.1080611520.20 120.python-list@python.org >...[color=green]
                        >> Buck Nuggets wrote:
                        >>[color=darkred]
                        >> > "Robert Brewer" <fumanchu@amor. org> wrote in message news:<mailman.3 8.1080542935.20 120.python-list@python.org >...
                        >> >
                        >> > In case you are interested in alternatives approaches...he re's how I
                        >> > typically do this:
                        >> >
                        >> > step 1: sort the file using a separate sort utility (unix sort, cygwin
                        >> > sort, etc)
                        >> >
                        >> > step 2: have a python program read in rows,
                        >> > compare each row to the prior,
                        >> > write out only one row for each set[/color]
                        >>
                        >> Good solution, but wayyyy too much effort.
                        >> You probably know it:
                        >> If you are seeking for duplicates, and doing it by
                        >> complete ordering, then you are thwowing lots of information
                        >> away, since you are not seeking for neighborship, right?
                        >> That clearly means: it must be inefficient.
                        >>
                        >> No offense, just trying to get you on the right track![/color]
                        >
                        >Ha, that's ok. I've been doing exactly this kind of thing for over
                        >twenty years (crusty old database developer). I think that you will
                        >find that it is more efficient in both development and run time. And
                        >it's simple enough that once you start down this path you won't need
                        >to brainstorm on how to get it to work.
                        >
                        >Rather than taking 2-18 hours with the previously mentioned solutions
                        >(which require index-building and 10 million index lookups), you'll
                        >probably do the entire thing in about 10 minutes (9 minutes to sort
                        >file + 1 minute to check dups).[/color]
                        From a crusty old unix developer to a crusty old database developer...
                        Part 2 can be done by piping the output of sort to the 'uniq' program
                        (available in cygwin and mingw also, I think).

                        And it's no effort, if it fits the bill. It may be inneficient with
                        regards to sorting algorithms, but extremely efficient in terms of
                        system and developer resources.

                        --dang

                        Comment

                        • Buck Nuggets

                          #13
                          Re: Orders of magnitude

                          Dang Griffith <noemail@noemai l4u.com> wrote in message news:<8ea198050 01359792a4e52ab 984a1e8c@news.t eranews.com>...
                          [color=blue]
                          > From a crusty old unix developer to a crusty old database developer...
                          > Part 2 can be done by piping the output of sort to the 'uniq' program
                          > (available in cygwin and mingw also, I think).
                          >
                          > And it's no effort, if it fits the bill. It may be inneficient with
                          > regards to sorting algorithms, but extremely efficient in terms of
                          > system and developer resources.[/color]

                          ha, yep - forgot to mention that you can drop dupes right in some sort
                          utilities or via a program like uniq.

                          I suppose the only reason to use python after the sort would be if you
                          have some unusual rules to include in the delta calculation (don't
                          include this record if status-field='del', etc).

                          ks

                          Comment

                          • Paul Rubin

                            #14
                            Re: Orders of magnitude

                            Dang Griffith <noemail@noemai l4u.com> writes:[color=blue]
                            > From a crusty old unix developer to a crusty old database developer...
                            > Part 2 can be done by piping the output of sort to the 'uniq' program
                            > (available in cygwin and mingw also, I think).[/color]

                            From a maybe even crustier unix developer: "sort -u" strips dups.

                            Comment

                            Working...