x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]
# list version: c = []
for i in x:
if i not in c:
c.append(i)[color=blue][color=green][color=darkred]
>>> t1.timeit(1)[/color][/color][/color]
2.0331145438875 637
# dict version: c = {}
for i in x:
if i not in c:
c[i] = None[color=blue][color=green][color=darkred]
>>> t2.timeit(1)[/color][/color][/color]
0.0067952770534 134288
# bsddb version: c = bsddb.btopen(No ne)
for i in x:
if i not in c:
c[i] = None[color=blue][color=green][color=darkred]
>>> t3.timeit(1)[/color][/color][/color]
0.1843075027692 2936
Wow. Dicts are *fast*.
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
Robert Brewer
MIS
Amor Ministries
fumanchu@amor.o rg
# list version: c = []
for i in x:
if i not in c:
c.append(i)[color=blue][color=green][color=darkred]
>>> t1.timeit(1)[/color][/color][/color]
2.0331145438875 637
# dict version: c = {}
for i in x:
if i not in c:
c[i] = None[color=blue][color=green][color=darkred]
>>> t2.timeit(1)[/color][/color][/color]
0.0067952770534 134288
# bsddb version: c = bsddb.btopen(No ne)
for i in x:
if i not in c:
c[i] = None[color=blue][color=green][color=darkred]
>>> t3.timeit(1)[/color][/color][/color]
0.1843075027692 2936
Wow. Dicts are *fast*.
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
Robert Brewer
MIS
Amor Ministries
fumanchu@amor.o rg
Comment