Re: PEP 372 -- Adding an ordered directory to collections

**bearophileHUGS@lycos.com** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

Martin v. L.:

For this API, I think it's important to make some performance guarantees.

I may appreciate them for all Python collections :-)

It seems fairly difficult to make byindex O(1), and
simultaneously also make insertion/deletion better than O(n).

It may be possible to make both of them O(log n), but I may appreciate
byindex O(n) and insertion/deletion O(1).

Bye,
bearophile

**Paul McGuire** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

On Jun 16, 5:24 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

``odict.byindex (index)``

>

Index-based lookup is supported by ``byindex()`` which returns
the key/value pair for an index, that is, the "position"o f a
key in the ordered dict. 0 is the first key/value pair, -1
the last.

>

>>d.byindex(2 )
('foo', 'bar')

>
For this API, I think it's important to make some performance
guarantees. It seems fairly difficult to make byindex O(1), and
simultaneously also make insertion/deletion better than O(n).
>
IOW, the PEP should somehow specify which operations are efficient,
and which ones aren't.
>
Regards,
Martin

My guess is that the two main memory allocate/deallocate cases are 1)
appending a new item to the end, and 2) GC'ing the entire data
structure. I would optimize these 2 at the expense of all others.

-- Paul

**=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

>For this API, I think it's important to make some performance guarantees.

>
I may appreciate them for all Python collections :-)

See

TimeComplexity

http://wiki.python.org/moin/TimeComplexity

>It seems fairly difficult to make byindex O(1), and
>simultaneous ly also make insertion/deletion better than O(n).

>
It may be possible to make both of them O(log n), but I may appreciate
byindex O(n) and insertion/deletion O(1).

If so, what's the advantage of using that method over d.items[n]?

Regards,
Martin

**=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

My guess is that the two main memory allocate/deallocate cases are 1)

appending a new item to the end, and 2) GC'ing the entire data
structure. I would optimize these 2 at the expense of all others.

Does that include dictionary lookups?

Regards,
Martin

**Paul McGuire** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

On Jun 17, 12:28 am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

My guess is that the two main memory allocate/deallocate cases are 1)
appending a new item to the end, and 2) GC'ing the entire data
structure. I would optimize these 2 at the expense of all others.

>
Does that include dictionary lookups?
>
Regards,
Martin

Well, I did (try to) qualify my guess as pertaining specifically to
memory allocate/deallocate cases, the other topics in the nearby posts
were talking about updates to the odict, so I hope I can be forgiven
this oversight. But you're right, the keyed access is one of the main
reasons this is being done with an odict instead of just a list of
tuples.

The point I was trying to make was that sometimes, the GC case occurs
far more frequently than other update methods, yet is overlooked
because it happens invisibly to most Python users.

I worked on a project a while ago where there was a huge performance
penalty in releasing a list structure, because each node was freed
individually, causing surrounding freelist entries to be updated for
every node. In typical worst-case scenario fashion, a second list was
built at the same time as the first, and the default system memory
allocator interleaved the list nodes. The truly frustrating part was
that at this point in the program, we were trying to free *all* the
nodes in *both* lists, and would have preferred to just release the
whole chunk of memory, if it had just been allocated in a large chunk
instead of a node at a time. I implemented a zone-based memory
allocator, so that nodes were allocated within a memory zone, but when
we were done, we just released the entire zone, with tremendous
improvement in system performance.

But what kind of keyed access is likely to be used on an odict? For
those cases where all of the items are desired, then I would recommend
that the odict iterator be preferred, for this type of retrieval:

for k,v in odictvar.iterit ems():
# do something with key k and value v

and discourage this usage:

for k in odictvar.keys() : # or iterkeys()
# do something with key k and value odictvar[k]

How about this as a simple first approach? Let's assume that the
nominal usage pattern for an odict is 1) create all entries in the
odict, probably using append, 2) access some or all of the entries,
either by iterating over the keys or values, or by keyed access to
specific values, and 3) delete the odict. Have the odict keep an
internal dict representing the keyed lookups, and have this internal
dict set to null whenever the odict is updated. When the odict is
accessed by specific key, the internal dict is used - if it null, it
is built using dict(odict.iter items()). In the nominal usage, this
dict will only be built once, since the keyed accesses wont begin
until after all of the key-value pairs have been added to the odict.

Or as a first-first approach (to avoid premature optimization), just
implement the odict as a list of tuples, and do linear search for
matching key if accessed by key.

I would bias any performance choices about keyed access toward cases
where the queried key exists in the odict - I think that in those
cases where odicts are used, such as ORMs and XML, the keys for a
particular odict are known and expected to exist. Those applications
where the keys are not known are likely to be meta-type apps, like
debuggers and introspection GUIs. I contend that the meat-and-
potatoes production apps will be those like database queries - when I
get an odict back after querying an EMPLOYEES table, I really should
reasonably expect odictvar["EMPNO"] to exist.

And what would be the conditions of an abusive form of odict? How
about an application log kept in an odict, keyed by timestamp? This
could have many 1000's of entries, and yet, I would guess that an
odict would be the wrong data structure for this, and that a list of
tuples or LogMessage objects would be more appropriate. But someone
is likely to do it - if they do, what will happen? Perhaps trying to
anticipate "abusive" or degenerate uses of an odict might shed some
light on ways to avoid, discourage, or maybe even accommodate these
uses in odict.

Well, this has become something of a rant, and a speculative one at
that, but I think the "delete the entire data structure" memory case
should be given reasonably high design attention. (And maybe this is
already the norm in Python development - I'm really quite ignorant of
this process, not being in the Python development circle.)

Always nice to hear from you Martin - cheers!

-- Paul

**bearophileHUGS@lycos.com** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

Martin v. L.:

http://wiki.python.org/moin/TimeComplexity

Thank you, I think that's not a list of guarantees, while a list of
how things are now in CPython.

If so, what's the advantage of using that method over d.items[n]?

I think I have lost the thread here, sorry. So I explain again what I
mean. I think for this data structure it's important to keep all the
normal dict operations at the same speed. If you use a C
implementation vaguely similar to my pure python recipe you can
perform the del in O(1) too, because pairs are joined in (double)
linked list. But such data structure is O(n) to find the n-th item
inserted into the sequence.

Bye,
bearophile

**=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

I think I have lost the thread here, sorry. So I explain again what I

mean. I think for this data structure it's important to keep all the
normal dict operations at the same speed. If you use a C
implementation vaguely similar to my pure python recipe you can
perform the del in O(1) too, because pairs are joined in (double)
linked list. But such data structure is O(n) to find the n-th item
inserted into the sequence.

Right. So byindex(n) would be O(n) then, right? If so, what's the
purpose of having that method in the first place?

The PEP doesn't give a rationale, but just proposes that the method
be there. My guess is that it includes it for performance reasons.
However, I think the PEP (author) is misguided in assuming that
making byindex() a method of odict, you get better performance than
directly doing .items()[n] - which, as you say, you won't.

Regards,
Martin

**bearophileHUGS@lycos.com** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

Martin v. L.:

However, I think the PEP (author) is misguided in assuming that
making byindex() a method of odict, you get better performance than
directly doing .items()[n] - which, as you say, you won't.

In Python 2.5 .items()[n] creates a whole list, and then takes one
item of such list.
An O(n) byindex() just scans the items to return the n-th.
So while being both O(n) in time, the .items()[n] may allocate quite
more memory.

Bye,
bearophile

**=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=** · Jun 27 '08, 04:29 PM

Re: PEP 372 -- Adding an ordered directory to collections

What's the purpose of having list.insert?

It's a convenience function: you don't have to write a loop to move all
items to a later index. Any reformulation of it is easy to get wrong,
and difficult to read.

One creates tons of unnecessary method calls, the other creates a full
blown list object just to throw it away later. Both less than optimal
solutions that can be implemented in a more efficient way on the C
layer where one only has to iterate over the linked list offset times
and return the item. And iteration for that linked list is most likely
something like "for (n = 0; n != offset; ++n) iter = iter->next".

Ok, so it is about performance, and intended to provide a speedup by
a constant factor (over the trivial reformulation without it).

What are the use cases for this function? I.e. in what specific
applications did you use it, or would you want to use it?

In these applications, could you instead also have used a (hypothetical)
function

def nth(iter, n):
while n:
iter.next()
n-=1
return iter.next()

instead?

Regards,
Martin

Re: PEP 372 -- Adding an ordered directory to collections

Re: PEP 372 -- Adding an ordered directory to collections

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment