Re: PEP 372 -- Adding an ordered directory to collections

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

    Re: PEP 372 -- Adding an ordered directory to collections

    ``odict.byindex (index)``
    >
    Index-based lookup is supported by ``byindex()`` which returns
    the key/value pair for an index, that is, the "position" of a
    key in the ordered dict. 0 is the first key/value pair, -1
    the last.
    >
    >>d.byindex(2 )
    ('foo', 'bar')
    For this API, I think it's important to make some performance
    guarantees. It seems fairly difficult to make byindex O(1), and
    simultaneously also make insertion/deletion better than O(n).

    IOW, the PEP should somehow specify which operations are efficient,
    and which ones aren't.

    Regards,
    Martin
  • bearophileHUGS@lycos.com

    #2
    Re: PEP 372 -- Adding an ordered directory to collections

    Martin v. L.:
    For this API, I think it's important to make some performance guarantees.
    I may appreciate them for all Python collections :-)

    It seems fairly difficult to make byindex O(1), and
    simultaneously also make insertion/deletion better than O(n).
    It may be possible to make both of them O(log n), but I may appreciate
    byindex O(n) and insertion/deletion O(1).

    Bye,
    bearophile

    Comment

    • Paul McGuire

      #3
      Re: PEP 372 -- Adding an ordered directory to collections

      On Jun 16, 5:24 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
          ``odict.byindex (index)``
      >
              Index-based lookup is supported by ``byindex()`` which returns
              the key/value pair for an index, that is, the "position"o f a
              key in the ordered dict.  0 is the first key/value pair, -1
              the last.
      >
              >>d.byindex(2 )
              ('foo', 'bar')
      >
      For this API, I think it's important to make some performance
      guarantees. It seems fairly difficult to make byindex O(1), and
      simultaneously also make insertion/deletion better than O(n).
      >
      IOW, the PEP should somehow specify which operations are efficient,
      and which ones aren't.
      >
      Regards,
      Martin
      My guess is that the two main memory allocate/deallocate cases are 1)
      appending a new item to the end, and 2) GC'ing the entire data
      structure. I would optimize these 2 at the expense of all others.

      -- Paul

      Comment

      • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

        #4
        Re: PEP 372 -- Adding an ordered directory to collections

        >For this API, I think it's important to make some performance guarantees.
        >
        I may appreciate them for all Python collections :-)
        See


        >It seems fairly difficult to make byindex O(1), and
        >simultaneous ly also make insertion/deletion better than O(n).
        >
        It may be possible to make both of them O(log n), but I may appreciate
        byindex O(n) and insertion/deletion O(1).
        If so, what's the advantage of using that method over d.items[n]?

        Regards,
        Martin

        Comment

        • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

          #5
          Re: PEP 372 -- Adding an ordered directory to collections

          My guess is that the two main memory allocate/deallocate cases are 1)
          appending a new item to the end, and 2) GC'ing the entire data
          structure. I would optimize these 2 at the expense of all others.
          Does that include dictionary lookups?

          Regards,
          Martin

          Comment

          • Paul McGuire

            #6
            Re: PEP 372 -- Adding an ordered directory to collections

            On Jun 17, 12:28 am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
            My guess is that the two main memory allocate/deallocate cases are 1)
            appending a new item to the end, and 2) GC'ing the entire data
            structure.  I would optimize these 2 at the expense of all others.
            >
            Does that include dictionary lookups?
            >
            Regards,
            Martin

            Well, I did (try to) qualify my guess as pertaining specifically to
            memory allocate/deallocate cases, the other topics in the nearby posts
            were talking about updates to the odict, so I hope I can be forgiven
            this oversight. But you're right, the keyed access is one of the main
            reasons this is being done with an odict instead of just a list of
            tuples.

            The point I was trying to make was that sometimes, the GC case occurs
            far more frequently than other update methods, yet is overlooked
            because it happens invisibly to most Python users.

            I worked on a project a while ago where there was a huge performance
            penalty in releasing a list structure, because each node was freed
            individually, causing surrounding freelist entries to be updated for
            every node. In typical worst-case scenario fashion, a second list was
            built at the same time as the first, and the default system memory
            allocator interleaved the list nodes. The truly frustrating part was
            that at this point in the program, we were trying to free *all* the
            nodes in *both* lists, and would have preferred to just release the
            whole chunk of memory, if it had just been allocated in a large chunk
            instead of a node at a time. I implemented a zone-based memory
            allocator, so that nodes were allocated within a memory zone, but when
            we were done, we just released the entire zone, with tremendous
            improvement in system performance.

            But what kind of keyed access is likely to be used on an odict? For
            those cases where all of the items are desired, then I would recommend
            that the odict iterator be preferred, for this type of retrieval:

            for k,v in odictvar.iterit ems():
            # do something with key k and value v

            and discourage this usage:

            for k in odictvar.keys() : # or iterkeys()
            # do something with key k and value odictvar[k]

            How about this as a simple first approach? Let's assume that the
            nominal usage pattern for an odict is 1) create all entries in the
            odict, probably using append, 2) access some or all of the entries,
            either by iterating over the keys or values, or by keyed access to
            specific values, and 3) delete the odict. Have the odict keep an
            internal dict representing the keyed lookups, and have this internal
            dict set to null whenever the odict is updated. When the odict is
            accessed by specific key, the internal dict is used - if it null, it
            is built using dict(odict.iter items()). In the nominal usage, this
            dict will only be built once, since the keyed accesses wont begin
            until after all of the key-value pairs have been added to the odict.

            Or as a first-first approach (to avoid premature optimization), just
            implement the odict as a list of tuples, and do linear search for
            matching key if accessed by key.

            I would bias any performance choices about keyed access toward cases
            where the queried key exists in the odict - I think that in those
            cases where odicts are used, such as ORMs and XML, the keys for a
            particular odict are known and expected to exist. Those applications
            where the keys are not known are likely to be meta-type apps, like
            debuggers and introspection GUIs. I contend that the meat-and-
            potatoes production apps will be those like database queries - when I
            get an odict back after querying an EMPLOYEES table, I really should
            reasonably expect odictvar["EMPNO"] to exist.

            And what would be the conditions of an abusive form of odict? How
            about an application log kept in an odict, keyed by timestamp? This
            could have many 1000's of entries, and yet, I would guess that an
            odict would be the wrong data structure for this, and that a list of
            tuples or LogMessage objects would be more appropriate. But someone
            is likely to do it - if they do, what will happen? Perhaps trying to
            anticipate "abusive" or degenerate uses of an odict might shed some
            light on ways to avoid, discourage, or maybe even accommodate these
            uses in odict.

            Well, this has become something of a rant, and a speculative one at
            that, but I think the "delete the entire data structure" memory case
            should be given reasonably high design attention. (And maybe this is
            already the norm in Python development - I'm really quite ignorant of
            this process, not being in the Python development circle.)

            Always nice to hear from you Martin - cheers!

            -- Paul

            Comment

            • bearophileHUGS@lycos.com

              #7
              Re: PEP 372 -- Adding an ordered directory to collections

              Martin v. L.:Thank you, I think that's not a list of guarantees, while a list of
              how things are now in CPython.

              If so, what's the advantage of using that method over d.items[n]?
              I think I have lost the thread here, sorry. So I explain again what I
              mean. I think for this data structure it's important to keep all the
              normal dict operations at the same speed. If you use a C
              implementation vaguely similar to my pure python recipe you can
              perform the del in O(1) too, because pairs are joined in (double)
              linked list. But such data structure is O(n) to find the n-th item
              inserted into the sequence.

              Bye,
              bearophile

              Comment

              • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

                #8
                Re: PEP 372 -- Adding an ordered directory to collections

                I think I have lost the thread here, sorry. So I explain again what I
                mean. I think for this data structure it's important to keep all the
                normal dict operations at the same speed. If you use a C
                implementation vaguely similar to my pure python recipe you can
                perform the del in O(1) too, because pairs are joined in (double)
                linked list. But such data structure is O(n) to find the n-th item
                inserted into the sequence.
                Right. So byindex(n) would be O(n) then, right? If so, what's the
                purpose of having that method in the first place?

                The PEP doesn't give a rationale, but just proposes that the method
                be there. My guess is that it includes it for performance reasons.
                However, I think the PEP (author) is misguided in assuming that
                making byindex() a method of odict, you get better performance than
                directly doing .items()[n] - which, as you say, you won't.

                Regards,
                Martin

                Comment

                • bearophileHUGS@lycos.com

                  #9
                  Re: PEP 372 -- Adding an ordered directory to collections

                  Martin v. L.:
                  However, I think the PEP (author) is misguided in assuming that
                  making byindex() a method of odict, you get better performance than
                  directly doing .items()[n] - which, as you say, you won't.
                  In Python 2.5 .items()[n] creates a whole list, and then takes one
                  item of such list.
                  An O(n) byindex() just scans the items to return the n-th.
                  So while being both O(n) in time, the .items()[n] may allocate quite
                  more memory.

                  Bye,
                  bearophile

                  Comment

                  • =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

                    #10
                    Re: PEP 372 -- Adding an ordered directory to collections

                    What's the purpose of having list.insert?

                    It's a convenience function: you don't have to write a loop to move all
                    items to a later index. Any reformulation of it is easy to get wrong,
                    and difficult to read.
                    One creates tons of unnecessary method calls, the other creates a full
                    blown list object just to throw it away later. Both less than optimal
                    solutions that can be implemented in a more efficient way on the C
                    layer where one only has to iterate over the linked list offset times
                    and return the item. And iteration for that linked list is most likely
                    something like "for (n = 0; n != offset; ++n) iter = iter->next".
                    Ok, so it is about performance, and intended to provide a speedup by
                    a constant factor (over the trivial reformulation without it).

                    What are the use cases for this function? I.e. in what specific
                    applications did you use it, or would you want to use it?

                    In these applications, could you instead also have used a (hypothetical)
                    function

                    def nth(iter, n):
                    while n:
                    iter.next()
                    n-=1
                    return iter.next()

                    instead?

                    Regards,
                    Martin

                    Comment

                    Working...