Adding tuples to a dictionary

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • =?utf-8?B?TWFjaWVqIEJsaXppxYRza2k=?=

    Adding tuples to a dictionary

    Hi Pythonistas!

    I've got a question about storing tuples in a dictionary. First, a
    small test case which creates a list of dictionaries:

    import time

    list_of_dicts = []
    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = key
    list_of_dicts.a ppend(my_dict)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk

    It creates dictionaries and stores them in a list, printing out
    execution times. The size of each dictionary is constant, so is the
    execution time for each iteration.

    0 0.1
    1 0.1
    2 0.1
    3 0.08
    4 0.09

    ....and so on.

    Then, just one line is changed:
    my_dict[key] = key
    into:
    my_dict[key] = (key, key)

    Full code:

    list_of_dicts = []
    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = (key, key)
    list_of_dicts.a ppend(my_dict)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk

    The difference is that instead of single values, tuples are added to
    the dictionary instead. When the program is run again:

    0 0.27
    1 0.37
    2 0.49
    3 0.6
    ....
    16 2.32
    17 2.45
    18 2.54
    19 2.68

    The execution time is rising linearly with every new dictionary
    created.

    Next experiment: dictionaries are not stored in a list, they are just
    left out when an iteration has finished. It's done by removing two
    lines:

    list_of_dicts = []

    and

    list_of_dicts.a ppend(my_dict)

    Full code:

    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = (key, key)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk

    The time is constant again:

    0 0.28
    1 0.28
    2 0.28
    3 0.26
    4 0.26

    I see no reason for this kind of performance problem, really. It
    happens when both things are true: dictionaries are kept in a list (or
    more generally, in memory) and they store tuples.

    As this goes beyond my understanding of Python internals, I would like
    to kindly ask, if anyone has an idea about how to create this data
    structure (list of dictionaries of tuples, assuming that size of all
    dictionaries is the same), in constant time?

    Regards,
    Maciej

  • Peter Otten

    #2
    Re: Adding tuples to a dictionary

    Maciej Blizi?ski wrote:
    Hi Pythonistas!
    >
    I've got a question about storing tuples in a dictionary. First, a
    small test case which creates a list of dictionaries:
    >
    import time
    >
    list_of_dicts = []
    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = key
    list_of_dicts.a ppend(my_dict)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk
    >
    It creates dictionaries and stores them in a list, printing out
    execution times. The size of each dictionary is constant, so is the
    execution time for each iteration.
    >
    0 0.1
    1 0.1
    2 0.1
    3 0.08
    4 0.09
    >
    ...and so on.
    >
    Then, just one line is changed:
    my_dict[key] = key
    into:
    my_dict[key] = (key, key)
    >
    Full code:
    >
    list_of_dicts = []
    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = (key, key)
    list_of_dicts.a ppend(my_dict)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk
    >
    The difference is that instead of single values, tuples are added to
    the dictionary instead. When the program is run again:
    >
    0 0.27
    1 0.37
    2 0.49
    3 0.6
    ...
    16 2.32
    17 2.45
    18 2.54
    19 2.68
    >
    The execution time is rising linearly with every new dictionary
    created.
    >
    Next experiment: dictionaries are not stored in a list, they are just
    left out when an iteration has finished. It's done by removing two
    lines:
    >
    list_of_dicts = []
    >
    and
    >
    list_of_dicts.a ppend(my_dict)
    >
    Full code:
    >
    keys = [str(x) for x in range(200000)]
    prev_clk = time.clock()
    for i in range(20):
    my_dict = {}
    for key in keys:
    my_dict[key] = (key, key)
    new_clk = time.clock()
    print i, new_clk - prev_clk
    prev_clk = new_clk
    >
    The time is constant again:
    >
    0 0.28
    1 0.28
    2 0.28
    3 0.26
    4 0.26
    >
    I see no reason for this kind of performance problem, really. It
    happens when both things are true: dictionaries are kept in a list (or
    more generally, in memory) and they store tuples.
    >
    As this goes beyond my understanding of Python internals, I would like
    to kindly ask, if anyone has an idea about how to create this data
    structure (list of dictionaries of tuples, assuming that size of all
    dictionaries is the same), in constant time?
    Disable garbage collection:

    import gc
    gc.disable()

    It uses inappropriate heuristics in this case.

    Peter

    Comment

    • bvukov@teletrader.com

      #3
      Re: Adding tuples to a dictionary

      On May 31, 8:30 pm, Maciej Bliziński <maciej.blizin. ..@gmail.com>
      wrote:
      Hi Pythonistas!
      >
      I've got a question about storing tuples in a dictionary. First, a
      small test case which creates a list of dictionaries:
      >
      import time
      >
      list_of_dicts = []
      keys = [str(x) for x in range(200000)]
      prev_clk = time.clock()
      for i in range(20):
      my_dict = {}
      for key in keys:
      my_dict[key] = key
      list_of_dicts.a ppend(my_dict)
      new_clk = time.clock()
      print i, new_clk - prev_clk
      prev_clk = new_clk
      >
      It creates dictionaries and stores them in a list, printing out
      execution times. The size of each dictionary is constant, so is the
      execution time for each iteration.
      >
      0 0.1
      1 0.1
      2 0.1
      3 0.08
      4 0.09
      >
      ...and so on.
      >
      Then, just one line is changed:
      my_dict[key] = key
      into:
      my_dict[key] = (key, key)
      >
      Full code:
      >
      list_of_dicts = []
      keys = [str(x) for x in range(200000)]
      prev_clk = time.clock()
      for i in range(20):
      my_dict = {}
      for key in keys:
      my_dict[key] = (key, key)
      list_of_dicts.a ppend(my_dict)
      new_clk = time.clock()
      print i, new_clk - prev_clk
      prev_clk = new_clk
      >
      The difference is that instead of single values, tuples are added to
      the dictionary instead. When the program is run again:
      >
      0 0.27
      1 0.37
      2 0.49
      3 0.6
      ...
      16 2.32
      17 2.45
      18 2.54
      19 2.68
      >
      The execution time is rising linearly with every new dictionary
      created.
      >
      Next experiment: dictionaries are not stored in a list, they are just
      left out when an iteration has finished. It's done by removing two
      lines:
      >
      list_of_dicts = []
      >
      and
      >
      list_of_dicts.a ppend(my_dict)
      >
      Full code:
      >
      keys = [str(x) for x in range(200000)]
      prev_clk = time.clock()
      for i in range(20):
      my_dict = {}
      for key in keys:
      my_dict[key] = (key, key)
      new_clk = time.clock()
      print i, new_clk - prev_clk
      prev_clk = new_clk
      >
      The time is constant again:
      >
      0 0.28
      1 0.28
      2 0.28
      3 0.26
      4 0.26
      >
      I see no reason for this kind of performance problem, really. It
      happens when both things are true: dictionaries are kept in a list (or
      more generally, in memory) and they store tuples.
      >
      As this goes beyond my understanding of Python internals, I would like
      to kindly ask, if anyone has an idea about how to create this data
      structure (list of dictionaries of tuples, assuming that size of all
      dictionaries is the same), in constant time?
      >
      Regards,
      Maciej
      Let me comment on what happens in you're code:
      The place where you create new objects is
      keys = [str(x) for x in range(200000)] # here you create 200000
      strings which will be reused ( by reference )
      and
      my_dict[key] = (key, key) # here you create a new tuple with 2
      elements
      ( both are key, so you're taking a
      reference of existing key object twice )
      The tricky part is where you wrote:
      for key in keys:
      my_dict[key] = (key, key)
      list_of_dicts.a ppend(my_dict) # note that
      list_of_dicts.a ppend is in the loop! check upstairs!
      This means that my_dict reference will be stored 200000 times, and it
      won't be released.
      statement
      my_dict = {}
      will always create new my_dict ( 20 times means 20 new dictionaries )
      and start over.
      Since python caches free dictionaries ( after delete - they're used
      everywhere ),
      reuse won't happen, and memory will have to be allocated again.
      Lists are internally like arrays, when there is not enough space for
      next element, pointer array is doubled, so
      there is no huge penalty in the append function. Lists are also reused
      from some internal cache.
      Dictionaries also have a some growth function. When there is no space
      for next key, internal hash map doubles.
      The reason why you have a growing time comes from the fact that memory
      allocation takes place instead of object
      being used by reference. Check the memory usage, and you'll see that
      test time is pretty much proportional to overall memory usage.

      Regards, Bosko

      Comment

      • Nick Craig-Wood

        #4
        Re: Adding tuples to a dictionary

        bvukov@teletrad er.com <bvukov@teletra der.comwrote:
        Let me comment on what happens in you're code:
        The place where you create new objects is
        keys = [str(x) for x in range(200000)] # here you create 200000
        strings which will be reused ( by reference )
        and
        my_dict[key] = (key, key) # here you create a new tuple with 2
        elements
        ( both are key, so you're taking a
        reference of existing key object twice )
        The tricky part is where you wrote:
        for key in keys:
        my_dict[key] = (key, key)
        list_of_dicts.a ppend(my_dict) # note that
        list_of_dicts.a ppend is in the loop! check upstairs!
        This means that my_dict reference will be stored 200000 times, and it
        won't be released.
        statement
        my_dict = {}
        will always create new my_dict ( 20 times means 20 new dictionaries )
        and start over.
        Since python caches free dictionaries ( after delete - they're used
        everywhere ),
        reuse won't happen, and memory will have to be allocated again.
        [snip]
        The reason why you have a growing time comes from the fact that memory
        allocation takes place instead of object
        being used by reference. Check the memory usage, and you'll see that
        test time is pretty much proportional to overall memory usage.
        That agrees with my analysis also - it is all about memory usage due
        to the duplicated (key, key) tuples. In fact it puts my machine into
        swap at the larger iteration numbers!

        Here is storing them in a cache which runs nice and fast!

        import time

        list_of_dicts = []
        keys = [str(x) for x in range(200000)]
        tuple_cache = {}
        prev_clk = time.clock()
        for i in range(20):
        my_dict = {}
        for key in keys:
        value = tuple_cache.set default((key, key), (key, key))
        my_dict[key] = value
        list_of_dicts.a ppend(my_dict)
        new_clk = time.clock()
        print i, new_clk - prev_clk
        prev_clk = new_clk

        0 1.0
        1 0.39
        2 0.41
        3 0.39
        4 0.39
        5 0.4
        6 0.41
        7 0.39
        8 0.4
        9 0.4
        10 0.39
        11 0.4
        12 0.4
        13 0.39
        14 0.4
        15 0.4
        16 0.39
        17 0.4
        18 0.39
        19 0.41

        Note the first iteration is slower as it builds the tuple cache

        --
        Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

        Comment

        • Maciej Blizi ski

          #5
          Re: Adding tuples to a dictionary

          Thank you for all the answers! My problem is solved even better than I
          expected!

          @Peter: Yes, the garbage collector was causing the slowdown. Switching
          it off sped the program up; each iteration was taking the same amount
          of time. I ran collection manually every 10 iterations to control
          memory usage.

          @Bosko: Right, this example code had a bug, appending to the
          dictionary was meant to be one nesting level up.

          @Nick: Using a tuple cache is a great advice! My real program's tuples
          are even better to cache, as there are roughly about 10 different
          tuples that are used as values. Using a tuple cache reduced the memory
          consumption by about 3-4 times (I'm telling that by looking at the
          gkrellm display).

          Thanks!
          Maciej

          Comment

          Working...