Defaultdict and speed

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bearophileHUGS@lycos.com

    Defaultdict and speed

    This post sums some things I have written in another Python newsgroup.
    More than 40% of the times I use defaultdict like this, to count
    things:
    >>from collections import defaultdict as DD
    >>s = "abracadabr a"
    >>d = DD(int)
    >>for c in s: d[c] += 1
    ....
    >>d
    defaultdict(<ty pe 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})

    But I have seen that if keys are quite sparse, and int() becomes called
    too much often, then code like this is faster:
    >>d = {}
    >>for c in s:
    .... if c in d: d[c] += 1
    .... else: d[c] = 1
    ....
    >>d
    {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}

    So to improve the speed for such special but common situation, the
    defaultdict can manage the case with default_factory =int in a different
    and faster way.

    Bye,
    bearophile

  • Klaas

    #2
    Re: Defaultdict and speed

    bearophileHUGS@ lycos.com wrote:
    This post sums some things I have written in another Python newsgroup.
    More than 40% of the times I use defaultdict like this, to count
    things:
    >
    >from collections import defaultdict as DD
    >s = "abracadabr a"
    >d = DD(int)
    >for c in s: d[c] += 1
    ...
    >d
    defaultdict(<ty pe 'int'>, {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1})
    >
    But I have seen that if keys are quite sparse, and int() becomes called
    too much often, then code like this is faster:
    >
    >d = {}
    >for c in s:
    ... if c in d: d[c] += 1
    ... else: d[c] = 1
    ...
    >d
    {'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}
    >
    So to improve the speed for such special but common situation, the
    defaultdict can manage the case with default_factory =int in a different
    and faster way.
    Benchmarks? I doubt it is worth complicating defaultdict's code (and
    slowing down other uses of the class) for this improvement...
    especially when the faster alternative is so easy to code. If that
    performance difference matters, you would likely find more fruitful
    gains in coding it in c, using PyDict_SET.

    -Mike

    Comment

    • bearophileHUGS@lycos.com

      #3
      Re: Defaultdict and speed

      Klaas wrote:
      Benchmarks?
      There is one (fixed in a succesive post) in the original thread I was
      referring to:

      If you want I can give more of them (and a bit less silly, with strings
      too, etc).

      def ddict(n):
      t = clock()
      d = defaultdict(int )
      for i in xrange(n):
      d[i] += 1
      print round(clock()-t, 2)

      def ndict(n):
      t = clock()
      d = {}
      for i in xrange(n):
      if i in d:
      d[i] += 1
      else:
      d[i] = 1
      print round(clock()-t, 2)

      ddict(300000)
      ndict(300000)

      (and slowing down other uses of the class)
      All it has to do is to cheek if the default_factory is an int, it's
      just an "if" done only once, so I don't think it slows down the other
      cases significantly.

      especially when the faster alternative is so easy to code.
      The faster alternative is easy to create, but the best faster
      alternative can't be coded, because if you code it in Python you need
      two hash accesses, while the defaultdict can require only one of them:

      if n in d:
      d[n] += 1
      else:
      d[n] = 1

      >If that performance difference matters,
      With Python it's usually difficult to tell if some performance
      difference matters. Probably in some programs it may matter, but in
      most other programs it doesn't matter. This is probably true for all
      the performance tweaks I may invent in the future too.

      you would likely find more fruitful
      gains in coding it in c, using PyDict_SET
      I've just started creating a C lib for related purposes, I'd like to
      show it to you all on c.l.p, but first I have to find a place to put it
      on :-) (It's not easy to find a suitable place, it's a python + c +
      pyd, and it's mostly an exercise).

      Bye,
      bearophile

      Comment

      • Klaas

        #4
        Re: Defaultdict and speed

        bearophileHUGS@ lycos.com wrote:
        Klaas wrote:
        Benchmarks?
        >
        There is one (fixed in a succesive post) in the original thread I was
        referring to:

        If you want I can give more of them (and a bit less silly, with strings
        too, etc).
        <>

        Sorry, I didn't see any numbers. I ran it myself and found the
        defaultdict version to be approximately twice as slow. This, as you
        suggest, is the worst case, as you are using integers as hash keys
        (essentially no hashing cost) and are accessing each key exactly once.
        >
        (and slowing down other uses of the class)
        >
        All it has to do is to cheek if the default_factory is an int, it's
        just an "if" done only once, so I don't think it slows down the other
        cases significantly.
        Once it makes that check, surely it must check a flag or some such
        every time it is about to invoke the key constructor function?
        especially when the faster alternative is so easy to code.
        >
        The faster alternative is easy to create, but the best faster
        alternative can't be coded, because if you code it in Python you need
        two hash accesses, while the defaultdict can require only one of them:
        >
        if n in d:
        d[n] += 1
        else:
        d[n] = 1
        How do you think that defaultdict is implemented? It must perform the
        dictionary access to determine that the value is missing. It must then
        go through the method dispatch machinery to look for the __missing__
        method, and execute it. If you _really_ want to make this fast, you
        should write a custom distionary subclass which accepts an object (not
        function) as default value, and assigns it directly.
        If that performance difference matters,
        >
        With Python it's usually difficult to tell if some performance
        difference matters. Probably in some programs it may matter, but in
        most other programs it doesn't matter. This is probably true for all
        the performance tweaks I may invent in the future too.
        In general, I agree, but in this case it is quite clear. The only
        possible speed up is for defaultdict(int ). The re-write using regular
        dicts is trivial, hence, for given piece of code is it quite clear
        whether the performance gain is important. This is not an
        interpreter-wide change, after all.

        Consider also that the performance gains would be relatively
        unsubstantial when more complicated keys and a more realistic data
        distribution is used. Consider further that the __missing__ machinery
        would still be called. Would the resulting construct be faster than
        the use of a vanilla dict? I doubt it.

        But you can prove me wrong by implementing it and benchmarking it.
        you would likely find more fruitful
        gains in coding it in c, using PyDict_SET
        >
        I've just started creating a C lib for related purposes, I'd like to
        show it to you all on c.l.p, but first I have to find a place to put it
        on :-) (It's not easy to find a suitable place, it's a python + c +
        pyd, and it's mostly an exercise).
        Would suggesting a webpage be too trite?

        -Mike

        Comment

        Working...