dict.get and str.xsplit

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bearophileHUGS@lycos.com

    dict.get and str.xsplit

    When I use dictionaries I find the dict.get() method very handy, and
    I'd like to use it often, but I think it's quite slow. This is a small
    benchmark:

    from random import randrange
    from timeit import default_timer as clock

    def main(N, M):
    keys = list(set(randra nge(2*N) for n in xrange(N)))
    n2 = len(keys) / 2
    keys1, keys2 = keys[:n2], keys[n2:]
    adict = dict.fromkeys(k eys1)

    t = clock()
    for _ in xrange(M):
    for k in keys1:
    r = adict.get(k, None)
    for k in keys2:
    r = adict.get(k, None)
    print round(clock() - t, 2), "s"

    t = clock()
    for _ in xrange(M):
    for k in keys1:
    if k in adict:
    r = adict[k]
    else:
    r = None
    for k in keys2:
    if k in adict:
    r = adict[k]
    else:
    r = None
    print round(clock() - t, 2), "s"

    main(40000, 30)


    On my old PC the output is:
    2.36 s
    1.59 s
    (Your timings may be 4-5 times smaller, so you can tweak N and M to
    have significant results).
    And note that the if/then version is faster even if it calls the hash
    function two times, while get() is supposed to do it once (in this
    case the hashing function is very simple and fast. If the keys are
    long strings the results of a similar benchmark may be different,
    because the computing of the hashing function may become the slowest
    part of that code).

    This is a real difference, that has real impact on the programs I
    write, so I often use the if/else approach, despite the dict.get()
    method being semantically fitter and shorter.
    So can the dict.get() method be speed up? And if not, why?

    ------------------

    Another method I'd like to see, is the str.xsplit() I was talking
    about time ago too. It's like str.split() but it's lazy. It avoids to
    build a potentially huge list. In real D programs I have seen such
    string function may speed up heavy-string-processing code (the kind of
    code you may want to use Python/Perl/Ruby for) up to 2-3 times. Seeing
    how such kind of processing is common in Python I think this isn't an
    optimization to ignore.
    (I am now getting better in writing C code, so in few months I may
    even become able to submit a xsplit patch myself. I think it's not too
    much complex to write, it can borrow lot of code from the str.split
    method).

    Bye, and thank you for your kind attention,
    bearophile
  • Marc 'BlackJack' Rintsch

    #2
    Re: dict.get and str.xsplit

    On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
    This is a real difference, that has real impact on the programs I
    write, so I often use the if/else approach, despite the dict.get()
    method being semantically fitter and shorter.
    So can the dict.get() method be speed up? And if not, why?
    I guess it's the method lookup that's the slow part. Factor it out of the
    loop and measure again::

    adict_get = adict.get
    for _ in xrange(M):
    for k in keys1:
    r = adict_get(k, None)
    for k in keys2:
    r = adict_get(k, None)

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Steve Holden

      #3
      Re: dict.get and str.xsplit

      Marc 'BlackJack' Rintsch wrote:
      On Tue, 26 Feb 2008 06:02:12 -0800, bearophileHUGS wrote:
      >
      >This is a real difference, that has real impact on the programs I
      >write, so I often use the if/else approach, despite the dict.get()
      >method being semantically fitter and shorter.
      >So can the dict.get() method be speed up? And if not, why?
      >
      I guess it's the method lookup that's the slow part. Factor it out of the
      loop and measure again::
      >
      adict_get = adict.get
      for _ in xrange(M):
      for k in keys1:
      r = adict_get(k, None)
      for k in keys2:
      r = adict_get(k, None)
      And think about using the timeit module, since it's been included among
      the batteries for your convenience, and it removes some common pitfalls
      with cross-platform timings.

      regards
      Steve
      --
      Steve Holden +1 571 484 6266 +1 800 494 3119
      Holden Web LLC http://www.holdenweb.com/

      Comment

      • bearophileHUGS@lycos.com

        #4
        Re: dict.get and str.xsplit

        Marc 'BlackJack' Rintsch:
        I guess it's the method lookup that's the slow part. Factor it out of the
        loop and measure again::
        I did know of that optimization, but sometimes I "forget" about it...
        The new timings:

        Output timings, best of 3, unloaded CPU:
        2.32 s with adict.get
        1.56 s with "in" + if/then/else
        1.78 s with adict_get

        It's slower still, despite doing the lookup two times half of the
        times (half keys are present in the test set). The "in" is an
        operator, it's faster than a method call, but I don't understand the
        other details. Now the difference between 1.78 and 1.56 is small
        enough, so probably now it's not much noticeable in practice in real
        programs, so my original argument is mostly dead now :-) In the code
        points where speed matters instead of a single line with a get() I can
        use two lines to create a local adict_get.

        Bye and thank you,
        bearophile

        Comment

        • marek.rocki@wp.pl

          #5
          Re: dict.get and str.xsplit

          bearophileH...@ lycos.com napisa³(a):
          It's slower still, despite doing the lookup two times half of the
          times (half keys are present in the test set). The "in" is an
          operator, it's faster than a method call, but I don't understand the
          other details. Now the difference between 1.78 and 1.56 is small
          enough, so probably now it's not much noticeable in practice in real
          programs, so my original argument is mostly dead now :-) In the code
          points where speed matters instead of a single line with a get() I can
          use two lines to create a local adict_get.
          AFAIK, method/function call is generally slow in Python (similarly to
          the dot attribute lookup). As for the original prooblem, why not use
          defaultdict? I think it's the most idiomatic way to achieve what we
          want. And also the fastest one, according to my quick-and-dirty tests:

          from collections import defaultdict

          def main(N, M):

          # <snip>

          addict = defaultdict(lam bda: None, adict)
          t = clock()
          for _ in xrange(M):
          for k in keys1:
          r = addict[k]
          for k in keys2:
          r = addict[k]
          print round(clock() - t, 2), "s"

          adict.get: 3.12 s
          adict_get: 2.24 s
          in: 1.62 s
          defaultdict: 1.05 s

          Comment

          • bearophileHUGS@lycos.com

            #6
            Re: dict.get and str.xsplit

            marek.ro...@wp. pl:
            As for the original prooblem, why not use
            defaultdict? I think it's the most idiomatic way to achieve what we
            want. And also the fastest one, according to my quick-and-dirty tests:
            It adds the new keys, I can't accept that:
            >>from collections import defaultdict as dd
            >>adict = {1:2, 3:4}
            >>addict = dd(lambda: None, adict)
            >>r = addict[10]
            >>addict
            defaultdict(<fu nction <lambdaat 0x008E6FB0>, {1: 2, 10: None, 3: 4})

            Bye,
            bearophile

            Comment

            • marek.rocki@wp.pl

              #7
              Re: dict.get and str.xsplit

              bearophileH...@ lycos.com napisa³(a):
              marek.ro...@wp. pl:
              As for the original prooblem, why not use
              defaultdict? I think it's the most idiomatic way to achieve what we
              want. And also the fastest one, according to my quick-and-dirty tests:
              >
              It adds the new keys, I can't accept that:
              Right, my fault not noticing that. I experimented further with
              __missing__ method and with "ask forgiveness" idiom (try... except
              KeyError) and they were both noticeably slower. So checking with "in"
              may be the most efficient way we have.

              Regards,
              Marek

              Comment

              Working...