Speed: bytecode vz C API calls

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jacek Generowicz

    Speed: bytecode vz C API calls

    I have a program in which I make very good use of a memoizer:

    def memoize(callabl e):
    cache = {}
    def proxy(*args):
    try: return cache[args]
    except KeyError: return cache.setdefaul t(args, callable(*args) )
    return proxy

    which, is functionally equivalent to

    class memoize:

    def __init__ (self, callable):
    self.cache = {}
    self.callable = callable

    def __call__ (self, *args):
    try: return self.cache[args]
    except KeyError:
    return self.cache.setd efault(args, self.callable(* args))

    though the latter is about twice as slow.

    I've got to the stage where my program is still not fast enough, and
    calls to the memoizer proxy are topping profiler output table. So I
    thought I'd try to see whether I can speed it up by recoding it in C.

    The closure-based version seems impossible to recode in C (anyone know
    a way?), so I decided to write an extension type equivalent to "class
    memoize" above. This seems to run about 10% faster than the pure
    Python version ... which is still a lot slower than the pure Python
    closure-based version.

    I was expecting C-extensions which make lots of calls to Python C API
    functions, not to be spectacularly fast, but I'm still a little
    disapponited with what I've got. Does this sort of speedup (or rather,
    lack of it) seem right, to those of you experienced with this sort of
    thing? or does it look like I'm doing it wrong?

    Could anyone suggest how I could squeeze more speed out of the
    memoizer? (I'll include the core of my memoize extension type at the below.)

    [What's the current state of the art wrt profiling Python, and its
    extension modules? I've tried using hotshot (though not extensively
    yet), but it seems to show even less information than profile, at
    first blush.]


    The memoize extension type is based around the following:

    typedef struct {
    PyObject_HEAD
    PyObject* x_attr;
    PyObject* cache;
    PyObject* fn;
    } memoizeObject;


    static int
    memoize_init(me moizeObject* self, PyObject* args, PyObject* kwds) {
    PyArg_ParseTupl e(args, "O", &(self->fn));
    Py_INCREF(self->fn);
    self->cache = PyDict_New();
    return 0;
    }

    static PyObject*
    memoize_call(me moizeObject* self, PyObject* args) {
    PyObject* value = PyDict_GetItem( self->fn, args);
    if (! value) {
    PyDict_SetItem( self->cache, args,
    PyObject_CallOb ject(self->fn, args));
    value = PyDict_GetItem( self->cache, args);
    }
    //Py_INCREF(value );
    return value;
    };



    Thanks,

  • Michael Hudson

    #2
    Re: Speed: bytecode vz C API calls

    Jacek Generowicz <jacek.generowi cz@cern.ch> writes:
    [color=blue]
    > Could anyone suggest how I could squeeze more speed out of the
    > memoizer? (I'll include the core of my memoize extension type at the
    > below.)[/color]

    You *might* get a speed bump by making your memoized callable a method
    of the memoizer object rather than implementing tp_call.

    Cheers,
    mwh

    --
    M-x psych[TAB][RETURN]
    -- try it

    Comment

    • Jacek Generowicz

      #3
      Re: Speed: bytecode vz C API calls

      Jacek Generowicz <jacek.generowi cz@cern.ch> writes:
      [color=blue]
      > I was expecting C-extensions which make lots of calls to Python C API
      > functions, not to be spectacularly fast, but I'm still a little
      > disapponited with what I've got. Does this sort of speedup (or rather,
      > lack of it) seem right, to those of you experienced with this sort of
      > thing? or does it look like I'm doing it wrong?[/color]
      [color=blue]
      > static int
      > memoize_init(me moizeObject* self, PyObject* args, PyObject* kwds) {
      > PyArg_ParseTupl e(args, "O", &(self->fn));[/color]

      Obvious bug: &(self->cache) !!
      [color=blue]
      > Py_INCREF(self->fn);
      > self->cache = PyDict_New();
      > return 0;
      > }[/color]

      Fixing the bug gives me a factor of 3 speedup over the class-based
      version.

      That's a bit more like what I was expecting.

      Comment

      • Terry Reedy

        #4
        Re: Speed: bytecode vz C API calls


        "Jacek Generowicz" <jacek.generowi cz@cern.ch> wrote in message
        news:tyf7k171jb q.fsf@pcepsft00 1.cern.ch...[color=blue]
        > I have a program in which I make very good use of a memoizer:
        >
        > def memoize(callabl e):
        > cache = {}
        > def proxy(*args):
        > try: return cache[args]
        > except KeyError: return cache.setdefaul t(args, callable(*args) )
        > return proxy[/color]
        ....[color=blue]
        > I've got to the stage where my program is still not fast enough, and
        > calls to the memoizer proxy are topping profiler output table. So I
        > thought I'd try to see whether I can speed it up by recoding it in C.[/color]

        Have you tried psyco on above? Or rather, on the proxies?
        Can any of your callables be memoized by list rather than dict?
        (ie, any count or restricted int args?)

        If you have just a few wrapped functions, faster, special-cased proxies
        (with exact arg lists) would be feasible and possibly easier for psyco.
        Special-casing would include inlining the callable when possible
        -- or rather, inlining the memoization within the function.

        TJR


        Comment

        • Aahz

          #5
          Re: Speed: bytecode vz C API calls

          In article <tyf7k171jbq.fs f@pcepsft001.ce rn.ch>,
          Jacek Generowicz <jacek.generowi cz@cern.ch> wrote:[color=blue]
          >
          >I have a program in which I make very good use of a memoizer:
          >
          > def memoize(callabl e):
          > cache = {}
          > def proxy(*args):
          > try: return cache[args]
          > except KeyError: return cache.setdefaul t(args, callable(*args) )
          > return proxy
          >
          >I've got to the stage where my program is still not fast enough, and
          >calls to the memoizer proxy are topping profiler output table. So I
          >thought I'd try to see whether I can speed it up by recoding it in C.[/color]

          I'm wondering whether you're tackling something the wrong way or perhaps
          you need to throw hardware at the problem. Dicts are one of the most
          highly optimized parts of Python, and if your cache hit rate is higher
          than 90%, you'll be dealing with exceptions only rarely, so the speed of
          cache.setdefaul t() isn't where your time is going. I'm wondering whether
          you're hitting some kind of memory issue or something, or possibly you're
          having cache effects (in CPU/memory), or maybe even your key is poorly
          constructed so that you're getting hash duplicates.

          OTOH, if the cache hit rate is smaller than that, you'll be paying the
          penalty for raised exceptions *and* the callable() call, in which case
          redesigning that part will pay the highest dividends.
          --
          Aahz (aahz@pythoncra ft.com) <*> http://www.pythoncraft.com/

          Weinberg's Second Law: If builders built buildings the way programmers wrote
          programs, then the first woodpecker that came along would destroy civilization.

          Comment

          • Jacek Generowicz

            #6
            Re: Speed: bytecode vz C API calls

            "Terry Reedy" <tjreedy@udel.e du> writes:
            [color=blue]
            > "Jacek Generowicz" <jacek.generowi cz@cern.ch> wrote in message
            > news:tyf7k171jb q.fsf@pcepsft00 1.cern.ch...[color=green]
            > > I have a program in which I make very good use of a memoizer:
            > >
            > > def memoize(callabl e):
            > > cache = {}
            > > def proxy(*args):
            > > try: return cache[args]
            > > except KeyError: return cache.setdefaul t(args, callable(*args) )
            > > return proxy[/color]
            > ...[color=green]
            > > I've got to the stage where my program is still not fast enough, and
            > > calls to the memoizer proxy are topping profiler output table. So I
            > > thought I'd try to see whether I can speed it up by recoding it in C.[/color]
            >
            > Have you tried psyco on above?[/color]

            Nope, but psyco would be politically unacceptable for my users, in
            this case. OTOH, I've been wanting to play with psyco for ages ...
            [color=blue]
            > Can any of your callables be memoized by list rather than dict?
            > (ie, any count or restricted int args?)[/color]

            You've lost me there.

            Comment

            • Jacek Generowicz

              #7
              Re: Speed: bytecode vz C API calls

              aahz@pythoncraf t.com (Aahz) writes:
              [color=blue]
              > In article <tyf7k171jbq.fs f@pcepsft001.ce rn.ch>,
              > Jacek Generowicz <jacek.generowi cz@cern.ch> wrote:[color=green]
              > >
              > >I have a program in which I make very good use of a memoizer:
              > >
              > > def memoize(callabl e):
              > > cache = {}
              > > def proxy(*args):
              > > try: return cache[args]
              > > except KeyError: return cache.setdefaul t(args, callable(*args) )
              > > return proxy
              > >
              > >I've got to the stage where my program is still not fast enough, and
              > >calls to the memoizer proxy are topping profiler output table. So I
              > >thought I'd try to see whether I can speed it up by recoding it in C.[/color]
              >
              > I'm wondering whether you're tackling something the wrong way[/color]

              There _is_ always that possibility, but I don't think that's what my
              problem is here (it may well be elsewhere in my program).
              [color=blue]
              > or perhaps you need to throw hardware at the problem.[/color]

              The definition of "not fast enough" in this case, is "much slower than
              a competing C++ implementation" , so throwing hardware at it advances
              me not one bit, I'm afraid.
              [color=blue]
              > Dicts are one of the most highly optimized parts of Python,[/color]

              Which is exactly why I'm using them extensively.
              [color=blue]
              > and if your cache hit rate is higher than 90%, you'll be dealing
              > with exceptions only rarely,[/color]

              My cache hit rate is well over 90%.
              [color=blue]
              > so the speed of cache.setdefaul t() isn't where your time is going.
              > I'm wondering whether you're hitting some kind of memory issue or
              > something, or possibly you're having cache effects (in CPU/memory),
              > or maybe even your key is poorly constructed so that you're getting
              > hash duplicates.[/color]

              No, I think that my memoizer works just fine. The "problem" is that
              I've memoized a lot of functions, and quite a few of them are being
              called in my inner loops. If I want to compete on speed with C++, I'm
              simply going have to eliminate Python bytecode from my inner loops[*].

              [*] Unless my overall algorithm is completely wrong, of course.

              Comment

              • Jacek Generowicz

                #8
                Re: Speed: bytecode vz C API calls

                Jacek Generowicz <jacek.generowi cz@cern.ch> writes:
                [color=blue]
                > Jacek Generowicz <jacek.generowi cz@cern.ch> writes:
                >[color=green]
                > > static int
                > > memoize_init(me moizeObject* self, PyObject* args, PyObject* kwds) {
                > > PyArg_ParseTupl e(args, "O", &(self->fn));[/color]
                >
                > Obvious bug: &(self->cache) !!
                >[color=green]
                > > Py_INCREF(self->fn);
                > > self->cache = PyDict_New();
                > > return 0;[/color][/color]

                Apologies, I misreported the bug. self->fn is appropriate here
                .... using self->fn (instread of self->cache) as the dictionary in the
                other function is what was wrong, of course.
                [color=blue]
                > Fixing the bug gives me a factor of 3 speedup over the class-based
                > version.[/color]

                This is still true.

                Comment

                • Aahz

                  #9
                  Re: Speed: bytecode vz C API calls

                  In article <tyfu14az6so.fs f@pcepsft001.ce rn.ch>,
                  Jacek Generowicz <jacek.generowi cz@cern.ch> wrote:[color=blue]
                  >aahz@pythoncra ft.com (Aahz) writes:[color=green]
                  >> In article <tyf7k171jbq.fs f@pcepsft001.ce rn.ch>,
                  >> Jacek Generowicz <jacek.generowi cz@cern.ch> wrote:[color=darkred]
                  >>>
                  >>>I have a program in which I make very good use of a memoizer:
                  >>>
                  >>> def memoize(callabl e):
                  >>> cache = {}
                  >>> def proxy(*args):
                  >>> try: return cache[args]
                  >>> except KeyError: return cache.setdefaul t(args, callable(*args) )
                  >>> return proxy
                  >>>
                  >>>I've got to the stage where my program is still not fast enough, and
                  >>>calls to the memoizer proxy are topping profiler output table. So I
                  >>>thought I'd try to see whether I can speed it up by recoding it in C.[/color]
                  >>
                  >> I'm wondering whether you're tackling something the wrong way[/color]
                  >
                  >There _is_ always that possibility, but I don't think that's what my
                  >problem is here (it may well be elsewhere in my program).
                  >[color=green]
                  >> or perhaps you need to throw hardware at the problem.[/color]
                  >
                  >The definition of "not fast enough" in this case, is "much slower than
                  >a competing C++ implementation" , so throwing hardware at it advances
                  >me not one bit, I'm afraid.
                  >[color=green]
                  >> Dicts are one of the most highly optimized parts of Python,[/color]
                  >
                  >Which is exactly why I'm using them extensively.
                  >[color=green]
                  >> and if your cache hit rate is higher than 90%, you'll be dealing
                  >> with exceptions only rarely,[/color]
                  >
                  >My cache hit rate is well over 90%.[/color]

                  All your comments taken as a set make no sense. There's just no way
                  that a bunch of dict accesses in Python can be much slower than a C++
                  program. (C program, *maybe* -- not C++)
                  [color=blue][color=green]
                  >> so the speed of cache.setdefaul t() isn't where your time is going.
                  >> I'm wondering whether you're hitting some kind of memory issue or
                  >> something, or possibly you're having cache effects (in CPU/memory),
                  >> or maybe even your key is poorly constructed so that you're getting
                  >> hash duplicates.[/color]
                  >
                  >No, I think that my memoizer works just fine. The "problem" is that
                  >I've memoized a lot of functions, and quite a few of them are being
                  >called in my inner loops. If I want to compete on speed with C++, I'm
                  >simply going have to eliminate Python bytecode from my inner loops[*].
                  >
                  >[*] Unless my overall algorithm is completely wrong, of course.[/color]

                  Waitasec .... okay, I see what your problem is, assuming you're
                  describing it correctly. The problem is that you're calling Python
                  functions in an inner loop. I think you're going to need to write a
                  solution that doesn't call functions (function unrolling, so to speak).
                  Instead of returning a memoizer function, just use a dict of dicts.
                  --
                  Aahz (aahz@pythoncra ft.com) <*> http://www.pythoncraft.com/

                  Weinberg's Second Law: If builders built buildings the way programmers wrote
                  programs, then the first woodpecker that came along would destroy civilization.

                  Comment

                  • Michael Hudson

                    #10
                    Re: Speed: bytecode vz C API calls

                    "Terry Reedy" <tjreedy@udel.e du> writes:
                    [color=blue]
                    > "Jacek Generowicz" <jacek.generowi cz@cern.ch> wrote in message
                    > news:tyf7k171jb q.fsf@pcepsft00 1.cern.ch...[color=green]
                    > > I have a program in which I make very good use of a memoizer:
                    > >
                    > > def memoize(callabl e):
                    > > cache = {}
                    > > def proxy(*args):
                    > > try: return cache[args]
                    > > except KeyError: return cache.setdefaul t(args, callable(*args) )
                    > > return proxy[/color]
                    > ...[color=green]
                    > > I've got to the stage where my program is still not fast enough, and
                    > > calls to the memoizer proxy are topping profiler output table. So I
                    > > thought I'd try to see whether I can speed it up by recoding it in C.[/color]
                    >
                    > Have you tried psyco on above? Or rather, on the proxies?[/color]

                    I doubt that would help -- the largest overhead is likely the function
                    call, and I think calling across the pysco/python boundary is
                    expensive.

                    Cheers,
                    mwh

                    --
                    Screaming 14-year-old boys attempting to prove to each other that
                    they are more 3133t than j00.
                    -- Reason #8 for quitting slashdot today, from

                    Comment

                    • Jacek Generowicz

                      #11
                      Re: Speed: bytecode vz C API calls

                      aahz@pythoncraf t.com (Aahz) writes:
                      [color=blue]
                      > Waitasec .... okay, I see what your problem is, assuming you're
                      > describing it correctly. The problem is that you're calling Python
                      > functions in an inner loop.[/color]

                      .... indeedy ...
                      [color=blue]
                      > I think you're going to need to write a solution that doesn't call
                      > functions (function unrolling, so to speak). Instead of returning a
                      > memoizer function, just use a dict of dicts.[/color]

                      Sounds interesting. But how do I get the dictionary to use the
                      function which it is memoizing, to calculate values which haven't been
                      cached yet?

                      Hmm ... maybe by subclassing dict, I can add the necessary
                      functionality ... but even then I don't see how to avoid wrapping the
                      dictionary access in a function call ... and then we're back to square
                      one.

                      Comment

                      • Peter Otten

                        #12
                        Re: Speed: bytecode vz C API calls

                        Jacek Generowicz wrote:
                        [color=blue]
                        > aahz@pythoncraf t.com (Aahz) writes:
                        >[color=green]
                        >> Waitasec .... okay, I see what your problem is, assuming you're
                        >> describing it correctly. The problem is that you're calling Python
                        >> functions in an inner loop.[/color]
                        >
                        > ... indeedy ...
                        >[color=green]
                        >> I think you're going to need to write a solution that doesn't call
                        >> functions (function unrolling, so to speak). Instead of returning a
                        >> memoizer function, just use a dict of dicts.[/color]
                        >
                        > Sounds interesting. But how do I get the dictionary to use the
                        > function which it is memoizing, to calculate values which haven't been
                        > cached yet?[/color]

                        Guessing into the blue:

                        try:
                        result = cache[fun, args]
                        except KeyError:
                        result = cache[fun, args] = fun(args)

                        should make a fast inner loop if cache misses are rare.
                        [color=blue]
                        >
                        > Hmm ... maybe by subclassing dict, I can add the necessary
                        > functionality ... but even then I don't see how to avoid wrapping the
                        > dictionary access in a function call ... and then we're back to square
                        > one.[/color]

                        Subclassing dict will not make it faster. Maybe it's time to show some inner
                        loop lookalike to the wider public for more to the point suggestions...

                        Peter

                        Comment

                        • Terry Reedy

                          #13
                          Re: Speed: bytecode vz C API calls


                          "Jacek Generowicz" <jacek.generowi cz@cern.ch> wrote in message
                          news:tyfy8tmz7h q.fsf@pcepsft00 1.cern.ch...[color=blue]
                          > "Terry Reedy" <tjreedy@udel.e du> writes:[color=green]
                          > > Can any of your callables be memoized by list rather than dict?
                          > > (ie, any count or restricted int args?)[/color]
                          >
                          > You've lost me there.[/color]

                          Lookup tables. Dicts can memoize any function. Lists can memoize or
                          partially memoize functions with one or more periodic domains. Perhaps I am
                          being too obvious or simple for *you*, but I have seen people memoize fib()
                          or fac() with a generic dict-based wrapper, which is overkill compared to a
                          simple list.

                          In another post, you said
                          [color=blue]
                          > If I want to compete on speed with C++, I'm
                          >simply going have to eliminate Python bytecode from my inner loops[*].
                          >[*] Unless my overall algorithm is completely wrong, of course.[/color]

                          For speed, you need to eliminate *executed* and *expensive* bytecodes, not
                          static, possibly skipped-over, bytecodes. I'll take that as what you meant.
                          It is certainly what our comments are aimed at. Hence my other suggestion
                          to eliminate a layer of (expensive) function calls by combining value
                          calculation and storage in one function (more space, less time). More
                          specific suggestions will require more specific examples.

                          Terry J. Reedy


                          Comment

                          • Aahz

                            #14
                            Function unrolling (was Re: Speed: bytecode vz C API calls)

                            In article <tyfd6ayylhy.fs f@pcepsft001.ce rn.ch>,
                            Jacek Generowicz <jacek.generowi cz@cern.ch> wrote:[color=blue]
                            >aahz@pythoncra ft.com (Aahz) writes:[color=green]
                            >>
                            >> I think you're going to need to write a solution that doesn't call
                            >> functions (function unrolling, so to speak). Instead of returning a
                            >> memoizer function, just use a dict of dicts.[/color]
                            >
                            >Sounds interesting. But how do I get the dictionary to use the
                            >function which it is memoizing, to calculate values which haven't been
                            >cached yet?[/color]

                            Same way you decide which function to call currently. Essentially, you
                            do something like this:

                            def foo():
                            pass

                            def bar():
                            pass

                            lookup = {
                            'foo': {'func': foo, 'cache': {}},
                            'bar': {'func': bar, 'cache': {}},
                            }

                            for operation,args in WorkQueue:
                            curr = lookup[operation]
                            try:
                            return curr['cache'][args]
                            except KeyError:
                            return curr['cache'].setdefault(arg s, curr['func'](*args))

                            (Note: untested; you'll have to fill in the blanks)
                            --
                            Aahz (aahz@pythoncra ft.com) <*> http://www.pythoncraft.com/

                            Weinberg's Second Law: If builders built buildings the way programmers wrote
                            programs, then the first woodpecker that came along would destroy civilization.

                            Comment

                            • Jacek Generowicz

                              #15
                              Re: Speed: bytecode vz C API calls

                              "Terry Reedy" <tjreedy@udel.e du> writes:
                              [color=blue]
                              > "Jacek Generowicz" <jacek.generowi cz@cern.ch> wrote in message
                              > news:tyfy8tmz7h q.fsf@pcepsft00 1.cern.ch...[color=green]
                              > > "Terry Reedy" <tjreedy@udel.e du> writes:[color=darkred]
                              > > > Can any of your callables be memoized by list rather than dict?
                              > > > (ie, any count or restricted int args?)[/color]
                              > >
                              > > You've lost me there.[/color]
                              >
                              > Lookup tables. Dicts can memoize any function. Lists can memoize or
                              > partially memoize functions with one or more periodic domains. Perhaps I am
                              > being too obvious or simple for *you*, but I have seen people memoize fib()
                              > or fac() with a generic dict-based wrapper, which is overkill compared to a
                              > simple list.[/color]

                              The arguments passed to the functions I have memoized are usually
                              types, so, no, I cannot memoize them by list.
                              [color=blue]
                              > In another post, you said
                              >[color=green]
                              > > If I want to compete on speed with C++, I'm
                              > >simply going have to eliminate Python bytecode from my inner loops[*].
                              > >[*] Unless my overall algorithm is completely wrong, of course.[/color]
                              >
                              > For speed, you need to eliminate *executed* and *expensive* bytecodes, not
                              > static, possibly skipped-over, bytecodes. I'll take that as what you meant.
                              > It is certainly what our comments are aimed at. Hence my other suggestion
                              > to eliminate a layer of (expensive) function calls by combining value
                              > calculation and storage in one function (more space, less time).[/color]

                              You are suggesting that I hand-memoize my functions, rather than using
                              the generic memoizer I posted at the top of the thread? in order to
                              save, in the case of cache misses, the function call overhead to the
                              memoized function? If I had a high proportion of cache misses, then I
                              suspect that would be worth while. In my case, however, I doubt that
                              would have any appreciable effect. Almost all the calls to the
                              memoized functions are cache hits, so effectively there is no value
                              calculation ... it's pretty much all lookup, so my "expensive"
                              function is *effectively*:


                              def proxy(*args):
                              try:
                              return cache[args]
                              except KeyError: pass

                              (where proxy is a closure over cache). Yes, there is something other
                              than "pass" in the handler, but we (essentially) never go there.

                              There's really not much left to play with (though I'm open to
                              suggestions), which is why I wanted to try recoding in C.

                              Comment

                              Working...