Feature suggestion: sum() ought to use a compensated summation algorithm

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • =?ISO-8859-1?Q?Szabolcs_Horv=E1t?=

    Feature suggestion: sum() ought to use a compensated summation algorithm


    I did the following calculation: Generated a list of a million random
    numbers between 0 and 1, constructed a new list by subtracting the mean
    value from each number, and then calculated the mean again.

    The result should be 0, but of course it will differ from 0 slightly
    because of rounding errors.

    However, I noticed that the simple Python program below gives a result
    of ~ 10^-14, while an equivalent Mathematica program (also using double
    precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
    more precise.

    Here's the program (pardon my style, I'm a newbie/occasional user):

    from random import random

    data = [random() for x in xrange(1000000)]

    mean = sum(data)/len(data)
    print sum(x - mean for x in data)/len(data)

    A little research shows that Mathematica uses a "compensate d summation"
    algorithm. Indeed, using the algorithm described at

    gives us a result around ~ 10^-17:

    def compSum(arr):
    s = 0.0
    c = 0.0
    for x in arr:
    y = x-c
    t = s+y
    c = (t-s) - y
    s = t
    return s

    mean = compSum(data)/len(data)
    print compSum(x - mean for x in data)/len(data)


    I thought that it would be very nice if the built-in sum() function used
    this algorithm by default. Has this been brought up before? Would this
    have any disadvantages (apart from a slight performance impact, but
    Python is a high-level language anyway ...)?

    Szabolcs Horvát
  • Arnaud Delobelle

    #2
    Re: Feature suggestion: sum() ought to use a compensated summation algorithm

    Szabolcs Horvát <szhorvat@gmail .comwrites:

    [...]
    A little research shows that Mathematica uses a "compensate d
    summation" algorithm. Indeed, using the algorithm described at

    gives us a result around ~ 10^-17:
    >
    def compSum(arr):
    s = 0.0
    c = 0.0
    for x in arr:
    y = x-c
    t = s+y
    c = (t-s) - y
    s = t
    return s
    >
    mean = compSum(data)/len(data)
    print compSum(x - mean for x in data)/len(data)
    >
    >
    I thought that it would be very nice if the built-in sum() function
    used this algorithm by default. Has this been brought up before?
    Would this have any disadvantages (apart from a slight performance
    impact, but Python is a high-level language anyway ...)?
    >
    Szabolcs Horvát
    sum() works for any sequence of objects with an __add__ method, not
    just floats! Your algorithm is specific to floats.

    --
    Arnaud

    Comment

    • Ivan Illarionov

      #3
      Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

      On Sat, 03 May 2008 18:50:34 +0200, Szabolcs Horvát wrote:
      I did the following calculation: Generated a list of a million random
      numbers between 0 and 1, constructed a new list by subtracting the mean
      value from each number, and then calculated the mean again.
      >
      The result should be 0, but of course it will differ from 0 slightly
      because of rounding errors.
      >
      However, I noticed that the simple Python program below gives a result
      of ~ 10^-14, while an equivalent Mathematica program (also using double
      precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
      more precise.
      >
      Here's the program (pardon my style, I'm a newbie/occasional user):
      >
      from random import random
      >
      data = [random() for x in xrange(1000000)]
      >
      mean = sum(data)/len(data)
      print sum(x - mean for x in data)/len(data)
      >
      A little research shows that Mathematica uses a "compensate d summation"
      algorithm. Indeed, using the algorithm described at
      http://en.wikipedia.org/wiki/Kahan_summation_algorithm gives us a result
      around ~ 10^-17:
      >
      def compSum(arr):
      s = 0.0
      c = 0.0
      for x in arr:
      y = x-c
      t = s+y
      c = (t-s) - y
      s = t
      return s
      >
      mean = compSum(data)/len(data)
      print compSum(x - mean for x in data)/len(data)
      >
      >
      I thought that it would be very nice if the built-in sum() function used
      this algorithm by default. Has this been brought up before? Would this
      have any disadvantages (apart from a slight performance impact, but
      Python is a high-level language anyway ...)?
      >
      Szabolcs Horvát
      Built-in sum should work with everything, not just floats. I think it
      would be useful addition to standard math module.

      If you know C you could write a patch to mathmodule.c and submit it to
      Python devs.

      --
      Ivan

      Comment

      • Nick Craig-Wood

        #4
        Re: Feature suggestion: sum() ought to use a compensated summation algorithm

        Szabolcs Horvát <szhorvat@gmail .comwrote:
        I did the following calculation: Generated a list of a million random
        numbers between 0 and 1, constructed a new list by subtracting the mean
        value from each number, and then calculated the mean again.
        >
        The result should be 0, but of course it will differ from 0 slightly
        because of rounding errors.
        >
        However, I noticed that the simple Python program below gives a result
        of ~ 10^-14, while an equivalent Mathematica program (also using double
        precision) gives a result of ~ 10^-17, i.e. three orders of magnitude
        more precise.
        >
        Here's the program (pardon my style, I'm a newbie/occasional user):
        >
        from random import random
        >
        data = [random() for x in xrange(1000000)]
        >
        mean = sum(data)/len(data)
        print sum(x - mean for x in data)/len(data)
        >
        A little research shows that Mathematica uses a "compensate d summation"
        algorithm. Indeed, using the algorithm described at

        gives us a result around ~ 10^-17:
        I was taught in my numerical methods course (about 20 years ago now!)
        that the way to do this sum with most accuracy is to sum from the
        smallest magnitude to the largest magnitude.

        Eg
        >>from random import random
        >>data = [random() for x in xrange(1000000)]
        >>mean = sum(data)/len(data)
        >>print sum(x - mean for x in data)/len(data)
        1.64152134108e-14
        >>mean = sum(sorted(data ))/len(data)
        >>print sum(x - mean for x in data)/len(data)
        5.86725534824e-15
        >>>
        It isn't as good as the compensated sum but it is easy!
        I thought that it would be very nice if the built-in sum() function used
        this algorithm by default. Has this been brought up before? Would this
        have any disadvantages (apart from a slight performance impact, but
        Python is a high-level language anyway ...)?
        sum() gets used for any numerical types not just floats...

        --
        Nick Craig-Wood <nick@craig-wood.com-- http://www.craig-wood.com/nick

        Comment

        • =?ISO-8859-1?Q?Szabolcs_Horv=E1t?=

          #5
          Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

          Arnaud Delobelle wrote:
          >
          sum() works for any sequence of objects with an __add__ method, not
          just floats! Your algorithm is specific to floats.
          This occurred to me also, but then I tried

          sum(['abc', 'efg'], '')

          and it did not work. Or is this just a special exception to prevent the
          misuse of sum to join strings? (As I said, I'm only an occasional user.)

          Generally, sum() seems to be most useful for numeric types (i.e. those
          that form a group with respect to __add__ and __neg__/__sub__), which
          may be either exact (e.g. integers) or inexact (e.g. floating point
          types). For exact types it does not make sense to use compensated
          summation (though it wouldn't give an incorrect answer, either), and
          sum() cannot decide whether a user-defined type is exact or inexact.

          But I guess that it would still be possible to make sum() use
          compensated summation for built-in floating point types (float/complex).

          Or, to go further, provide a switch to choose between the two methods,
          and make use compensated summation for float/complex by default. (But
          perhaps people would consider this last alternative a bit too messy.)

          (Just some thoughts ...)

          Comment

          • hdante

            #6
            Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

            On May 3, 3:44 pm, Szabolcs Horvát <szhor...@gmail .comwrote:
            Arnaud Delobelle wrote:
            >
            sum() works for any sequence of objects with an __add__ method, not
            just floats!  Your algorithm is specific to floats.
            >
            This occurred to me also, but then I tried
            >
            sum(['abc', 'efg'], '')
            >
            and it did not work.  Or is this just a special exception to prevent the
              misuse of sum to join strings?  (As I said, I'm only an occasional user.)
            >
            Generally, sum() seems to be most useful for numeric types (i.e. those
            that form a group with respect to __add__ and __neg__/__sub__), which
            may be either exact (e.g. integers) or inexact (e.g. floating point
            types).  For exact types it does not make sense to use compensated
            summation (though it wouldn't give an incorrect answer, either), and
            sum() cannot decide whether a user-defined type is exact or inexact.
            >
            But I guess that it would still be possible to make sum() use
            compensated summation for built-in floating point types (float/complex).
            >
            Or, to go further, provide a switch to choose between the two methods,
            and make use compensated summation for float/complex by default.  (But
            perhaps people would consider this last alternative a bit too messy.)
            >
            (Just some thoughts ...)
            The benefits should be weighted considering the common case. For
            example, do you find an error of 10^-14 unbearable ? How many times
            someone will add 1.000.000 numbers in a typical scientific application
            written in python ?

            I believe that moving this to third party could be better. What about
            numpy ? Doesn't it already have something similar ?

            Comment

            • Arnaud Delobelle

              #7
              Re: Feature suggestion: sum() ought to use a compensated summation algorithm

              Szabolcs Horvát <szhorvat@gmail .comwrites:
              Arnaud Delobelle wrote:
              >>
              >sum() works for any sequence of objects with an __add__ method, not
              >just floats! Your algorithm is specific to floats.
              >
              This occurred to me also, but then I tried
              >
              sum(['abc', 'efg'], '')
              >
              and it did not work. Or is this just a special exception to prevent
              the misuse of sum to join strings? (As I said, I'm only an occasional
              user.)
              I think that's right: anything with an __add__ method, apart from
              string, can be sum()ed.

              --
              Arnaud

              Comment

              • Erik Max Francis

                #8
                Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                Szabolcs Horvát wrote:
                Arnaud Delobelle wrote:
                >>
                >sum() works for any sequence of objects with an __add__ method, not
                >just floats! Your algorithm is specific to floats.
                >
                This occurred to me also, but then I tried
                >
                sum(['abc', 'efg'], '')
                >
                and it did not work. Or is this just a special exception to prevent the
                misuse of sum to join strings? (As I said, I'm only an occasional user.)
                What you wrote is nonsensical there, no different from 'a' + 1 -- which
                is why it quite rightly raises a TypeError.

                You're trying to add a list to a string, which is nonsensical. You add
                strings to strings, or lists to lists, but mixing them up doesn't make
                sense. Python can't guess what you mean when you write something like
                ['abc', 'def'] + '' -- which is the functional equivalent of your call
                to sum -- and so doesn't try. It indicates this by raising a TypeError.

                --
                Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
                San Jose, CA, USA && 37 18 N 121 57 W && AIM, Y!M erikmaxfrancis

                Comment

                • Torsten Bronger

                  #9
                  Re: Feature suggestion: sum() ought to use a compensated summation algorithm

                  Hallöchen!

                  Erik Max Francis writes:
                  Szabolcs Horvát wrote:
                  >
                  >Arnaud Delobelle wrote:
                  >>>
                  >>sum() works for any sequence of objects with an __add__ method, not
                  >>just floats! Your algorithm is specific to floats.
                  >>
                  >This occurred to me also, but then I tried
                  >>
                  >sum(['abc', 'efg'], '')
                  >>
                  >and it did not work. Or is this just a special exception to
                  >prevent the misuse of sum to join strings? (As I said, I'm only
                  >an occasional user.)
                  >
                  What you wrote is nonsensical there, no different from 'a' + 1 --
                  which is why it quite rightly raises a TypeError.
                  No, the above expression should yield ''+'abc'+'efg', look for the
                  signature of sum in the docs.

                  Tschö,
                  Torsten.

                  --
                  Torsten Bronger, aquisgrana, europa vetus
                  Jabber ID: bronger@jabber. org
                  (See http://ime.webhop.org for further contact info.)

                  Comment

                  • Erik Max Francis

                    #10
                    Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                    Torsten Bronger wrote:
                    No, the above expression should yield ''+'abc'+'efg', look for the
                    signature of sum in the docs.
                    You're absolutely right, I misread it. Sorry about that.

                    --
                    Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
                    San Jose, CA, USA && 37 18 N 121 57 W && AIM, Y!M erikmaxfrancis

                    Comment

                    • Ivan Illarionov

                      #11
                      Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                      On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:
                      Arnaud Delobelle wrote:
                      >>
                      >sum() works for any sequence of objects with an __add__ method, not
                      >just floats! Your algorithm is specific to floats.
                      >
                      This occurred to me also, but then I tried
                      >
                      sum(['abc', 'efg'], '')
                      Interesting, I always thought that sum is like shortcut of
                      reduce(operator .add, ...), but I was mistaken.

                      reduce() is more forgiving:
                      reduce(operator .add, ['abc', 'efg'], '' ) # it works
                      'abcefg'

                      --
                      Ivan

                      Comment

                      • sturlamolden

                        #12
                        Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                        On May 3, 10:13 pm, hdante <hda...@gmail.c omwrote:
                        I believe that moving this to third party could be better. What about
                        numpy ? Doesn't it already have something similar ?
                        Yes, Kahan summation makes sence for numpy arrays. But the problem
                        with this algorithm is optimizing compilers. The programmer will be
                        forced to use tricks like inline assembly to get around the optimizer.
                        If not, the optimizer would remove the desired features of the
                        algorithm. But then we have a serious portability problem.

                        Comment

                        • Thomas Dybdahl Ahle

                          #13
                          Re: Feature suggestion: sum() ought to use a compensated summationalgori thm


                          On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
                          On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:
                          >
                          Arnaud Delobelle wrote:
                          >
                          sum() works for any sequence of objects with an __add__ method, not
                          just floats! Your algorithm is specific to floats.
                          This occurred to me also, but then I tried

                          sum(['abc', 'efg'], '')
                          >
                          Interesting, I always thought that sum is like shortcut of
                          reduce(operator .add, ...), but I was mistaken.
                          >
                          reduce() is more forgiving:
                          reduce(operator .add, ['abc', 'efg'], '' ) # it works
                          'abcefg'
                          Hm, it works for lists:
                          sum(([1], [2]), [])
                          [1, 2]

                          However I find the seccond argument hack ugly.
                          Does the sum way have any performance advantages over the reduce way?

                          --
                          Best Regards,
                          Med Venlig Hilsen,
                          Thomas

                          Comment

                          • Ivan Illarionov

                            #14
                            Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                            On Sun, 04 May 2008 00:31:01 +0200, Thomas Dybdahl Ahle wrote:
                            On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
                            >On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:
                            >>
                            Arnaud Delobelle wrote:
                            >>
                            >sum() works for any sequence of objects with an __add__ method, not
                            >just floats! Your algorithm is specific to floats.
                            >
                            This occurred to me also, but then I tried
                            >
                            sum(['abc', 'efg'], '')
                            >>
                            >Interesting, I always thought that sum is like shortcut of
                            >reduce(operato r.add, ...), but I was mistaken.
                            >>
                            >reduce() is more forgiving:
                            >reduce(operato r.add, ['abc', 'efg'], '' ) # it works 'abcefg'
                            >
                            Hm, it works for lists:
                            sum(([1], [2]), [])
                            [1, 2]
                            >
                            However I find the seccond argument hack ugly. Does the sum way have any
                            performance advantages over the reduce way?
                            Yes, sum() is faster:

                            $ python -m timeit "" "sum([[1], [2], [3, 4]], [])"
                            100000 loops, best of 3: 6.16 usec per loop

                            $ python -m timeit "import operator" \
                            "reduce(operato r.add, [[1], [2], [3, 4]])"
                            100000 loops, best of 3: 11.9 usec per loop

                            --
                            Ivan

                            Comment

                            • George Sakkis

                              #15
                              Re: Feature suggestion: sum() ought to use a compensated summationalgori thm

                              On May 3, 7:12 pm, Ivan Illarionov <ivan.illario.. .@gmail.comwrot e:
                              On Sun, 04 May 2008 00:31:01 +0200, Thomas Dybdahl Ahle wrote:
                              On Sat, 2008-05-03 at 21:37 +0000, Ivan Illarionov wrote:
                              On Sat, 03 May 2008 20:44:19 +0200, Szabolcs Horvát wrote:
                              >
                              Arnaud Delobelle wrote:
                              >
                              sum() works for any sequence of objects with an __add__ method, not
                              just floats!  Your algorithm is specific to floats.
                              >
                              This occurred to me also, but then I tried
                              >
                              sum(['abc', 'efg'], '')
                              >
                              Interesting, I always thought that sum is like shortcut of
                              reduce(operator .add, ...), but I was mistaken.
                              >
                              reduce() is more forgiving:
                              reduce(operator .add, ['abc', 'efg'], '' ) # it works 'abcefg'
                              Hm, it works for lists:
                              sum(([1], [2]), [])
                              [1, 2]
                              So it's not reduce() that is more forgiving, it's sum() that makes an
                              exception for strings only.

                              However I find the seccond argument hack ugly. Does the sum way have any
                              performance advantages over the reduce way?
                              >
                              Yes, sum() is faster:
                              >
                              $ python -m timeit "" "sum([[1], [2], [3, 4]], [])"
                              100000 loops, best of 3: 6.16 usec per loop
                              >
                              $ python -m timeit "import operator" \
                              "reduce(operato r.add, [[1], [2], [3, 4]])"
                              100000 loops, best of 3: 11.9 usec per loop
                              Not really; you measure the import and the attribute access time in
                              the second case. sum() is 2x faster for adding numbers only:

                              # Adding integers
                              python -mtimeit --setup="x=[1]*100" "sum(x)"
                              100000 loops, best of 3: 4.87 usec per loop
                              python -mtimeit --setup="from operator import add; x=[1]*100"
                              "reduce(add ,x)"
                              100000 loops, best of 3: 10.1 usec per loop

                              # Adding floats
                              python -mtimeit --setup="x=[1.0]*100" "sum(x)"
                              100000 loops, best of 3: 5.14 usec per loop
                              python -mtimeit --setup="from operator import add; x=[1.0]*100"
                              "reduce(add ,x)"
                              100000 loops, best of 3: 10.1 usec per loop

                              # Adding tuples
                              python -mtimeit --setup="x=[(1,)]*100" "sum(x,())"
                              10000 loops, best of 3: 61.6 usec per loop
                              python -mtimeit --setup="from operator import add; x=[(1,)]*100"
                              "reduce(add,x,( ))"
                              10000 loops, best of 3: 66.3 usec per loop

                              # Adding lists
                              python -mtimeit --setup="x=[[1]]*100" "sum(x,[])"
                              10000 loops, best of 3: 79.9 usec per loop
                              python -mtimeit --setup="from operator import add; x=[[1]]*100"
                              "reduce(add ,x,[])"
                              10000 loops, best of 3: 84.3 usec per loop

                              George

                              Comment

                              Working...