problem returning floating point

**Richard Heathfield** · Nov 6 '08, 07:55 AM

Re: problem returning floating point

Peter said:

I have written this small app to explain an issue I'm having with a
larger program.
>
In the following code I'm taking 10 ints from the keyboard. In the call
to average() these 10 ints are then added and Divided by the number of
ints to return the average, a float.
>
Can someone see why I get for example: if the total is 227 then divided
by 10 I get a return of 22.00000 instead of the correct answer 22.7.

227.0 / 10.0 is indeed 22.7, within reason - the nature of floating point
is such that you can't always get an answer as precisely as you'd like.

So:

227.0 / 10.0 is 22.7 (ish)
227.0 / 10 is 22.7 (ish)
227 / 10.0 is 22.7 (ish)
227 / 10 is 22 (precisely)

Spot the difference?

<snip>

One modification will suffice to fix the above problem:

float average(int count, int numbers[])
{
int total=0;

Change this to:

float total=0;

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

**viza** · Nov 6 '08, 12:25 PM

Re: problem returning floating point

Hi

On Thu, 06 Nov 2008 07:51:03 +0000, Richard Heathfield wrote:

Peter said:
>

>float average(int count, int numbers[])
>{
> int total=0;
> int i;
> float ans;
>>
> for(i=0; i<count; i++)
> {
> total +=numbers[i];
> ans = total/count;
> }
> return ans;
>>}

>
One modification will suffice to fix the above problem:
>

>int total=0;

>
Change this to:
>
float total=0;

Not strictly wrong, but bad advice.

As long as the size of the integer used is big enough to hold the sum, it
is both faster and more accurate to do the addition in an integer.

A much better change to make is:

ans = (double)total / (double)count;

Also, it is massively stupid to calculate the division inside each loop,
although a half-decent compiler will realise this and optimise it away.

My version of this function would look like:

float average( int count, int *numbers ){
int total= 0;
int *stop= numbers + count;

while( numbers < stop )
total+= *numbers++;

return (double)total / count;
}

I have removed the subscripting in favour of comparing pointers, to
improve speed, in case this is performance critical.

In the case where an int couldn't hold the maximum possible sum of the
ten inputs, I would use a long or long long. In the case where that was
still not enough, I would use two nested loops, the inner one adding up
using integers, and the outer one using doubles. Again, this would give
both increased accuracy and increased speed.

**Richard Tobin** · Nov 6 '08, 12:35 PM

Re: problem returning floating point

In article <BhBQk.2959$i%. 2137@newsfe25.a ms2>,
viza <tom.viza@gm-il.com.obviousc hange.invalidwr ote:

>In the case where an int couldn't hold the maximum possible sum of the
>ten inputs, I would use a long or long long. In the case where that was
>still not enough, I would use two nested loops, the inner one adding up
>using integers, and the outer one using doubles. Again, this would give
>both increased accuracy and increased speed.

Loss of precision in a case like this arises when a small number is
added to a large one. A reasonable approach therefore is to add up
each half of the numbers and then add these together, using the same
method recursively to add up the halves. Something like this
(untested):

double sum_ints(int count, int *numbers)
{
if(count == 0)
return 0.0;
else if(count == 1)
return (double)numbers[0];
else return sum_ints(count/2, numbers) +
sum_ints(count-count/2, numbers+count/2);
}

-- Richard
--
Please remember to mention me / in tapes you leave behind.

**Nick Keighley** · Nov 6 '08, 12:45 PM

Re: problem returning floating point

On 6 Nov, 12:19, viza <tom.v...@gm-il.com.obviousc hange.invalid>
wrote:

I have removed the subscripting in favour of comparing pointers, to
improve speed, in case this is performance critical.

do you have access to a compiler where this makes any difference?

--
Nick Keighley

**viza** · Nov 6 '08, 02:25 PM

Re: problem returning floating point

Hi

On Thu, 06 Nov 2008 04:39:57 -0800, Nick Keighley wrote:

On 6 Nov, 12:19, viza wrote:
>

>I have removed the subscripting in favour of comparing pointers, to
>improve speed, in case this is performance critical.

>
do you have access to a compiler where this makes any difference?

No, not noticeably, if at all. The use of one less variable makes the C
look tidier to me, quite aside from what the compiler outputs.

viza

**Nick Keighley** · Nov 6 '08, 02:35 PM

Re: problem returning floating point

On 6 Nov, 14:20, viza <tom.v...@gm-il.com.obviousc hange.invalid>
wrote:

Hi
>
On Thu, 06 Nov 2008 04:39:57 -0800, Nick Keighley wrote:

On 6 Nov, 12:19, viza wrote:

>

I have removed the subscripting in favour of comparing pointers, to
improve speed, in case this is performance critical.

>

do you have access to a compiler where this makes any difference?

>
No, not noticeably, if at all. The use of one less variable makes the C
look tidier to me, quite aside from what the compiler outputs.

I don't see how you are saving a variable. You wrote
(I fiddled with layout a bit)

float average( int count, int *numbers )
{
int total= 0;
int *stop= numbers + count;

while( numbers < stop )
total+= *numbers++;

return (double)total / count;
}

I'd write

float average (int count, int *numbers)
{
int total = 0;
int i;

for (i = 0; i < count; count++)
total += numbers [i];

return (double)total / count;
}

I don't find your version significantly easier to understand.
The termination condition seems a little obscure.

--
Nick Keighley

**Richard Heathfield** · Nov 6 '08, 04:05 PM

Re: problem returning floating point

viza said:

Hi
>
On Thu, 06 Nov 2008 07:51:03 +0000, Richard Heathfield wrote:

>Peter said:
>>

>>float average(int count, int numbers[])
>>{
>> int total=0;
>> int i;
>> float ans;
>>>
>> for(i=0; i<count; i++)
>> {
>> total +=numbers[i];
>> ans = total/count;
>> }
>> return ans;
>>>}

>>
>One modification will suffice to fix the above problem:
>>

>>int total=0;

>>
>Change this to:
>>
>float total=0;

>
Not strictly wrong, but bad advice.

I disagree, but let's see what your grounds are.

As long as the size of the integer used is big enough to hold the sum,

We don't know this, so you're already on dodgy ground.

it is both faster and more accurate to do the addition in an integer.

Chapter and verse, please. (Math co-processors are commonplace nowadays.)

>
A much better change to make is:
>
ans = (double)total / (double)count;

Introducing two spurious casts, and risking overflow on the total.

Also, it is massively stupid to calculate the division inside each loop,
although a half-decent compiler will realise this and optimise it away.

Fine, but he wasn't asking about performance - he was asking about
precision.

My version of this function would look like:
>
float average( int count, int *numbers ){
int total= 0;
int *stop= numbers + count;
>
while( numbers < stop )
total+= *numbers++;
>
return (double)total / count;
}

/* test driver */
#include <stdio.h>
#include <math.h>
#include <limits.h>

int main(void)
{
int data[] = { INT_MAX, INT_MAX - 1, INT_MAX - 2 };
float expected = INT_MAX - 1;
float actual = average(3, data);
printf("Expecte d: %f\n", expected);
printf("Actual: %f\n", actual);
printf("Error: %f\n", fabs(expected - actual));
return 0;
}

Result (on my system):

Expected: 2147483648.0000 00
Actual: 715827904.00000 0
Error: 1431655744.0000 00

Curiously, on my system, the expected result isn't that hot - but look at
the actual result!

If you plug in doubles instead, the expected result is far better:

Expected: 2147483646.0000 00
Actual: 715827880.66666 7
Error: 1431655765.3333 33

but the error is still colossal.

I have removed the subscripting in favour of comparing pointers, to
improve speed, in case this is performance critical.

In these days of optimising compilers it's unlikely to make any difference
at all, and it's certainly not guaranteed not to be slower.

In the case where an int couldn't hold the maximum possible sum of the
ten inputs, I would use a long

Might not help - didn't in the above test.

or long long.

Might not be available. And of course might not help even if it is.

In the case where that was
still not enough, I would use two nested loops, the inner one adding up
using integers, and the outer one using doubles. Again, this would give
both increased accuracy and increased speed.

I doubt it very much. But feel free to demonstrate.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

**viza** · Nov 6 '08, 05:15 PM

Re: problem returning floating point

Hi

On Thu, 06 Nov 2008 16:05:28 +0000, Richard Heathfield wrote:

viza said:

>On Thu, 06 Nov 2008 07:51:03 +0000, Richard Heathfield wrote:

>As long as the size of the integer used is big enough to hold the sum,

>
We don't know this, so you're already on dodgy ground.

We don't know that float is either. As soon as you add two of anything
in C you make an assumption that the result will fit in the type.

>it is both faster and more accurate to do the addition in an integer.

>
Chapter and verse, please. (Math co-processors are commonplace
nowadays.)

I did test specifically this function when implementing several image
shrinking functions. Integer was faster on both x86 and x86-64, which
was all I had to hand. 64-bit integers were even faster then float or
double on one 32 bit cpu that I tested.

>A much better change to make is:
>>
> ans = (double)total / (double)count;

>
Introducing two spurious casts, and risking overflow on the total.

Your method fails first. The range where all integers can be represented
in a float is much smaller than the range of the same size int. Once you
pass that limit then if some of your input ints are small they will be
completely lost and if many of them are small (wrt the eventual mean)
then the output of the float method will be less precise than in the
integer method.

int data[] = { INT_MAX, INT_MAX - 1, INT_MAX - 2 };

>or long long.

Might not be available.

Ok, my method simply cannot handle integers of the largest type available
on the system which are close to their maximum value. In many situations
it is possible to know a priori that this will not be the case, and
achieve a more accurate result more quickly by using my method.

By your method the OP would discard accuracy when that may not have been
necessary or even acceptable, and would have run slower on many systems.

Have a nice day.

viza

**Richard Heathfield** · Nov 6 '08, 05:25 PM

Re: problem returning floating point

viza said:

Hi
>
On Thu, 06 Nov 2008 16:05:28 +0000, Richard Heathfield wrote:

>viza said:

<snip>

>>A much better change to make is:
>>>
>> ans = (double)total / (double)count;

>>
>Introducing two spurious casts, and risking overflow on the total.

>
Your method fails first.

Not according to my testing.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

**user923005** · Nov 6 '08, 09:15 PM

Re: problem returning floating point

On Nov 6, 4:19 am, viza <tom.v...@gm-il.com.obviousc hange.invalid>
wrote:

Hi
>
>
>
>
>
On Thu, 06 Nov 2008 07:51:03 +0000, Richard Heathfield wrote:

Peter said:

>

float average(int count, int numbers[])
{
int total=0;
int i;
float ans;

>

for(i=0; i<count; i++)
{
total +=numbers[i];
ans = total/count;
}
return ans;
>}

>

One modification will suffice to fix the above problem:

>

int total=0;

>

Change this to:

>

float total=0;

>
Not strictly wrong, but bad advice.
>
As long as the size of the integer used is big enough to hold the sum, it
is both faster and more accurate to do the addition in an integer.

Utter poppycock. As long as the floating point type has enough
precision to hold the totals, a floating point type is exactly as
precise as an integral type.
So for up to 6 digits with 4 byte float or 15 digits with 8 byte
float, the floating point value will hold the exact answer just like
an integer would.

Also, floating point operations are not appreciably slower and many
modern systems have special hardware for floating point to accelerate
it.
See, for example:

Fractal eXtreme: x86 and Infinity

http://www.cygnus-software.com/papers/x86andinfinity.html

One floating point addition per cycle is going to be hard to beat.

A much better change to make is:
>
ans = (double)total / (double)count;
>
Also, it is massively stupid to calculate the division inside each loop,
although a half-decent compiler will realise this and optimise it away.
>
My version of this function would look like:
>
float average( int count, int *numbers ){
int total= 0;
int *stop= numbers + count;
>
while( numbers < stop )
total+= *numbers++;
>
return (double)total / count;
>
}
>
I have removed the subscripting in favour of comparing pointers, to
improve speed, in case this is performance critical.

More nonsense.

In the case where an int couldn't hold the maximum possible sum of the
ten inputs, I would use a long or long long. In the case where that was
still not enough, I would use two nested loops, the inner one adding up
using integers, and the outer one using doubles. Again, this would give
both increased accuracy and increased speed.

All solutions are going to be O(n) and these little tweaky things you
are suggesting are for the most part utter nonsense.

However, your accumulation only and single division is the right way
to do it unless there is a need for a running sum (e.g. if you need
the current average after each insertion).

**viza** · Nov 7 '08, 05:25 PM

Re: problem returning floating point

Hi

On Thu, 06 Nov 2008 13:13:01 -0800, user923005 wrote:

On Nov 6, 4:19Â am, viza wrote:

>As long as the size of the integer used is big enough to hold the sum,
>it is both faster and more accurate to do the addition in an integer.

Also, floating point operations are not appreciably slower and many
modern systems have special hardware for floating point to accelerate
it.
See, for example:

Fractal eXtreme: x86 and Infinity

http://www.cygnus-software.com/papers/x86andinfinity.html

>
One floating point addition per cycle is going to be hard to beat.

Ok, if you use simd it gets much faster but most compilers can't (or
don't by default) use those instructions in a loop like this, so if you
are only writing standard C and using a common compiler then integer
addition is still much faster.

I take your point that both ints and floats can store ints exactly over a
certain range; that range is larger for ints though.

>I have removed the subscripting in favour of comparing pointers, to
>improve speed, in case this is performance critical.

>
More nonsense.

The compiler will probably do that for you but it can't hurt.

viza

problem returning floating point

problem returning floating point

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment