32-bit IEEE float multiplication

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Eric Sosman

    #16
    Re: 32-bit IEEE float multiplication

    "Dik T. Winter" wrote:[color=blue]
    >
    > In article <aed59298.03120 40655.43db93e4@ posting.google. com> bikejog@hotmail .com (Andy) writes:[color=green]
    > > But here's a real question, given the code below
    > >
    > > unsigned long ticks = 0; /* 32-bit */
    > > float fVal; /* 32-bit IEEE-754 */
    > >
    > > while(++ticks) {
    > > fVal = (float)ticks * 0.004;
    > > }
    > >[/color]
    > [...]
    > Another problem with your code is that when ticks exceeds 2**24 the
    > number is no longer exactly represenatable as a floating point number,
    > so all bets are off.[/color]

    Not "all" bets, but it certainly scuttles any hope of
    strict monotonicity.

    Here's the scenario: `ticks' counts up to a large power
    of two, about 2**24 or 2**25 (I'd have to pull out my IEEE
    reference to be sure of the value, but the exact location
    of this boundary isn't essential to understanding the effect).
    Up to this point, the expression `(float)ticks' has produced
    an exact conversion: the result is exactly equal to the
    original value of `ticks'.

    But at the next upward count a problem arises: `ticks'
    now has one more significant bit than a `float' can handle.
    (Imagine counting upwards in decimal arithmetic with three
    significant digits: 999 is fine, 1000==1e3 is fine, but
    1001 has too many digits.) So the conversion is inexact,
    and if "round to even" is in effect the result will be a
    hair too small -- in fact, it will be the same result as was
    obtained from the preceding exact conversion. That is, `ticks'
    increased but `(float)ticks' did not.

    On the next count, the problem disappears momentarily:
    the low-order bit of `ticks' is now a zero, so the number of
    significant bits is once again within the capacity of `float'.
    The conversion is again exact -- but look what's happened: the
    value `(float)ticks' has advanced two steps at once. You've
    entered a regime where `(float)ticks' "sticks" at one value
    for two ticks before advancing to the correct result; it
    "increments every other time."

    As you go higher still, `ticks' will eventually attain
    two more significant bits than a `float' can handle, and
    will "stick" at one value for four cycles before increasing.
    And then you'll get to three bits too many and an eight-
    cycle plateau, and so on. (Consider the decimal analogy
    again: after you reach 1000000==100e4, you've got to count
    upward many more times before reaching 101e4==1010000. )

    However, the cure for your case seems obvious: Whether
    you know it or not, you're actually employing `double'
    arithmetic in this expression because the constant `0.004'
    has type `double'. So why convert to `float', losing a few
    bits in the process, only to go ahead and re-convert that
    mangled value to a `double'? Just make `fVal' a `double'
    to begin with and get rid of the `(float)' cast, and you
    should be immune to this effect.

    There may be other problems elsewhere, of course, but
    this problem, at least, will cease to bother you.

    --
    Eric.Sosman@sun .com

    Comment

    • Dan Pop

      #17
      Re: 32-bit IEEE float multiplication

      In <3FCF60D4.3F1C6 406@sun.com> Eric Sosman <Eric.Sosman@su n.com> writes:
      [color=blue]
      > However, the cure for your case seems obvious: Whether
      >you know it or not, you're actually employing `double'
      >arithmetic in this expression because the constant `0.004'
      >has type `double'. So why convert to `float', losing a few
      >bits in the process, only to go ahead and re-convert that
      >mangled value to a `double'? Just make `fVal' a `double'
      >to begin with and get rid of the `(float)' cast, and you
      >should be immune to this effect.[/color]

      Given its target platform, what he may really want to do is to replace
      0.004 by 0.004f so that double precision arithmetic is completely
      avoided. That is, unless his application has plenty of spare cpu cycles
      to burn...

      Single precision IEEE-754 floating point is already painfully slow
      on an 8-bit micro with no hardware floating point support. Any trick
      that allows avoiding floating point completely is a big win (and a big
      saver of ROM memory space).

      Dan
      --
      Dan Pop
      DESY Zeuthen, RZ group
      Email: Dan.Pop@ifh.de

      Comment

      • Eric Sosman

        #18
        Re: 32-bit IEEE float multiplication

        Dan Pop wrote:[color=blue]
        >
        > In <3FCF60D4.3F1C6 406@sun.com> Eric Sosman <Eric.Sosman@su n.com> writes:
        >[color=green]
        > > However, the cure for your case seems obvious: Whether
        > >you know it or not, you're actually employing `double'
        > >arithmetic in this expression because the constant `0.004'
        > >has type `double'. So why convert to `float', losing a few
        > >bits in the process, only to go ahead and re-convert that
        > >mangled value to a `double'? Just make `fVal' a `double'
        > >to begin with and get rid of the `(float)' cast, and you
        > >should be immune to this effect.[/color]
        >
        > Given its target platform, what he may really want to do is to replace
        > 0.004 by 0.004f so that double precision arithmetic is completely
        > avoided. That is, unless his application has plenty of spare cpu cycles
        > to burn...
        >
        > Single precision IEEE-754 floating point is already painfully slow
        > on an 8-bit micro with no hardware floating point support. Any trick
        > that allows avoiding floating point completely is a big win (and a big
        > saver of ROM memory space).[/color]

        I'm not familiar with his platform; maybe `double'
        is out of the question. If so, I don't see how he can
        avoid the "stair-step" problem that occurs with large
        counts, except possibly by breaking the count into two
        pieces and extracting two `float' quantities instead
        of one. E.g.,

        float hi, lo;
        lo = (ticks & 0xFFFF) * 0.004f;
        hi = (ticks >> 16) * (0.004f * 65536);

        Of course, then he's stuck with two `float' values and the
        necessity to handle them both, more or less doubling the
        amount of work that needs to be done with them elsewhere.
        `double' might, perhaps, turn out to be cheaper after all.

        To the O.P.: What is the purpose of this floating-point
        result? What do you do with it; what decisions are based
        upon it? Perhaps we can come up with a way to avoid floating-
        point altogether, and stay strictly in integer-land.

        --
        Eric.Sosman@sun .com

        Comment

        • Christian Bau

          #19
          Re: 32-bit IEEE float multiplication

          In article <aed59298.03120 40655.43db93e4@ posting.google. com>,
          bikejog@hotmail .com (Andy) wrote:
          [color=blue]
          > The compiler is Keil for Intel 8051 and derivatives. It's
          > an embedded compiler. It was actually my mistake. I think the
          > code is actually working, but the test code I wrote had a race
          > condition on the variable gcGCLK with the ISR that actually
          > increments this variable. The actual code is appended at the
          > end of my message.
          > But here's a real question, given the code below
          >
          > unsigned long ticks = 0; /* 32-bit */
          > float fVal; /* 32-bit IEEE-754 */
          >
          > while(++ticks) {
          > fVal = (float)ticks * 0.004;
          > }
          >
          > 1. Can I expect fVal to always not decreasing as ticks is
          > incremented? (until of course when ticks wraps around
          > to 0)
          > 2. Does fVal covers all the integral values? ie, could it
          > go from say 56 to 60 skipping 57, 58, and 59?
          > 3. In this example, each integral value equals to 250
          > ticks. Are all intervals between any two consecutive
          > integral values of fVal always 250 ticks? (within tolerance
          > of course). ie, between fVal of 250 and 251, there're
          > 250 ticks. Is this true for all integral intervals of fVal?
          > 3. How about for a smaller number like 0.00004.[/color]

          You will have a problem with large numbers. IEEE 32 bit float has 24 bit
          for the mantissa including the leading "1" in the mantissa which is
          never stored.

          So for floating point numbers 1 <= x < 2, the resolution is 2^-23. That
          means the difference between x and the next largest floating point
          number is 2^-23. For 2^23 <= x < 2^24 the resolution is 1. For 2^31 <= x
          < 2^32 the resolution is 2^8 = 256, so the difference between one
          floating point number and the next larger floating point number is 256.
          256 * 0.004 = 1.024 > 1 so yes, you will not cover all integral values
          when x is large.

          Is there a good reason why you don't write

          ticks / 250

          ?

          Comment

          • Dik T. Winter

            #20
            Re: 32-bit IEEE float multiplication

            In article <3FCF84EE.53A47 6A4@sun.com> Eric.Sosman@Sun .COM writes:[color=blue]
            > Dan Pop wrote:[/color]
            ....[color=blue][color=green]
            > > Single precision IEEE-754 floating point is already painfully slow
            > > on an 8-bit micro with no hardware floating point support. Any trick
            > > that allows avoiding floating point completely is a big win (and a big
            > > saver of ROM memory space).[/color]
            >
            > I'm not familiar with his platform; maybe `double'
            > is out of the question.[/color]

            Yup. Actually normal fp is also out of the question...
            [color=blue]
            > If so, I don't see how he can
            > avoid the "stair-step" problem that occurs with large
            > counts, except possibly by breaking the count into two
            > pieces and extracting two `float' quantities instead
            > of one. E.g.,
            >
            > float hi, lo;
            > lo = (ticks & 0xFFFF) * 0.004f;
            > hi = (ticks >> 16) * (0.004f * 65536);[/color]

            This ignores something he wants. When ticks is a multiple of 250
            the exact integer value should be produced as a floating point number.
            What I wrote in a previous post still stands:
            (float)(ticks / 250) + (float)(ticks % 250) * 0.004f.
            I am looking for a way to do "ticks / 250" faster on an 8-bit micro
            than just division (which, again, is slow).
            --
            dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
            home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

            Comment

            • Andy

              #21
              Re: 32-bit IEEE float multiplication

              Thanks for your great answers. As I understand it, as the tick
              gets greater then 2^24, then all I'm loosing are the lower 8-bits
              of the tick. Then since 256 is only 1 second, the worse thing
              that happens is that the result will be rounded to the nearest
              second or a little more than that. I will not even loose
              anything close to a minute or an hour. Would I?

              TIA
              Andy


              "Dik T. Winter" <Dik.Winter@cwi .nl> wrote in message news:<HpDME9.1H G@cwi.nl>...[color=blue]
              > In article <aed59298.03120 40655.43db93e4@ posting.google. com> bikejog@hotmail .com (Andy) writes:[color=green]
              > > But here's a real question, given the code below
              > >
              > > unsigned long ticks = 0; /* 32-bit */
              > > float fVal; /* 32-bit IEEE-754 */
              > >
              > > while(++ticks) {
              > > fVal = (float)ticks * 0.004;
              > > }
              > >
              > > 1. Can I expect fVal to always not decreasing as ticks is
              > > incremented? (until of course when ticks wraps around
              > > to 0)[/color]
              >
              > Yes.
              >[color=green]
              > > 2. Does fVal covers all the integral values? ie, could it
              > > go from say 56 to 60 skipping 57, 58, and 59?[/color]
              >
              > No. It will not go from 56 to 60, but it may go from slightly less than
              > 59 to slightly more than 59, skipping 59 itself.
              >[color=green]
              > > 3. In this example, each integral value equals to 250
              > > ticks. Are all intervals between any two consecutive
              > > integral values of fVal always 250 ticks? (within tolerance
              > > of course). ie, between fVal of 250 and 251, there're
              > > 250 ticks. Is this true for all integral intervals of fVal?[/color]
              >
              > Because of the above observation, no, not necessarily.
              >[color=green]
              > > 3. How about for a smaller number like 0.00004.[/color]
              >
              > Similar answer. 0.04 and 0.00004 are not exactly representable as
              > floating point numbers, so rounding occurs both when the representation
              > of those numbers is created and when this value is used in the
              > multiplication. A better way would be to calculate:
              > (float)ticks / 250.0
              > but that may be slower on your system. (250.0 is exactly representable,
              > so IEEE mandates that when ticks is an integer that is a multiple of
              > 250 the result should be exact.)
              >
              > Another problem with your code is that when ticks exceeds 2**24 the
              > number is no longer exactly represenatable as a floating point number,
              > so all bets are off.
              >
              > To get the best possible answer you need something like:
              > (float)(ticks / 250) + (float)(ticks % 250)/250.0[/color]

              Comment

              • Andy

                #22
                Re: 32-bit IEEE float multiplication

                Using double is pretty much out of the question. I can tolerate
                a stair-step of say less than 1 minute when the number gets huge.
                As long as the error is relatively a small percentage of the
                actual count. Is the worst case error 250/(2^32)? Or how do
                you calculate the worse case error?
                By the way, I'm kinda slow replying to the posts because
                I'm using the google newfeed service. It has response time
                of about 9 hours...

                TIA
                Andy


                "Dik T. Winter" <Dik.Winter@cwi .nl> wrote in message news:<HpEMGz.3C E@cwi.nl>...[color=blue]
                > In article <3FCF84EE.53A47 6A4@sun.com> Eric.Sosman@Sun .COM writes:[color=green]
                > > Dan Pop wrote:[/color]
                > ...[color=green][color=darkred]
                > > > Single precision IEEE-754 floating point is already painfully slow
                > > > on an 8-bit micro with no hardware floating point support. Any trick
                > > > that allows avoiding floating point completely is a big win (and a big
                > > > saver of ROM memory space).[/color]
                > >
                > > I'm not familiar with his platform; maybe `double'
                > > is out of the question.[/color]
                >
                > Yup. Actually normal fp is also out of the question...
                >[color=green]
                > > If so, I don't see how he can
                > > avoid the "stair-step" problem that occurs with large
                > > counts, except possibly by breaking the count into two
                > > pieces and extracting two `float' quantities instead
                > > of one. E.g.,
                > >
                > > float hi, lo;
                > > lo = (ticks & 0xFFFF) * 0.004f;
                > > hi = (ticks >> 16) * (0.004f * 65536);[/color]
                >
                > This ignores something he wants. When ticks is a multiple of 250
                > the exact integer value should be produced as a floating point number.
                > What I wrote in a previous post still stands:
                > (float)(ticks / 250) + (float)(ticks % 250) * 0.004f.
                > I am looking for a way to do "ticks / 250" faster on an 8-bit micro
                > than just division (which, again, is slow).[/color]

                Comment

                • Andy

                  #23
                  Re: 32-bit IEEE float multiplication

                  Yes. I do not want to wast CPU cycles. My intend is not really
                  to cover all the integral values when the number gets huge. If I
                  only loose one second for anything greater than 2^24 (that's >18 hours
                  BTW), then that's ok. With 32-bits, I should be able to cover
                  something like 198 days and if the error is even one minute out
                  of 180 days, then that's fine, but one day is not. What's the
                  maximum error I can expect?

                  TIA
                  Andy

                  Christian Bau <christian.bau@ cbau.freeserve. co.uk> wrote in message news:<christian .bau-2F3655.22593304 122003@slb-newsm1.svr.pol. co.uk>...[color=blue]
                  > when x is large.
                  >
                  > Is there a good reason why you don't write
                  >
                  > ticks / 250
                  >
                  > ?[/color]

                  Comment

                  • Eric Sosman

                    #24
                    Re: 32-bit IEEE float multiplication

                    Andy wrote:[color=blue]
                    >
                    > Thanks for your great answers. As I understand it, as the tick
                    > gets greater then 2^24, then all I'm loosing are the lower 8-bits
                    > of the tick. Then since 256 is only 1 second, the worse thing
                    > that happens is that the result will be rounded to the nearest
                    > second or a little more than that. I will not even loose
                    > anything close to a minute or an hour. Would I?[/color]

                    For the benefit of those who (like me) do not entirely
                    understand exactly what you're trying to do, could you
                    describe what these "ticks" are supposed to be and what
                    you are trying to do with them?

                    ... and if you're trying to convert a 256 Hz "tick" to
                    seconds, multiplying by a floating-point value is surely a
                    poor way to proceed. Even an integer division is overkill;
                    a simple shift-and-mask will do all that's necessary. If
                    you're willing to think about fixed-point arithmetic, even
                    that tiny amount of work is more than required!

                    So, what's the goal?

                    --
                    Eric.Sosman@sun .com

                    Comment

                    • Christian Bau

                      #25
                      Re: 32-bit IEEE float multiplication

                      In article <aed59298.03120 50738.38f13268@ posting.google. com>,
                      bikejog@hotmail .com (Andy) wrote:
                      [color=blue]
                      > Yes. I do not want to wast CPU cycles. My intend is not really
                      > to cover all the integral values when the number gets huge. If I
                      > only loose one second for anything greater than 2^24 (that's >18 hours
                      > BTW), then that's ok. With 32-bits, I should be able to cover
                      > something like 198 days and if the error is even one minute out
                      > of 180 days, then that's fine, but one day is not. What's the
                      > maximum error I can expect?[/color]

                      Couldn't you just use two separate counters for seconds and ticks?

                      You are multiplying ticks by 0.004, so every 250 times you would add a
                      second. You could do something like this:

                      static unsigned long whole_seconds = 0;
                      static unsigned int sub_seconds = 0;
                      static unsigned long last_ticks;

                      Set last_ticks to ticks when you start. Then whenever you check ticks,
                      you do the following:


                      ticks = <calculate current time>
                      while (last_ticks != ticks) {
                      ++last_ticks;
                      if (++sub_seconds == 250) { sub_seconds = 0; ++whole_seconds ; }
                      }

                      No floating point arithmetic; that should be a bit faster on an 8051.

                      Comment

                      • J. J. Farrell

                        #26
                        Re: 32-bit IEEE float multiplication

                        bikejog@hotmail .com (Andy) wrote in message news:<aed59298. 0312050738.38f1 3268@posting.go ogle.com>...[color=blue]
                        > Christian Bau <christian.bau@ cbau.freeserve. co.uk> wrote in message news:<christian .bau-2F3655.22593304 122003@slb-newsm1.svr.pol. co.uk>...[color=green]
                        > >
                        > > Is there a good reason why you don't write
                        > >
                        > > ticks / 250[/color]
                        >
                        > Yes. I do not want to wast CPU cycles. My intend is not really
                        > to cover all the integral values when the number gets huge. If I
                        > only loose one second for anything greater than 2^24 (that's >18 hours
                        > BTW), then that's ok. With 32-bits, I should be able to cover
                        > something like 198 days and if the error is even one minute out
                        > of 180 days, then that's fine, but one day is not. What's the
                        > maximum error I can expect?[/color]

                        Excuse me if this point is stupid, as I know next to nothing
                        about FP arithmetic. I don't see how ((float)ticks * 0.004f) can
                        possibly use fewer cycles than (float)(ticks / 250), particularly
                        on a system without hardware floating point arithmetic.

                        I confess I'm puzzled by this whole discussion. What are you trying
                        to achieve? Why are you using FP arithmetic instead of integer -
                        particularly if CPU cycles are important? Unless I'm really missing
                        something, you'd get perfect accuracy up to 136 years with integer.

                        I guess I'm missing something obvious ...

                        Comment

                        • Andy

                          #27
                          Re: 32-bit IEEE float multiplication

                          Eric,
                          My goal is to provide a generic timing device which would
                          provide accuracy (although not exact) from days down to
                          milliseconds. The idea is to have a free running 32-bit
                          timer (tick) that all others compare for timing. I'm using
                          multiplication because the idea is to not limit the
                          free running tick counter to 1 frequence (as in my previous
                          examples 4 milliseconds). Maybe a concrete example would help.

                          typedef unsigned long GCLK_T;
                          GCLK_T gzFreeTicks; /* this gets incremented in an ISR */
                          GCLK_T GCLK(void); /* provides atomic read of gzFreeTicks */

                          #define SECS_PER_TICK 0.004 /* seconds for each tick */
                          #define MSECS_TO_TICKS( MSECS) xxxx /* converting milliseconds to ticks */

                          GCLK_T ElapsedTicks(GC LK_T ticks) {
                          return(GCLK() - ticks);
                          }

                          unsigned long ElapsedSecs(GCL K_T ticks) {
                          return((float)E lapsedTicks(tic ks) * (float)(SECS_PE R_TICK));
                          }

                          /* has a endless loop with one 100 millisecond task */
                          void main(void) {
                          GCLK_T zMyTicks;

                          zMyTicks = GCLK();
                          while(1) {
                          /* this provides a very fast compare */
                          if (ElapsedTicks(z MyTicks) > MSECS_TO_TICKS( 100)) {
                          Do100Millisecon dTask();
                          zMyTicks = GCLK();
                          }
                          }
                          }

                          Hope this helps.
                          Andy

                          Eric Sosman <Eric.Sosman@su n.com> wrote in message news:<3FD0A771. 36064FA3@sun.co m>...
                          [color=blue]
                          > For the benefit of those who (like me) do not entirely
                          > understand exactly what you're trying to do, could you
                          > describe what these "ticks" are supposed to be and what
                          > you are trying to do with them?
                          >
                          > ... and if you're trying to convert a 256 Hz "tick" to
                          > seconds, multiplying by a floating-point value is surely a
                          > poor way to proceed. Even an integer division is overkill;
                          > a simple shift-and-mask will do all that's necessary. If
                          > you're willing to think about fixed-point arithmetic, even
                          > that tiny amount of work is more than required!
                          >
                          > So, what's the goal?[/color]

                          Comment

                          • Andy

                            #28
                            Re: 32-bit IEEE float multiplication

                            Keeping a seconds counter is out of the question since then
                            you're forced to increment the ticks at a frequency exactly
                            divisable by one second. Please see my previous reply to
                            Eric and maybe you will get a good idea what I'm trying to
                            accomplish. But the basic idea is to allow the ticks to
                            be incremented at any frequency because so often, timers
                            are hard to come by. I do not want to dedicate an entire
                            timer just for this.
                            The equation

                            unsigned long elapsedSeconds;

                            seconds = (float)elasedSe conds * 0.004;

                            will always yield a valid number for all 32-bits of
                            elapsedSeconds, right? I mean it won't give a number that's
                            hours, days, or years away from the actual value when
                            elapsedSeconds is greater than 24-bits, right?
                            If that's the case, then I think I'm happy with it.

                            Andy


                            Christian Bau <christian.bau@ cbau.freeserve. co.uk> wrote in message news:<christian .bau-F6ADC9.21522605 122003@slb-newsm1.svr.pol. co.uk>...[color=blue]
                            >
                            > Couldn't you just use two separate counters for seconds and ticks?
                            >
                            > You are multiplying ticks by 0.004, so every 250 times you would add a
                            > second. You could do something like this:
                            >
                            > static unsigned long whole_seconds = 0;
                            > static unsigned int sub_seconds = 0;
                            > static unsigned long last_ticks;
                            >
                            > Set last_ticks to ticks when you start. Then whenever you check ticks,
                            > you do the following:
                            >
                            >
                            > ticks = <calculate current time>
                            > while (last_ticks != ticks) {
                            > ++last_ticks;
                            > if (++sub_seconds == 250) { sub_seconds = 0; ++whole_seconds ; }
                            > }
                            >
                            > No floating point arithmetic; that should be a bit faster on an 8051.[/color]

                            Comment

                            • Andy

                              #29
                              Re: 32-bit IEEE float multiplication

                              I'm sorry, I was thinking of this equation

                              (float)(ticks / 250) + (float)(ticks % 250)/250.0

                              posted by another person when I wrote the reply. Please see my
                              reply to Eric to get an idea of what I'm trying to accomplish.

                              TIA
                              Andy


                              jjf@bcs.org.uk (J. J. Farrell) wrote in message news:<5c04bc56. 0312051449.5d69 bb22@posting.go ogle.com>...[color=blue]
                              > bikejog@hotmail .com (Andy) wrote in message news:<aed59298. 0312050738.38f1 3268@posting.go ogle.com>...[color=green]
                              > > Christian Bau <christian.bau@ cbau.freeserve. co.uk> wrote in message news:<christian .bau-2F3655.22593304 122003@slb-newsm1.svr.pol. co.uk>...[color=darkred]
                              > > >
                              > > > Is there a good reason why you don't write
                              > > >
                              > > > ticks / 250[/color]
                              > >
                              > > Yes. I do not want to wast CPU cycles. My intend is not really
                              > > to cover all the integral values when the number gets huge. If I
                              > > only loose one second for anything greater than 2^24 (that's >18 hours
                              > > BTW), then that's ok. With 32-bits, I should be able to cover
                              > > something like 198 days and if the error is even one minute out
                              > > of 180 days, then that's fine, but one day is not. What's the
                              > > maximum error I can expect?[/color]
                              >
                              > Excuse me if this point is stupid, as I know next to nothing
                              > about FP arithmetic. I don't see how ((float)ticks * 0.004f) can
                              > possibly use fewer cycles than (float)(ticks / 250), particularly
                              > on a system without hardware floating point arithmetic.
                              >
                              > I confess I'm puzzled by this whole discussion. What are you trying
                              > to achieve? Why are you using FP arithmetic instead of integer -
                              > particularly if CPU cycles are important? Unless I'm really missing
                              > something, you'd get perfect accuracy up to 136 years with integer.
                              >
                              > I guess I'm missing something obvious ...[/color]

                              Comment

                              • Andy

                                #30
                                Re: [FAQ] Re: 32-bit IEEE float multiplication

                                How about

                                float a,b,c;
                                assert (a > 1.0);
                                assert (a < pow(2,32)); /* full range (excluding 0) of unsigned long */
                                assert (b != 0); /* b is some small but representable non-zero value */
                                assert (b < 1.0);
                                c = a*b;
                                assert (c != 0);
                                assert (!isnan(c))

                                Please note float instead of double. And also, what kind of
                                error can I expect out the product of a*b?

                                TIA
                                Andy

                                "Arthur J. O'Dwyer" <ajo@nospam.and rew.cmu.edu> wrote in message news:<Pine.LNX. 4.58-035.03120321273 60.31331@unix48 .andrew.cmu.edu >...[color=blue]
                                > On Thu, 4 Dec 2003, John Smith wrote:[color=green]
                                > >
                                > > Arthur J. O'Dwyer wrote:[color=darkred]
                                > > >[/color][/color]
                                >
                                > First, as several others have pointed out, if we have
                                >
                                > double a,b,c; /* initialized somehow */
                                > assert(a > 1.0);
                                > assert(b > 0.0);
                                > assert(b < 1.0);
                                > c = a*b;
                                >
                                > then it is always the case that
                                >
                                > assert(c != 0.0);
                                >
                                > However, it is plausible that the OP might have initialized
                                > 'b' in such a way as to make him *think* it was a small positive
                                > value, while in fact it had already been corrupted by round-off
                                > error. For example, I think
                                >
                                >[/color]

                                <snip..>
                                [color=blue]
                                > HTH,
                                > -Arthur[/color]

                                Comment

                                Working...