weird benchmarks with c arrays

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Emiurgo
    New Member
    • Aug 2008
    • 2

    weird benchmarks with c arrays

    Hi there to everyone!
    I've got a problem which may be interesting to some of you, and I'd be very grateful if there is someone who can give me some advice (or maybe redirect me to some other place).

    Very brief introduction.
    Since I'm learning vectorization with SSE (I use gcc 4.1.3 on Ubuntu 7.10) I've built a simple test program in C for comparing running times. I wanted to see how much improvement could give vectorization to simple iterated operations. So I made also some tests with standard arrays, pointers, and such (with and without SSE), and I stumbled upon some strange facts I cannot naively explain.

    The (apparently) weird thing.
    Let me show you an example. One of the test is the multiplication of a random-filled array of doubles for a fixed scalar.
    The array has length L and the multiplication is repeated R times. (The repetition is the outer loop, while the array operation is in the inner loop). For example, the code for the indexed iteration is:
    Code:
    for (r = 0; r < repeat; r++) for (i = 0; i < length; i++) v[i] *= m;
    Amongst the benchmark I've tried, there is one test in which I modify the values of R and L keeping the product R * L constant.
    These are the execution times in seconds (on a AMD processor). The program has been compiled with optimization -O2 and SSE flags on.

    operation - repeat - array length -- SSE / indexed loop / pointer loop
    sclmul 5000000 32 -- 0.19 0.31 0.24
    sclmul 2500000 64 -- 0.16 0.29 0.22
    sclmul 1250000 128 -- 0.15 0.28 0.21
    sclmul 500000 320 -- 0.14 0.27 0.20
    sclmul 250000 640 -- 0.14 0.27 0.20
    sclmul 125000 1280 -- 0.13 0.26 0.20
    sclmul 50000 3200 -- 0.13 0.27 0.20
    sclmul 25000 6400 -- 0.14 0.27 0.20
    sclmul 12500 12800 -- 0.24 0.35 0.32
    sclmul 5000 32000 -- 0.24 0.35 0.32
    sclmul 2500 64000 -- 0.25 0.36 0.32
    sclmul 1250 128000 -- 0.44 0.55 0.51
    sclmul 1000 160000 -- 0.60 0.71 0.65
    sclmul 800 200000 -- 0.72 0.84 0.78
    sclmul 500 320000 -- 0.72 0.84 0.77
    sclmul 250 640000 -- 0.72 0.84 0.78
    sclmul 125 1280000 -- 0.72 0.84 0.78
    sclmul 62 2560000 -- 0.73 0.86 0.79
    sclmul 50 3200000 -- 0.72 0.84 0.78

    The strange fact is that as the array length increases, the execution time worsens a lot. In some other benchmarks (with copy operations), it feels like there is a cutoff above which the execution goes almost ten times slower (from 0.40 to 3.26)!
    The fact is not linked to optimization: it happens even with no optimization on. And it is not linked to a specific processor, since I've run it on three different machines and I found similar results (with different "cutoffs", but there is always an array length for which the phenomenon occurs).
    After some tinkering, I've found that _adding_ an extra extern loop -- which in fact "segments" the array as if it was the union of smaller arrays -- the slowness vanishes.

    Code:
    for (k = 0; k < kmax; k++) {
    start = k*length/kmax;
    for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m; 
    }
    (The piece of code above is awkward but it is actually up to ten time faster than a normal array sweep for very very large arrays -- kmax must be fine-tuned.)
    Actually, I'm not able to explain it with a violation of locality of reference or other simple optimization techniques, since we are talking about a very simple operation which is simply iterated in a (very very) long array (remember that this happens even with optimization turned off).
    I feel like I'm missing something fundamental.

    Any idea?
    Thank you for your time!

    Luigi
  • newb16
    Contributor
    • Jul 2008
    • 687

    #2
    (6400-12800 )*sizeof(double ) looks like L1 cache size, and next pit is like to be about L2 cache size.

    Comment

    • weaknessforcats
      Recognized Expert Expert
      • Mar 2007
      • 9214

      #3
      Put your array on the heap rather than the stack.

      Also, I hope this code:
      Originally posted by Emiurgo
      for (k = 0; k < kmax; k++) {
      start = k*length/kmax;
      for (r = 0; r < repeat; r++) for (i = start; i < start + length/kmax; i++) v[i] *= m;
      }
      does not have k, kmax, r and repeat as doubles. Operators like < don't work as expected for floating point. That could cause loop problems.

      Also, avoid calculations in the condition part of a loop since these will be redone on each cycle. Better to calculate once, put the result in a variable and use the variable in the loop.

      Comment

      • newb16
        Contributor
        • Jul 2008
        • 687

        #4
        So when you process array slice-by-slice it is loaded into cache only once per each k iteration. Google for 'cache trashing'

        Comment

        • Emiurgo
          New Member
          • Aug 2008
          • 2

          #5
          Originally posted by newb16
          (6400-12800 )*sizeof(double ) looks like L1 cache size, and next pit is like to be about L2 cache size.
          Touché. Yes, that must be the answer, thank you.
          Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)
          Originally posted by weaknessforcats
          Put your array on the heap rather than the stack.
          I'm dynamically allocating arrays with malloc (actually, with a function which calls malloc, since I need to manually align the addresses to 16 bytes, to use them with SSE registers). So, I think that in my implementation (gcc) they are already on the heap, why do you guess they are on the stack?

          Comment

          • weaknessforcats
            Recognized Expert Expert
            • Mar 2007
            • 9214

            #6
            Originally posted by Emiurgo
            why do you guess they are on the stack?
            You would be surprised the number o C programmers that can't use malloc. Then they create huge arrays on the stack, exceed stack memory limits and die with all sorts of strange errors.

            Comment

            • newb16
              Contributor
              • Jul 2008
              • 687

              #7
              Originally posted by Emiurgo
              Is there a way to know cache sizes at runtime? (I need to run the program on different machines.)
              1) CPUID command. or 2) run several attempts at startup with different array sizes measuring performance.

              Comment

              Working...