2.2.2-Estimating-Performance-with-Cache-Misses

Cache Optimizations

Effective programming requires knowledge of contemporary multiprocessor architecture and development.

When evaluating performance after code compilation, consider the following:

Floating Point Units (FPU) Arithmetic Rate: The effective functioning of FPUs influences overall performance.
Memory System Data Transfer Rate (Bandwidth): Efficiency in data movement between memory and processors affects performance.
Floating Point Intensity: This is the ratio of double precision operations to the data volume transferred between memory and registers.

Consider the code:

for (i=0; i<N; i++)  
    x += A[i];

Total Time = N * 0.5 ns = 0.5 ms

Total Time = N * 0.5 ns + (N/8) * 50 ns = 6.75 msCache misses can make performance over ten times slower.

Code 1: x += A[i]; (1 FP op/load)
Code 2: s += A[i] * A[i]; (2 FP ops/load; better efficiency due to higher arithmetic intensity.)

for(i=0;i<N;i++)  
    {  
      x += A[i];  
      max = A[i] > max ? A[i] : max;  
    }

The combined code is more efficient due to reduced cache misses despite lesser FP operations.

To improve performance while sustaining arithmetic intensity, the focus should be on reducing cache misses.

Example Analysis:

for(i=0;i<N;i++)  
    for(j=0;j<M;j++)  
        x += A[j][i];

Assumption: Cache size < N * w (words per cache line). Each access leads to N^2 cache misses.

Reorder loops:

for (j = 0; j < M; j++)  
    for (i = 0; i < N; i++)  
        x += A[j][i];

This adjustment significantly decreases cache misses.