2.2.2-Estimating-Performance-with-Cache-Misses
Cache Optimizations
Understanding Multiprocessor Programming
Effective programming requires knowledge of contemporary multiprocessor architecture and development.
Key Determinants of Sequential Performance
When evaluating performance after code compilation, consider the following:
Floating Point Units (FPU) Arithmetic Rate: The effective functioning of FPUs influences overall performance.
Memory System Data Transfer Rate (Bandwidth): Efficiency in data movement between memory and processors affects performance.
Floating Point Intensity: This is the ratio of double precision operations to the data volume transferred between memory and registers.
Performance Calculation Using Example Code
Consider the code:
for (i=0; i<N; i++)
x += A[i];Assumptions:
Clock Rate: 2 GHz (0.5 ns per cycle)
Floating-point Operations Per Cycle: 1 (improvable to 2 with FMAD)
N = 1,000,000
Cache Line Size: 64 bytes
Data Type Size: double (8 bytes each)
Cache Miss Penalty: 50 ns
Without Cache Misses:
Total Time = N * 0.5 ns = 0.5 ms
With Cache Misses:
Total Time = N * 0.5 ns + (N/8) * 50 ns = 6.75 msCache misses can make performance over ten times slower.
Arithmetic Intensity Calculations
Example Code for Various Loops:
Loop 1:
x += A[i];Arithmetic Intensity: 1 FP/word = 0.125Loop 2:
x += A[i] * A[i];Arithmetic Intensity: 2/8 = 0.25Loop 3:
x += A[i] * A[i] * A[i];Arithmetic Intensity: 3/8 = 0.375
Improving Arithmetic Intensity
Example 2:
Code 1:
x += A[i];(1 FP op/load)Code 2:
s += A[i] * A[i];(2 FP ops/load; better efficiency due to higher arithmetic intensity.)
Example 3: Analyzing Loops
Code 1:
x += A[i];Code 2:
max = A[i] > max ? A[i] : max;Combined Code:
for(i=0;i<N;i++)
{
x += A[i];
max = A[i] > max ? A[i] : max;
}The combined code is more efficient due to reduced cache misses despite lesser FP operations.
Cache Based Optimizations
To improve performance while sustaining arithmetic intensity, the focus should be on reducing cache misses.
Doubly Nested Loop Problem
Example Analysis:
for(i=0;i<N;i++)
for(j=0;j<M;j++)
x += A[j][i];Assumption: Cache size < N * w (words per cache line). Each access leads to N^2 cache misses.
Optimizing Access Pattern:
Reorder loops:
for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
x += A[j][i];This adjustment significantly decreases cache misses.