2.2.2-Estimating-Performance-with-Cache-Misses

Cache Optimizations

Understanding Multiprocessor Programming

Effective programming requires knowledge of contemporary multiprocessor architecture and development.

Key Determinants of Sequential Performance

When evaluating performance after code compilation, consider the following:

  1. Floating Point Units (FPU) Arithmetic Rate: The effective functioning of FPUs influences overall performance.

  2. Memory System Data Transfer Rate (Bandwidth): Efficiency in data movement between memory and processors affects performance.

  3. Floating Point Intensity: This is the ratio of double precision operations to the data volume transferred between memory and registers.

Performance Calculation Using Example Code

Consider the code:

for (i=0; i<N; i++)  
    x += A[i];

Assumptions:

  • Clock Rate: 2 GHz (0.5 ns per cycle)

  • Floating-point Operations Per Cycle: 1 (improvable to 2 with FMAD)

  • N = 1,000,000

  • Cache Line Size: 64 bytes

  • Data Type Size: double (8 bytes each)

  • Cache Miss Penalty: 50 ns

Without Cache Misses:

Total Time = N * 0.5 ns = 0.5 ms

With Cache Misses:

Total Time = N * 0.5 ns + (N/8) * 50 ns = 6.75 msCache misses can make performance over ten times slower.

Arithmetic Intensity Calculations

Example Code for Various Loops:

  1. Loop 1: x += A[i];Arithmetic Intensity: 1 FP/word = 0.125

  2. Loop 2: x += A[i] * A[i];Arithmetic Intensity: 2/8 = 0.25

  3. Loop 3: x += A[i] * A[i] * A[i];Arithmetic Intensity: 3/8 = 0.375

Improving Arithmetic Intensity

Example 2:

  • Code 1: x += A[i]; (1 FP op/load)

  • Code 2: s += A[i] * A[i]; (2 FP ops/load; better efficiency due to higher arithmetic intensity.)

Example 3: Analyzing Loops

  • Code 1: x += A[i];

  • Code 2: max = A[i] > max ? A[i] : max;

  • Combined Code:

for(i=0;i<N;i++)  
    {  
      x += A[i];  
      max = A[i] > max ? A[i] : max;  
    }

The combined code is more efficient due to reduced cache misses despite lesser FP operations.

Cache Based Optimizations

To improve performance while sustaining arithmetic intensity, the focus should be on reducing cache misses.

Doubly Nested Loop Problem

Example Analysis:

for(i=0;i<N;i++)  
    for(j=0;j<M;j++)  
        x += A[j][i];

Assumption: Cache size < N * w (words per cache line). Each access leads to N^2 cache misses.

Optimizing Access Pattern:

Reorder loops:

for (j = 0; j < M; j++)  
    for (i = 0; i < N; i++)  
        x += A[j][i];

This adjustment significantly decreases cache misses.