2.2.4-Matric-Vector-Multiplication

Overview of Cache Optimizations in Matrix Vector Multiplication

General Context

Focus on optimizing cache usage in matrix-vector multiplication operations. Importance highlighted due to constraints where cache size is smaller than the size of the matrix/vector (denoted as N).

Cache Behavior in Matrix Vector Multiply

Assumptions:
- Cache size is smaller than N words.
Memory Misses:
- Matrices A and C incur only compulsory misses, calculated as:
  - A: N^2 / w (where w = word size)
  - C: N / w
- Matrix B, however, suffers from multiple loads leading to considerable misses, calculated as:
  - B: N^2 / w misses.
Access Pattern: Each row of A accesses B once, but previous data from B may be evicted from cache before the next row is accessed.

Optimization Strategy for B Reuse

Objective: Increase reuse of matrix B to reduce the number of cache misses.
Proposed Solution:
- Loop Interchange: Modify loops to allow the use of values from B multiple times during computation.
- Calculation Dependency: Understanding which calculations require specific elements of B to optimize load strategy.
Challenges: While loop interchange optimizes access to B (e.g., reusing B[0]), access patterns for A may degrade cache performance.

Implementation of Column Order Traversal

Column Order Traversal:
- By restructuring the loop, for certain rows X, B is accessible multiple times.

New loop structure:

for (i = 0; i < N; i++)  
    for (j = 0; j < N; j++)  
        C[i] += A[i][j] * B[j]

Enhances cache utilization as successive accesses to B remain in cache for longer.

Improved Reuse Model

Assumptions Extended: Still assume cache is smaller than N words but optimized for reduced misses:
- A and C continues to experience compulsory misses (N^2/w for A, N/w for C).
- B now is reused X times, leading to a total of:
  - N^2 / (X * w) misses.

Revised Loop Structure:

Employ an outer loop for rows of A with the inner traverse for each row of B:

for (i = 0; i < N; i += X)  
    for (j = 0; j < N; j++)  
        for (k = 0; k < X; k++)  
            C[i+k] += A[i+k][j] * B[j]

This methodology combines the benefits of reusing already loaded values while minimizing cache misses.