2.2.4-Matric-Vector-Multiplication
Overview of Cache Optimizations in Matrix Vector Multiplication
General Context
Focus on optimizing cache usage in matrix-vector multiplication operations. Importance highlighted due to constraints where cache size is smaller than the size of the matrix/vector (denoted as N).
Cache Behavior in Matrix Vector Multiply
Assumptions:
Cache size is smaller than N words.
Memory Misses:
Matrices A and C incur only compulsory misses, calculated as:
A: N^2 / w (where w = word size)
C: N / w
Matrix B, however, suffers from multiple loads leading to considerable misses, calculated as:
B: N^2 / w misses.
Access Pattern: Each row of A accesses B once, but previous data from B may be evicted from cache before the next row is accessed.
Optimization Strategy for B Reuse
Objective: Increase reuse of matrix B to reduce the number of cache misses.
Proposed Solution:
Loop Interchange: Modify loops to allow the use of values from B multiple times during computation.
Calculation Dependency: Understanding which calculations require specific elements of B to optimize load strategy.
Challenges: While loop interchange optimizes access to B (e.g., reusing B[0]), access patterns for A may degrade cache performance.
Implementation of Column Order Traversal
Column Order Traversal:
By restructuring the loop, for certain rows X, B is accessible multiple times.
New loop structure:
for (i = 0; i < N; i++) for (j = 0; j < N; j++) C[i] += A[i][j] * B[j]Enhances cache utilization as successive accesses to B remain in cache for longer.
Improved Reuse Model
Assumptions Extended: Still assume cache is smaller than N words but optimized for reduced misses:
A and C continues to experience compulsory misses (N^2/w for A, N/w for C).
B now is reused X times, leading to a total of:
N^2 / (X * w) misses.
Revised Loop Structure:
Employ an outer loop for rows of A with the inner traverse for each row of B:
for (i = 0; i < N; i += X) for (j = 0; j < N; j++) for (k = 0; k < X; k++) C[i+k] += A[i+k][j] * B[j]This methodology combines the benefits of reusing already loaded values while minimizing cache misses.