2.2.4-Matric-Vector-Multiplication

Overview of Cache Optimizations in Matrix Vector Multiplication

General Context

Focus on optimizing cache usage in matrix-vector multiplication operations. Importance highlighted due to constraints where cache size is smaller than the size of the matrix/vector (denoted as N).

Cache Behavior in Matrix Vector Multiply

  • Assumptions:

    • Cache size is smaller than N words.

  • Memory Misses:

    • Matrices A and C incur only compulsory misses, calculated as:

      • A: N^2 / w (where w = word size)

      • C: N / w

    • Matrix B, however, suffers from multiple loads leading to considerable misses, calculated as:

      • B: N^2 / w misses.

  • Access Pattern: Each row of A accesses B once, but previous data from B may be evicted from cache before the next row is accessed.

Optimization Strategy for B Reuse

  • Objective: Increase reuse of matrix B to reduce the number of cache misses.

  • Proposed Solution:

    • Loop Interchange: Modify loops to allow the use of values from B multiple times during computation.

    • Calculation Dependency: Understanding which calculations require specific elements of B to optimize load strategy.

  • Challenges: While loop interchange optimizes access to B (e.g., reusing B[0]), access patterns for A may degrade cache performance.

Implementation of Column Order Traversal

  • Column Order Traversal:

    • By restructuring the loop, for certain rows X, B is accessible multiple times.

  • New loop structure:

    for (i = 0; i < N; i++)  
        for (j = 0; j < N; j++)  
            C[i] += A[i][j] * B[j]  
  • Enhances cache utilization as successive accesses to B remain in cache for longer.

Improved Reuse Model

  • Assumptions Extended: Still assume cache is smaller than N words but optimized for reduced misses:

    • A and C continues to experience compulsory misses (N^2/w for A, N/w for C).

    • B now is reused X times, leading to a total of:

      • N^2 / (X * w) misses.

  • Revised Loop Structure:

    • Employ an outer loop for rows of A with the inner traverse for each row of B:

    for (i = 0; i < N; i += X)  
        for (j = 0; j < N; j++)  
            for (k = 0; k < X; k++)  
                C[i+k] += A[i+k][j] * B[j]  
  • This methodology combines the benefits of reusing already loaded values while minimizing cache misses.