Vectorization

Vectorization

  • A technique used to parallelize data parallel applications.

  • Aimed at computations operating on different portions of vectors while performing the same operations.

  • Relies on specific hardware.

Review of Last Lesson (Using awk)

  • awk is a line-oriented text processor.

  • Command structure:

    • awk '{commands}'

    • Commands within curly brackets are executed for each line.

    • END {commands}: Commands executed at the end of processing all lines.

  • Example:

    • Objective: Calculate the average of a specific field in a set of experiments.

    • awk '{sum += $4} END {print sum / NR}'

      • $4: Refers to the fourth field in the line.

      • sum: A variable to accumulate the sum.

      • NR: Number of rows processed.

  • Corrected mistake: Using $ before variable names.

    • sum is a variable and should not be prefixed with $. Only the fields in the line (numerical values) must have the $ sign in front.

SIMD Extensions and Assembly Language

  • Vectorization relies on SIMD (Single Instruction, Multiple Data) extensions in the assembly instruction set.

  • Assembly language:

    • Low-level programming language for processors.

    • Uses registers (e.g., r0, r1, r2).

    • Instructions operate on registers (add, subtract, load, store, etc.).

    • Example (ARM assembly):

      • add r0, r1, r2 (r0 + r1 -> r2, or r2 = r0 + r1, different syntax may apply).

    • Memory access:

      • load (memory -> register), store (register -> memory).

      • Memory addresses can be specified using registers as base addresses with offsets.

      • Example: 8(r1) (memory address = contents of r1 + 8).

    • Control instructions: jump, branch, conditional branches (e.g., jump if equal).

  • Traditional assembly instructions operate on word-length registers (e.g., 64 bits on a 64-bit architecture).

SIMD: Single Instruction, Multiple Data

  • SIMD extensions use longer registers to process multiple data items with a single instruction.

  • Example:

    • AVX2 extension with 256-bit registers.

    • A 256-bit register can be seen as:

      • One 256-bit value

      • Two 128-bit values

      • Four 64-bit values

      • Eight 32-bit values

  • Instructions specify the data type to be used within the register (e.g., half-word = 32 bits).

  • Example:

    • add.h l0, l1, l2 (add half-words in registers l0 and l1, store the result in l2).

    • The operation is performed in parallel on the corresponding data elements within the registers.

  • Requires replicated hardware (ALUs) to perform parallel operations.

Assembly Code Generation

  • Compiling C/C++ code with the -S flag generates assembly code.

  • Assembly code varies depending on the target processor architecture.

  • Example: Compiling the same C++ code on an Intel processor vs. a Raspberry Pi (ARM) produces different assembly instructions.

Vectorization Example

  • Simple loop example:

    • Initialize vectors x, y, and z.

    • Compute z[i] = x[i] + y[i] in a loop.

    • Sum all elements of z to prevent the compiler from optimizing out the loop.

  • Compiling without optimization (e.g., -O0) results in traditional assembly instructions operating on individual data elements.

  • Compiling with optimization (e.g., -O3) enables automatic vectorization.

  • Vectorization loads multiple data elements into longer registers, performs the operation in parallel, and stores the results.

  • Compiler flag -O3 (enables level three optimization, including automatic vectorization).

Performance Comparison: Vectorization vs. Scalar operations

  • Scalar Operations (without vectorization):

    • Involve separate instructions for loading, adding, and storing each data element.

    • Require multiple clock cycles per data element.

  • Vectorized Operations:

    • Load multiple data elements into vector registers.

    • Perform a single instruction on multiple data elements simultaneously.

    • Reduce the number of clock cycles required.

  • Significant speedup can be achieved through vectorization.

Loop Unrolling and Vectorization

  • Loop unrolling is a loop transformation technique that duplicates the loop body multiple times, reducing loop overhead and exposing more opportunities for vectorization.

  • When loop unrolling is combined with vectorization, the compiler can process multiple iterations of the loop in parallel, resulting in significant performance improvements.

Compiler Feedback and Optimization Flags

  • The -fopt-info-vec-missed flag provides feedback on loops that were not vectorized.

  • Reasons for missed vectorization:

    • Unsupported data types (e.g., double).

    • Data dependencies.

    • Misaligned memory accesses.

  • The -funsafe-math-optimizations flag allows the compiler to perform aggressive optimizations that may sacrifice some accuracy.

  • Enabling unsafe math optimizations can sometimes enable vectorization of loops that would otherwise not be vectorized.

Case Study: Psec Code Optimization

  • Experimented with the Psec code for computing the value of Pi.

  • Compiled the code with and without vectorization.

  • Measured the execution time.

  • Observed a significant speedup when vectorization was enabled.

  • Used the -f unsafe-math-optimizations flag to further improve performance.

SIMD Extensions Supported

  • MMX

  • SSE2

  • AVX

  • AVX2

  • AVX-512

AVX2 Extension Details

  • AVX2 operates on 256-bit registers.

  • Can process eight 32-bit values or four 64-bit values in parallel.

  • Potential for a significant reduction in the number of iterations.