Vectorization
Vectorization
A technique used to parallelize data parallel applications.
Aimed at computations operating on different portions of vectors while performing the same operations.
Relies on specific hardware.
Review of Last Lesson (Using awk)
awkis a line-oriented text processor.Command structure:
awk '{commands}'Commands within curly brackets are executed for each line.
END {commands}: Commands executed at the end of processing all lines.
Example:
Objective: Calculate the average of a specific field in a set of experiments.
awk '{sum += $4} END {print sum / NR}'$4: Refers to the fourth field in the line.sum: A variable to accumulate the sum.NR: Number of rows processed.
Corrected mistake: Using
$before variable names.sumis a variable and should not be prefixed with$. Only the fields in the line (numerical values) must have the$sign in front.
SIMD Extensions and Assembly Language
Vectorization relies on SIMD (Single Instruction, Multiple Data) extensions in the assembly instruction set.
Assembly language:
Low-level programming language for processors.
Uses registers (e.g.,
r0,r1,r2).Instructions operate on registers (add, subtract, load, store, etc.).
Example (ARM assembly):
add r0, r1, r2(r0 + r1 -> r2, or r2 = r0 + r1, different syntax may apply).
Memory access:
load(memory -> register),store(register -> memory).Memory addresses can be specified using registers as base addresses with offsets.
Example:
8(r1)(memory address = contents ofr1+ 8).
Control instructions:
jump,branch, conditional branches (e.g.,jump if equal).
Traditional assembly instructions operate on word-length registers (e.g., 64 bits on a 64-bit architecture).
SIMD: Single Instruction, Multiple Data
SIMD extensions use longer registers to process multiple data items with a single instruction.
Example:
AVX2 extension with 256-bit registers.
A 256-bit register can be seen as:
One 256-bit value
Two 128-bit values
Four 64-bit values
Eight 32-bit values
Instructions specify the data type to be used within the register (e.g., half-word = 32 bits).
Example:
add.h l0, l1, l2(add half-words in registers l0 and l1, store the result in l2).The operation is performed in parallel on the corresponding data elements within the registers.
Requires replicated hardware (ALUs) to perform parallel operations.
Assembly Code Generation
Compiling C/C++ code with the
-Sflag generates assembly code.Assembly code varies depending on the target processor architecture.
Example: Compiling the same C++ code on an Intel processor vs. a Raspberry Pi (ARM) produces different assembly instructions.
Vectorization Example
Simple loop example:
Initialize vectors
x,y, andz.Compute
z[i] = x[i] + y[i]in a loop.Sum all elements of
zto prevent the compiler from optimizing out the loop.
Compiling without optimization (e.g.,
-O0) results in traditional assembly instructions operating on individual data elements.Compiling with optimization (e.g.,
-O3) enables automatic vectorization.Vectorization loads multiple data elements into longer registers, performs the operation in parallel, and stores the results.
Compiler flag
-O3(enables level three optimization, including automatic vectorization).
Performance Comparison: Vectorization vs. Scalar operations
Scalar Operations (without vectorization):
Involve separate instructions for loading, adding, and storing each data element.
Require multiple clock cycles per data element.
Vectorized Operations:
Load multiple data elements into vector registers.
Perform a single instruction on multiple data elements simultaneously.
Reduce the number of clock cycles required.
Significant speedup can be achieved through vectorization.
Loop Unrolling and Vectorization
Loop unrolling is a loop transformation technique that duplicates the loop body multiple times, reducing loop overhead and exposing more opportunities for vectorization.
When loop unrolling is combined with vectorization, the compiler can process multiple iterations of the loop in parallel, resulting in significant performance improvements.
Compiler Feedback and Optimization Flags
The
-fopt-info-vec-missedflag provides feedback on loops that were not vectorized.Reasons for missed vectorization:
Unsupported data types (e.g.,
double).Data dependencies.
Misaligned memory accesses.
The
-funsafe-math-optimizationsflag allows the compiler to perform aggressive optimizations that may sacrifice some accuracy.Enabling unsafe math optimizations can sometimes enable vectorization of loops that would otherwise not be vectorized.
Case Study: Psec Code Optimization
Experimented with the
Pseccode for computing the value of Pi.Compiled the code with and without vectorization.
Measured the execution time.
Observed a significant speedup when vectorization was enabled.
Used the
-f unsafe-math-optimizationsflag to further improve performance.
SIMD Extensions Supported
MMX
SSE2
AVX
AVX2
AVX-512
AVX2 Extension Details
AVX2 operates on 256-bit registers.
Can process eight 32-bit values or four 64-bit values in parallel.
Potential for a significant reduction in the number of iterations.