Vectorization

A technique used to parallelize data parallel applications.
Aimed at computations operating on different portions of vectors while performing the same operations.
Relies on specific hardware.

awk is a line-oriented text processor.
Command structure:
- awk '{commands}'
- Commands within curly brackets are executed for each line.
- END {commands}: Commands executed at the end of processing all lines.
Example:
- Objective: Calculate the average of a specific field in a set of experiments.
- awk '{sum += $4} END {print sum / NR}'
  - $4: Refers to the fourth field in the line.
  - sum: A variable to accumulate the sum.
  - NR: Number of rows processed.
Corrected mistake: Using $ before variable names.
- sum is a variable and should not be prefixed with $. Only the fields in the line (numerical values) must have the $ sign in front.

Vectorization relies on SIMD (Single Instruction, Multiple Data) extensions in the assembly instruction set.
Assembly language:
- Low-level programming language for processors.
- Uses registers (e.g., r0, r1, r2).
- Instructions operate on registers (add, subtract, load, store, etc.).
- Example (ARM assembly):
  - add r0, r1, r2 (r0 + r1 -> r2, or r2 = r0 + r1, different syntax may apply).
- Memory access:
  - load (memory -> register), store (register -> memory).
  - Memory addresses can be specified using registers as base addresses with offsets.
  - Example: 8(r1) (memory address = contents of r1 + 8).
- Control instructions: jump, branch, conditional branches (e.g., jump if equal).
Traditional assembly instructions operate on word-length registers (e.g., 64 bits on a 64-bit architecture).

SIMD extensions use longer registers to process multiple data items with a single instruction.
Example:
- AVX2 extension with 256-bit registers.
- A 256-bit register can be seen as:
  - One 256-bit value
  - Two 128-bit values
  - Four 64-bit values
  - Eight 32-bit values
Instructions specify the data type to be used within the register (e.g., half-word = 32 bits).
Example:
- add.h l0, l1, l2 (add half-words in registers l0 and l1, store the result in l2).
- The operation is performed in parallel on the corresponding data elements within the registers.
Requires replicated hardware (ALUs) to perform parallel operations.

Compiling C/C++ code with the -S flag generates assembly code.
Assembly code varies depending on the target processor architecture.
Example: Compiling the same C++ code on an Intel processor vs. a Raspberry Pi (ARM) produces different assembly instructions.

Simple loop example:
- Initialize vectors x, y, and z.
- Compute z[i] = x[i] + y[i] in a loop.
- Sum all elements of z to prevent the compiler from optimizing out the loop.
Compiling without optimization (e.g., -O0) results in traditional assembly instructions operating on individual data elements.
Compiling with optimization (e.g., -O3) enables automatic vectorization.
Vectorization loads multiple data elements into longer registers, performs the operation in parallel, and stores the results.
Compiler flag -O3 (enables level three optimization, including automatic vectorization).

Scalar Operations (without vectorization):
- Involve separate instructions for loading, adding, and storing each data element.
- Require multiple clock cycles per data element.
Vectorized Operations:
- Load multiple data elements into vector registers.
- Perform a single instruction on multiple data elements simultaneously.
- Reduce the number of clock cycles required.
Significant speedup can be achieved through vectorization.

Loop unrolling is a loop transformation technique that duplicates the loop body multiple times, reducing loop overhead and exposing more opportunities for vectorization.
When loop unrolling is combined with vectorization, the compiler can process multiple iterations of the loop in parallel, resulting in significant performance improvements.

The -fopt-info-vec-missed flag provides feedback on loops that were not vectorized.
Reasons for missed vectorization:
- Unsupported data types (e.g., double).
- Data dependencies.
- Misaligned memory accesses.
The -funsafe-math-optimizations flag allows the compiler to perform aggressive optimizations that may sacrifice some accuracy.
Enabling unsafe math optimizations can sometimes enable vectorization of loops that would otherwise not be vectorized.