1/23
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Arm SIMD
Implemented through co-processor extensions - configurable interfaces to add other blocks etc.
SIMD and SVE are extensions that Arm made
SIMD Concepts
Have vector registers, which group multiple items together.
Items are processed in lanes (e.g. a pixel with 4 8 bit components has 4 lanes, 8 bits each)
Compute on whole vectors at once, in one cycle, instead of separately
Better usage of memory and ALUs.
Arm SIMD Registers / VFP (vector floating point)
64 or 128 bit SIMD registers
Arm SIMD support
supports integer, fixed point, and floating point. fixed is useful for particular signal processing applications
Arm SIMD uses
integers, multimedia, signal processing (video, graphics, voice processing, image processing)
VFP is special and can be used for 3D graphics, games cosoles etc.
Arm SIMD extras
only have it when needed, supports unaligned data access, has powerful load / store instructions that can be interleaved.
SIMD mnemonics
Mnemonics on instructions indicate what type of data can be found in a SIMD register, e.g. VADD.I16
If output is a different size to input this can be handled, e.g.
MUL.S16 Q0, D2, D3
multipliaction could make them as big as 32 bits, so results go into a larger register
Common SIMD instructions
Conversion, Comparison, Artihmetic, newton-rhapson reciprocal estimation, SQRT, Saturating arithmetic, Polynomail arithmetic, specific decoding stuff
How to use SIMD in programs
Intrinsics, or automated
Intrinsics
High level language to specify SIMD behaviour
These can help the compiler compile to SIMD instructions, without needing to guess
C++ implements this with operator overloading
Automatic
Compilers can detect vectorisable loops. They often need hints to help with this though, so they know vectorisation will be ok
These don’t actually do anything, but give hints to the compiler
Arm SVE
Vectors are 128 - 2048 bits (inc in 128 bit chunks)
Agnostic to vector lengths
Nice to compile to
Lots of support for predication
Can vectorise loops that are not exact multiples of vector width without needing peel loops, this is done using predicates to indicate when lanes are empty.
Vector registers
32 Vectors (LEN x 128bits long)
DP & SP Floating Point
64, 32, 16, 8 bit integers
Predicate registers
8 lane masks (LEN x 16 bits)
8 more predicate registers for manipulation
FFR - first fault register, used for deciding if things have gone wrong
Control registers
One to control vector length, one to control privilege level
SVE Predicates
Used to drive loop control.
Overloads usual NZCV predicates
N = first element is active
Z = no element is active
C = last element is not active
V = scalarised loop state, else zero
Use of predicates
If next predicate has Z set, or C is set that tells us not to keep looping
We can branch based off of predicates
Vector partitioning
Use predication to allow speculative vectorisaion
operate on a partiion of elements that are “safe” according to dynamic conditions and predicate
partitions are inherited by nested conditions and loops
Uncounted loops, data-dependent exits
operations with side-effects following a break must not be architecturally performed
operate on a before-break partition, then exit loop if break is detected
Speculative load errors
loads required to detect break condition may fault
operate on a before-fault partition, then iterate until a break is detected
Length agnosticism
also uses partitions
partition is defined by dynamic vector length
SVE speedup
Not in all test cases, but in some cases speedups are crazy good when compared to NEON
x86 SIMD
Not as prevalent, but there are a wide range of instructions
Vectors (256-bit) of 8-64 bit ints and floats
String manip, CRC, popcount, unalgined loads, AI / ML stuff
AVX Intel
Skylake onwards we have 512 bit vectors