SIMD Instruction sets

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

Arm SIMD

Implemented through co-processor extensions - configurable interfaces to add other blocks etc.

SIMD and SVE are extensions that Arm made

New cards

SIMD Concepts

Have vector registers, which group multiple items together.

Items are processed in lanes (e.g. a pixel with 4 8 bit components has 4 lanes, 8 bits each)

Compute on whole vectors at once, in one cycle, instead of separately

Better usage of memory and ALUs.

New cards

Arm SIMD Registers / VFP (vector floating point)

64 or 128 bit SIMD registers

New cards

Arm SIMD support

supports integer, fixed point, and floating point. fixed is useful for particular signal processing applications

New cards

Arm SIMD uses

integers, multimedia, signal processing (video, graphics, voice processing, image processing)

VFP is special and can be used for 3D graphics, games cosoles etc.

New cards

Arm SIMD extras

only have it when needed, supports unaligned data access, has powerful load / store instructions that can be interleaved.

New cards

SIMD mnemonics

Mnemonics on instructions indicate what type of data can be found in a SIMD register, e.g. VADD.I16

If output is a different size to input this can be handled, e.g.

MUL.S16 Q0, D2, D3

multipliaction could make them as big as 32 bits, so results go into a larger register

New cards

Common SIMD instructions

Conversion, Comparison, Artihmetic, newton-rhapson reciprocal estimation, SQRT, Saturating arithmetic, Polynomail arithmetic, specific decoding stuff

New cards

How to use SIMD in programs

Intrinsics, or automated

New cards

Intrinsics

High level language to specify SIMD behaviour

These can help the compiler compile to SIMD instructions, without needing to guess

C++ implements this with operator overloading

New cards

Automatic

Compilers can detect vectorisable loops. They often need hints to help with this though, so they know vectorisation will be ok

These don’t actually do anything, but give hints to the compiler

New cards

Arm SVE

Vectors are 128 - 2048 bits (inc in 128 bit chunks)

Agnostic to vector lengths

Nice to compile to

Lots of support for predication

Can vectorise loops that are not exact multiples of vector width without needing peel loops, this is done using predicates to indicate when lanes are empty.

New cards

Vector registers

32 Vectors (LEN x 128bits long)

DP & SP Floating Point

64, 32, 16, 8 bit integers

New cards

Predicate registers

8 lane masks (LEN x 16 bits)

8 more predicate registers for manipulation

FFR - first fault register, used for deciding if things have gone wrong

New cards

Control registers

One to control vector length, one to control privilege level

New cards

SVE Predicates

Used to drive loop control.

Overloads usual NZCV predicates

N = first element is active

Z = no element is active

C = last element is not active

V = scalarised loop state, else zero

New cards

Use of predicates

If next predicate has Z set, or C is set that tells us not to keep looping

We can branch based off of predicates

New cards

Vector partitioning

Use predication to allow speculative vectorisaion

operate on a partiion of elements that are “safe” according to dynamic conditions and predicate
partitions are inherited by nested conditions and loops

New cards

Uncounted loops, data-dependent exits

operations with side-effects following a break must not be architecturally performed
operate on a before-break partition, then exit loop if break is detected

New cards

Speculative load errors

loads required to detect break condition may fault
operate on a before-fault partition, then iterate until a break is detected

New cards

Length agnosticism

also uses partitions
partition is defined by dynamic vector length

New cards

SVE speedup

Not in all test cases, but in some cases speedups are crazy good when compared to NEON

New cards

x86 SIMD

Not as prevalent, but there are a wide range of instructions

Vectors (256-bit) of 8-64 bit ints and floats

String manip, CRC, popcount, unalgined loads, AI / ML stuff

New cards

AVX Intel

Skylake onwards we have 512 bit vectors