1/21
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is the advantage of also quantizing activations (inputs) in addition to the weights of a model?
All operations are then in int8.
GPU can leverage custom integer kernels.
What are disadvantages of quantizing activations (inputs)?
Activations change with every new prompt → quantization scale has to be set for every new input.
Dynamic vs. Static quantization of activations (inputs)
Dynamic:
Learn the scale for every input on the fly.
Accurate but slow.
Static:
Use a calibration dataset to learn static scale.
Fast but less accurate.
What is the idea of Mixed Precision Quantization?
Problem: There are outliers in models > 3b parameters that take valuable quantization space.
Do not quantize outliers (less than 1% of weights).
Calculate them seperately.
What is the idea behind GPTQ?
Minimize loss between output of quantized and unquantized matmul.
Quantizes to lower than 8-bits, f.e. 4 or 2 bits.
Needs a calibration dataset.
What is the idea behind SmoothQuant?
Migrate outliers between activations (many outliers) and weights (few outliers) to use quantization space more efficiently.
Needs a calibration dataset.
What is the idea behind AWQ?
Leave import weights in bf16 - only quantize others.
Scale weights and activations before quantiziation.
Lean quantization parameters by solving optimization problem over data.
Needs calibration dataset.
What are 6 key considerations in quantization?
Do we need a calibration dataset?
Is the quantization static / dynamic?
Quantize only weights? Or activations too?
Learn quantization parameters from data or make heuristic choices?
Does my strategy have hardware support?
Space saving ≠ speed up.
What is Magnitude Pruning?
Prune (set to 0) weights that are lower than a treshold
What is WandA (weights and activation pruning)?
Prune weights that have a low importance based on | W | ⋅ | | X | |2 metric.
What is structured pruning?
Prune fixed ratio of weights (f.e. 2:4) between each structure.
can leverage hardware acceleration.
How is a 32-bit float structured?
1 bit for sign (+,-)
8-bit biased exponent => [10-38, 1038].
24-bit fraction => 7 digit precision.
=> 32 bits is too large for modern LLMs
How is a 16-bit float structured?
1 bit for sign (+,-)
5-bit biased exponent => [10-4, 104].
10-bit fraction => 3 digit precision.
=> Range is too small for LLMs
How is a bfloat16 structured?
Idea: Less bits for precision, more for range
1 bit for sign (+,-)
8-bit biased exponent => [10-38, 1038].
7-bit fraction => 2 digit precision.
What is the idea behind Quantizing weights?
Map float values to int8 (254 distinct) values.
Uses half as much space as bfloat16.
int8 operations can be computed much faster (hardware acceleration).
→ Results in some errors, but no big difference in performance.
Symmetric Quantization vs. Asymmetric Quantization
Symmetric Quantization:
0 points of base and quantized match.
Min / Max are negatives of each other.
Asymmetric Quantization:
0 points do not match.
More precision than symmetric.
=> Both have problems with outliers. Can be solved by clipping weights to a pre-determined range.
What is the idea behind Mixture of Experts (MoE)?
Parameters of the model are split into disjoint blocks → “experts”.
Each Token (input) is passed through a single (or few) blocks.
Router decodes which block should process the token.
=> No memory saving (all experts need to be on GPU), but better parallelization = faster inference.
How does Routing work in Mixture of Experts (MoE)?
Router is fully connected NN model.
Outputs softmax probabilities for all experts.
Mistral: Weigh output by softmax probability.
What is the idea behind Sparse Attention?
Reduce the O(N2) attention cost.
Only compute attention for specific tokens.
What is the H2O eviction strategy for KV-Caching?
Only a few tokens are responsible for most of the attention score → Can drop / zero-out most previous tokens from cache.
Important tokens can be found through an Attention Sparsity Score over the LLMs vocabulary.
What are Attention Sinks?
Tokens in the beginning take up large attention scores, even if they are semantically unimportant.
Happens because of the softmax computation of attention needs to sum to 1 → initial token is visible from every other token.
=> Takeaway: Don’t evict initial tokens.
What is FlashAttention?
Hardware-aware optimization technique.
Use Fused Kernels to perform all attention operations at once → no back and forth between SRAM and HBM.
Matrix Tiling → load only parts of Q, K and V into SRAM.