3-6. Efficient Inference with Transformers

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/24

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

25 Terms

New cards

What is the advantage of also quantizing activations (inputs) in addition to the weights of a model?

All operations are then in int8.
GPU can leverage custom integer kernels.

<ul><li><p>All operations are then in <span style="color: yellow">int8</span>.</p></li><li><p>GPU can leverage <span style="color: yellow">custom integer kernels</span>.</p></li></ul><p></p>

New cards

What are disadvantages of quantizing activations (inputs)?

Activations change with every new prompt → quantization scale has to be set for every new input.

New cards

Dynamic vs. Static quantization of activations (inputs)

Dynamic:

Learn the scale for every input on the fly.
Accurate but slow.

Static:

Use a calibration dataset to learn static scale.
Fast but less accurate.

New cards

What is the idea of Mixed Precision Quantization?

Problem: There are outliers in models > 3b parameters that take valuable quantization space.

Do not quantize outliers (less than 1% of weights).
Calculate them seperately.

New cards

What is the idea behind GPTQ?

Minimize loss between output of quantized and unquantized matmul.
Quantizes to lower than 8-bits, f.e. 4 or 2 bits.
Needs a calibration dataset.

<ul><li><p>Minimize<span style="color: yellow"> loss between output of quantized and unquantized matmul</span>.</p></li><li><p><span style="color: yellow">Quantizes </span>to lower than 8-bits, f.e. 4 or 2 bits.</p></li><li><p>Needs a calibration dataset.</p></li></ul><p></p>

New cards

What is the idea behind SmoothQuant?

Migrate outliers between activations (many outliers) and weights (few outliers) to use quantization space more efficiently.
Needs a calibration dataset.

<ul><li><p><span style="color: yellow">Migrate outliers</span> between activations (many outliers) and weights (few outliers) to use quantization space more efficiently.</p></li><li><p>Needs a calibration dataset.</p></li></ul><p></p>

New cards

What is the idea behind AWQ?

Leave import weights in bf16 - only quantize others.
Scale weights and activations before quantiziation.
Lean quantization parameters by solving optimization problem over data.
Needs calibration dataset.

<ul><li><p>Leave import weights in bf16 - only quantize others.</p></li><li><p>Scale weights and activations before quantiziation.</p></li><li><p>Lean quantization parameters by solving optimization problem over data.</p></li><li><p>Needs calibration dataset.</p></li></ul><p></p>

New cards

What are 6 key considerations in quantization?

Do we need a calibration dataset?
Is the quantization static / dynamic?
Quantize only weights? Or activations too?
Learn quantization parameters from data or make heuristic choices?
Does my strategy have hardware support?
Space saving ≠ speed up.

New cards

What is Magnitude Pruning?

Prune (set to 0) weights that are lower than a treshold

New cards

What is WandA (weights and activation pruning)?

Prune weights that have a low importance based on | W | ⋅ | | X | |₂ metric.

<p>Prune weights that have a low importance based on | W | ⋅ | | X | |<sub>2</sub> metric.</p>

New cards

What is structured pruning?

Prune fixed ratio of weights (f.e. 2:4) between each structure.

can leverage hardware acceleration.

<ul><li><p>Prune fixed ratio of weights (f.e. 2:4) between each structure.</p></li></ul><ul><li><p>can leverage hardware acceleration.</p></li></ul><p></p>

New cards

How is a 32-bit float structured?

1 bit for sign (+,-)
8-bit biased exponent => [10^-38, 10³⁸].
24-bit fraction => 7 digit precision.

=> 32 bits is too large for modern LLMs

$<ul><li>1 bit for sign (+,-)</li><li>8-bit biased exponent => [10-38, 1038].</li><li>24-bit fraction => 7 digit precision.</li></ul>=> 32 bits is too large for modern LLMs$

New cards

How is a 16-bit float structured?

1 bit for sign (+,-)
5-bit biased exponent => [10^-4, 10⁴].
10-bit fraction => 3 digit precision.

=> Range is too small for LLMs

$<ul><li>1 bit for sign (+,-)</li><li>5-bit biased exponent => [10-4, 104].</li><li>10-bit fraction => 3 digit precision. </li></ul>=> Range is too small for LLMs$

New cards

How is a bfloat16 structured?

Idea: Less bits for precision, more for range

1 bit for sign (+,-)
8-bit biased exponent => [10^-38, 10³⁸].
7-bit fraction => 2 digit precision.

$Idea: Less bits for precision, more for range<ul><li>1 bit for sign (+,-)</li><li>8-bit biased exponent => [10-38, 1038].</li><li>7-bit fraction => 2 digit precision.</li></ul>$

New cards

What is the idea behind Quantizing weights?

Map float values to int8 (254 distinct) values.
Uses half as much space as bfloat16.
int8 operations can be computed much faster (hardware acceleration).

→ Results in some errors, but no big difference in performance.

<ul><li><p>Map float values to int8 (254 distinct) values.</p></li><li><p>Uses half as much space as bfloat16.</p></li><li><p>int8 operations can be computed much faster (hardware acceleration).</p></li></ul><p></p><p>→ Results in some errors, but no big difference in performance.</p>

New cards

Symmetric Quantization vs. Asymmetric Quantization

Symmetric Quantization:

0 points of base and quantized match.
Min / Max are negatives of each other.

Asymmetric Quantization:

0 points do not match.
More precision than symmetric.

=> Both have problems with outliers. Can be solved by clipping weights to a pre-determined range.

<p><u>Symmetric Quantization:</u></p><ul><li><p>0 points of base and quantized match.</p></li><li><p>Min / Max are negatives of each other.</p></li></ul><p></p><p><u>Asymmetric Quantization:</u></p><ul><li><p>0 points do not match.</p></li><li><p>More precision than symmetric.</p></li></ul><p></p><p>=> Both have problems with <span style="color: red">outliers</span>. Can be solved by clipping weights to a pre-determined range.</p>

New cards

What is the idea behind Mixture of Experts (MoE)?

Parameters of the model are split into disjoint blocks → “experts”.
Each Token (input) is passed through a single (or few) blocks.
Router decodes which block should process the token.

=> No memory saving (all experts need to be on GPU), but better parallelization = faster inference.

<ul><li><p>Parameters of the model are <span style="color: yellow">split into disjoint blocks</span> → “experts”.</p></li><li><p>Each Token (input) is passed through a single (or few) blocks.</p></li><li><p>Router decodes which block should process the token.</p></li></ul><p></p><p>=> No memory saving (all experts need to be on GPU), but better parallelization = <span style="color: yellow">faster inference</span>.</p><p></p>

New cards

How does Routing work in Mixture of Experts (MoE)?

Router is fully connected NN model.
Outputs softmax probabilities for all experts.
Mistral: Weigh output by softmax probability.

<ul><li><p>Router is<span style="color: yellow"> fully connected NN</span> model.</p></li><li><p>Outputs <span style="color: yellow">softmax probabilities</span> for all experts.</p></li><li><p>Mistral: Weigh output by softmax probability.</p></li></ul><p></p>

New cards

What is the idea behind Sparse Attention?

Reduce the O(N²) attention cost.
Only compute attention for specific tokens.

<ul><li><p>Reduce the O(N<sup>2</sup>) attention cost.</p></li><li><p>Only compute attention for <span style="color: yellow">specific tokens</span>.</p></li></ul><p></p>

New cards

What is the H2O eviction strategy for KV-Caching?

Only a few tokens are responsible for most of the attention score → Can drop / zero-out most previous tokens from cache.
Important tokens can be found through an Attention Sparsity Score over the LLMs vocabulary.

<ul><li><p>Only a <span style="color: yellow">few tokens are responsible for most of the attention</span> score → Can drop / zero-out most previous tokens from cache.</p></li><li><p>Important tokens can be found through an Attention Sparsity Score over the LLMs vocabulary.</p></li></ul><p></p>

New cards

What are Attention Sinks?

Tokens in the beginning take up large attention scores, even if they are semantically unimportant.
Happens because of the softmax computation of attention needs to sum to 1 → initial token is visible from every other token.

=> Takeaway: Don’t evict initial tokens.

<ul><li><p>Tokens in the beginning take up <span style="color: yellow">large attention scores</span>, even if they are <span style="color: yellow">semantically unimportant</span>.</p></li><li><p>Happens because of the <span style="color: yellow">softmax computation</span> of attention needs to sum to 1 → initial token is visible from every other token.</p></li></ul><p>=> Takeaway: Don’t evict initial tokens.</p>

New cards

What is FlashAttention?

Hardware-aware optimization technique.
Use Fused Kernels to perform all attention operations at once → no back and forth between SRAM and HBM.
Matrix Tiling → load only parts of Q, K and V into SRAM.

<ul><li><p><span style="color: yellow">Hardware-aware optimization</span> technique.</p></li><li><p>Use <span style="color: yellow">Fused Kernels</span> to perform all attention operations at once → no back and forth between SRAM and HBM.</p></li><li><p><span style="color: yellow">Matrix Tiling</span> → load only parts of Q, K and V into SRAM.</p></li></ul><p></p>

New cards

What are the 3 techniques to increase LLM inference performance with Sparcity?

Weight sparcity.
Block-level sparcity (Mixture of Experts).
KV-Cache sparcity.

<ul><li><p>Weight sparcity.</p></li><li><p>Block-level sparcity (Mixture of Experts).</p></li><li><p>KV-Cache sparcity.</p></li></ul><p></p>

New cards

What is PagedAttention?

Split KV-Cache of each sequence into blocks.
Store the physical block in a non-contiguous, on-demand manner. Mapping between logial and physical cache blocks via Paging (block table), like in OSs.

Advantages:

Memory waste can only happen in last block of a sequence.
Efficient memory sharing, f.e. in parallel sampling.

<ul><li><p><span style="color: yellow">Split KV-Cache</span> of each sequence into <span style="color: yellow">blocks</span>. </p></li><li><p>Store the physical block in a <span style="color: yellow">non-contiguous</span>, on-demand manner. Mapping between logial and physical cache blocks via <span style="color: yellow">Paging </span>(block table), like in OSs.</p></li></ul><p></p><p><strong>Advantages</strong>:</p><ul><li><p><span style="color: yellow">Memory waste</span> can only happen in <span style="color: yellow">last block</span> of a sequence.</p></li><li><p><span style="color: yellow">Efficient memory sharing</span>, f.e. in parallel sampling.</p></li></ul><p></p>

New cards

How to evict blocks in PagedAttention?

Evict block with 0 reference count (→ no running requests with this block).
If multiple blocks have 0 reference, evict leasr-recently used (LRU) block.
If LRU ties, evict block at the end of longest prefix.