3-5. Efficient Inference with Transformers

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/21

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

22 Terms

1
New cards

What is the advantage of also quantizing activations (inputs) in addition to the weights of a model?

  • All operations are then in int8.

  • GPU can leverage custom integer kernels.

<ul><li><p>All operations are then in <span style="color: yellow">int8</span>.</p></li><li><p>GPU can leverage <span style="color: yellow">custom integer kernels</span>.</p></li></ul><p></p>
2
New cards

What are disadvantages of quantizing activations (inputs)?

  • Activations change with every new prompt → quantization scale has to be set for every new input.

3
New cards

Dynamic vs. Static quantization of activations (inputs)

Dynamic:

  • Learn the scale for every input on the fly.

  • Accurate but slow.

Static:

  • Use a calibration dataset to learn static scale.

  • Fast but less accurate.

4
New cards

What is the idea of Mixed Precision Quantization?

Problem: There are outliers in models > 3b parameters that take valuable quantization space.

  • Do not quantize outliers (less than 1% of weights).

  • Calculate them seperately.

<p>Problem: There are outliers in models &gt; 3b parameters that take valuable quantization space.</p><ul><li><p>Do <span style="color: yellow">not quantize outliers</span> (less than 1% of weights).</p></li><li><p>Calculate them <span style="color: yellow">seperately</span>.</p></li></ul><p></p>
5
New cards

What is the idea behind GPTQ?

  • Minimize loss between output of quantized and unquantized matmul.

  • Quantizes to lower than 8-bits, f.e. 4 or 2 bits.

  • Needs a calibration dataset.

<ul><li><p>Minimize<span style="color: yellow"> loss between output of quantized and unquantized matmul</span>.</p></li><li><p><span style="color: yellow">Quantizes </span>to lower than 8-bits, f.e. 4 or 2 bits.</p></li><li><p>Needs a calibration dataset.</p></li></ul><p></p>
6
New cards

What is the idea behind SmoothQuant?

  • Migrate outliers between activations (many outliers) and weights (few outliers) to use quantization space more efficiently.

  • Needs a calibration dataset.

<ul><li><p><span style="color: yellow">Migrate outliers</span> between activations (many outliers) and weights (few outliers) to use quantization space more efficiently.</p></li><li><p>Needs a calibration dataset.</p></li></ul><p></p>
7
New cards

What is the idea behind AWQ?

  • Leave import weights in bf16 - only quantize others.

  • Scale weights and activations before quantiziation.

  • Lean quantization parameters by solving optimization problem over data.

  • Needs calibration dataset.

<ul><li><p>Leave import weights in bf16 - only quantize others.</p></li><li><p>Scale weights and activations before quantiziation.</p></li><li><p>Lean quantization parameters by solving optimization problem over data.</p></li><li><p>Needs calibration dataset.</p></li></ul><p></p>
8
New cards

What are 6 key considerations in quantization?

  • Do we need a calibration dataset?

  • Is the quantization static / dynamic?

  • Quantize only weights? Or activations too?

  • Learn quantization parameters from data or make heuristic choices?

  • Does my strategy have hardware support?

  • Space saving ≠ speed up.

9
New cards

What is Magnitude Pruning?

Prune (set to 0) weights that are lower than a treshold

<p>Prune (set to 0) weights that are lower than a treshold</p><p></p>
10
New cards

What is WandA (weights and activation pruning)?

Prune weights that have a low importance based on | W | ⋅ | | X | |2 metric.

<p>Prune weights that have a low importance based on | W | ⋅ | | X | |<sub>2</sub> metric.</p>
11
New cards

What is structured pruning?

  • Prune fixed ratio of weights (f.e. 2:4) between each structure.

  • can leverage hardware acceleration.

<ul><li><p>Prune fixed ratio of weights (f.e. 2:4) between each structure.</p></li></ul><ul><li><p>can leverage hardware acceleration.</p></li></ul><p></p>
12
New cards

How is a 32-bit float structured?

  • 1 bit for sign (+,-)

  • 8-bit biased exponent => [10-38, 1038].

  • 24-bit fraction => 7 digit precision.

=> 32 bits is too large for modern LLMs

<ul><li><p>1 bit for sign (+,-)</p></li><li><p>8-bit biased exponent =&gt; [10<sup>-38</sup>, 10<sup>38</sup>].</p></li><li><p>24-bit fraction =&gt; 7 digit precision.</p></li></ul><p></p><p>=&gt; 32 bits is too large for modern LLMs</p>
13
New cards

How is a 16-bit float structured?

  • 1 bit for sign (+,-)

  • 5-bit biased exponent => [10-4, 104].

  • 10-bit fraction => 3 digit precision.

=> Range is too small for LLMs

<ul><li><p>1 bit for sign (+,-)</p></li><li><p>5-bit biased exponent =&gt; [10<sup>-4</sup>, 10<sup>4</sup>].</p></li><li><p>10-bit fraction =&gt; 3 digit precision. </p></li></ul><p></p><p>=&gt; Range is too small for LLMs</p>
14
New cards

How is a bfloat16 structured?

Idea: Less bits for precision, more for range

  • 1 bit for sign (+,-)

  • 8-bit biased exponent => [10-38, 1038].

  • 7-bit fraction => 2 digit precision.

<p>Idea: Less bits for precision, more for range</p><ul><li><p>1 bit for sign (+,-)</p></li><li><p>8-bit biased exponent =&gt; [10<sup>-38</sup>, 10<sup>38</sup>].</p></li><li><p>7-bit fraction =&gt; 2 digit precision.</p></li></ul><p></p>
15
New cards

What is the idea behind Quantizing weights?

  • Map float values to int8 (254 distinct) values.

  • Uses half as much space as bfloat16.

  • int8 operations can be computed much faster (hardware acceleration).

→ Results in some errors, but no big difference in performance.

<ul><li><p>Map float values to int8 (254 distinct) values.</p></li><li><p>Uses half as much space as bfloat16.</p></li><li><p>int8 operations can be computed much faster (hardware acceleration).</p></li></ul><p></p><p>→ Results in some errors, but no big difference in performance.</p>
16
New cards

Symmetric Quantization vs. Asymmetric Quantization

Symmetric Quantization:

  • 0 points of base and quantized match.

  • Min / Max are negatives of each other.

Asymmetric Quantization:

  • 0 points do not match.

  • More precision than symmetric.

=> Both have problems with outliers. Can be solved by clipping weights to a pre-determined range.

<p><u>Symmetric Quantization:</u></p><ul><li><p>0 points of base and quantized match.</p></li><li><p>Min / Max are negatives of each other.</p></li></ul><p></p><p><u>Asymmetric Quantization:</u></p><ul><li><p>0 points do not match.</p></li><li><p>More precision than symmetric.</p></li></ul><p></p><p>=&gt; Both have problems with <span style="color: red">outliers</span>. Can be solved by clipping weights to a pre-determined range.</p>
17
New cards

What is the idea behind Mixture of Experts (MoE)?

  • Parameters of the model are split into disjoint blocks → “experts”.

  • Each Token (input) is passed through a single (or few) blocks.

  • Router decodes which block should process the token.

=> No memory saving (all experts need to be on GPU), but better parallelization = faster inference.

<ul><li><p>Parameters of the model are <span style="color: yellow">split into disjoint blocks</span> → “experts”.</p></li><li><p>Each Token (input) is passed through a single (or few) blocks.</p></li><li><p>Router decodes which block should process the token.</p></li></ul><p></p><p>=&gt; No memory saving (all experts need to be on GPU), but better parallelization = <span style="color: yellow">faster inference</span>.</p><p></p>
18
New cards

How does Routing work in Mixture of Experts (MoE)?

  • Router is fully connected NN model.

  • Outputs softmax probabilities for all experts.

  • Mistral: Weigh output by softmax probability.

<ul><li><p>Router is<span style="color: yellow"> fully connected NN</span> model.</p></li><li><p>Outputs <span style="color: yellow">softmax probabilities</span> for all experts.</p></li><li><p>Mistral: Weigh output by softmax probability.</p></li></ul><p></p>
19
New cards

What is the idea behind Sparse Attention?

  • Reduce the O(N2) attention cost.

  • Only compute attention for specific tokens.

<ul><li><p>Reduce the O(N<sup>2</sup>) attention cost.</p></li><li><p>Only compute attention for <span style="color: yellow">specific tokens</span>.</p></li></ul><p></p>
20
New cards

What is the H2O eviction strategy for KV-Caching?

  • Only a few tokens are responsible for most of the attention score → Can drop / zero-out most previous tokens from cache.

  • Important tokens can be found through an Attention Sparsity Score over the LLMs vocabulary.

<ul><li><p>Only a <span style="color: yellow">few tokens are responsible for most of the attention</span> score → Can drop / zero-out most previous tokens from cache.</p></li><li><p>Important tokens can be found through an Attention Sparsity Score over the LLMs vocabulary.</p></li></ul><p></p>
21
New cards

What are Attention Sinks?

  • Tokens in the beginning take up large attention scores, even if they are semantically unimportant.

  • Happens because of the softmax computation of attention needs to sum to 1 → initial token is visible from every other token.

=> Takeaway: Don’t evict initial tokens.

<ul><li><p>Tokens in the beginning take up <span style="color: yellow">large attention scores</span>, even if they are <span style="color: yellow">semantically unimportant</span>.</p></li><li><p>Happens because of the <span style="color: yellow">softmax computation</span> of attention needs to sum to 1 → initial token is visible from every other token.</p></li></ul><p>=&gt; Takeaway: Don’t evict initial tokens.</p>
22
New cards

What is FlashAttention?

  • Hardware-aware optimization technique.

  • Use Fused Kernels to perform all attention operations at once → no back and forth between SRAM and HBM.

  • Matrix Tiling → load only parts of Q, K and V into SRAM.

<ul><li><p><span style="color: yellow">Hardware-aware optimization</span> technique.</p></li><li><p>Use <span style="color: yellow">Fused Kernels</span> to perform all attention operations at once → no back and forth between SRAM and HBM.</p></li><li><p><span style="color: yellow">Matrix Tiling</span> → load only parts of Q, K and V into SRAM.</p></li></ul><p></p>