1/4
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
How is a 32-bit float structured?
1 bit for sign (+,-)
8-bit biased exponent => [10-38, 1038].
24-bit fraction => 7 digit precision.
=> 32 bits is too large for modern LLMs
How is a 16-bit float structured?
1 bit for sign (+,-)
5-bit biased exponent => [10-4, 104].
10-bit fraction => 3 digit precision.
=> Range is too small for LLMs
How is a bfloat16 structured?
Idea: Less bits for precision, more for range
1 bit for sign (+,-)
8-bit biased exponent => [10-38, 1038].
7-bit fraction => 2 digit precision.
What is the idea behind Quantizing weights?
Map float values to int8 (254 distinct) values.
Uses half as much space as bfloat16.
int8 operations can be computed much faster (hardware acceleration).
→ Results in some errors, but no big difference in performance.
Symmetric Quantization vs. Asymmetric Quantization
Symmetric Quantization:
0 points of base and quantized match.
Min / Max are negatives of each other.
Asymmetric Quantization:
0 points do not match.
More precision than symmetric.
=> Both have problems with outliers. Can be solved by clipping weights to a pre-determined range.