2a
Floating-point Numbers Overview
Presented by Troels Henriksen, based on slides by Randal E. Bryant and David R. O’Hallaron.
Agenda
Why are numbers exciting?
Preliminaries: Biased numbers
Floating-point arithmetic
Background: Fractional binary numbers
IEEE floating-point standard
Examples and properties
Rounding, addition, and multiplication
Floating-point in C
Summary
Learning Objectives
Understand:
Non-uniform distribution of numbers.
Consequences of roundoff.
Difficulty in performing numerical operations accurately.
Kerbal Space Program Example
Physics simulation of rocket parts.
Connected parts affected by forces from engines.
Players traveling far from launch site face craft fragility issues.
Understanding the underlying numerical issues leads to resolution.
Representation of Biased Numbers
Biased Numbers:
Raw bits interpreted as unsigned; a bias is subtracted.
Unsigned Representation:
Formula:
Bits2N(X) = wX−1 * Σ xi·2i
Two’s Complement:
Formula:
TC2Int(X) = −xw−1·2w−1 + Σ xi·2i
Biased Representation:
Formula:
B2Int(X) = Bits2N(x)−bwith typical biasb = 2^(w−1) − 1.
Example for
w = 8,b = 127:Encoding of basic numbers like
⟨00000000⟩,⟨01111111⟩
Integral Binary Numbers
Interpretation of binary like
10010101as decimal149.
Fractional Numbers Representation
Decimal number expressed as:
123.456 = 1 · 10^2 + 2 · 10^1 + 3 · 10^0 + 4 · 10^(-1) + 5 · 10^(-2) + 6 · 10^(-3).
General formula:
am−1 · ... · a0.a−1 · ... a−n = Σ ai · 10^i
Fractional Binary Numbers
Equivalent structure for radix
r:am−1 · ... · a0.a−1 · ... a−n = Σ ai · r^i
Weights and representations of binary digits:
b−1 · 2^(-1), b−2 · 2^(-2), ...for right of binary point.
Limitations of Representable Numbers
Can only represent fractions of the form
x/2^k.Limited binary point setting within
wbits restricting numerical range.
Fixed-point Dilemma
Example with w = 8:
1 bit for fraction leads to precision loss near zero.
IEEE Floating-Point Standard 754
Established in 1985 for uniform floating-point representation.
Supported by major CPUs, emphasizing rounding and underflow concerns.
Floating-Point Representation Formulation
Numerical form:
(-1)^s · m · 2^e, where:Sign bit
sSignificand
m(typically [1, 2))Exponent
e
Different precision formats:
32-bit single precision:
float64-bit double precision:
double
Normalisation and Denormalisation
Normalised values when exponents follow specific conditions.
Denormal values occur when exponents are zero, leading to representation of tiny values.
Special Values in Floating-Point
Cases with exponent
Eas all 1s:Represents ±∞ for overflow or NaN for undefined numeric results.
Rounding Techniques
Distinction between various rounding modes, predominantly towards nearest even (default).
It affects precision and can lead to biases for floating-point arithmetic.
Floating-Point Arithmetic
Operations guidelines:
Addition focuses on aligning exponents before adding significands, normalizing afterward, and ensuring rounding.
Multiplication includes deterministically creating a new significand and exponent.
Challenges in Floating-Point Operations
Associativity and distributivity not guaranteed due to rounding errors.
Example cases demonstrating NaN and overflow conditions.
Floating-point in C
C programming provides two main types (
float,double) with specific behaviors on casting and conversions.Notable case: Ariane 5 failure due to improper conversion of floating-point to integers causing exceptions.
Summary
IEEE floating-point has defined properties that might conflict with intuitive interpretations, focusing on precision and rounding while acknowledging limitations in numeric computations.