2a

Floating-point Numbers Overview

Presented by Troels Henriksen, based on slides by Randal E. Bryant and David R. O’Hallaron.

Agenda

Why are numbers exciting?
Preliminaries: Biased numbers
Floating-point arithmetic
Background: Fractional binary numbers
IEEE floating-point standard
Examples and properties
Rounding, addition, and multiplication
Floating-point in C
Summary

Learning Objectives

Understand:
- Non-uniform distribution of numbers.
- Consequences of roundoff.
- Difficulty in performing numerical operations accurately.

Kerbal Space Program Example

Physics simulation of rocket parts.
Connected parts affected by forces from engines.
Players traveling far from launch site face craft fragility issues.
Understanding the underlying numerical issues leads to resolution.

Representation of Biased Numbers

Biased Numbers:
- Raw bits interpreted as unsigned; a bias is subtracted.
- Unsigned Representation:
  - Formula: Bits2N(X) = wX−1 * Σ xi·2i
- Two’s Complement:
  - Formula: TC2Int(X) = −xw−1·2w−1 + Σ xi·2i
- Biased Representation:
  - Formula: B2Int(X) = Bits2N(x)−b with typical bias b = 2^(w−1) − 1.
- Example for w = 8, b = 127:
  - Encoding of basic numbers like ⟨00000000⟩, ⟨01111111⟩

Integral Binary Numbers

Interpretation of binary like 10010101 as decimal 149.

Fractional Numbers Representation

Decimal number expressed as:
- 123.456 = 1 · 10^2 + 2 · 10^1 + 3 · 10^0 + 4 · 10^(-1) + 5 · 10^(-2) + 6 · 10^(-3).
General formula:
- am−1 · ... · a0.a−1 · ... a−n = Σ ai · 10^i

Fractional Binary Numbers

Equivalent structure for radix r:
- am−1 · ... · a0.a−1 · ... a−n = Σ ai · r^i
Weights and representations of binary digits:
- b−1 · 2^(-1), b−2 · 2^(-2), ... for right of binary point.

Limitations of Representable Numbers

Can only represent fractions of the form x/2^k.
Limited binary point setting within w bits restricting numerical range.

Fixed-point Dilemma

Example with w = 8:
- 1 bit for fraction leads to precision loss near zero.

IEEE Floating-Point Standard 754

Established in 1985 for uniform floating-point representation.
Supported by major CPUs, emphasizing rounding and underflow concerns.

Floating-Point Representation Formulation

Numerical form: (-1)^s · m · 2^e, where:
- Sign bit s
- Significand m (typically [1, 2))
- Exponent e
Different precision formats:
- 32-bit single precision: float
- 64-bit double precision: double

Normalisation and Denormalisation

Normalised values when exponents follow specific conditions.
Denormal values occur when exponents are zero, leading to representation of tiny values.

Special Values in Floating-Point

Cases with exponent E as all 1s:
- Represents ±∞ for overflow or NaN for undefined numeric results.

Rounding Techniques

Distinction between various rounding modes, predominantly towards nearest even (default).
It affects precision and can lead to biases for floating-point arithmetic.

Floating-Point Arithmetic

Operations guidelines:
- Addition focuses on aligning exponents before adding significands, normalizing afterward, and ensuring rounding.
- Multiplication includes deterministically creating a new significand and exponent.

Challenges in Floating-Point Operations

Associativity and distributivity not guaranteed due to rounding errors.
Example cases demonstrating NaN and overflow conditions.

Floating-point in C

C programming provides two main types (float, double) with specific behaviors on casting and conversions.
Notable case: Ariane 5 failure due to improper conversion of floating-point to integers causing exceptions.

Summary

IEEE floating-point has defined properties that might conflict with intuitive interpretations, focusing on precision and rounding while acknowledging limitations in numeric computations.