2a

Floating-point Numbers Overview

  • Presented by Troels Henriksen, based on slides by Randal E. Bryant and David R. O’Hallaron.

Agenda

  • Why are numbers exciting?

  • Preliminaries: Biased numbers

  • Floating-point arithmetic

  • Background: Fractional binary numbers

  • IEEE floating-point standard

  • Examples and properties

  • Rounding, addition, and multiplication

  • Floating-point in C

  • Summary

Learning Objectives

  • Understand:

    • Non-uniform distribution of numbers.

    • Consequences of roundoff.

    • Difficulty in performing numerical operations accurately.

Kerbal Space Program Example

  • Physics simulation of rocket parts.

  • Connected parts affected by forces from engines.

  • Players traveling far from launch site face craft fragility issues.

  • Understanding the underlying numerical issues leads to resolution.

Representation of Biased Numbers

  • Biased Numbers:

    • Raw bits interpreted as unsigned; a bias is subtracted.

    • Unsigned Representation:

      • Formula: Bits2N(X) = wX−1 * Σ xi·2i

    • Two’s Complement:

      • Formula: TC2Int(X) = −xw−1·2w−1 + Σ xi·2i

    • Biased Representation:

      • Formula: B2Int(X) = Bits2N(x)−b with typical bias b = 2^(w−1) − 1.

    • Example for w = 8, b = 127:

      • Encoding of basic numbers like ⟨00000000⟩, ⟨01111111⟩

Integral Binary Numbers

  • Interpretation of binary like 10010101 as decimal 149.

Fractional Numbers Representation

  • Decimal number expressed as:

    • 123.456 = 1 · 10^2 + 2 · 10^1 + 3 · 10^0 + 4 · 10^(-1) + 5 · 10^(-2) + 6 · 10^(-3).

  • General formula:

    • am−1 · ... · a0.a−1 · ... a−n = Σ ai · 10^i

Fractional Binary Numbers

  • Equivalent structure for radix r:

    • am−1 · ... · a0.a−1 · ... a−n = Σ ai · r^i

  • Weights and representations of binary digits:

    • b−1 · 2^(-1), b−2 · 2^(-2), ... for right of binary point.

Limitations of Representable Numbers

  • Can only represent fractions of the form x/2^k.

  • Limited binary point setting within w bits restricting numerical range.

Fixed-point Dilemma

  • Example with w = 8:

    • 1 bit for fraction leads to precision loss near zero.

IEEE Floating-Point Standard 754

  • Established in 1985 for uniform floating-point representation.

  • Supported by major CPUs, emphasizing rounding and underflow concerns.

Floating-Point Representation Formulation

  • Numerical form: (-1)^s · m · 2^e, where:

    • Sign bit s

    • Significand m (typically [1, 2))

    • Exponent e

  • Different precision formats:

    • 32-bit single precision: float

    • 64-bit double precision: double

Normalisation and Denormalisation

  • Normalised values when exponents follow specific conditions.

  • Denormal values occur when exponents are zero, leading to representation of tiny values.

Special Values in Floating-Point

  • Cases with exponent E as all 1s:

    • Represents ±∞ for overflow or NaN for undefined numeric results.

Rounding Techniques

  • Distinction between various rounding modes, predominantly towards nearest even (default).

  • It affects precision and can lead to biases for floating-point arithmetic.

Floating-Point Arithmetic

  • Operations guidelines:

    • Addition focuses on aligning exponents before adding significands, normalizing afterward, and ensuring rounding.

    • Multiplication includes deterministically creating a new significand and exponent.

Challenges in Floating-Point Operations

  • Associativity and distributivity not guaranteed due to rounding errors.

  • Example cases demonstrating NaN and overflow conditions.

Floating-point in C

  • C programming provides two main types (float, double) with specific behaviors on casting and conversions.

  • Notable case: Ariane 5 failure due to improper conversion of floating-point to integers causing exceptions.

Summary

  • IEEE floating-point has defined properties that might conflict with intuitive interpretations, focusing on precision and rounding while acknowledging limitations in numeric computations.