Lecture Notes: Integer Representations, Overflow, and IEEE 754 Floating Point

Signed and unsigned integers; two's complement and overflow

Interpreting numbers in two's complement uses a fixed bit width; the leftmost bit (most significant bit) serves as the sign bit. If this bit is 1, the value is negative; if 0, the value is nonnegative. The choice of fixed bit width is crucial because it inherently defines the range of representable values; consequently, some values cannot be represented if they fall outside this range.
Example Concept: In a fixed width system, a value like 24 may not be representable in just 4 bits. This limitation arises not only due to the sign bit's role but primarily because there are simply not enough unique bit patterns in 4 bits to encode all possible two's complement values, positive or negative.
4-bit Illustration: In two's complement with 4 bits, the representable range for signed integers is given by the formula $-2^{n-1} \le V \le 2^{n-1}-1$ , where $n$ is the number of bits. For $n=4$ , this translates to $-2^{3} \le V \le 2^{3}-1$ , meaning $-8 \le V \le 7$ . The bit pattern 1000 represents $-8$ when interpreted as a signed value, leveraging the sign bit. However, the exact same bit pattern 1000 would represent 8 if interpreted as an unsigned integer.
Unsigned Literals: In programming languages, you can explicitly mark an integer literal as unsigned using a suffix, such as 24u or 249u. This tells the compiler to treat the value as an unsigned integer, affecting how it's stored and how arithmetic operations are performed.
Overflow Semantics:
- For unsigned arithmetic: Overflow is well-defined and wraps around modulo the width of the type. For example, in n bits, the result of an addition $a + b$ is equivalent to $(a + b) \bmod 2^n$ . This means that if an operation exceeds the maximum representable value, it "wraps around" to the smallest values in the range.
- For signed arithmetic: Overflow is not necessarily defined by the language standard in all languages (e.g., C/C++ can exhibit undefined behavior). This means the result of a signed overflow operation can be unpredictable and may vary between compilers or execution environments. However, the lecture emphasizes that overflow itself is not an error in the sense of a program crash, but rather a logical inconsistency. Developers can detect it semantically, such as observing that adding two positive numbers yields a negative result, which is a clear indicator of signed overflow. There is no automatic "clamping" or correction; you must explicitly check for overflow conditions after an operation if you need to handle it.
Why this matters for conditionals and loops: If an expression whose value could overflow is used in a conditional or loop termination check, it is critical to ensure that the condition evaluates correctly under all intended circumstances. Failure to account for overflow can lead to subtle logic errors, incorrect program behavior, or even infinite loops, particularly with unsigned types that wrap unexpectedly.
- Example discussed: Using unsigned variables in a loop where a decrement might cause underflow can lead to the value wrapping to a very large positive number. This unexpected wrap-around can cause serious issues like out-of-bounds memory access if used as an array index, as the large wrapped value will point far beyond the array's legitimate memory region.

Practical example: unsigned versus signed in a loop; wrap-around behavior

Scenario: Consider a loop that uses an unsigned counter I and decrements it. If I reaches 0 and is then decremented by one, instead of becoming -1 (which is not representable as unsigned), it will wrap to the maximum value representable by its unsigned type (e.g., $2^{32}-1$ for a 32-bit unsigned integer), not stopping as expected.
Consequence: If the loop termination condition (e.g., I >= 0 for I being unsigned) is not carefully designed to handle this wrap-around, the loop could continue indefinitely or, more critically, access memory beyond valid array bounds if I is used as an index into an array. For instance, accessing array[MAX_UNSIGNED_INT] would likely result in a segmentation fault or corrupted data.
Suggested Remedy from the discussion: To prevent such issues, a common approach is to adjust the loop's starting and stopping conditions. For example, one might start the counter at a safe value (e.g., count - 2 instead of count - 1) and design the loop to stop when I is less than a specific threshold (e.g., I < 1). This way, the loop condition explicitly prevents reaching a state where unsigned underflow could cause incorrect or dangerous behavior.
The broader point: A significant number of real-world software bugs, including security vulnerabilities, arise from neglecting to consider overflow or underflow conditions, especially in unsigned arithmetic. Relying on bounds-checked constructs or carefully validated loop logic is essential to mitigate these risks.

Word size, unsigned types, and size_t

Word size: This refers to the number of bits that a processor's central processing unit (CPU) can process at one time, often defining the native width of registers and memory addresses. It frequently dictates the natural width of integer types (e.g., int, long) and thus determines the maximum value they can hold without overflow.
size_t: This is a fundamental data type in C and C++ primarily designed to represent sizes of objects, counts of items, and array indices. It is always an unsigned integer type. Crucially, size_t is guaranteed to be large enough to represent the size of any object in memory on the given platform, making it ideal for memory-related operations and indexing large data structures. Its width is tied to the machine’s word size and address bus width, but it is not freely configurable by the programmer in typical languages.
Questions addressed in the transcript:
- Is size_t always equal to the computer’s word size? Generally, yes, by design, as it needs to be able to address all memory. However, programmers should not rely on a particular bit width for size_t in portable code. For situations where an explicit width is required for interoperability or specific data representations, fixed-width types like uint32_t, uint64_t, etc., available in <stdint.h>, should be used.
- Can you set size_t to a different width? Typically, a programmer cannot directly configure or set size_t's width. Its size is determined by the compiler and the target architecture. If a specific integer width is needed, explicit-width types (e.g., uint16_t for 16 bits) are the correct and portable solution.
Practical note: size_t is the canonical type for array indexing, loop counters that iterate over array elements, and the return type of sizeof and malloc because it is guaranteed to be large enough to index any possible object on the given platform without overflow issues related to positive sizes.

Bit-twiddling and compiler optimizations: shifts vs multiplication

The compiler optimizes certain arithmetic by using bit shifts when multiplying or dividing an integer by powers of two. This optimization is possible because a left bit shift by $N$ positions is equivalent to multiplication by $2^N$ , and a right bit shift by $N$ positions is equivalent to integer division by $2^N$ . Bit shift operations are typically much faster at the hardware level than generic multiplication or division circuits.
Example: Multiplying an integer x by eight (x * 8) can be efficiently implemented by the compiler as a left shift by three bits (x \ll 3), as $2^3 = 8$ .
Decision Logic (as discussed): The compiler employs sophisticated heuristics to weigh the cost of a sequence of shift and add/subtract operations versus using a dedicated hardware multiplication/division instruction. For simple expressions or specific hardware architectures, a shift-based implementation can be significantly cheaper in terms of CPU cycles. However, for more complex or longer arithmetic expressions, or on modern CPUs with highly optimized integer ALUs (Arithmetic Logic Units), the compiler might prefer to use a standard multiply instruction.
Takeaway: When writing code, trust the optimizer to make the best decision. For micro-optimizations, it is generally advised to write clear, readable code and let the compiler handle the low-level efficiency. Only resort to manual bit-twiddling (e.g., x \ll 3 instead of x * 8) after profiling has identified a specific performance bottleneck, and always consider the trade-offs in terms of readability, maintainability, and portability.

Floating point overview: fixed binary fractions, normalization, and the IEEE 754 standard

Historical Context: Before 1985, floating-point formats were largely hardware-specific and inconsistent. Different computer vendors implemented their own unique encodings for real numbers, leading to significant portability issues where numerical calculations could yield different results across machines. This made it difficult to write reliable, cross-platform scientific and engineering software.
Why Standardization Mattered: The introduction of the IEEE 754 standard for floating-point arithmetic provided a common, hardware-independent representation. This standardization was crucial because it ensures that software behaves consistently across diverse hardware platforms, greatly facilitating software portability and simplifying compiler design by providing a universal target for floating-point operations.
Core Idea: Floating-point numbers represent real numbers in a format analogous to scientific notation, specifically in the form: $\text{value} = (-1)^{\text{sign}} \times (\text{mantissa}) \times 2^{\text{exponent}}$
- In IEEE 754 normalized form, the mantissa (or significand) is implicitly structured to be 1.fraction (e.g., 1.xxxxxxxxx). This

Signed and Unsigned Integers; Two's Complement & Overflow
- Two's complement interprets numbers with a fixed bit width; the leftmost bit is the sign bit (1 for negative, 0 for non-negative).
- The representable range for $n$ -bit signed integers is $-2^{n-1} \le V \le 2^{n-1}-1$ . (e.g., 4 bits: -8 to 7).
- A bit pattern like 1000 is -8 signed, 8 unsigned in a 4-bit system.
- Unsigned integer literals use a u suffix (e.g., 24u).
- Overflow Semantics:
- Unsigned arithmetic: Overflow is well-defined and wraps around modulo the type's width ((a + b) mod 2^n).
- Signed arithmetic: Overflow results in undefined behavior in languages like C/C++; it's a logical inconsistency rather than a crash, discernible if, for example, two positive numbers add to a negative.
- Overflow in conditionals or loops can lead to logic errors, incorrect program behavior, or infinite loops, particularly with unsigned types wrapping unexpectedly (e.g., unsigned underflow to a large positive number).
Practical Example: Unsigned vs. Signed in a Loop
- An unsigned counter decrementing past 0 wraps to its maximum value, leading to potential infinite loops or out-of-bounds memory access if used as an array index.
- Remedy: Adjust loop conditions to prevent unsigned underflow (e.g., start at count - 2, stop when I < 1).
- Many software bugs and security flaws stem from neglecting overflow/underflow, especially in unsigned arithmetic.
Word Size, Unsigned Types, and size_t
- Word size: The number of bits a CPU processes simultaneously, defining native integer width.
- size_t: An unsigned integer type in C/C++ for object sizes and array indices, guaranteed to be large enough for any memory object on the platform.
- Programmers should not rely on a particular bit width for size_t; use fixed-width types (e.g., uint32_t) when specific widths are needed.
- size_t is the canonical type for array indexing, loop counters that iterate over array elements, and return types of sizeof and malloc.
Bit-Twiddling and Compiler Optimizations
- Compilers optimize multiplication/division by powers of two into faster bit shift operations (e.g., x * 8 becomes x << 3 since $8 = 2^3$ ).
- The compiler uses heuristics to decide between shift-based and dedicated multiplication/division instructions.
- Takeaway: Write clear, readable code and trust the compiler's optimizer; manual bit-twiddling should only be used after profiling reveals a specific bottleneck.
Floating Point Overview: IEEE 754 Standard
- Historical Context: Before 1985, inconsistent floating-point formats caused portability issues.
- IEEE 754 Standard: Introduced for consistent, hardware-independent representation, ensuring reliable numerical calculations across platforms.
- Core Idea: Floating-point numbers represent real numbers as $\text{value} = (-1)^{\text{sign}} \times (\text{mantissa}) \times 2^{\text{exponent}}$ .
- In IEEE 754 normalized form, the mantissa is implicitly 1.fraction.

This structure allows for a wide dynamic range and precision, enabling efficient storage and arithmetic with real numbers in various computing environments.