Computer Organisation and Architecture Notes

Positive and Negative Numbers

The discussion focuses on the binary system.
Computer systems use the Most Significant Bit (MSB) to represent the sign of an integer.
- MSB = 0 indicates a positive number.
- MSB = 1 indicates a negative number.
The remaining bits represent the value which can be interpreted differently based on the number format.
Signed binary integers can be expressed using different number formats; the course will cover the most commonly used format: Two’s Complement Representation.

Two’s Complement Representation

Positive numbers are represented as normal binary numbers.
Negative numbers are created through negation:
- Invert all bits of the number (flip the bits, also known as 1’s complement).
- Add one to the inverted result, ignoring any overflow.
The MSB indicates the sign:
- MSB = 0 for positive numbers.
- MSB = 1 for negative numbers.
The negation process can compute the negative equivalent of a positive number and vice versa. The same negation process is applied when converting a negative number back to positive.
Example:
- +106 is represented as 0 1 1 0 1 0 1 0.
- Invert all bits: 1 0 0 1 0 1 0 1
- Add 1: 1 0 0 1 0 1 1 0
- -106 is 1 0 0 1 0 1 1 0.

Two’s Complement to Decimal Conversion

Each bit position has a weight.
Table of weights for an 8-bit number:

2^0	2^1	2^2	2^3	2^4	2^5	2^6	-2^7
1	2	4	8	16	32	64	-128

The most significant bit has a negative weight.
To convert from two’s complement to decimal:
- Calculate the sum of the product of individual bits and their corresponding weightage.
Example:
- Convert 10001110 to decimal.
- $-128 + 0 + 0 + 0 + 8 + 4 + 2 + 0 = -114$
- Therefore, the decimal representation of 10001110 is -114.
The MSB is not just a sign bit; it also has a negative weight.
If the MSB is ‘1’, the final value will always be negative.

Carry vs Overflow

Carry Flag is set when there is a ‘1’ that gets carried out of the MSB of the result. *Example: 48 – 19 = 48 + (-19) = 29
- First, negate 19 (00010011 -> 11101101)
- Then add the two numbers and discard any carries emitting from the high order bit.
- Carry does not always mean that we have an error/overflow.
In signed number systems, a carry bit set does not necessarily indicate an error or overflow.
Overflow detection in result needs to be determined to see if there is an error.

Detecting Overflow in Two’s Complement Numbers

Overflow can be detected by checking the MSB of the operands and the result.
Conditions for overflow in addition (Result = A + B):
- If MSB(A) = MSB(B) and MSB(Result) ≠ MSB(A), then overflow occurs.
Conditions for overflow in subtraction (Result = A – B):
- If MSB(A) ≠ MSB(B) and MSB(Result) ≠ MSB(A), then overflow occurs.
Conditions:

Operation	Conditions	Result
A + B	A > 0, B > 0	< 0
A + B	A < 0, B < 0	> 0
A – B	A > 0, B < 0	< 0
A – B	A < 0, B > 0	> 0

If the results are not feasible (e.g., adding two positive numbers but getting a negative result), the overflow flag is set, which implies an error if the system used is a signed number system.

Carry vs. Overflow (Recap)

Unsigned Numbers:
- Carry = 1 always indicates an overflow (the new value is too large to be stored in the given number of bits).
- The overflow flag means nothing in the context of unsigned numbers.
Signed Numbers:
- Overflow = 1 indicates an overflow.
- The carry flag can be set for signed numbers, but this does not necessarily mean an overflow has occurred.
Examples:

Expression	Result	Carry?	Overflow?	Correct Result?
0100 (+4) + 0010 (+2)	0110 (+6)	No	No	Yes
0100 (+4) + 0110 (+6)	1010 (-6)	No	Yes	No
1100 (-4) + 1110 (-2)	1010 (-6)	Yes	No	Yes
1100 (-4) + 1010 (-6)	0110 (+6)	Yes	Yes	No

Sign Extension

In two’s complement, sign extension is needed to convert a smaller size operand to a larger size operand.
Sign extension copies the sign bit (MSB) into the higher-order bits.
Its basically the method used when converting a smaller size to a larger size data.
For example, to convert a number stored in an 8-bit data type to a 16-bit data type variable, sign extension is used to ensure that the values in the two variables are the same.
Examples:
- 8-bit: 10011010 (-102)
- 16-bit Sign Extended: 1111111110011010 (-102)

Multi-Precision Arithmetic

If operands are larger than 32-bits (e.g., 64-bit operands) and we have only a single 32-bit ALU:
- Reuse the 32-bit adder for multi-precision addition.
Multi-precision arithmetic involves the computation of numbers whose precision is larger than what is supported by the maximum size of the processor register (Single-Precision).
Example (64-bit addition using a 32-bit ALU):

ADD R0, R0, R1  ; add lower word with carry out
ADC R2, R2, R3  ; add upper word with carry in

The ARM registers are 32bit wide so at any instance, it can only store 32bit data.
To add two 64bit operands, we need to add the lower order word first, followed by the higher order word.
When adding the higher order word, the carry bit (C bit) has to be added as well.
Carry bit is only set if there are any ‘1’ overflowing from the MSB of the lower order word.
The ARM assembly instruction that adds the operands and the carry bit is ADC.

Fixed and Floating Point Number System

The two main types of number systems: fixed-point and floating-point.

Range and Precision

Range: Interval between the smallest (max-) and largest (max+) representable number.
- Example: Range of two’s complement is $-(2^{N-1})$ to $(2^{N-1} - 1)$
  *Precision, on the other hand, is the amount of information used to represent each number. E.g. 1.666 represent information in a higher precision compared to 1.67.
Precision: Amount of information used to represent each number.
- Example: 1.666 has higher precision than 1.67.
Precision corresponds to the interval between adjacent tick marks on the number line. Each tick mark corresponds to a representable number in the number system used.
In a binary number system, precision corresponds to the value represented by the LSB.

Fixed-Point Representation

Fixed-point format can represent integer and/or fractional values.
Similar to the decimal system, the binary system has a radix point too known as the binary point.
Fixed point system has a fixed resolution as its radix point position is fixed.
A 4-bit integer has a range of 0 to 15.
Binary points also has a fractional weightage.
The smallest resolution that this system can support determines its precision, and this correspond to its LSB.
Limitations:
- Given a fixed total number of bits allocated for a number.
- Precision is limited by the range of the integer.
- If you allocate more bits for the integer, the range will increase but the precision will reduce.

Floating Point Representation

Three main fields:
- Sign: positive/negative number
- Mantissa: base value
- Exponent: specifies position of radix point
Example: Simple decimal floating-point format:
- Sign: ±
- Mantissa: 9.99
- Exponent: ±99
Representing number in this manner, we are able to ‘move’ the position of the radix point by changing the exponent value.
Comparing with a fixed point number, we can see that it doesn’t belong to a positional numbering system, actual value is calculated by evaluating the sign, mantissa and exponent value.

Floating Point Representation

The size of the exponent determines the range.
The size of the mantissa and the value of the exponent determine the precision.
Small numbers can be represented with good precision.
Large numbers may sacrifice precision to achieve a greater range.
Density of floating-point numbers is not uniform.
Floating point representation can represent values across a wide range (-9.99 $10^{99}$ to +9.99 $10^{+99}$ ).

Normalisation

Multiple representations exist for the same value, so Normalisation is necessary to avoid synonymous representation by maintaining one non-zero digit before the radix-point.
In decimal number, this digit can be from 1 to 9
In binary number, this digit should be 1
Maximizes the number of bits of precision since the number of leading zeroes are minimised.

Underflow

Normalisation creates an underflow region.
Underflow region is close to zero and cannot be represented by the floating point number.
Underflow occurs when a value is too small to be represented.
Floating-point overflow and underflow can cause programs to crash if not handled properly.
Smallest positive normalized number +1 x $10^{-99}$
Smallest negative normalized number -1 x $10^{-99}$

IEEE 754

IEEE754 floating point number standard is used in the industry.
It is also the standard behind the floating point data type you declared in programming languages such as C.

IEEE 754 Floating Point Standard

Found in virtually every computer since 1980.
Simplified porting of floating-point numbers.
Unified the development of floating-point algorithms.
Single Precision (32-bits):
- 1-bit sign + 8-bit exponent + 23-bit fraction.
Double Precision (64-bits):
- 1-bit sign + 11-bit exponent + 52-bit fraction.
Note however that the ‘Exponent’ and ‘Fractional’ field specify in the IEEE754 format is not the Exponent and Mantissa value of a floating point number.

IEEE 754 Normalised Numbers

Sign bit:
- S = 0 (positive); S = 1 (negative).
Fraction:
- Assumes hidden 1 (not stored) for normalised numbers.
- Value of normalized floating point number is: $(-1)^S$ x $(1 + f<em>1x2^{-1} + f</em>2x2^{-2} + f<em>3x2^{-3} + f</em>4x2^{-4} + …)_2$ x $2^{E-Bias}$
Exponent:
- Biased representation (00000001 to 11111110).
- Value of exponent = E - Bias
- Bias = 127 (Single Precision) and 1023 (Double Precision).
The Mantissa is derived by attaching a leading ‘1’ to the left of the fractional bits. Giving the expression 1.F.
The actual exponent value is given by the expression E-Bias where E is the value from the Exponent Field in a IEEE754 number representation and Bias is = 127 and 1023 for Single and Double precision IEEE754 number respectively.
The E field in IEEE754 is a positive number, Bias allows the actual exponent to span across positive and negative number ranges.

Converting Single Precision to Decimal

Example 1:
- 0 10110010 1110000000000000000000
- Sign = 0 (positive)
- Exponent = $10110010_2$ = 178; E – Bias = 178 – 127 = 51
- 1 + Fraction = $(1.111)_2$ = 1 + $2^{-1}$ + $2^{-2}$ + $2^{-3}$ = 1.875
- Value in decimal = +1.875 x $2^{51}$
Example 2:
- 1 00001100 01010000000000000000000
- Sign = 1 (negative)
- Exponent = $00001100_2$ = 12; E – Bias = 12 – 127 = -115
- 1 + Fraction = $(1.0101)_2$ = 1 + $2^{-2}$ + $2^{-4}$ = 1.3125
- Value in decimal = -1.3125 x $2^{-115}$

Representable Range for Normalised Single Precision

Smallest magnitude normalized number
*Normalized (+ve) X 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Exponent = (00000001)2 = 1; E - Bias = 1 – 127 = -126
1 + Fraction = (1.000…000)2 = 1
Value in decimal = 1 x 2-126
+2-126
Largest magnitude normalized number
X 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Exponent = (11111110)2 = 254; E - Bias = 254 – 127 = 127
1 + Fraction = (1.111…111)2 ≈ 2
Value in decimal ≈ 2 x 2127 = 2128
+2128
In normalised mode, exponent is from 00000001 to 11111110

IEEE 754 Encoding Mode

Encoding Mode	Sign	Exponent	Fraction
Normalized	1 / 0	00000001 to 11111110	Anything
Denormalized	1 / 0	00000000	Non zero
Zero	1 / 0	00000000	0000 … 0000
Infinity	1 / 0	11111111	0000 … 0000
Not a Number (NaN)	1 / 0	11111111	Non zero

In normalised mode, the E value ranges from 1 to 1111 1110 while the fraction can take on any value.
IEEE754 has a special Denormalised mode that allows it to represent numbers in its underflow region.
In denorm mode, the MSB of the mantissa is a 0 and not a 1, i.e. 0.F instead of 1.F. Exponent is given a value zero and Fraction can be any non-zero value.
Take note however that the exponent value used in the denorm mode is $2^{-126}$ and not $2^{-127}$ .
Zero, Infinity and NaN are special use cases and are represented by special numbers in E and Fraction.

Fixed Point vs Floating Point Number System

Given the same number of bits to represent a data, e.g. 32 bits.
Floating Point (IEEE754)
- Max Range ≈ 2* $2^{128}$ (- $2^{128}$ to ~ $2^{128}$ ).
- Max Precision (near to zero) less than $2^{-126}$
Fixed Point
- Max Range (Radix right of LSB, unsigned) ≈ $2^{32}$
- Max Precision (Radix left of MSB) $2^{-32}$
Floating point yield a larger range and better precision at small numbers with the same number of bits representation.
One usually needs the best precision when the numbers are small.
However, Fixed point number has the advantage of having uniform precision across entire range.
Floating point number’s precision changes across the range and the very coarse precision at the two end of the range may not be desirable to an intended algorithm.