Comprehensive Notes on Number Bases, Conversions, Two's Complement, Floating Point, Modulo, and Set Theory

Number Bases and Systems

Introduction to Number Bases
  • Concept of Grouping: Numbers are represented by grouping quantities. The base of a number system, also known as its radix, defines the size of these groups and the number of unique digits available in that system.

    • In Base 10 (decimal), we use ten unique digits (0-9). Each digit's position corresponds to a power of 10. For example, in the number 221022_{10}, the leftmost '2' represents 2×1012 \times 10^1 (twenty), and the rightmost '2' represents 2×1002 \times 10^0 (two).

  • Example of place value, where digits represent multiples of powers of the base.
    - 20+2=2220 + 2 = 22 (Base 10)
    - 100+4=104100 + 4 = 104 (Base 10)
    - 40+19=5940 + 19 = 59 (Base 10)

  • Conversion via Grouping (Example: 25010250_{10} to Base-6): This method involves conceptually forming groups of the target base's powers. Instead of repeated division, you directly determine coefficients for each power.

    • Determine how many groups of basen\text{base}^n fit into the number, progressing through powers of the base.

    • Step 1: Find the largest power of 6 less than or equal to 250. 63=2166^3 = 216, 62=366^2 = 36. So we start with 626^2.

      • How many groups of 62=366^2 = 36 in 250? (250÷36=6250 \div 36 = 6 with remainder 3434).

      • The first digit (coefficient for 626^2) is 6.

    • Step 2: With the remainder (34), how many groups of 61=66^1 = 6?

      • (34÷6=534 \div 6 = 5 with remainder 44).

      • The second digit (coefficient for 616^1) is 5.

    • Step 3: With the remaining remainder (4), how many groups of 60=16^0 = 1?

      • (4÷1=44 \div 1 = 4 with remainder 00).

      • The third digit (coefficient for 606^0) is 4.

    • Result: Combining the coefficients, 250<em>10=654</em>6250<em>{10} = 654</em>6.

Common Number Bases
  • Base 10 (Decimal): Uses digits 0-9. Our everyday number system. Each position represents a power of 10. Primarily used for human calculations and general measurement.

  • Base 2 (Binary): Uses digits 0-1. Fundamental for computers and digital electronics, where information is represented as electrical signals (on/off, high/low voltage).

  • Base 8 (Octal): Uses digits 0-7. Sometimes used as a compact representation of binary numbers (three binary bits map to one octal digit), particularly in older computing systems.

  • Base 16 (Hexadecimal): Uses digits 0-9 and letters A-F (where A=10, B=11, C=12, D=13, E=14, F=15). Widely used in computing for representing memory addresses, color codes (e.g., #FFFFFF), and other binary data in a more human-readable form, as four binary bits map to one hexadecimal digit.

    • Often denoted with prefixes like 0x or h (e.g., 0xA1 or A1h).

Core Rules and Constraints
  • Largest Digit Allowed: In any column (place value), the largest digit permitted is one less than the base. This is because once you reach a value equal to the base, you form a group of that base and carry it over to the next higher place value.

    • Formula: c=b1c = b - 1 (where cc is the largest valid digit in a given base and bb is the base itself).

  • Column Values (Place Values): Each position in a number represents a power of the base. Moving from right to left, the power of the base increases by one for each position. For fractional parts, moving right from the decimal point, the power of the base decreases by one.

    • Base 2: 23,22,21,20\ldots 2^3, 2^2, 2^1, 2^0 (and for fractions, 21,22,etc.\ldots 2^{-1}, 2^{-2}, \text{etc.})

    • Base 8: 83,82,81,80\ldots 8^3, 8^2, 8^1, 8^0

    • Base 16: 163,162,161,160\ldots 16^3, 16^2, 16^1, 16^0

  • Memorization: It is required to memorize the hexadecimal and binary equivalents for decimal numbers 0-15. This is crucial for rapid conversion between these common bases.

    • Decimal | Binary | Hexadecimal

    • ---|---|---

    • 0 | 0000 | 0

    • 1 | 0001 | 1

    • 2 | 0010 | 2

    • 3 | 0011 | 3

    • 4 | 0100 | 4

    • 5 | 0101 | 5

    • 6 | 0110 | 6

    • 7 | 0111 | 7

    • 8 | 1000 | 8

    • 9 | 1001 | 9

    • 10 | 1010 | A

    • 11 | 1011 | B

    • 12 | 1100 | C

    • 13 | 1101 | D

    • 14 | 1110 | E

    • 15 | 1111 | F

Number Base Conversions
1. Converting from Base-N to Decimal (Base-10)
  • Method: To convert any number from an arbitrary base-N to decimal, you expand the number using the definition of place values. Each digit is multiplied by its corresponding column value (power of the base), and these products are then summed.

    • Formula: For a number d<em>k1d</em>k2d<em>1d</em>0d<em>{k-1}d</em>{k-2}\text{…}d<em>1d</em>0 in base bb, its decimal value is: value=<em>i=0k1d</em>ibi\text{value} = \sum<em>{i=0}^{k-1} d</em>i \cdot b^i For numbers with a fractional part, the formula extends to negative powers: d<em>2b2+d</em>1b1+d<em>0b0+d</em>1b1+d2b2… d<em>2b^2 + d</em>1b^1 + d<em>0b^0 + d</em>{-1}b^{-1} + d_{-2}b^{-2} …

  • Steps:

    1. Write down the number from left to right.

    2. Assign column values (powers of the base) to each digit, starting from b0b^0 for the rightmost integer digit and increasing by 1 for each position to the left. For fractional parts, start with b1b^{-1} for the first digit after the decimal point, decreasing by 1 for each position to the right.

    3. Multiply each digit by its respective column value.

    4. Sum all the products to get the decimal equivalent.

  • Examples:

    • Convert 325<em>6325<em>6 to Decimal:
      362+261+560=336+26+51=108+12+5=125</em>103 \cdot 6^2 + 2 \cdot 6^1 + 5 \cdot 6^0 = 3 \cdot 36 + 2 \cdot 6 + 5 \cdot 1 = 108 + 12 + 5 = 125</em>{10}

    • Convert 11001<em>211001<em>2 to Decimal:
      124+123+022+021+120=16+8+0+0+1=25</em>101 \cdot 2^4 + 1 \cdot 2^3 + 0 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 = 16 + 8 + 0 + 0 + 1 = 25</em>{10}

    • Convert 67<em>867<em>8 to Decimal:
      681+780=48+7=55</em>106 \cdot 8^1 + 7 \cdot 8^0 = 48 + 7 = 55</em>{10}

    • Convert 0x3CD<em>160x3CD<em>{16} to Decimal: (Remember C=12, D=13)
      3162+C(12)161+D(13)160=3256+1216+131=768+192+13=973</em>103 \cdot 16^2 + C(12) \cdot 16^1 + D(13) \cdot 16^0 = 3 \cdot 256 + 12 \cdot 16 + 13 \cdot 1 = 768 + 192 + 13 = 973</em>{10}

2. Converting from Decimal (Base-10) to Base-N
  • Method (Division Method for Integers): This method involves repeatedly dividing the decimal number by the target base and noting the remainders. This works because each remainder represents the coefficient of a power of the base, starting from b0b^0 (rightmost digit).

  • Steps:

    1. Divide the decimal number by the target base.

    2. Record the remainder. This remainder will be the rightmost digit (least significant digit) in the new base.

    3. Use the quotient from the previous step as the new number to divide.

    4. Repeat steps 1-3 until the quotient is 0.

    5. The new base representation is the sequence of remainders, read from bottom to top (last remainder to first, which corresponds to the most significant digit to the least significant digit).

  • Examples:

    • Convert 75310753_{10} to Base-6:

    • 753÷6=125753 \div 6 = 125 R 33

    • 125÷6=20125 \div 6 = 20 R 55

    • 20÷6=320 \div 6 = 3 R 22

    • 3÷6=03 \div 6 = 0 R 33

    • Result (read remainders upwards): 325363253_6

    • Convert 20010200_{10} to Binary:

    • 200÷2=100200 \div 2 = 100 R 00

    • 100÷2=50100 \div 2 = 50 R 00

    • 50÷2=2550 \div 2 = 25 R 00

    • 25÷2=1225 \div 2 = 12 R 11

    • 12÷2=612 \div 2 = 6 R 00

    • 6÷2=36 \div 2 = 3 R 00

    • 3÷2=13 \div 2 = 1 R 11

    • 1÷2=01 \div 2 = 0 R 11

    • Result: 11001000211001000_2

    • Convert 19810198_{10} to Base-7:

    • 198÷7=28198 \div 7 = 28 R 22

    • 28÷7=428 \div 7 = 4 R 00

    • 4÷7=04 \div 7 = 0 R 44

    • Result: 4027402_7

3. Converting Decimal Fractions to Another Base
  • Method (Multiplication Method for Fractions): This method involves repeatedly multiplying the fractional part of the decimal number by the target base. Each integer part generated becomes the next digit in the new base's fractional representation. This process extracts the digits from most significant to least significant after the decimal point.

  • Steps:

    1. Multiply the decimal fractional part by the target base.

    2. The integer part of the result is the next digit in the new base (most-significant first).

    3. Take the remaining fractional part of the product and use it for the next step.

    4. Repeat until the fractional part is 0 or the desired level of precision is reached. Be aware that some decimal fractions may result in non-terminating (repeating) fractions in other bases.

  • Examples:

    • Convert 0.12100.12_{10} to Binary (four decimal places):

    • 0.12×2=0.240.12 \times 2 = 0.24 (integer part: 0)

    • 0.24×2=0.480.24 \times 2 = 0.48 (integer part: 0)

    • 0.48×2=0.960.48 \times 2 = 0.96 (integer part: 0)

    • 0.96×2=1.920.96 \times 2 = 1.92 (integer part: 1)

    • 0.92×2=1.840.92 \times 2 = 1.84 (integer part: 1)

    • 0.84×2=1.680.84 \times 2 = 1.68 (integer part: 1)

    • 0.68×2=1.360.68 \times 2 = 1.36 (integer part: 1)

    • 0.36×2=0.720.36 \times 2 = 0.72 (integer part: 0)

    - 0.72×2=1.440.72 \times 2 = 1.44 (integer part: 1)

    • (Continuing this process, you may find it to be a repeating binary fraction.)

    • Result: 0.00011110120.000111101…_2

    • Convert 0.125100.125_{10} to Base-5:

    • 0.125×5=0.6250.125 \times 5 = 0.625 (integer part: 0)

    • 0.625×5=3.1250.625 \times 5 = 3.125 (integer part: 3)

    • 0.125×5=0.6250.125 \times 5 = 0.625 (integer part: 0)

    • (This pattern will repeat.)

    • Result: 0.03050.030…_5

4. Converting Fractional Binary to Decimal
  • Method: Similar to converting integers, multiply each fractional digit by its corresponding negative power of 2 and sum the products.

  • Formula: For a fractional digit d<em>id<em>{-i} in binary, it represents d</em>i2id</em>{-i} \cdot 2^{-i}.

  • Example: Convert 101.011<em>2101.011<em>2 to Decimal:
    122+021+120+021+122+1231 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 + 0 \cdot 2^{-1} + 1 \cdot 2^{-2} + 1 \cdot 2^{-3}
    =4+0+1+0+0.25+0.125=5.375</em>10= 4 + 0 + 1 + 0 + 0.25 + 0.125 = 5.375</em>{10}

5. Shortcut Conversions (Binary, Octal, Hexadecimal)
  • These bases are related by powers of 2, allowing for direct grouping of bits, which significantly simplifies conversions without needing to go through decimal.

    • Binary and Hexadecimal are related:

    • One hexadecimal digit represents exactly four binary bits (24=162^4 = 16). This relationship makes conversions very fast.

    • Binary to Hex: To convert, group the binary digits into sets of four, starting from the right for the integer part and from the left for the fractional part (pad with leading/trailing zeros if necessary to complete a group of four). Then convert each group to its single hex equivalent using the memorized table.

      • Example: 1010000100111111<em>21010</em>2 (A) 0001<em>2 (1) 0011</em>2 (3) 1111<em>2 (F)A13F</em>161010000100111111<em>2 \rightarrow 1010</em>2 \text{ (A) } 0001<em>2 \text{ (1) } 0011</em>2 \text{ (3) } 1111<em>2 \text{ (F)} \rightarrow A13F</em>{16}

    • Hex to Binary: Convert each hex digit into its four-bit binary equivalent from the memorized table. Concatenate these binary groups to form the full binary number.

      • Example: 1ACB<em>160001</em>2 1010<em>2 1100</em>2 1011<em>20001101011001011</em>21ACB<em>{16} \rightarrow 0001</em>2 \text{ } 1010<em>2 \text{ } 1100</em>2 \text{ } 1011<em>2 \rightarrow 0001101011001011</em>2

    • Binary and Octal are related:

    • One octal digit represents exactly three binary bits (23=82^3 = 8). This similarly allows for direct grouping.

    • Binary to Octal: Group binary digits into sets of three, starting from the right for the integer part and from the left for the fractional part (pad with leading/trailing zeros if necessary). Convert each group to its octal equivalent.

      • Example: 011010111<em>2011</em>2 (3) 010<em>2 (2) 111</em>2 (7)3278011010111<em>2 \rightarrow 011</em>2 \text{ (3) } 010<em>2 \text{ (2) } 111</em>2 \text{ (7)} \rightarrow 327_8

    • Octal to Binary: Convert each octal digit into its three-bit binary equivalent. Concatenate these binary groups.

      • Example: 327<em>8011</em>2 010<em>2 111</em>20110101112327<em>8 \rightarrow 011</em>2 \text{ } 010<em>2 \text{ } 111</em>2 \rightarrow 011010111_2

Arithmetic in Different Bases
1. Addition in Different Bases
  • Rule: Perform addition column by column, starting from the rightmost digit, just as you would in base 10. If the sum in a column equals or exceeds the base, you perform a carry. The value carried to the next column is exactly 1 (representing one group of the base), and the digit in the current column is the remainder after subtracting the base from the sum.

  • Example (Conceptual): In base 6, if you add 5<em>6+3</em>65<em>6 + 3</em>6

    • 5+3=85 + 3 = 8 in decimal.

    • Since 88 is greater than or equal to the base (6), we subtract the base: 86=28 - 6 = 2.

    • We carry 1 to the next column (which doesn't exist here, so it becomes the leading digit).

    • The result is 12612_6 (carry 1, current digit 2).

  • Example: Add 452<em>6+133</em>6452<em>6 + 133</em>6:

    • Rightmost column ($6^0$): 2<em>6+3</em>6=562<em>6 + 3</em>6 = 5_6. (No carry)

    • Middle column ($6^1$): 5<em>6+3</em>6=8<em>10=1 (carry)+2</em>65<em>6 + 3</em>6 = 8<em>{10} = 1 \text{ (carry)} + 2</em>6. Write down 2, carry 1.

    • Leftmost column ($6^2$): 4<em>6+1</em>6+1 (carry)=6<em>10=1 (carry)+0</em>64<em>6 + 1</em>6 + 1 \text{ (carry)} = 6<em>{10} = 1 \text{ (carry)} + 0</em>6. Write down 0, carry 1.

    • Result: The carries continue to the next column forming a leading digit. So, it's 102561025_6.

2. Subtraction in Different Bases
  • Rule: Perform subtraction column by column, starting from the rightmost digit. If a digit in the top number is too small to subtract from, you must borrow from the next higher column. When you borrow 1 from a column (say, digit×bx\text{digit} \times b^x), it adds a value equal to the base (bb) to the current column (say, digit×bx1\text{digit} \times b^{x-1}). The digit in the column from which you borrowed is reduced by 1.

  • Example (Conceptual): In base 6, if you need to subtract 5<em>65<em>6 from a top digit of 2</em>62</em>6 in a column:

    • You would borrow 1 from the digit to its left.

    • That borrowed 1 (from the 616^1 position) becomes 6106_{10} in the current 606^0 column.

    • So, in the current column, you now effectively have (2+6)<em>10=8</em>10(2 + 6)<em>{10} = 8</em>{10}.

    • Then, you subtract: 8<em>105</em>10=3<em>10=3</em>68<em>{10} - 5</em>{10} = 3<em>{10} = 3</em>6.

    • The digit in the column from which you borrowed would be reduced by 1.

  • Binary Subtraction: The fundamental rules are simplified due to only two digits:

    • 00=00 - 0 = 0

    • 10=11 - 0 = 1

    • 11=01 - 1 = 0

    • 01=10 - 1 = 1 (This requires a borrow from the left. When you borrow 1 from the next position, it becomes 2<em>102<em>{10} in the current position. So, (2)</em>101=1(2)</em>{10} - 1 = 1).

Modulo Operation (a mod m)
  • Definition: The modulo operation, a mod ma \text{ mod } m, calculates the remainder when an integer aa (the dividend) is divided by another integer mm (the divisor), where mm must be greater than 0. The result rr is always non-negative and less than mm.

  • Formula (Relationship): The division algorithm states that for any integers aa and mm with m > 0, there exist unique integers qq (quotient) and rr (remainder) such that a=mq+ra = m \cdot q + r, where 0 \le r < m. The modulo operation finds this rr, so r=a mod mr = a \text{ mod } m.

  • Applications: Detecting divisibility (if a mod m=0a \text{ mod } m = 0, then aa is divisible by mm), cyclic counting (e.g., hours on a clock, days of the week), modular arithmetic in cryptography, hashing functions, and understanding fixed-width arithmetic (how numbers 'wrap around' on overflow).

  • Examples:

    • 12 mod 5=212 \text{ mod } 5 = 2 (because 12=52+212 = 5 \cdot 2 + 2)

    • 135 mod 2=1135 \text{ mod } 2 = 1 (Useful for checking if a number is odd (1) or even (0)).

    • 13 mod 6=5-13 \text{ mod } 6 = 5 (because 13=36+5-13 = -3 \cdot 6 + 5; the remainder must be non-negative).

    • (5×7) mod 9=35 mod 9=8(5 \times 7) \text{ mod } 9 = 35 \text{ mod } 9 = 8 (because 35=93+835 = 9 \cdot 3 + 8)

Powers of Two and Metric Prefixes
  • Powers of 2 to Memorize: These are fundamental in computer science and often appear in various contexts related to memory, data storage, and processing.

    • 20=12^0 = 1

    • 21=22^1 = 2

    • 22=42^2 = 4

    • 23=82^3 = 8

    • 24=162^4 = 16

    • 25=322^5 = 32

    • 26=642^6 = 64

    • 27=1282^7 = 128

    • 28=2562^8 = 256

    • 29=5122^9 = 512

    • 210=10242^{10} = 1024

  • Metric Prefixes for Powers of 2 (approximate powers of 10): In computing, these prefixes often refer to powers of 2 (binary prefixes, sometimes called IEC prefixes) rather than exact powers of 10, though they are often used interchangeably in approximate contexts due to their proximity.

    • 210=10242^{10} = 1024 (often referred to as 1 Kilobyte (KB) or more precisely 1 Kibibyte (KiB) - approximately 10310^3)

    • 220=1,048,5762^{20} = 1,048,576 (often referred to as 1 Megabyte (MB) or more precisely 1 Mebibyte (MiB) - approximately 10610^6)

    • 230=1,073,741,8242^{30} = 1,073,741,824 (often referred to as 1 Gigabyte (GB) or more precisely 1 Gibibyte (GiB) - approximately 10910^9)

    • 240=1,099,511,627,7762^{40} = 1,099,511,627,776 (often referred to as 1 Terabyte (TB) or more precisely 1 Tebibyte (TiB) - approximately 101210^{12})

    • 250=1,125,899,906,842,6242^{50} = 1,125,899,906,842,624 (often referred to as 1 Petabyte (PB) or more precisely 1 Pebibyte (PiB) - approximately 101510^{15})

  • Shortcut for large powers: To quickly estimate large powers of 2, you can use the relationship with the kilo, mega, giga prefixes:
    2n=2(n mod 10)×210n/102^{n} = 2^{(n \text{ mod } 10)} \times 2^{10 \cdot \lfloor n/10 \rfloor} (e.g., 213=23×210=8×1K=8K2^{13} = 2^3 \times 2^{10} = 8 \times 1K = 8K; 222=22×220=4×1M=4M2^{22} = 2^2 \times 2^{20} = 4 \times 1M = 4M)

Binary Number Representation and Limits
1. Unsigned Binary Numbers
  • Definition: In unsigned binary representation, all bits in a number are used to represent the magnitude of the value. This means numbers are always considered non-negative and can only represent zero or positive integers.

  • Storage Limits for nn bits:

    • Number of distinct values: 2n2^n (each bit can be 0 or 1, and there are nn bits).

    • Smallest value: 00 (all bits are 0).

    • Largest value: 2n12^n - 1 (all bits are 1).

  • Example: With 8 bits:

    • Range: 00 to 281=2552^8 - 1 = 255.

    • So, an 8-bit unsigned integer can represent any integer from 0 to 255, inclusive.

2. Signed Binary Numbers (Two's Complement)
  • Definition: Two's Complement is the most common and efficient method used by modern computers to represent signed integers (positive, negative, and zero). Its key advantages include a single representation for zero and simplified arithmetic operations (addition and subtraction can be performed using the same hardware).

  • The most significant bit (MSB) acts as the sign bit: 0 for positive numbers and 1 for negative numbers. However, for negative numbers, the remaining bits are not simply the magnitude of the number.

  • Storage Limits for nn bits:

    • Number of distinct values: 2n2^n (same as unsigned).

    • Smallest (most negative) value: 2n1-2^{n-1} (when the MSB is 1 and all other bits are 0).

    • Largest (most positive) value: 2n112^{n-1} - 1 (when the MSB is 0 and all other bits are 1).

  • Example: With 8 bits:

    • Range: 281-2^{8-1} to 28112^{8-1} - 1 = 27-2^7 to 2712^7 - 1 = 128-128 to 127127.

3. Converting Decimal to Two's Complement
  • For Positive Numbers: Convert the positive decimal number to its binary representation normally. Then, pad with leading zeros to meet the specified number of bits (nn).

    • Example: +2<em>10+2<em>{10} in 8 bits is 00000010</em>200000010</em>2.

    • Example: +79<em>10+79<em>{10} in 8 bits is 01001111</em>201001111</em>2.

  • For Negative Numbers (Steps): This is a three-step process:

    1. Take the absolute value of the decimal number and convert it to its binary representation using the specified number of bits (nn).

    2. Invert all the bits (0s become 1s, and 1s become 0s). This result is known as the one's complement.

    3. Add 1 to the inverted result. Any carry-out from the most significant bit is discarded.

  • Example: Convert 7910-79_{10} to 8-bit Two's Complement:

    1. Absolute value: 79<em>10|79|<em>{10} converted to 8-bit binary is 01001111</em>201001111</em>2.

    2. Invert all bits (one's complement): 10110000210110000_2.

    3. Add 1: 10110000<em>2+1</em>2=10110001210110000<em>2 + 1</em>2 = 10110001_2.

    • Result: 79<em>10=10110001</em>2-79<em>{10} = 10110001</em>2

4. Converting Two's Complement to Decimal
  • For Positive Numbers (Leading bit is 0): If the most significant bit is 0, the number is positive. Convert the binary number to decimal normally, just as you would for an unsigned binary number.

    • Example: 01101100<em>2=64+32+8+4=108</em>1001101100<em>2 = 64+32+8+4 = 108</em>{10}.

  • For Negative Numbers (Leading bit is 1): If the most significant bit is 1, the number is negative. Use one of two methods:

    1. Method 1 (Invert and Add 1):
      a. Invert all the bits (take the one's complement).
      b. Add 1 to the inverted result.
      c. Convert the resulting positive binary number to decimal.
      d. Place a negative sign in front of the decimal value to get the final answer.

    2. Method 2 (Weighted Sum with Negative MSB): Treat the MSB's place value as negative. For an nn-bit two's complement number d<em>n1d</em>n2d<em>0d<em>{n-1}d</em>{n-2}…d<em>0, the decimal value is (d</em>n12n1)+(d<em>n22n2)++(d</em>020)(-d</em>{n-1} \cdot 2^{n-1}) + (d<em>{n-2} \cdot 2^{n-2}) + … + (d</em>0 \cdot 2^0).

  • Example: Convert 10001011210001011_2 (8-bit Two's Complement) to Decimal (using Method 1):

    1. Invert bits: 01110100201110100_2

    2. Add 1: 01110100<em>2+1</em>2=01110101201110100<em>2 + 1</em>2 = 01110101_2

    3. Convert to decimal: 01110101<em>2=(0128)+(164)+(132)+(116)+(08)+(14)+(02)+(11)=64+32+16+4+1=117</em>1001110101<em>2 = (0 \cdot 128) + (1 \cdot 64) + (1 \cdot 32) + (1 \cdot 16) + (0 \cdot 8) + (1 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64+32+16+4+1 = 117</em>{10}.

    4. Place a negative sign: 11710-117_{10}.

    • Result: 10001011<em>2=117</em>1010001011<em>2 = -117</em>{10}

Floating-Point Representation (IEEE 754 Standard)

Floating-point representation is a method to encode real numbers (numbers with fractional parts) within a fixed number of bits, allowing for a wide dynamic range at the cost of precision. The IEEE 754 standard is the most common standard for floating-point arithmetic.

1. General Structure

Floating-point numbers (real numbers) are typically stored using three main parts:

  1. Sign Bit (S): A single bit that determines whether the number is positive (0) or negative (1).

  2. Exponent (E): Represents the magnitude of the number. To handle both positive and negative exponents conveniently without needing a separate sign bit for the exponent, a bias is added to the actual exponent. The stored exponent is therefore the true exponent plus a fixed bias. For single-precision (32-bit) floats, the bias is 127; for double-precision (64-bit), it's 1023.

  3. Mantissa / Significand (M): Represents the precision bits of the number. In the IEEE 754 standard, the significand is always normalized, meaning it is represented in the form 1.f1.f where ff is the fractional part stored. The leading '1' before the decimal point is implicit (not stored) to save a bit, except for denormalized numbers.

2. Format Details (IEEE 754)
  • Single-Precision (32-bit floating-point number):

    • 1 bit for the Sign (S)

    • 8 bits for the Exponent (E), with a bias of 2811=1272^{8-1}-1 = 127

    • 23 bits for the Mantissa/Significand (M), providing a precision equivalent to about 7 decimal digits.

    • The value is calculated as: (1)S×2(ExponentBias)×(1+Mantissa)(-1)^S \times 2^{(\text{Exponent} - \text{Bias})} \times (1 + \text{Mantissa})

  • Double-Precision (64-bit floating-point number):

    • 1 bit for the Sign (S)

    • 11 bits for the Exponent (E), with a bias of 21111=10232^{11-1}-1 = 1023

    • 52 bits for the Mantissa/Significand (M), providing a precision equivalent to about 15-17 decimal digits.

    • The value is calculated as: (1)S×2(ExponentBias)×(1+Mantissa)(-1)^S \times 2^{(\text{Exponent} - \text{Bias})} \times (1 + \text{Mantissa})

3. Special Cases

The IEEE 754 standard defines specific combinations of the exponent and mantissa to represent special values:

  • Zero (0): Represented by an exponent of all zeros and a mantissa of all zeros. Both +0 and -0 exist.

  • Infinity ($\infty$): Represented by an exponent of all ones and a mantissa of all zeros. Used for results that overflow the representable range (e.g., division by zero).

  • Denormalized Numbers: Represented by an exponent of all zeros and a non-zero mantissa. They allow for representing numbers even closer to zero than normalized numbers, filling the underflow gap. The implicit leading '1' is dropped, and the exponent is fixed at the minimum valid exponent (1Bias1 - \text{Bias}).

  • Not a Number (NaN): Represented by an exponent of all ones and a non-zero mantissa. Used to represent results of invalid or unrepresentable operations (e.g., 0/00/0, 1\sqrt{-1}).

This detailed structure allows for efficient and standardized representation and arithmetic of a wide range of real numbers in computing systems.