Module 2: Control Unit Design & ALU Operations - Comprehensive Notes

Single Bus Organization

  • MDR (Memory Data Register) has 2 inputs and 2 outputs. Data may be loaded into the MDR either from memory bus (external) or from processor bus (internal).

  • MAR (Memory Address Register) input is connected to the internal bus, and MAR's output is connected to the external bus.

  • Instruction-decoder and control-unit are responsible for issuing signals that control the operation of all units inside the processor (and for interacting with the memory bus) and for implementing the actions specified by the instruction (loaded in the IR).

  • Registers R0 through R(n−1) are provided for general-purpose use by the programmer.

  • Additional internal storage registers Y, Z, and TEMP are used by the processor for temporary storage during the execution of some instructions. These are transparent to the programmer and are never referenced explicitly by any instruction.

  • A multiplexer (MUX) selects either the output of Y or a constant value 4 (used to increment the PC) as input A to the ALU. The B input of the ALU is obtained directly from the processor bus.

  • As instructions execute, data are transferred from one register to another, often passing through the ALU to perform arithmetic or logic operations.

  • An instruction can be executed by performing one or more of the following operations:
    1) Transfer a word of data from one processor register to another or to the ALU.
    2) Perform an arithmetic or logic operation and store the result in a processor register.
    3) Fetch the contents of a given memory location and load them into a processor register.
    4) Store a word of data from a processor register into a given memory location.

Observations and Insights

  • The internal processor bus and the external memory bus can be operated independently (concurrently) because of MAR and MDR separation.
  • Independent operations enable parallel steps, e.g., memory access can occur while the PC is incremented, or instruction decoding occurs while source registers are read.
  • During memory access, the processor waits for MFC (Memory Function Completed). There is nothing to do but wait for a few cycles. MFC indicates that the requested operation has completed.

Multiple-Bus Organization

  • A major disadvantage of the single-bus organization is that only one data item can be transferred internally at a time.

  • In the register-file, all general-purpose registers are combined into a single block with 3 ports.

    • There are 2 outputs allowing contents of two different registers to be simultaneously placed on buses A and B.
    • The third input-port (bus C) allows data on bus C to be loaded into a third register during the same clock cycle.
  • In a multiple-bus system:

    • Bus A and B are used to transfer the source operands to the ALU inputs A and B.
    • The ALU performs the operation, and the result is transferred to the destination over bus C.
  • The design allows:

    • The contents of two different registers to be accessed simultaneously and placed on buses A and B.
    • The data on bus C to be loaded into a third register in the same clock cycle.
    • An incrementer unit is available.
    • The ALU can simply pass one of its two input operands unmodified to bus C (control signals: R=A or R=B).
  • An incrementer (used to increment the PC by 4) eliminates the need to add the constant value 4 to the PC via the main ALU. The source for the constant 4 at the ALU input multiplexer can be used to increment other addresses (e.g., load/multiple and store/multiple operations).

  • Example control sequence for an instruction in a three-bus organization (Figure 7.9 reference):

    • Step 1: PC out → MAR in, Read; IncPC (increment PC)
    • Step 2: WMF C (write memory function complete) // memory read completes
    • Step 3: MDR out B, R=B, IR in
    • Step 4: R4 out A, R5 out B, SelectA, Add, R6 in // example path
  • Practical execution with a three-bus system shows how the processor fetches an instruction and then decodes/executed it with parallelism across buses.

UNIT - IV: Basic Processing Unit

Overview

  • Instruction Set Processor (ISP) / Central Processing Unit (CPU): A typical computing task is executed as a sequence of machine instructions that specify rudimentary steps.
  • An instruction is executed by carrying out a sequence of smaller operations (micro-operations).

Some Fundamental Concepts

  • The processor fetches one instruction at a time and performs the operation specified.
  • Instructions are fetched from successive memory locations until a branch or a jump occurs.
  • The Program Counter (PC) keeps track of the memory location containing the next instruction to be fetched.
  • The Instruction Register (IR) holds the current instruction to be executed.

Executing an Instruction

  • Fetch the contents of the memory location pointed to by the PC and load them into the IR (fetch phase): IR ← [PC].
  • Increment the PC by 4 (assuming byte-addressable memory): PC ← [PC] + 4.
  • Carry out the actions specified by the instruction in the IR (execution phase).

Processor Organization: Datapath

  • The datapath comprises: ALU, registers for temporary storage, and various digital circuits (gates, multiplexers, decoders, counters) to execute micro-operations.
  • The internal path for movement of data between the ALU and registers is implemented by gating circuits and a common bus path.
  • Driver circuits transmit signals to external units; receiver circuits handle incoming signals from external units.

Internal Registers and Data Path

  • PC keeps track of program execution: memory address of the next instruction.
  • MAR holds the address of the location to be accessed; MAR input is on the internal bus; MAR output drives the external bus.
  • MDR contains data to be written into or read from the addressed location; MDR has 2 inputs and 2 outputs, with loadable data from either the memory bus or the internal processor bus.
  • R0 to Rn−1: general-purpose registers.
  • Y, Z, TEMP: temporary registers used during instruction execution; not visible to the programmer.
  • A MUX can select either Y or constant 4 as input A to the ALU; B input to ALU is from the processor bus.
  • Ri input/output gating: Each Ri has Rin and Riout control signals to load from the common bus or place its contents on the bus, respectively.

Example: Data Transfers and Arithmetic in the Data Path

  • To transfer data from R1 to R4:

    • Enable R1out to place R1 on the bus, enable R4in to load the data into R4.
  • For an arithmetic operation, the ALU takes two operands from MUX and the bus, computes the result, and stores it in Z; then Zout loads the result into the destination register.

  • Example sequence to add R1 and R2 and store in R3:

    • Step 1: R1out, Yin
    • Step 2: R2out, SelectY, Add, Zin
    • Step 3: Zout, R3in
  • Fetching a Word from Memory:

    • Address loaded into MAR; issue Read; data loaded into MDR.
    • The processor waits for WMFC (Memory-Function Completed).
    • MDR out, R2 in (data moved from MDR to a register).
  • Control and timing in memory operations: The timing steps include asserting Rin, Read, WMFC, and then loading MDR and transferring to the destination register.

Execution of a Complete Instruction

  • Example: Add (R3), R1
    • Fetch the instruction
    • Fetch the first operand (contents of memory location pointed to by R3)
    • Perform the addition
    • Load the result into R1

Branch Instructions

  • Unconditional Branch: The PC is replaced with the branch target address, typically obtained by adding an offset X in the branch instruction. The offset X is the difference between the branch target and the address following the branch instruction.
  • Control steps for unconditional branch include transferring PC to MAR, performing a memory read, streaming into IR, computing target address, and updating PC.

Control Unit Implementation

  • The CPU, memory, and I/O are the major digital hardware functional units.
  • The control unit initiates sequences of micro-operations.
  • There are two methods to implement the control unit: hardwired control and microprogrammed control.

Hardwired Control Unit

  • Control signals are generated by hardware using fixed logic (and/or arrays, decoders, etc.).
  • Key characteristics:
    • High-speed operation
    • Expensive and relatively complex
    • No flexibility to add new instructions easily
  • Examples of CPUs with hardwired control: Intel 8085, Motorola 6802, Zilog Z80, and many RISC CPUs.

Microprogrammed Control

  • An alternative design where control signals are generated by a microprogram stored in a fast memory called the control store.
  • The control program (microprogram) determines the control signal settings for each execution step.
  • The microprogram store contains control words (CW), each representing a specific combination of control signals (e.g., Add, End, Zin).
  • Microinstructions form micro-routines, sequences of micro-instructions that implement instruction execution.
  • A microprogram counter (μPC) reads control words sequentially; when a new instruction is loaded into IR, the starting address generator loads the starting address into μPC, and μPC is incremented to fetch successive microinstructions.
  • Control signals are generated by a programmatic set of micro-instructions, analogous to a separate instruction stream embedded in the CPU.

Microprogrammed Control: Details and Architecture

Microinstruction Organization

  • A straightforward microinstruction structure assigns one bit to each control signal. This is inefficient due to long microinstructions and sparse bit usage.
  • Length can be reduced by exploiting mutual exclusivity and grouping signals that are not needed simultaneously.
  • Examples of groupings:
    • ALU operations: only one ALU function is active at a time.
    • Data transfer sources: only one source is active for a given transfer.
    • Read/Write memory cannot be active simultaneously.
  • A binary coding scheme represents signals within a group; at most one micro-operation per group.

Microinstruction Formats: Horizontal vs Vertical

  • Horizontal organization (wide, high parallelism):
    • Large words, minimal decoding logic, fast.
  • Vertical organization (narrow, sequential):
    • Small control store, more decoding logic, potentially slower.
  • The trade-off: speed vs. control-store size and decoding complexity.

Microprogram Sequencing

  • If most microprograms require straightforward sequential execution with occasional branches, a μPC-driven flow is efficient.
  • Disadvantages of per-instruction microprogram routines:
    • Large total microinstruction count and control store size.
    • Potentially longer execution due to branching overhead.
  • Example: Add src, Rdst with multiple addressing modes (register, autoincrement, autodecrement, indexed, and indirect forms).

Conditional Branching in Microprograms

  • To support conditional branches, microinstructions can examine external inputs or condition codes and decide whether to branch to alternate microroutines.
  • Structure includes:
    • Starting and branch-address generator. It loads a new μPC value when instructed.
    • μPC increments automatically on fetching microinstructions unless a new instruction is loaded or a branch is taken.
    • Branch conditions can be checked via condition-code inputs and IR contents.

Microinstruction and Control Store Details

  • Microinstructions are stored as control words (CW) in the control store.
  • Each CW can include bits for various control signals (e.g., ALU operation, register transfers, memory control signals).
  • The microprogram memory (control store) is read by μPC, and the generated microinstructions drive the datapath.
  • A starting address generator supplies the μPC starting address for each machine instruction; μPC increments to read subsequent microinstructions.

Microinstruction Encoding and Optimization

  • Microinstructions can be optimized by grouping mutually exclusive signals and using binary coding within groups.
  • A side approach to efficiency is bit-ORing: a technique to bypass certain microinstructions by directly encoding a desired next microinstruction address via bit-level manipulation.
  • Microinstructions with a Next-Address Field: each microinstruction includes an address field to indicate the next microinstruction location, reducing the need for separate branch microinstructions.
    • Benefits: less branch microinstructions, more flexible addressing.
    • Cost: additional bits needed for the next-address field.

Implementation of Microroutines

  • A practical example demonstrates bit-ORing and the use of next-address fields to streamline microroutines (e.g., for Add (Rsrc), Rdst).
  • The microcode example shows how microinstructions transition from one stage to another, including cases where the addressing mode bypasses certain microinstructions.

Control Circuitry for Microprogrammed Control

  • The control circuitry includes: IR, mPC, a microinstruction decoder, and sequencing logic that selects the next microinstruction based on the current microinstruction, addressing mode, and branch decisions.

Arithmetic: Addition, Subtraction, and Carry Mechanics

Addition and Subtraction of n-bit Numbers

  • Addition of two n-bit numbers can be built from n full-adders (FA) cascaded in a ripple-carry adder.
  • Each FA computes a sum bit and a carry-out, with carries rippling through the chain.
  • Subtraction can be performed by adding the two's complement of the subtrahend:
    • X − Y = X + (−Y) = X + Y + 1 (in two's complement)
  • For add/sub control in an ALU: add/sub control = 0 implies addition; add/sub control = 1 implies subtraction.

Carry-Lookahead Adder (CLA)

  • To speed up addition, carries can be computed using carry-lookahead logic.
  • Define Generate and Propagate signals for each bit position i:
    • G<em>i=x</em>iyiG<em>i = x</em>i y_i
    • P<em>i=x</em>i+yiP<em>i = x</em>i + y_i
  • Carry lookahead recurrence:
    • c<em>i+1=G</em>i+P<em>ic</em>ic<em>{i+1} = G</em>i + P<em>i c</em>i
  • A full derivation shows that carries can be computed in parallel, abstracting the dependency on previous carries, yielding much faster adders.
  • For an n-bit adder, a single 4-bit CLA block is practical; for longer adders, cascade 4-bit CLA blocks to form a Blocked Carry-Lookahead Adder.
  • The carry-out of a block can be expressed as a function of the Gi and Pi of that block and preceding blocks, enabling fast hierarchical carry computation.

4-bit CLA and Blocked CLA

  • A 4-bit CLA uses a B-cell structure to generate local carries and sums in parallel.
  • Blocking four bits at a time provides manageable fan-in and wiring while maintaining high speed for longer operands.

Multiplication of Unsigned Numbers

Overview

  • The product of two n-bit unsigned numbers is up to a 2n-bit number.
  • View multiplication as the addition of shifted versions of the multiplicand when the multiplier bits are 1.
  • A standard method uses an array of partial products and adds them (shifted multiplicands) to accumulate the final product.

Shifts and Partial Products

  • If the i-th bit of the multiplier is 1, shift the multiplicand left by i and add it to the current partial product.
  • A common approach stores partial products and uses an adder tree to accumulate the sum.

Multiplication Cells and Architectures

  • A typical multiplication cell includes a partial-product bit, an input carry, and an adder cell to accumulate contributions.
  • Two broad techniques:
    • Combinatorial array multiplier: very fast for small word lengths but high gate count for large widths.
    • Sequential multiplier: uses less combinational logic, trading speed for area/complexity.

Booth Algorithm (Signed and Unsigned Extensions)

  • Booth multiplier recoding table helps handle signed operands efficiently.
  • For Booth, the multiplier version of the multiplicand is selected by bit i of the multiplier and bit i−1, enabling recoding to minimize adds/subtracts.
  • Booth’s algorithm is particularly useful for signed multiplication and can reduce the number of partial products.

Division: Restoring and Non-Restoring Techniques

Longhand Division (Binary)

  • Binary longhand division proceeds by aligning divisor and dividend, performing successive subtractions, and generating quotient bits as you go.
  • If the partial remainder is nonnegative, the quotient bit is 1; otherwise, the divisor is added back (restored) and the next bit is processed.

Restoring Division

  • Steps in restoring division:
    • Shift A and Q left by one bit.
    • Subtract M from A; if A is negative, restore A by adding M back and set q0 accordingly.
    • Repeat for n iterations.

Non-restoring Division

  • Avoid restoring after a nonzero negative remainder by adjusting the operation (add or subtract M) based on the sign of A at each step.
  • After n steps, finalize by adjusting the quotient bit based on the final sign of A.

Floating Point Arithmetic and IEEE Representation

Fractions and Fixed-Point Background

  • A binary vector b31..b0 with an implicit binary point can be interpreted as an unsigned integer:
    • V(b) = b{31} 2^{-1} + b{30} 2^{-2} +
      … + b1 2^{-31} + b0 2^{-32}.
  • The implicit binary point can be moved left or right, producing a normalized scientific form.

Range of Fractions

  • For an n-bit binary fraction (with an implicit point to the left of the vector):
    • 0 ≤ V(b) ≤ 1 − 2^{−n}.

Scientific Notation and Floating Point Components

  • Scientific representation: ± m1 . m2 m3 m4 … × b^e, where b is the base (2 for binary).
  • Mantissa (significand), exponent, and base provide a wide dynamic range.
  • Normalization ensures the most significant 1 of the mantissa is in the MSB position.

IEEE Floating Point Notation (Standard Representations)

  • There are two common precisions: Single and Double.
    • Both use base 2 and normalized mantissas with a hidden leading 1 (except for denormals).
  • IEEE single precision (32-bit): 1 sign bit, 8-bit exponent (excess-127), 23-bit mantissa.
  • IEEE double precision (64-bit): 1 sign bit, 11-bit exponent (excess-1023), 52-bit mantissa.
  • General form: for a nonzero number, x = (−1)^s × 1.mantissa × 2^{(E − bias)}, where bias is 127 for single and 1023 for double.
  • The sign bit determines the overall sign, the exponent is stored in excess notation, and the mantissa is normalized with an implicit leading 1 (hidden bit).

Exponent Encoding and Excess Notation

  • Exponent fields are stored in excess notation to simplify comparisons: E’ = E_true + bias.
  • Example: In single precision, an exponent field stores E’ with bias 127; in double precision, bias is 1023.
  • This representation helps with comparing the magnitudes of floating point numbers.

IEEE Exponent Edge Conditions

  • The exponent field encodes special values (e.g., 0 and 255 for single precision) representing exact zero, infinity, and NaNs (not detailed here but noted as special cases).

Floating Point Arithmetic Rules

  • Addition/Subtraction:
    • Align exponents by shifting the mantissa of the number with the smaller exponent to the right.
    • Add/subtract mantissas, determine the sign, normalize if needed, and then truncate/round to the required mantissa precision.
    • Choose the number with the smaller exponent as the one to be shifted.
  • Multiplication:
    • Add exponents; subtract the bias; multiply mantissas; determine sign; normalize; truncate/round mantissa.
  • Division:
    • Subtract exponents; add the bias; divide mantissas; determine sign; normalize; truncate/round mantissa.
  • Guard bits: When aligning mantissas, extra bits are kept to prevent loss of precision during right shifts; after operations, mantissas are normalized and truncated/rounded to the target mantissa width.
  • Truncation/Rounding: Various strategies exist; IEEE generally uses rounding to nearest with tie-breaking rules (guard bits used to decide rounding).

Normalization, Overflow, and Underflow

  • Normalization adjusts the mantissa so that the MSB is 1, shifting the mantissa and adjusting the exponent accordingly.
  • Overflow occurs when the exponent exceeds the representable range; underflow occurs when the exponent becomes too small (too negative).
  • The normalization process may cause overflow or underflow depending on how far the exponent must shift.

Exponent Representation and Scaling Issues

  • Excess notation (bias) is used to simplify comparisons and arithmetic of exponents.
  • When multiplying numbers with exponents in excess form, the effective exponent addition may appear as x+y+2p if the simple interpretation is used; proper normalization and bias handling is essential.

Division: Restoring and Non-Restoring (Binary) – Worked Examples

  • Restoring division uses a sequence of shifts and subtract-then-possibly-restore steps; non-restoring avoids the explicit restoration step by adjusting based on the sign of the partial remainder A.
  • Binary division circuits implement these steps with a sequence of add/subtract operations and test signs to determine quotient bits.

Addition, Subtraction, and Overflow Detection in Binary Arithmetic

  • Overflow occurs when the sign of the operands is the same but the sign of the result is different. For n-bit numbers, overflow can be detected by the XOR of the carry-in and carry-out at the MSB or by the sign bits of the operands and result:
    • Overflow = x{n-1} y{n-1} s{n-1} + x{n-1} y{n-1} s{n-1}
    • Another common form: Overflow = cn ⊕ c{n-1}
  • For multi-bit adders, carry-lookahead logic reduces the time to compute carries, which in turn reduces the overall add time significantly.

Booleans, Gates, and Micro-Operation Timings

  • Inside arithmetic units, gate delays determine the overall latency of operations. For example, an n-bit ripple-carry adder has carry propagation delays that increase with n, whereas a carry-lookahead adder reduces these delays significantly (constant or near-constant for carries, depending on implementation).

Practical Multiplication and Division Timings

  • In multiplication, the time complexity is influenced by the number of partial products and the depth of the adder tree. Booth encoding reduces the number of partial products for certain operands.
  • In division, both restoring and non-restoring techniques require a number of iterations proportional to the number of bits in the dividend or divisor; digital circuits implement these steps with a control unit driving the sequence of shifts and adds/subtracts.

Floating Point: Guard Bits, Rounding, and Normalization (Further Details)

  • Guard bits are extra precision bits kept during intermediate steps to prevent loss of information when shifting mantissas to align exponents.
  • After the arithmetic operation, mantissas are normalized and rounded to the target precision (e.g., 23 bits for single precision).
  • Rounding to nearest may occasionally require a renormalization step if the rounding causes a carry that shifts the mantissa.
  • IEEE rounding rules are used (commonly round-to-nearest-even in many implementations).

Quick References and Terminology

  • MFC: Memory Function Completed, the signal indicating memory operation completion.
  • MAR: Memory Address Register, holds the address for memory access.
  • MDR: Memory Data Register, holds data to be written to or read from memory.
  • IR: Instruction Register, holds the current instruction to execute.
  • PC: Program Counter, points to the next instruction to fetch.
  • Y, Z, TEMP: internal temporary registers in datapath; not visible to software.
  • Riout / Rin: gates controlling whether a register places its value on the common bus or loads data from the bus.
  • μPC: microprogram counter in a microprogrammed control unit; reads microinstructions from control store.
  • Control store: the memory storing microinstructions (CW) for microprogrammed control.
  • Start address generator: provides the starting μPC value for a new machine instruction.
  • μPC: increments unless a branch or new instruction resets it.
  • Horizontal vs Vertical microarchitecture: different styles for encoding and decoding microinstructions.
  • Bit-ORing: a technique used to bypass certain microinstructions by combining signals at the encoding/branch level.
  • Next-Address Field: a microinstruction field that specifies the next microinstruction address, reducing explicit branching microinstructions.

Connections to Prior Concepts and Real-World Relevance

  • The single-bus vs multi-bus discussion mirrors modern CPU datapath design trade-offs between simplicity and throughput. Multi-bus designs enable higher instruction throughput by allowing concurrency among data transfers and ALU operations.
  • The distinction between hardwired and microprogrammed control highlights performance vs flexibility trade-offs. Hardwired control is fast but inflexible; microprogrammed control offers software-like flexibility for instruction set changes and extended feature support.
  • The use of carry-lookahead adders demonstrates a fundamental approach to improving adder speed, which is critical for instruction throughput in modern CPUs.
  • Floating point representations and IEEE formats are essential for scientific computing, graphics, and simulations, underscoring the need for robust handling of normalization, rounding, overflow/underflow, and special values (NaN, infinity).
  • Booth’s algorithm and the array/microarchitectural approaches to multiplication reflect practical strategies to optimize hardware efficiency for signed and unsigned arithmetic.
  • Division algorithms (restoring and non-restoring) illustrate the complexity of division in hardware and why hardware designers often prefer iterative or approximate methods depending on application domains.

Mathematical Highlights (LaTeX)

  • Carry Lookahead Adder (c{i+1} = Gi + Pi ci):
    c<em>i+1=G</em>i+P<em>ic</em>ic<em>{i+1} = G</em>i + P<em>i c</em>i
  • Generalized carry lookahead for an n-bit adder (block):
    c<em>i+1=G</em>i+P<em>iG</em>i1+P<em>iP</em>i1G<em>i2+..+P</em>iP<em>i1P</em>1G<em>0+P</em>iP<em>i1P</em>0c0c<em>{i+1} = G</em>i + P<em>i G</em>{i-1} + P<em>i P</em>{i-1} G<em>{i-2} + \,..\, + P</em>i P<em>{i-1} \dots P</em>1 G<em>0 + P</em>i P<em>{i-1} \dots P</em>0 c_0
  • IEEE single-precision floating point value (nonzero):
    x=(1)s×(1.m)×2(E127)x = (-1)^s \times (1.m) \times 2^{(E - 127)}
  • IEEE double-precision floating point value (nonzero):
    x=(1)s×(1.m)×2(E1023)x = (-1)^s \times (1.m) \times 2^{(E - 1023)}
  • Floating point normalization concept: the mantissa is shifted to position the most significant 1 in the leading bit, adjusting the exponent accordingly.

End of Notes