Unit-4: Basic Processing Unit, Control Strategies, and Pipelining – Comprehensive Notes

Instruction-Set Processor (ISP) and CPU Overview

Computers break down tasks into tiny steps called machine instructions, which make up a program.
Each machine instruction is run by even simpler steps, called micro-operations, handled by the processor’s main parts (datapath) and its control unit.
An Instruction-Set Processor (ISP) is designed specifically to be really good at handling its own set of instructions.

Role of the Control Unit

The control unit is like the CPU's brain, telling all the other parts what to do and when. It reads the current instruction, then sends out specific signals that control data movement, tell the math unit (ALU) what calculation to do, start memory actions, and update the program counter.

Fundamental Concepts of Instruction Execution

Fetch–Execute Cycle
1. Fetch: Get the instruction from the memory location the Program Counter (PC) is pointing to, and put it into the Instruction Register (IR): $IR \leftarrow [ [ PC ] ]$
2. Increment PC: Move the PC address to point to the next instruction (for memory where each instruction takes 4 bytes): $PC \leftarrow [PC] + 4$
3. Decode and Execute: Figure out what the instruction in the IR means, and then perform the small steps (micro-operations) needed to carry it out.
Key architectural state
• Program Counter (PC) – Stores the address of the next instruction to fetch.
• Instruction Register (IR) – Holds the instruction currently being executed.
• Condition-code flags – Special bits that remember results from calculations (like if a number was zero, positive, or caused an overflow) and affect how 'branch' instructions work.

Single-Bus Datapath Organization

Uses one shared internal pathway (a tri-state bus) to connect all the registers, the ALU, and the memory connection parts (MAR, MDR).
This design is simpler and cheaper, but it means only one piece of data can move at a time, making the steps for an instruction take longer.

Main Components

General-Purpose Registers R0 … R $n-1$ – These are the storage spots visible to the programmer.
Special Registers – Like the PC, stack pointer, and index registers, which have specific jobs.
Temporary Registers – Y, Z, TEMP are hidden storage used only by the hardware during complicated operations.
Memory Address Register (MAR) – Holds the address of the memory location we want to access.
Memory Data Register (MDR) – Acts as a two-way buffer, temporarily holding data going to or from memory.
ALU – The Arithmetic Logic Unit does calculations. It takes one input from a selector (either the number $4$ or from register Y) and another directly from the main bus. Its result goes into register Z.
Multiplexer before ALU – A switch that chooses between register Y or the constant number $4$ (often used to increase the PC).
Driver/Receiver circuits – Bridges that connect the internal parts of the processor to the external memory system.

Control of Register Transfers

Each register $Ri$ has two control signals: • $Ri^{in}$ : Tells the register to load data from the bus when the clock ticks.
• $R_i^{out}$ : Tells the register to put its contents onto the bus.
All processor actions happen in sync with clock signals; sometimes, multiple clock signals are needed if certain types of storage (latches) are used.

Example – Copy $R1 \rightarrow R4$

Turn on $R1^{out}=1$ (data from R1 goes onto the bus).
Turn on $R4^{in}=1$ (data from the bus is stored into R4). The control unit turns on $R1^{out}$ and $R4^{in}$ at the same time to make the transfer happen.

Example – Addition $R3 \leftarrow R1 + R2$

Turn on $R1^{out}$ and $Y^{in}$ — this moves the first number from R1 into Y.
Turn on $R2^{out}$ , select Y as ALU input (SelectY), tell ALU to Add, and turn on $Z^{in}$ — this moves the second number from R2 to the bus, does the addition, and puts the result into Z.
Turn on $Z^{out}$ and $R3^{in}$ — this moves the final result from Z into R3.

Memory Read Sequence (single-bus)

Instruction: Move (R1), R2 (Reads data from memory address in R1 and puts it into R2)

$R1^{out}, MAR^{in}, Read$ (Put R1's content into MAR, start a memory read).
Wait for Memory-Function-Complete (WMFC). (Wait until memory is done).
$MDR^{out}, R2^{in}$ (Move data from MDR to R2).

Memory Write Sequence

Instruction: Move R2, (R1) (Writes data from R2 to memory address in R1)

$R1^{out}, MAR^{in}$ (Put R1's content into MAR).
$R2^{out}, MDR^{in}, Write$ (Put R2's content into MDR, start a memory write).
Wait WMFC (Wait until data moves from MDR to memory).

Execution of a Complete Example Instruction

ADD (R3), R1 (add memory content at address R3 to R1)

Step-by-step small operations were detailed; key points:

The instruction goes through a sequence: fetch, decode, get the numbers, do the math, store the result.
The constant $4$ is used in the selector before the ALU to update the PC during the fetch phase.

Branch Instruction Execution (Single-Bus)

A branch instruction changes the PC to the new PC + offset (a jump).
The offset is usually calculated as target address – (PC after the branch instruction). This makes the target relative to the instruction following the branch.
For an unconditional branch (always jump), the new PC is calculated early using the ALU.

Multi-Bus (Three-Bus) Datapath Organization

This setup has three internal pathways (buses A, B for sources, and C for destination), plus a dedicated counter for the PC. This lets three operations happen at the same time in one clock cycle.
It uses a Register File that can read from two registers (A, B) and write to one (C) at the same time.
This design removes the need for temporary registers Y and Z, which were used in the single-bus system.
The ALU can also pass numbers through without changing them (like just sending A or B out) for simple moves.
The constant $4$ is still available for adding to addresses.

Example – ADD R4, R5 \rightarrow R6 (multibus)

$PC^{out}\,(B), MAR^{in}, Read, IncPC$ – The PC value goes to bus B, that value goes into MAR, a memory read starts, and the PC increases, all at once.
WMFC. (Wait for memory to finish).
$MDR^{out}\,(B), IR^{in}$ . (Data from memory (MDR) goes to bus B, then into the Instruction Register (IR)).
$R4^{out}\,(A), R5^{out}\,(B), Add, R6^{in}$ — R4 goes to bus A, R5 to bus B, the ALU adds them, and the result is stored in R6, all in a single clock cycle.

Control Unit Strategies

Hardwired Control

How it works: The control signals are created directly by physical logic circuits (like gates and switches) that are permanently wired together. It's like a complex machine where every action is a direct result of its fixed connections.
Control signals are made by logic based on:
• The current step (like $T1, T2, ext{etc.}$ ).
• The instruction's operation code (from the IR).
• The flag bits (like Z, C, N, V).
• Outside inputs (like 'Memory-Function-Complete').
This is designed as a finite-state machine; it's very fast but cannot be easily changed.
Advantages:
- Speed: Super fast because signals are generated directly by hardware.
- Efficiency: Can be very hardware-efficient for simpler instruction sets.
Disadvantages:
- Complexity: Extremely hard to design and fix for complex instruction sets.
- Inflexibility: Any change to instructions or how they run means completely re-designing the hardware. It's rigid and hard to update.
- Design time: Takes a long time to design complex CPUs this way.
Example logic equation: $Z^{in} = T1 + (T6 \land ADD) + (T4 \land BR) + \dots$ (meaning, the $Z^{in}$ signal is turned on if it's step $T1$ , OR if it's step $T6$ and the instruction is ADD, OR if it's step $T4$ and the instruction is BR).
An END signal resets the step counter to start fetching the next instruction: $END = T7 \land ADD + T5 \land BR + \dots$

Microprogrammed Control

How it works: The control signals are stored as codes (called Control Words or microinstructions) in a special memory chip called the Control Store. Each main instruction runs by fetching a sequence of these microinstructions, which is called a microroutine.
Control signals are put into Control Words (CWs) and kept in a special memory (Control Store).
A series of CWs that perform one machine instruction is called a microroutine; each CW is a microinstruction.
A special counter (µPC - microprogram counter) steps through this control memory.
A Starting & Branch Address Generator decides where the µPC should go next, based on the instruction's operation code, condition flags, and external events. This lets the control actions themselves 'branch' or 'loop' (e.g., to wait for memory or check flags).
It supports conditional branches within the microcode for things like waiting for memory (WMFC), checking flags, etc.
The size of microinstructions can be made smaller by field encoding: related control signals are grouped and given binary codes (e.g., a 4-bit code can represent 16 ALU operations).
Advantages:
- Flexibility: Easier to design, find errors, and change. If instructions need updates or bugs need fixing, you just change the code in the control store (like a software update).
- Modularity: Complex instructions can be broken down into simpler, manageable sets of micro-operations, making design easier.
- Cost-effective: Can be cheaper for very complex instruction sets because it uses less complex physical wiring and more memory.
Disadvantages:
- Slower: Generally not as fast as hardwired control because it takes extra time to read microinstructions from memory.
- More hardware: Requires extra parts like the microprogram counter, the control store memory, and address generating logic.

Pipelining Fundamentals

Goal: To increase how many instructions a processor can handle per second by doing parts of multiple instructions at the same time.
Instruction-level parallelism (ILP): Starting the next instruction before the current one is completely finished.

Two-Stage Pipeline (Fetch & Execute)

Has separate units for 'fetching' instructions and 'executing' them, with a buffer (B_1) in between.
After an initial warm-up time, one instruction finishes every clock cycle.

Four-Stage Pipeline (F, D, E, W)

Uses four dedicated hardware units: Fetch, Decode (and get numbers), Execute, and Write-back. Buffers (B_1, B_2, B_3) hold intermediate results between stages.
Ideally, the speed gain is equal to the number of stages (i.e., $S = n$ ), assuming each stage takes the same amount of time and there are no delays.

Role of Cache

To keep the pipeline running smoothly, instructions and data must be available very quickly (in one clock cycle per stage).
On-chip caches (small, fast memories close to the CPU) provide lightning-fast memory access, preventing the pipeline from idling often.

Pipeline Performance Metrics

Throughput (IPT): The rate at which instructions are completed, which increases to about one instruction per clock cycle after the pipeline is full.
Speedup with a perfect pipeline: $ext{Speedup} = \frac{\text{Time for sequential execution}}{\text{Time for pipeline execution}} \approx n$
Real-world speedup is reduced by pipeline hazards, which cause empty slots (bubbles) or delays (stalls).

Pipeline Hazards

Pipeline hazards are situations where an instruction in the pipeline can't move forward in its scheduled clock cycle, leading to a delay or a 'stall'. There are three main types:

Structural Hazards – occur when two or more instructions try to use the same piece of hardware at the same time (e.g., one memory unit for both instructions and data).
- Resolution: Either make an instruction wait (stall) or duplicate the conflicting hardware.
Data Hazards – happen when instructions depend on the data produced by a previous instruction that hasn't finished yet. This creates a problem if an instruction tries to read data before another instruction has written it.
- RAW (Read After Write) is the most common data hazard: an instruction tries to read a register/memory location before a previous instruction has written to it.
- Mitigation: Often handled by forwarding or bypassing, which sends the result of an operation directly to where it's needed in an earlier pipeline stage, instead of waiting for it to be written back to a register. Even with forwarding, a 'load-use' data hazard (when a value is loaded from memory and then immediately used by the very next instruction) often causes a one-cycle delay.
Control (Instruction) Hazards – occur when a branch instruction (a jump) or an interrupt changes the Program Counter. Until the CPU knows where the jump goes, it might fetch the wrong instructions.
- Branch penalty = The number of wasted cycles due to correctly guessing the branch direction or due to unavoidable delays of branches.

Example Load–Use Hazard

If instruction 1 is LD 0(R2), R1 (Load data from memory at R2+0 into R1) and instruction 2 immediately after it is DSUB R4,R1,R5 (Subtract content of R1 from R4 and put into R5), this needs a stall. Why? Because DSUB needs the value in R1 that LD is supposed to produce, but LD only makes that value available after its 'Memory access' (MEM) stage, which is usually later in the pipeline than when DSUB would need it for its 'Execute' (EX) stage.

Techniques to Mitigate Hazards

Forwarding – Sending the results from the ALU or memory directly to an earlier pipeline stage that needs them, instead of waiting for them to be written into registers.
Stalling (Bubble insertion) – Temporarily pausing the later stages of the pipeline, inserting a 'bubble' (a do-nothing cycle) to give earlier stages time to complete.
Instruction scheduling (compiler) – The compiler rearranges instructions in the code so that waiting times (delay slots) are filled with useful operations.
Delayed Branch – A design where the instruction immediately following a branch (in the 'delay slot') is always executed, even if the branch is taken. The compiler tries to put a useful instruction there.
Branch Prediction – Guessing which way a branch will go (taken or not taken) and pre-fetching instructions along the guessed path. If the guess is wrong, the pre-fetched instructions are discarded (flushed).
• Static prediction (done by the compiler) vs. Dynamic prediction (done by hardware using historical data).

Two-State Predictor

States: LT (likely taken) / LNT (likely not taken).

If a branch is actually taken, the predictor moves to the 'likely taken' state; otherwise, it stays or moves to 'likely not taken'.
It predicts