1/58
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Latency
The time it takes for one instruction to finish.
Throughput
The number of instructions completed per unit time.
Increasing CPU performance methods
Increase clock frequency, add multiple cores, use pipelining, or add dedicated circuitry.
Problem with increasing clock frequency
Power wall — higher frequency leads to more power consumption and heat.
Problem with multiple cores
Requires parallel programming to be effective.
Problem with pipelining
Limited throughput improvement due to dependencies.
Problem with dedicated circuitry
Only improves one specific operation (one-trick pony).
Pipelined execution
Technique that overlaps instruction execution to increase throughput.
4-stage pipeline stages
Fetch → Decode → Execute → Write result.
Interstage buffers
Buffers that separate pipeline stages to allow independent operation (e.g., RA, RB, RM, RY, RZ).
B1 Fetch-decode buffer
Holds instruction.
B2 Decode-execute buffer
Holds source operands, operation, destination operand.
B3 Execute-write buffer
Holds result and destination operand.
Pipeline logic
Transfers data, operand specifiers, and control signals between stages.
Pipeline hazards
Events that cause stalls in the pipeline.
Common pipeline hazards
Cache miss, data dependency, branching, long operations (e.g., division).
Data dependency
When one instruction depends on the result of a previous one.
Pipeline stall due to dependency
CPU delays the dependent instruction until the required data is available.
Data forwarding
Sends ALU result directly to next instruction input, bypassing register file to reduce stalls.
Branching
Alters the normal sequence of instruction execution.
Unconditional branching
Always jumps to a new program location (e.g., jmp .loop).
Unconditional branch penalty
1 cycle (target computed in decode stage).
Conditional branching
Jumps to a new location only if a condition is true.
Conditional branch penalty
2 cycles (condition computed in execute stage).
Reducing branch penalty
Move comparator to decode stage (1 cycle penalty).
Branch prediction
Technique to predict the outcome of branch instructions to minimize pipeline stalls.
Static branch prediction
Based on fixed rules; assumes backward branches are taken, forward branches are not.
Static prediction accuracy
Around 80%.
Dynamic branch prediction
Uses previous execution outcomes to predict next branch.
Dynamic prediction states
LT (likely taken), NLT (not likely taken).
Branch target buffer (BTB)
Hardware table storing addresses and outcomes of recent branch instructions.
Purpose of BTB
Identifies branches and predicts target addresses for faster fetching.
Pipeline performance formula variables
N = number of instructions, S = cycles per instruction, R = clock rate.
Branch misprediction penalty formula
Sbranch = Branch_proportion + (1 - accuracy) * penalty.
Cache miss penalty formula
Smiss = (Fetch_miss + Load_store Data_miss) penalty.
Total cycles per instruction
S = 1 + Smiss + Sbranch.
Effect of n-stage pipeline
Increases throughput by factor of n, but too many stages cause more stalls.
Superscalar processor
CPU with multiple parallel execution units, each possibly pipelined.
Goal of superscalar architecture
Increase instruction throughput via parallelism.
Superscalar operation steps
Fetch multiple instructions → Queue them → Dispatch to multiple execution units.
Types of execution units
Arithmetic unit, load/store unit, write unit.
Execution types
Parallel execution, in-order issue, out-of-order execution.
Program-order completion
In-order issue and in-order completion.
Speculative execution
Predicts results of operations to continue execution before confirmation; rollback if wrong.
Speculative execution use
Works with branch prediction to minimize stalls.
Runahead execution
Executes instructions ahead of time while waiting for data dependencies to resolve.
Register renaming
Eliminates false data dependencies by assigning unique registers.
Symmetric Multithreading (SMT) / Hyperthreading
Allows multiple hardware threads per core to increase utilization.
Processor structure
Contains multiple cores (1-192), each core may support multiple hardware threads (1-4).
Parallelization
Increases throughput and reduces program runtime by running tasks simultaneously.
Amdahl's Law formula
Ps = Ts * (Fs + Fp / p), where Fs = sequential portion, Fp = parallel portion, p = processors.
Speedup formula
S = Ts / Ps.
Flynn's taxonomy
Classification of computer architectures by instruction and data streams.
SISD
Single Instruction, Single Data — conventional sequential processor.
SIMD
Single Instruction, Multiple Data — same instruction operates on multiple data streams.
MIMD
Multiple Instruction, Multiple Data — multiple instruction and data streams.
MISD
Multiple Instruction, Single Data — rarely used architecture.
Unified memory architecture
All processors share a single memory space.
Distributed memory architecture
Each processor has its own private memory.