1/34
Covers Branch Prediction, Measurement (Amdahl's Law, processor performance equation), Dynamic Pipelines (Tomasulo's Algorithm), Hardware Based Speculation, Memory Hierarchy, Cache (Tag, Index, Block offset, Hit time, Miss time, Miss penalty), Temporal Locality, Spatial Locality
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Amdahl’s Law for finding New Execution Time
ExTime old * [(1-Fractionenhanced)+(Fraction enhanced / Speedup enhanced)]
Amdahl’s Law for finding Speedup Overall
ExTimeold / ExTimenew = 1 / [(1-Fractionenhanced) + (Fractionenhanced/ Speedupenhanced)]
Best you can get from Amdahl’s law
Speedupmaximum = 1 / (1 - Fractionenhanced)
Processor Performance Equation
CPU time = seconds / program = [(instructions / program) (cycles / instruction) * (seconds / cycle)]
Measuring performance
1 / execution time
Finding the enhanced time with Amdahl’s Law
Te = T0(1-Fe) + (T0Fe /Se)
Arithmetic Mean (usual mean)
(x1 + x2 + … + xn) / n
Harmonic Mean
n / (1/x1 + 1/x2 + … + 1/xn)
Geometric Mean
nthroot(X1 x X2 x… x Xn)
Dynamic Scheduling
hardware rearranges instruction execution to reduce stalls (maintains data flow and exception behavior)
creates WAR and WAW hazards
Hardware Speculation
built on dynamic scheduling, offers more performance advantages
makes predictions - allowing processor to continue past branch
Tomasulo’s Algorithm
Uses reservation stations to accomplish dynamic scheduling. When an operands value is unknown, the instruction waits until the value is produced and sent out on the common data bus(CDB). Uses register renaming to avoid WAR and WAW hazards
three reservation stations for addition
two reservation stations for multiplication
Can overlap loops because of register renaming
Reservation Station Components
Op: operation to perform on unit
Vj, Vk: Value of source operands
Qj, Qk: Reservation stations producing the value that goes in a source register
if == 0, the value is ready
Busy: indicates this reservation station or FU(functional unit) is busy
Register result status: indicates which FU will write each register
Stages of Tomaulo’s Algorithm
Issue - get instruction from queue, if reservation station free, issue instruction
Execution - operate on the operands
Write Result - write on Common Data Bus to all waiting units and mark station as not busy
Reorder Buffer (ROB)
Array with head and tail pointer where speculative instructions are kept
When the instruction at the head is completed it becomes non speculative
If a prediction was wrong, the tail is moved to the head, ‘erasing’ the bad instructions
fields are instruction, destination and value
Map Table
Used for renaming registers
Temporal Locality
If an item is referenced, it will tend to be referenced again soon (loops, reuse)
Spatial Locality
If an item is referenced, items whose addresses are close by tend to be referenced soon (arrays)
Fully Associative Cache
Blocks can go anywhere
Size of index is 0
Direct Mapped Cache
Must go into one specific block (usually calculated by modulo)
Index is large
Set Associative Cache
Block can go in a space within a set (set calculated by modulo)
Typically 2, 4 or 8 way
Reduces conflict misses
Write Strategy: Write Through
on write cache and lower level updated
Write Strategy: Write Back
on write only cache updated
when block is replaced → write to lower level
has a bit marking dirty(written)/not
Block to be replaced on a miss
Random Block (large associativity)
Least Recently Used (smaller associativity)
What happens on cache access
Index is used to select a block/set
Tag is compared to see if match
If match - data is gotten from location at the block offset
If miss - data fetched from lower levels and then retrieved from block offset
Compulsory Miss
First access to a block
Capacity Miss
Cache cannot contain all the blocks needed during execution of a program
Blocks are discarded and later retrieved again
Conflict Miss
In set associative/direct mapped, block can be discarded and later retrieved if too many instructions map to the same block/set
Victim Cache
smaller cache that holds recently discarded blocks, offers “associativity as needed”
Pseudo-Associativity
divide cache, on miss flip bit to check other half of cache
typically only done in 2 way, 4 way can be slow since checks are sequential and not parallel like set associative
Complier Optimizations to Reduce Misses
Merging Arrays
Loop Interchange
Loop Fusion
Blocking
Average Memory Access Time
Hit time + Miss Rate * Miss Penalty
Early Restart
when requested word from block arrives - send to CPU so it can continue
Critical Word First
Retrieves missed word first, CPU can continue, then rest of block is retrieved
Calculating Cache Tag, Index and Offset
Offset = # words per block
Index = n; where 2n = # lines/sets in cache
Tag = what is left