1/40
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Why have a hierarchy?
Main memory is slow af, but on chip memory is expensive, so having a smaller on chip memory is a reasonable trade off. Even if not everything fits on these, we can do quite well at putting the right things on it using locality
Two types of locality
Spacial locality - programs use data it regions, so related data is usually clumped together (e.g. arrays)
Temporal locality - if you’ve used data recently, you’re likely to use it again
Layers of cache
L1, L2, L3
L1 cache
one per core, 10s of KB, 1-4 cycles to access
L2 cache
one per core, 100s of KB, 8-12 cycles to access
L3 cache
shared between cores, 10s of MB, 20-30 cycles to access
Main Memory
10s of GB, 200-400 cycles to access
Cache properties
Organised into a cache lines, each of which stores multiple words
Each line is commonly 32 or 64 bytes long
Allocation policy is how memory gets put into cache
Replacement policy is how we decide what gets kicked out
Cache line extra info
Cache lines include - tag to indicate where in memory it came from, valid bit to indicate if it holds data, dirty bit to indicate if it’s been written to
Direct Mapped Cache
Allocation - each line in memory maps to a specific location in cache (simple mod on addresses usually)
Advantages
simple
little hardware
fast, small, low power
easy to understand
Disadvantages
suffers from lots of collisions
unnecessary data eviction (cash thrashing)
highest miss rate
behaviour can be difficult to understand
Fully associative cache
Allocation - each line in memory can map to anywhere in cache
Advantages
most efficient use of space
relatively easy to understand
Disadvantages
lots of hardware needed
Need a CAM or similar
largest performance overhead, hard to make it fast
evictions become difficult, lots of options
Set associative cache
line at address A can only map to one set, but could map to any of the S lines in that set (again using a A mod (N/s))
Advantages:
best trade-off
performs well, not too hard to implement, good efficiency
Disadvantages
harder to understand
Current convention
4, 8, 16 way set associativity
Types of cache misses
Compulsory, Capacity, Conflict, (coherence)
Compulsory
Line has not been brought into cache before, would be misses even in an infinite cache
Capacity
Cache is lot large enough s.t. some lines are discarded and later retrieved - happen in a fully associative cache
Conflict
Set associative or direct mapped cache, lines are discarded but later needed because too many things went into the set. Also called collision misses, or interference misses. Could happen in any N-way set associative cache
Coherence miss
happens in multi-core systems when one core makes an item in another core’s cache stale
Cache trends as size +
Compulsory caches become insignificant, and capacity misses shrink
2-1 rule
miss-rate for a 1-way set associative cache of size X ~= miss rate for 2-way set associative cache of size X/2
Replacement policies
least recently used, least recently replaced, random
LRU
if it’s not been used in a while, then get rid of it
Least recently replaced
oldset in the set, could still be in use, but is simpler to implement
Random
simplest, fairly effective, ideally pseudo-random
Cache consistency
What happens when data is modified in a lower cache, how does that change get back to main memory?
Two main approaches are write-through and write-back
Write-through
write is passed on to next level when it happens (and will then propogate)
Write-back
Data is only update in the next level when that line is evicted
Inclusive vs exclusive cache
Inclusive, L3 contains all of L1 and L2, for example.
Exclusive, it doesn’t
Advantages of inclusive
Benefits for spatial and temporal coherence. If it was in L1 but isn’t any more, it’s likely to be in L2 or L3.
Easy to check if a core has a copy of data, just need to check it’s highest level of cache (L2).
Disadvantages of inclusive
Duplication of data
reduced unique capacity in n+1
expensive to maintain for shared caches with lots of cores
Trends in this?
Trending towards exclusive, especially as cores increase. If you have lots of different L2, L3 becomes mostly just L2 stuff
Cache coherency
If multiple cores share an array, where does that live?
Whichever core most recently modified the data has the up to date copy
When other cores want to access this they need to discover the most up-to-date version
Two classes of coherence protocols
Directory based, snooping
Directory Based
Sharing status for a block is kept in one place, the directory.
Can be SMP - one central directory, associated with e.g. main memory, or L3 cache (for single chip multi-core)
For a multi-chip thing we have a distributed directory, which is more complicated
Snooping
Caches handle coherence individually. They track sharing status of the blocks they have. Memory requests are broadcast on a shared bus, and controllers can snoop these requests to know when their data is updated for example.
BC4 Cache
35 MB L3, 256KB L2 per core, 32 KB data per core, 32KB instructions per core.
8-way set associative, write-back
Distributed around the CPU in a ring, some close to each core
Cache trends
On-chip has grown!
Cache optimisations
Hardware prefetch, Hit under miss (while waiting for a miss, do something else), Critical word first (when loading a line, bring the one we care about first), Merging write buffers (multiple updates happen together), Compiler optimisations make better use of caches too
Critical Word first
request missed word from memory, send it on as soon as it arrives, then load the rest of the line
Early restart
request the line as normal, but as soon as the requested data arrives send it on and allow execution to continue
Hardware prefetching
Maybe when you fetch a line, also fetch the next line, especially if there’s special ISA juice telling you too