Memory Hierarchy

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/40

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

41 Terms

New cards

Why have a hierarchy?

Main memory is slow af, but on chip memory is expensive, so having a smaller on chip memory is a reasonable trade off. Even if not everything fits on these, we can do quite well at putting the right things on it using locality

New cards

Two types of locality

Spacial locality - programs use data it regions, so related data is usually clumped together (e.g. arrays)

Temporal locality - if you’ve used data recently, you’re likely to use it again

New cards

Layers of cache

L1, L2, L3

New cards

L1 cache

one per core, 10s of KB, 1-4 cycles to access

New cards

L2 cache

one per core, 100s of KB, 8-12 cycles to access

New cards

L3 cache

shared between cores, 10s of MB, 20-30 cycles to access

New cards

Main Memory

10s of GB, 200-400 cycles to access

New cards

Cache properties

Organised into a cache lines, each of which stores multiple words

Each line is commonly 32 or 64 bytes long

Allocation policy is how memory gets put into cache

Replacement policy is how we decide what gets kicked out

New cards

Cache line extra info

Cache lines include - tag to indicate where in memory it came from, valid bit to indicate if it holds data, dirty bit to indicate if it’s been written to

New cards

Direct Mapped Cache

Allocation - each line in memory maps to a specific location in cache (simple mod on addresses usually)

Advantages

simple
little hardware
fast, small, low power
easy to understand

Disadvantages

suffers from lots of collisions
unnecessary data eviction (cash thrashing)
highest miss rate
behaviour can be difficult to understand

New cards

Fully associative cache

Allocation - each line in memory can map to anywhere in cache

Advantages

most efficient use of space
relatively easy to understand

Disadvantages

lots of hardware needed
Need a CAM or similar
largest performance overhead, hard to make it fast
evictions become difficult, lots of options

New cards

Set associative cache

line at address A can only map to one set, but could map to any of the S lines in that set (again using a A mod (N/s))

Advantages:

best trade-off
performs well, not too hard to implement, good efficiency

Disadvantages

harder to understand

New cards

Current convention

4, 8, 16 way set associativity

New cards

Types of cache misses

Compulsory, Capacity, Conflict, (coherence)

New cards

Compulsory

Line has not been brought into cache before, would be misses even in an infinite cache

New cards

Capacity

Cache is lot large enough s.t. some lines are discarded and later retrieved - happen in a fully associative cache

New cards

Conflict

Set associative or direct mapped cache, lines are discarded but later needed because too many things went into the set. Also called collision misses, or interference misses. Could happen in any N-way set associative cache

New cards

Coherence miss

happens in multi-core systems when one core makes an item in another core’s cache stale

New cards

Cache trends as size +

Compulsory caches become insignificant, and capacity misses shrink

New cards

2-1 rule

miss-rate for a 1-way set associative cache of size X ~= miss rate for 2-way set associative cache of size X/2

New cards

Replacement policies

least recently used, least recently replaced, random

New cards

LRU

if it’s not been used in a while, then get rid of it

New cards

Least recently replaced

oldset in the set, could still be in use, but is simpler to implement

New cards

Random

simplest, fairly effective, ideally pseudo-random

New cards

Cache consistency

What happens when data is modified in a lower cache, how does that change get back to main memory?

Two main approaches are write-through and write-back

New cards

Write-through

write is passed on to next level when it happens (and will then propogate)

New cards

Write-back

Data is only update in the next level when that line is evicted

New cards

Inclusive vs exclusive cache

Inclusive, L3 contains all of L1 and L2, for example.

Exclusive, it doesn’t

New cards

Advantages of inclusive

Benefits for spatial and temporal coherence. If it was in L1 but isn’t any more, it’s likely to be in L2 or L3.

Easy to check if a core has a copy of data, just need to check it’s highest level of cache (L2).

New cards

Disadvantages of inclusive

Duplication of data

reduced unique capacity in n+1

expensive to maintain for shared caches with lots of cores

New cards

Trends in this?

Trending towards exclusive, especially as cores increase. If you have lots of different L2, L3 becomes mostly just L2 stuff

New cards

Cache coherency

If multiple cores share an array, where does that live?

Whichever core most recently modified the data has the up to date copy

When other cores want to access this they need to discover the most up-to-date version

New cards

Two classes of coherence protocols

Directory based, snooping

New cards

Directory Based

Sharing status for a block is kept in one place, the directory.

Can be SMP - one central directory, associated with e.g. main memory, or L3 cache (for single chip multi-core)

For a multi-chip thing we have a distributed directory, which is more complicated

New cards

Snooping

Caches handle coherence individually. They track sharing status of the blocks they have. Memory requests are broadcast on a shared bus, and controllers can snoop these requests to know when their data is updated for example.

New cards

BC4 Cache

35 MB L3, 256KB L2 per core, 32 KB data per core, 32KB instructions per core.

8-way set associative, write-back

Distributed around the CPU in a ring, some close to each core

New cards

Cache trends

On-chip has grown!

New cards

Cache optimisations

Hardware prefetch, Hit under miss (while waiting for a miss, do something else), Critical word first (when loading a line, bring the one we care about first), Merging write buffers (multiple updates happen together), Compiler optimisations make better use of caches too

New cards

Critical Word first

request missed word from memory, send it on as soon as it arrives, then load the rest of the line

New cards

Early restart

request the line as normal, but as soon as the requested data arrives send it on and allow execution to continue

New cards

Hardware prefetching

Maybe when you fetch a line, also fetch the next line, especially if there’s special ISA juice telling you too