1/40
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Why are caches needed, and what is the basic idea behind them?
Accessing main memory is slow
Reasons:
memory is shared between processors/cores → congestion
main memory uses cheap but slower DRAM
memory chips run at a lower frequency than the CPU
Caches are one of the most important performance optimizations in modern systems
Core idea: build a storage hierarchy and keep frequently used data close to the processor
This reduces memory access time
What is the memory hierarchy, and how do speed, cost, capacity, and volatility change across it?
Typical hierarchy:
Registers
L1 cache
L2 cache
L3 cache
Main memory
Disk/Drive
L1, L2, L3 form the main cache hierarchy
Cache memory is usually volatile SRAM
Main memory is usually volatile DRAM
Disk/Drive (like HDD or SSD) is non-volatile
Going up the hierarchy:
faster
more expensive per bit
smaller capacity
Going down the hierarchy:
slower / higher latency
larger capacity
cheaper per bit
How is a high-level cache hierarchy organized in a multi-core processor?
Each core can have its own nearby caches
Often the L1 cache is split into:
L1-I = instruction cache
L1-D = data cache
Lower levels such as L2 may be unified
The last-level cache (often L3) is often shared between cores
Important cache design choices mentioned on the slide:
Split vs. unified
Inclusive vs. exclusive
Non-exclusive
Meanings:
Split cache: separate instruction and data caches
Unified cache: one cache for both instructions and data
Inclusive: higher-level cache may contain copies of data from lower levels
Exclusive: data is kept in only one level as much as possible
Non-exclusive: mixed behavior
Cache latency is shown in processor cycles
Example values on the slide are rough orders of magnitude:
split caches: about 10 cycles
unified cache: about 100 cycles
shared L3: about 1000 cycles
Example note: at 1 GHz, 1 cycle = 1 ns
What are split caches and unified caches?
Split caches use separate caches for:
instructions
data
Example: L1-I for instructions and L1-D for data
Unified caches store both instructions and data in the same cache
So the main difference is:
split = separate instruction/data caches
unified = one common cache for both
What are inclusive, exclusive, and non-exclusive caches?
Inclusive caches:
data is stored in both the core’s private/exclusive caches and the shared caches
so the same data can be present in multiple cache levels at once
Exclusive caches:
data is stored in either the core’s private/exclusive caches or the shared caches
the goal is to avoid duplication across levels
Non-exclusive caches (or partially inclusive caches):
data may be stored in both private and shared caches
but it does not have to be
So:
inclusive = always duplicated across levels
exclusive = only one level
non-exclusive = duplication is allowed but not required
What is the main content of the “Type of Cache” part in this chapter?
This part explains important cache types / design choices, especially:
split vs. unified caches
inclusive vs. exclusive vs. non-exclusive caches
These describe how instructions and data are organized and whether data is duplicated across cache levels
What is cache associativity, and how do direct-mapped, set-associative, and fully associative caches differ?
Cache associativity describes how many places a memory block is allowed to go inside the cache
In a set-associative cache, a memory block maps to one set, but can be placed in any of the X ways inside that set
In a fully associative cache, a memory block can be placed in any cache line in the whole cache
you can think of it as 1 set with many ways
it reduces cache misses, but needs more complex hardware and more power
mainly used in very small caches close to the CPU
In a direct-mapped cache, a memory block can go to only one exact cache line
you can think of it as 1 way per set
How do the three associativity types compare in flexibility, miss rate, hardware cost, and typical use?
Direct-mapped cache
placement flexibility: very low
each block has exactly one location
cache miss rate: high because of many conflict misses
hardware and energy cost: very low
typical use: very small/simple caches, cost- and energy-efficient systems
Set-associative cache
placement flexibility: medium
each block has X possible locations inside one set
cache miss rate: low
hardware and energy cost: medium
typical use: standard in modern CPUs, especially L1, L2, L3
Fully associative cache
placement flexibility: very high
block can go into any cache line
cache miss rate: very low
hardware and energy cost: very high
typical use: very small special-purpose caches, e.g. TLB
Easy summary:
direct-mapped = simplest but most conflicts
fully associative = fewest conflicts but most expensive
set-associative = practical compromise
How is a physical memory address divided in cache architecture, and what do tag, index, and offset do?
A physical memory address is divided into:
Tag
Index
Offset
Index bits decide which cache set is selected
After the set is selected, all ways in that set are checked
The tag stored in each cache line is compared with the address tag
If the tag matches and the line is valid, it is a cache hit
If there is no match, or the entry is invalid, it is a cache miss
The offset bits select the exact byte inside the data block/cache line
A cache line contains at least:
a valid bit
a tag
the data block
In the shown example, the cache is 4-way set-associative with 8 sets
How does cache lookup work in a cache line, and what does PIPT mean?
The cache controller uses the index to choose the correct cache set
Then it compares the address tag bits with the tags of all ways in that set that have the valid bit set
If a tag matches, the data is already in the cache → cache hit
If no tag matches, the data is not there or is invalid → cache miss
On a hit, the offset bits are used to locate the correct byte inside the cache line’s data block
The slide shows PIPT = Physically Indexed, Physically Tagged
both set selection and tag comparison use the physical address
It also mentions other designs:
VIVT = Virtually Indexed, Virtually Tagged
VIPT = Virtually Indexed, Physically Tagged
What is PIPT cache addressing, and what are its advantages and disadvantages?
PIPT = Physically Indexed, Physically Tagged
Both the index bits and the tag bits are taken from the physical memory address
This means the processor must first do address translation from virtual address → physical address before accessing the cache
Because of that, cache access gets extra latency
So PIPT is usually not used in the caches closest to the processor
In the comparison table:
index derived from: physical address
tag derived from: physical address
TLB lookup required before cache: yes
cache access latency: higher
homonym problem: no
synonym problem: no
design complexity: low
typical use: L2 / L3 caches
What is VIVT cache addressing, and what are its advantages and disadvantages?
VIVT = Virtually Indexed, Virtually Tagged
Both the index bits and the tag bits are taken from the virtual memory address
No address translation is needed before cache lookup
So cache access can be very fast / very low latency
But VIVT has two important problems because virtual addresses are not globally unique:
Homonym problem: the same virtual address in different processes may refer to different physical addresses
Synonym problem: different virtual addresses may refer to the same physical address
The OS or cache controller must handle these problems
In the comparison table:
index derived from: virtual address
tag derived from: virtual address
TLB lookup required before cache: no
cache access latency: very low
homonym problem: yes
synonym problem: yes
design complexity: high
typical use: mostly historical
What is VIPT cache addressing, and why is it commonly used?
VIPT = Virtually Indexed, Physically Tagged
The index bits come from the virtual address
The tag bits come from the physical address
This allows the processor to do:
cache indexing and
address translation
at the same time
So translation can happen in parallel with the cache request
This makes cache access faster than PIPT, but safer than VIVT
The physical tag prevents homonym problems
But synonym problems are still possible
In the comparison table:
index derived from: virtual address
tag derived from: physical address
TLB lookup required before cache: no, because it is done in parallel
cache access latency: low
homonym problem: no
synonym problem: possible
design complexity: medium
typical use: standard L1 caches
Question 4: How do PIPT, VIVT, and VIPT compare?
Answer:
PIPT
safest and simple
no homonym or synonym problem
but slower because translation must happen first
common for L2/L3 caches
VIVT
fastest cache lookup
no translation needed before access
but has both homonym and synonym problems
mostly historical
VIPT
compromise between speed and correctness
indexing uses virtual address, tag uses physical address
translation and cache lookup happen in parallel
no homonym problem, but synonym problem can still happen
standard for L1 caches
What is the basic cache control flow for a read or write request?
A memory read request or memory write request comes in
The cache controller:
selects the cache set
compares the tag bits of all valid cache lines in that set
If there is a read/write hit:
the correct data is selected using the offset bits
then the cache either returns data or updates data
The slide also points ahead to write behavior:
write-back vs. write-through
What is the main point of the “Cache Policies” section starting here?
The next topic is cache policies
From the control-flow slide, an important policy question is how writes are handled:
write-back
write-through
So this section will explain the rules the cache uses when data is read, written, replaced, or missed
What are the main write-hit policies: write-back and write-through?
A write hit means the data being written is already in the cache
Write-back:
data is changed only in the cache line, not immediately in main memory
this makes cache and main memory temporarily inconsistent
therefore each cache line needs a dirty bit
the dirty bit shows that the cache line was modified and main memory does not yet contain the newest value
if such a line is later invalidated/evicted, the changed data must first be written back to memory, otherwise the changes are lost
pros from the summary table: high performance
cons: needs a dirty bit
typical use: modern CPUs
Write-through:
data is written at the same time to the cache and to main memory
so cache and memory stay consistent
no dirty bit is needed
pros: simple, consistent
cons: high memory traffic
typical use: simple systems
What are the main write-miss policies: write-allocate and no-write-allocate?
A write miss means the data being written is not currently in the cache
Write-allocate:
on a write miss, the system first loads the block into the cache from memory
then the write is performed in the cache
usually combined with write-back
pros: good locality
cons: needs an extra read first
No-write-allocate:
on a write miss, the cache is not filled with that block
the write goes directly to main memory
no data is cached in this case
usually combined with write-through
pros: simple
cons: can be slow for writes
How does the cache control flow work for hits, misses, and replacement?
A memory read or memory write request enters the cache
The cache controller:
selects the cache set
compares the tag bits of valid cache lines
If there is a read/write hit:
the cache uses the offset bits to select the correct data
then it either returns data or updates data
the exact write behavior depends on write-back vs. write-through
If there is a read miss:
the request is relayed to the next cache level or main memory until the data is found
If there is a write miss:
behavior depends on write-allocate vs. no-write-allocate
When a new block must be inserted:
first check if there is an empty invalid cache line in the set
if yes, insert there
if not, one line must be evicted using a replacement policy
Replacement examples mentioned:
LRU
LFU
MRU
What are the main cache replacement policies: LRU, LFU, and MRU?
Replacement policy decides which cache line is evicted when the cache/set is full
LRU (Least Recently Used):
evicts the cache line with the oldest age
meaning the line that has not been used for the longest time
usually gives a good hit rate
but storing exact age causes hardware overhead
for caches with many ways, systems often use pseudo-LRU to reduce overhead
in the summary table: standard replacement policy
LFU (Least Frequently Used):
evicts the line that was used the least often
works well for stable access patterns
but adapts slowly if the pattern changes
usage is rare
MRU (Most Recently Used):
evicts the most recently used cache line
can help in some streaming situations
but is usually bad for locality
usage is very rare
What is the overall summary of cache policies from the control-flow table?
Write hit + write-through:
write immediately to next level
pros: simple, consistent
cons: high traffic
used in simple systems
Write hit + write-back:
update cache, write later
pros: high performance
cons: dirty bit needed
used in modern CPUs
Write miss + write-allocate:
load block first, then write
pros: good locality
cons: extra read
often used with write-back
Write miss + no-write-allocate:
write bypasses cache
pros: simple
cons: slow for writes
often used with write-through
Read miss:
fetch data from lower level
correctness is the main benefit
drawback is high latency
Replacement:
LRU = standard
LFU = rare
MRU = very rare
What is DMA and why is it useful?
DMA = Direct Memory Access
It is a method that allows peripherals/devices to transfer data directly to or from memory
This happens without continuous CPU intervention
Purpose / benefits of DMA:
improves efficiency of data transfer
frees the CPU for other tasks
speeds up I/O by avoiding the CPU bottleneck
What are the key characteristics and advantages of DMA?
DMA uses a DMA controller (DMAC) to manage data transfers
It can transfer large blocks of data with minimal overhead
Devices such as network cards, disk drives, and graphics cards use DMA to transfer data directly to/from memory with very little CPU involvement
Main advantages:
reduces CPU workload
makes I/O faster and more efficient
How does DMA basically work in the system diagram?
First, the CPU programs the DMA controller
The DMA controller stores information such as:
address
count
control
Then the device/controller sends a DMA request
Data is transferred between the device buffer and main memory over the bus
After the transfer is done, the DMA controller sends an interrupt to the CPU
So the CPU starts the transfer, but does not move every byte itself
What is the real-life movie example of a DMA transfer?
Scenario: you want to watch a movie stored on your HDD or SSD
DMA is used to move the movie data from the storage device into RAM
This lets the CPU avoid handling every small data movement itself
The CPU can then focus more on other work, like later decoding and displaying the movie
What happens in DMA transfer Step 1?
Step 1 is preparation by the CPU
Example: you open a media player and press Play
The CPU sets up the DMA controller by specifying:
Source: where the movie file is on the hard drive
Destination: where in RAM the data should go
Size: how much data should be transferred
The CPU also tells the disk controller to start reading the movie data and place it into its internal buffer
What happens in DMA transfer Steps 2 and 3?
Step 2: DMA controller takes over
it sends a read request to the disk controller to start transferring data
Step 3: Data transfer
the disk controller reads the movie data from the hard drive and places it in its internal buffer
the DMA controller transfers chunks of data directly from the disk controller’s buffer into RAM
this happens without CPU involvement
the transfer uses the system bus
What happens in DMA transfer Step 4?
Step 4 is the repeat/loop phase until completion
For each chunk of data:
the DMA controller updates the destination address in RAM
it also reduces the remaining data size/count
meanwhile the disk controller keeps filling its internal buffer from the hard drive
This repeats until the required part of the movie, or the whole file, is in RAM
What happens in DMA transfer Steps 5 and 6?
Step 5: Completion notification
when all data has been transferred, the DMA controller sends an interrupt to the CPU
this tells the CPU that the movie data is now ready in RAM
Step 6: Playback
the CPU now focuses on decoding and displaying the movie
the media player accesses the data directly from RAM
What are the main DMA operating modes?
Word-at-a-time mode
also called cycle stealing
DMA occasionally “steals” the bus from the CPU for a few cycles to do short transfers
Block mode / burst mode
a whole series of transfers is done at once
can be more efficient
but long bursts can block the bus for the CPU or other devices for a long time
Fly-by mode
DMA tells the device to store/read data directly to/from memory
DMA stores/reads words itself
supports device-to-device and memory-to-memory copies
Why is an internal device buffer required, and when is DMA not always useful?
An internal device buffer is needed because:
checks such as checksums may need to be verified before transferring data to memory
the device cannot wait for the memory bus every time, because data from the disk arrives at a steady rate
DMA is not always meaningful:
if the CPU is much faster than the DMA controller
in embedded devices, where reducing complexity and cost may be more important
What is the main summary of this whole topic?
Caches improve performance because they provide much faster access than main memory
The lecture covered basic cache architectures
It also covered cache policies and control flow
And it introduced DMA, which improves I/O by letting devices transfer data directly to/from memory with little CPU help
A processor uses an L1 cache with VIPT addressing. Two processes use the same virtual address, but it maps to different physical pages. Which issue is still possible without additional handling?
A. Homonym problem only
B. Synonym problem only
C. Both homonym and synonym problems
D. No correctness issues can occur
A. No
Homonym = same virtual address, different physical addresses
In VIPT, the tag is physical, so these are distinguished correctly
B. Yes
Synonym = different virtual addresses, same physical address
In VIPT, indexing is still virtual, so this problem can still happen
C. No
Homonym is handled by the physical tag
So not both remain possible
D. No
VIPT is not perfect
Synonym issues can still occur without extra handling
Correct: B
A system applies a write-back + write-allocate policy. A store instruction targets an address that is not currently cached. What sequence of actions is most consistent with this design?
A. Write directly to main memory and skip the cache
B. Load the cache line from memory, then update it and mark it dirty
C. Invalidate the corresponding cache set and retry
D. Forward the write immediately to the next cache level
B. Yes
Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty
A. No
That is no-write-allocate behavior
C. No
A write miss does not mean the whole set is invalidated
D. No
Immediate forwarding to lower level matches write-through, not write-back
Two CPU cores share a last-level cache (L3). A workload shows frequent evictions in L3, while private L1 and L2 caches have low miss rates. What is the most plausible explanation?
A. L3 has lower latency than L2
B. L3 contention arises due to inter-core sharing
C. L1 caches are directly mapped
D. The replacement policy is irrelevant in shared caches
B. Yes
Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty
A. No
That is no-write-allocate behavior
C. No
A write miss does not mean the whole set is invalidated
D. No
Immediate forwarding to lower level matches write-through, not write-back
A system applies a write-back + write-allocate policy. A store instruction targets an
address that is not currently cached. What sequence of actions is most consistent
with this design?
A. Write directly to main memory and skip the cache
B. Load the cache line from memory, then update it and mark it dirty
C. Invalidate the corresponding cache set and retry
D. Forward the write immediately to the next cache level
B. Yes
Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty
A. No
That is no-write-allocate behavior
C. No
A write miss does not mean the whole set is invalidated
D. No
Immediate forwarding to lower level matches write-through, not write-back
A designer chooses VIVT addressing to avoid address-translation latency in the L1 cache. What trade-off is introduced by this choice?
A. Increased cache access latency
B. Potential synonym and homonym problems
C. Forced inclusiveness across cache levels
D. Mandatory dirty bits for all cache lines
B. Yes
VIVT avoids translation before cache access, so it gives very low latency
But it introduces both homonym and synonym problems
A. No
VIVT is chosen to reduce, not increase, cache access latency
C. No
VIVT does not force caches to be inclusive
D. No
Dirty bits depend on write-back policy, not on VIVT addressing
Correct: B
A cache uses write-through + no-write-allocate. A store instruction causes a write miss. Which action is most likely?
A. Fetch the cache line into cache, then write
B. Write to cache only and delay memory update
C. Write directly to main memory without caching
D. Allocate the line only in L3
C. Yes
No-write-allocate on a write miss means: do not load the block into the cache
Write-through means the write goes directly to main memory
A. No
That would be write-allocate
B. No
That is closer to write-back, not write-through
D. No
The given policy says nothing about allocating only in L3
Correct: C
Why is the dirty bit essential in write-back caches?
A. To speed up read hits
B. To ensure modified data is written back before eviction
C. To distinguish instruction and data caches
D. To avoid synonym problems
B. Yes
In write-back, changed data may exist only in the cache
The dirty bit shows that the line was modified
Before eviction, that modified data must be written back to memory
A. No
Dirty bits are not for speeding up read hits
C. No
That is about split vs. unified caches
D. No
Synonym problems are about addressing, not dirty bits
Correct: B
Compared to LRU, which access pattern is most negatively affected by an MRU replacement policy?
A. Repeated reuse of recently accessed data
B. Temporal locality with frequent re-access
C. One-pass streaming access with no reuse
D. Random access with uniform probability
B. Yes
MRU removes the most recently used line
That is especially bad when the program has temporal locality
Temporal locality means recently accessed data will likely be used again soon
LRU keeps such data, but MRU throws it away
A. Also close, but not the best choice
“Repeated reuse of recently accessed data” describes the same bad case
But B is the more general and standard formulation
C. No
In one-pass streaming, data is usually not reused
So removing the most recently used block is often not very harmful
MRU can even work reasonably well there
D. No
With random access, there is no strong recent-use pattern to exploit
So MRU is not especially worse for that specific pattern compared to locality-heavy patterns
Correct: B
Why do DMA-based devices typically require an internal buffer during data transfer?
A. Because the CPU must copy data into the buffer
B. Because memory bus availability may not match the device’s data rate
C. Because DMA cannot access main memory directly
D. Because interrupts replace buffering
B. Yes
The device may produce/consume data at a steady rate
But the memory bus is not always immediately available
So an internal buffer is needed to hold data temporarily
A. No
With DMA, the CPU does not copy every piece of data itself
C. No
DMA can access main memory directly
D. No
Interrupts tell the CPU that a transfer is done or needs attention
They do not replace buffering
Correct: B
“DMA always improves system performance.” Which argument best contradicts this statement?
A. DMA is slower than CPU in all systems
B. DMA may be inefficient if the CPU is much faster or in low-cost embedded systems C. DMA works only with HDDs, not SSDs
D. DMA disables cache usage
B. Yes
DMA is not always useful
It may be inefficient when the CPU is much faster than the DMA controller
It may also be unattractive in low-cost embedded systems where simplicity matters more
A. No
DMA is not slower in all systems
C. No
DMA is used with many devices, not only HDDs
D. No
DMA does not generally disable cache usage
Correct: B