Caching and IO

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/40

There's no tags or description

Looks like no tags are added yet.

Last updated 11:01 PM on 4/2/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

41 Terms

New cards

Why are caches needed, and what is the basic idea behind them?

Accessing main memory is slow
Reasons:
- memory is shared between processors/cores → congestion
- main memory uses cheap but slower DRAM
- memory chips run at a lower frequency than the CPU
Caches are one of the most important performance optimizations in modern systems
Core idea: build a storage hierarchy and keep frequently used data close to the processor
This reduces memory access time

New cards

What is the memory hierarchy, and how do speed, cost, capacity, and volatility change across it?

Typical hierarchy:
- Registers
- L1 cache
- L2 cache
- L3 cache
- Main memory
- Disk/Drive
L1, L2, L3 form the main cache hierarchy
Cache memory is usually volatile SRAM
Main memory is usually volatile DRAM
Disk/Drive (like HDD or SSD) is non-volatile
Going up the hierarchy:
- faster
- more expensive per bit
- smaller capacity
Going down the hierarchy:
- slower / higher latency
- larger capacity
- cheaper per bit

New cards

How is a high-level cache hierarchy organized in a multi-core processor?

Each core can have its own nearby caches
Often the L1 cache is split into:
- L1-I = instruction cache
- L1-D = data cache
Lower levels such as L2 may be unified
The last-level cache (often L3) is often shared between cores
Important cache design choices mentioned on the slide:
- Split vs. unified
- Inclusive vs. exclusive
- Non-exclusive
Meanings:
- Split cache: separate instruction and data caches
- Unified cache: one cache for both instructions and data
- Inclusive: higher-level cache may contain copies of data from lower levels
- Exclusive: data is kept in only one level as much as possible
- Non-exclusive: mixed behavior
Cache latency is shown in processor cycles
Example values on the slide are rough orders of magnitude:
- split caches: about 10 cycles
- unified cache: about 100 cycles
- shared L3: about 1000 cycles
Example note: at 1 GHz, 1 cycle = 1 ns

New cards

What are split caches and unified caches?

Split caches use separate caches for:
- instructions
- data
Example: L1-I for instructions and L1-D for data
Unified caches store both instructions and data in the same cache
So the main difference is:
- split = separate instruction/data caches
- unified = one common cache for both

New cards

What are inclusive, exclusive, and non-exclusive caches?

Inclusive caches:
- data is stored in both the core’s private/exclusive caches and the shared caches
- so the same data can be present in multiple cache levels at once
Exclusive caches:
- data is stored in either the core’s private/exclusive caches or the shared caches
- the goal is to avoid duplication across levels
Non-exclusive caches (or partially inclusive caches):
- data may be stored in both private and shared caches
- but it does not have to be
So:
- inclusive = always duplicated across levels
- exclusive = only one level
- non-exclusive = duplication is allowed but not required

New cards

What is the main content of the “Type of Cache” part in this chapter?

This part explains important cache types / design choices, especially:
- split vs. unified caches
- inclusive vs. exclusive vs. non-exclusive caches
These describe how instructions and data are organized and whether data is duplicated across cache levels

New cards

What is cache associativity, and how do direct-mapped, set-associative, and fully associative caches differ?

Cache associativity describes how many places a memory block is allowed to go inside the cache
In a set-associative cache, a memory block maps to one set, but can be placed in any of the X ways inside that set
In a fully associative cache, a memory block can be placed in any cache line in the whole cache
- you can think of it as 1 set with many ways
- it reduces cache misses, but needs more complex hardware and more power
- mainly used in very small caches close to the CPU
In a direct-mapped cache, a memory block can go to only one exact cache line
- you can think of it as 1 way per set

New cards

How do the three associativity types compare in flexibility, miss rate, hardware cost, and typical use?

Direct-mapped cache
- placement flexibility: very low
- each block has exactly one location
- cache miss rate: high because of many conflict misses
- hardware and energy cost: very low
- typical use: very small/simple caches, cost- and energy-efficient systems
Set-associative cache
- placement flexibility: medium
- each block has X possible locations inside one set
- cache miss rate: low
- hardware and energy cost: medium
- typical use: standard in modern CPUs, especially L1, L2, L3
Fully associative cache
- placement flexibility: very high
- block can go into any cache line
- cache miss rate: very low
- hardware and energy cost: very high
- typical use: very small special-purpose caches, e.g. TLB
Easy summary:
- direct-mapped = simplest but most conflicts
- fully associative = fewest conflicts but most expensive
- set-associative = practical compromise

New cards

How is a physical memory address divided in cache architecture, and what do tag, index, and offset do?

A physical memory address is divided into:
- Tag
- Index
- Offset
Index bits decide which cache set is selected
After the set is selected, all ways in that set are checked
The tag stored in each cache line is compared with the address tag
If the tag matches and the line is valid, it is a cache hit
If there is no match, or the entry is invalid, it is a cache miss
The offset bits select the exact byte inside the data block/cache line
A cache line contains at least:
- a valid bit
- a tag
- the data block
In the shown example, the cache is 4-way set-associative with 8 sets

New cards

How does cache lookup work in a cache line, and what does PIPT mean?

The cache controller uses the index to choose the correct cache set
Then it compares the address tag bits with the tags of all ways in that set that have the valid bit set
If a tag matches, the data is already in the cache → cache hit
If no tag matches, the data is not there or is invalid → cache miss
On a hit, the offset bits are used to locate the correct byte inside the cache line’s data block
The slide shows PIPT = Physically Indexed, Physically Tagged
- both set selection and tag comparison use the physical address
It also mentions other designs:
- VIVT = Virtually Indexed, Virtually Tagged
- VIPT = Virtually Indexed, Physically Tagged

New cards

What is PIPT cache addressing, and what are its advantages and disadvantages?

PIPT = Physically Indexed, Physically Tagged
Both the index bits and the tag bits are taken from the physical memory address
This means the processor must first do address translation from virtual address → physical address before accessing the cache
Because of that, cache access gets extra latency
So PIPT is usually not used in the caches closest to the processor
In the comparison table:
- index derived from: physical address
- tag derived from: physical address
- TLB lookup required before cache: yes
- cache access latency: higher
- homonym problem: no
- synonym problem: no
- design complexity: low
- typical use: L2 / L3 caches

New cards

What is VIVT cache addressing, and what are its advantages and disadvantages?

VIVT = Virtually Indexed, Virtually Tagged
Both the index bits and the tag bits are taken from the virtual memory address
No address translation is needed before cache lookup
So cache access can be very fast / very low latency
But VIVT has two important problems because virtual addresses are not globally unique:
- Homonym problem: the same virtual address in different processes may refer to different physical addresses
- Synonym problem: different virtual addresses may refer to the same physical address
The OS or cache controller must handle these problems
In the comparison table:
- index derived from: virtual address
- tag derived from: virtual address
- TLB lookup required before cache: no
- cache access latency: very low
- homonym problem: yes
- synonym problem: yes
- design complexity: high
- typical use: mostly historical

New cards

What is VIPT cache addressing, and why is it commonly used?

VIPT = Virtually Indexed, Physically Tagged
The index bits come from the virtual address
The tag bits come from the physical address
This allows the processor to do:
- cache indexing and
- address translation
  at the same time
So translation can happen in parallel with the cache request
This makes cache access faster than PIPT, but safer than VIVT
The physical tag prevents homonym problems
But synonym problems are still possible
In the comparison table:
- index derived from: virtual address
- tag derived from: physical address
- TLB lookup required before cache: no, because it is done in parallel
- cache access latency: low
- homonym problem: no
- synonym problem: possible
- design complexity: medium
- typical use: standard L1 caches

Question 4: How do PIPT, VIVT, and VIPT compare?
Answer:

PIPT
- safest and simple
- no homonym or synonym problem
- but slower because translation must happen first
- common for L2/L3 caches
VIVT
- fastest cache lookup
- no translation needed before access
- but has both homonym and synonym problems
- mostly historical
VIPT
- compromise between speed and correctness
- indexing uses virtual address, tag uses physical address
- translation and cache lookup happen in parallel
- no homonym problem, but synonym problem can still happen
- standard for L1 caches

New cards

What is the basic cache control flow for a read or write request?

A memory read request or memory write request comes in
The cache controller:
- selects the cache set
- compares the tag bits of all valid cache lines in that set
If there is a read/write hit:
- the correct data is selected using the offset bits
- then the cache either returns data or updates data
The slide also points ahead to write behavior:
- write-back vs. write-through

New cards

What is the main point of the “Cache Policies” section starting here?

The next topic is cache policies
From the control-flow slide, an important policy question is how writes are handled:
- write-back
- write-through
So this section will explain the rules the cache uses when data is read, written, replaced, or missed

New cards

What are the main write-hit policies: write-back and write-through?

A write hit means the data being written is already in the cache
Write-back:
- data is changed only in the cache line, not immediately in main memory
- this makes cache and main memory temporarily inconsistent
- therefore each cache line needs a dirty bit
- the dirty bit shows that the cache line was modified and main memory does not yet contain the newest value
- if such a line is later invalidated/evicted, the changed data must first be written back to memory, otherwise the changes are lost
- pros from the summary table: high performance
- cons: needs a dirty bit
- typical use: modern CPUs
Write-through:
- data is written at the same time to the cache and to main memory
- so cache and memory stay consistent
- no dirty bit is needed
- pros: simple, consistent
- cons: high memory traffic
- typical use: simple systems

New cards

What are the main write-miss policies: write-allocate and no-write-allocate?

A write miss means the data being written is not currently in the cache
Write-allocate:
- on a write miss, the system first loads the block into the cache from memory
- then the write is performed in the cache
- usually combined with write-back
- pros: good locality
- cons: needs an extra read first
No-write-allocate:
- on a write miss, the cache is not filled with that block
- the write goes directly to main memory
- no data is cached in this case
- usually combined with write-through
- pros: simple
- cons: can be slow for writes

New cards

How does the cache control flow work for hits, misses, and replacement?

A memory read or memory write request enters the cache
The cache controller:
- selects the cache set
- compares the tag bits of valid cache lines
If there is a read/write hit:
- the cache uses the offset bits to select the correct data
- then it either returns data or updates data
- the exact write behavior depends on write-back vs. write-through
If there is a read miss:
- the request is relayed to the next cache level or main memory until the data is found
If there is a write miss:
- behavior depends on write-allocate vs. no-write-allocate
When a new block must be inserted:
- first check if there is an empty invalid cache line in the set
- if yes, insert there
- if not, one line must be evicted using a replacement policy
Replacement examples mentioned:
- LRU
- LFU
- MRU

New cards

What are the main cache replacement policies: LRU, LFU, and MRU?

Replacement policy decides which cache line is evicted when the cache/set is full
LRU (Least Recently Used):
- evicts the cache line with the oldest age
- meaning the line that has not been used for the longest time
- usually gives a good hit rate
- but storing exact age causes hardware overhead
- for caches with many ways, systems often use pseudo-LRU to reduce overhead
- in the summary table: standard replacement policy
LFU (Least Frequently Used):
- evicts the line that was used the least often
- works well for stable access patterns
- but adapts slowly if the pattern changes
- usage is rare
MRU (Most Recently Used):
- evicts the most recently used cache line
- can help in some streaming situations
- but is usually bad for locality
- usage is very rare

New cards

What is the overall summary of cache policies from the control-flow table?

Write hit + write-through:
- write immediately to next level
- pros: simple, consistent
- cons: high traffic
- used in simple systems
Write hit + write-back:
- update cache, write later
- pros: high performance
- cons: dirty bit needed
- used in modern CPUs
Write miss + write-allocate:
- load block first, then write
- pros: good locality
- cons: extra read
- often used with write-back
Write miss + no-write-allocate:
- write bypasses cache
- pros: simple
- cons: slow for writes
- often used with write-through
Read miss:
- fetch data from lower level
- correctness is the main benefit
- drawback is high latency
Replacement:
- LRU = standard
- LFU = rare
- MRU = very rare

New cards

What is DMA and why is it useful?

DMA = Direct Memory Access
It is a method that allows peripherals/devices to transfer data directly to or from memory
This happens without continuous CPU intervention
Purpose / benefits of DMA:
- improves efficiency of data transfer
- frees the CPU for other tasks
- speeds up I/O by avoiding the CPU bottleneck

New cards

What are the key characteristics and advantages of DMA?

DMA uses a DMA controller (DMAC) to manage data transfers
It can transfer large blocks of data with minimal overhead
Devices such as network cards, disk drives, and graphics cards use DMA to transfer data directly to/from memory with very little CPU involvement
Main advantages:
- reduces CPU workload
- makes I/O faster and more efficient

New cards

How does DMA basically work in the system diagram?

First, the CPU programs the DMA controller
The DMA controller stores information such as:
- address
- count
- control
Then the device/controller sends a DMA request
Data is transferred between the device buffer and main memory over the bus
After the transfer is done, the DMA controller sends an interrupt to the CPU
So the CPU starts the transfer, but does not move every byte itself

New cards

What is the real-life movie example of a DMA transfer?

Scenario: you want to watch a movie stored on your HDD or SSD
DMA is used to move the movie data from the storage device into RAM
This lets the CPU avoid handling every small data movement itself
The CPU can then focus more on other work, like later decoding and displaying the movie

New cards

What happens in DMA transfer Step 1?

Step 1 is preparation by the CPU
Example: you open a media player and press Play
The CPU sets up the DMA controller by specifying:
- Source: where the movie file is on the hard drive
- Destination: where in RAM the data should go
- Size: how much data should be transferred
The CPU also tells the disk controller to start reading the movie data and place it into its internal buffer

New cards

What happens in DMA transfer Steps 2 and 3?

Step 2: DMA controller takes over
- it sends a read request to the disk controller to start transferring data
Step 3: Data transfer
- the disk controller reads the movie data from the hard drive and places it in its internal buffer
- the DMA controller transfers chunks of data directly from the disk controller’s buffer into RAM
- this happens without CPU involvement
- the transfer uses the system bus

New cards

What happens in DMA transfer Step 4?

Step 4 is the repeat/loop phase until completion
For each chunk of data:
- the DMA controller updates the destination address in RAM
- it also reduces the remaining data size/count
- meanwhile the disk controller keeps filling its internal buffer from the hard drive
This repeats until the required part of the movie, or the whole file, is in RAM

New cards

What happens in DMA transfer Steps 5 and 6?

Step 5: Completion notification
- when all data has been transferred, the DMA controller sends an interrupt to the CPU
- this tells the CPU that the movie data is now ready in RAM
Step 6: Playback
- the CPU now focuses on decoding and displaying the movie
- the media player accesses the data directly from RAM

New cards

What are the main DMA operating modes?

Word-at-a-time mode
- also called cycle stealing
- DMA occasionally “steals” the bus from the CPU for a few cycles to do short transfers
Block mode / burst mode
- a whole series of transfers is done at once
- can be more efficient
- but long bursts can block the bus for the CPU or other devices for a long time
Fly-by mode
- DMA tells the device to store/read data directly to/from memory
DMA stores/reads words itself
- supports device-to-device and memory-to-memory copies

New cards

Why is an internal device buffer required, and when is DMA not always useful?

An internal device buffer is needed because:
- checks such as checksums may need to be verified before transferring data to memory
- the device cannot wait for the memory bus every time, because data from the disk arrives at a steady rate
DMA is not always meaningful:
- if the CPU is much faster than the DMA controller
- in embedded devices, where reducing complexity and cost may be more important

New cards

What is the main summary of this whole topic?

Caches improve performance because they provide much faster access than main memory
The lecture covered basic cache architectures
It also covered cache policies and control flow
And it introduced DMA, which improves I/O by letting devices transfer data directly to/from memory with little CPU help

New cards

A processor uses an L1 cache with VIPT addressing. Two processes use the same virtual address, but it maps to different physical pages. Which issue is still possible without additional handling?

A. Homonym problem only

B. Synonym problem only

C. Both homonym and synonym problems

D. No correctness issues can occur

A. No

Homonym = same virtual address, different physical addresses
In VIPT, the tag is physical, so these are distinguished correctly

B. Yes

Synonym = different virtual addresses, same physical address
In VIPT, indexing is still virtual, so this problem can still happen

C. No

Homonym is handled by the physical tag
So not both remain possible

D. No

VIPT is not perfect
Synonym issues can still occur without extra handling

Correct: B

New cards

A system applies a write-back + write-allocate policy. A store instruction targets an address that is not currently cached. What sequence of actions is most consistent with this design?

A. Write directly to main memory and skip the cache

B. Load the cache line from memory, then update it and mark it dirty

C. Invalidate the corresponding cache set and retry

D. Forward the write immediately to the next cache level

B. Yes

Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty

A. No

That is no-write-allocate behavior

C. No

A write miss does not mean the whole set is invalidated

D. No

Immediate forwarding to lower level matches write-through, not write-back

New cards

Two CPU cores share a last-level cache (L3). A workload shows frequent evictions in L3, while private L1 and L2 caches have low miss rates. What is the most plausible explanation?

A. L3 has lower latency than L2

B. L3 contention arises due to inter-core sharing

C. L1 caches are directly mapped

D. The replacement policy is irrelevant in shared caches

B. Yes

Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty

A. No

That is no-write-allocate behavior

C. No

A write miss does not mean the whole set is invalidated

D. No

Immediate forwarding to lower level matches write-through, not write-back

New cards

A system applies a write-back + write-allocate policy. A store instruction targets an

address that is not currently cached. What sequence of actions is most consistent

with this design?

A. Write directly to main memory and skip the cache

B. Load the cache line from memory, then update it and mark it dirty

C. Invalidate the corresponding cache set and retry

D. Forward the write immediately to the next cache level

B. Yes

Write-allocate on a write miss means: first load the cache line into the cache
Then the store updates the cached copy
Write-back means memory is not updated immediately, so the line is marked dirty

A. No

That is no-write-allocate behavior

C. No

A write miss does not mean the whole set is invalidated

D. No

Immediate forwarding to lower level matches write-through, not write-back

New cards

A designer chooses VIVT addressing to avoid address-translation latency in the L1 cache. What trade-off is introduced by this choice?

A. Increased cache access latency

B. Potential synonym and homonym problems

C. Forced inclusiveness across cache levels

D. Mandatory dirty bits for all cache lines

B. Yes

VIVT avoids translation before cache access, so it gives very low latency
But it introduces both homonym and synonym problems

A. No

VIVT is chosen to reduce, not increase, cache access latency

C. No

VIVT does not force caches to be inclusive

D. No

Dirty bits depend on write-back policy, not on VIVT addressing

Correct: B

New cards

A cache uses write-through + no-write-allocate. A store instruction causes a write miss. Which action is most likely?

A. Fetch the cache line into cache, then write

B. Write to cache only and delay memory update

C. Write directly to main memory without caching

D. Allocate the line only in L3

C. Yes

No-write-allocate on a write miss means: do not load the block into the cache
Write-through means the write goes directly to main memory

A. No

That would be write-allocate

B. No

That is closer to write-back, not write-through

D. No

The given policy says nothing about allocating only in L3

Correct: C

New cards

Why is the dirty bit essential in write-back caches?

A. To speed up read hits

B. To ensure modified data is written back before eviction

C. To distinguish instruction and data caches

D. To avoid synonym problems

B. Yes

In write-back, changed data may exist only in the cache
The dirty bit shows that the line was modified
Before eviction, that modified data must be written back to memory

A. No

Dirty bits are not for speeding up read hits

C. No

That is about split vs. unified caches

D. No

Synonym problems are about addressing, not dirty bits

Correct: B

New cards

Compared to LRU, which access pattern is most negatively affected by an MRU replacement policy?

A. Repeated reuse of recently accessed data

B. Temporal locality with frequent re-access

C. One-pass streaming access with no reuse

D. Random access with uniform probability

B. Yes

MRU removes the most recently used line
That is especially bad when the program has temporal locality
Temporal locality means recently accessed data will likely be used again soon
LRU keeps such data, but MRU throws it away

A. Also close, but not the best choice

“Repeated reuse of recently accessed data” describes the same bad case
But B is the more general and standard formulation

C. No

In one-pass streaming, data is usually not reused
So removing the most recently used block is often not very harmful
MRU can even work reasonably well there

D. No

With random access, there is no strong recent-use pattern to exploit
So MRU is not especially worse for that specific pattern compared to locality-heavy patterns

Correct: B

New cards

Why do DMA-based devices typically require an internal buffer during data transfer?

A. Because the CPU must copy data into the buffer

B. Because memory bus availability may not match the device’s data rate

C. Because DMA cannot access main memory directly

D. Because interrupts replace buffering

B. Yes

The device may produce/consume data at a steady rate
But the memory bus is not always immediately available
So an internal buffer is needed to hold data temporarily

A. No

With DMA, the CPU does not copy every piece of data itself

C. No

DMA can access main memory directly

D. No

Interrupts tell the CPU that a transfer is done or needs attention
They do not replace buffering

Correct: B

New cards

“DMA always improves system performance.” Which argument best contradicts this statement?

A. DMA is slower than CPU in all systems

B. DMA may be inefficient if the CPU is much faster or in low-cost embedded systems C. DMA works only with HDDs, not SSDs

D. DMA disables cache usage

B. Yes

DMA is not always useful
It may be inefficient when the CPU is much faster than the DMA controller
It may also be unattractive in low-cost embedded systems where simplicity matters more

A. No