Computer Architecture - Cache Organisation Part 1

Why Cache Organisation?

Cache organization and design have a significant impact on the performance and functionality of modern computers.
Caches are critical to system performance and a major area in contemporary machine design.
Understanding caches allows optimization of application code to maximize performance and avoid performance problems. It also enables anticipating application performance problems by relating application design to cache organization.
Understanding cache limitations helps in differentiating machine performance for value-for-money hardware shopping.

Storage Speed vs. Performance

The performance of a computer system depends on the CPU's performance and the storage hierarchy it uses.
Ideally, using only the fastest RAM would allow the CPU to fetch instructions and data without delay, but this is very costly.
Alternatively, using economical DRAMs, SDRAMs, SSDs, and rotating disks is affordable.
If storage is significantly slower than the CPU, the CPU may stall, waiting for instructions from memory.
The performance loss due to a CPU stall depends on the hardware. For example, if the CPU executes an instruction every 2 nanoseconds and DRAM access time is 40 nanoseconds, the CPU wastes time.
The solution is to use a fast, small memory (cache) between the CPU and main memory.

Cache Memory Layout

A cache is a small, fast memory between the CPU and main memory (DRAM).
Registers are very fast, with access times around 1 nanosecond.
Cache memory has moderately fast access times, around tens of nanoseconds (e.g., 50 ns).

Spatial and Temporal Locality

Spatial Locality: Accessing a memory location implies that nearby memory locations are likely to be accessed soon.
Temporal Locality: Accessing a memory location implies it will likely be accessed again shortly.

Locality Example

Consider a loop writing to a data structure; the CPU fetches each instruction in the loop once per loop and writes the structure once per loop.
This exhibits both temporal and spatial locality.
Programs access memory most frequently to fetch instructions, less frequently to read data, and least frequently to write data.

Caches

Caches are small memories between the CPU and main memory, operating at speeds similar to CPU registers.
Accessed instructions and data are stored in the cache for quick retrieval.
Accessing instructions or data already in the cache minimizes time loss.

Issues in Caching

How to find data in the cache.
How to make the cache invisible to the programmer and OS.
How to maximize the ratio of cache hits to misses.
How to minimize cost and maximize cache speed.
How to ensure consistency between the cache and main memory.

Directly Mapped Caches

The cache contains a Cache Data Memory which holds cached contents. And each line contains one location's content.
Each entry has a Valid Bit, set when the entry contains valid data.
Each entry has a Tag, stored in the Cache Tag Memory.
A comparator compares tag values with address bits.

Directly Mapped Cache - Hit

Low Order Address Bits (LOAB) index into the Cache Tag Memory.
The comparator checks High Order Address Bits (HOAB) against the Cache Tag Memory value indexed by the LOAB.
If the values match, it's a Cache Hit, and the value from Cache Data Memory is retrieved.

Directly Mapped Cache - Miss #1

If the cache line is empty (Valid Bit = 0).
Low Order Address Bits (LOAB) index into the Cache Tag Memory.
High Order Address Bits (HOAB) are compared against the value in the Cache Tag Memory location.
Since the Valid Bit is zero, it's a Cache Miss, requiring the CPU to fetch data from memory.

Directly Mapped Cache - Miss #2

Assume cache contains an entry with identical Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
Low Order Address Bits (LOAB) index into the Cache Tag Memory.
The comparator tests High Order Address Bits (HOAB) against the value held in the Cache Tag Memory location indexed by the Low Order Address Bits.
Since the values are different, it is a “Cache Miss”; the CPU must fetch the data from memory instead.

Directly Mapped Cache Operation

Spatial locality suggests that accessing one location means nearby locations (with common High Order Address Bits) will likely be accessed.
This cache can hold only one location for each set of Low Order Address Bits (LOAB) at a time.
Increasing the number of cache lines (depth) increases "C" (number of cache lines) and reduces the tag size, $N-C$ .

Directly Mapped Cache Operation - Filling

On every cache miss, data fetched from memory is saved in the cache.
The number of consecutive instructions that fit in the cache, assuming one instruction per line, depends on the number of cache lines.

Directly Mapped Cache - Pros/Cons

Advantage: simple design.
Disadvantage: Frequent accesses to locations with the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB) cause overwrites.
Cache performance is measured by the hit ratio: $Hit ratio = (Hits) / (Hits + Misses)$
A reduced hit ratio increases the average access time and reduces total performance.

Direct Mapped Cache (a worked example)

Read 0x12345: High Order Address Bits (HOAB) is 12, Low Order Address Bits (LOAB) is 34, byte-in-line is 5.
Cache line indexed by 34. If there's a mismatch, it's a MISS!
Fill cache line: 12 | 003456729A7B89453400FFAABBCCDDEE, Read 0x12345, Cache HIT!
Reading 0x12340 will also be a HIT in the same cache line.
Reading 0x22343 results in a MISS, filling the cache line with new data.

Set Associative Caches

Caches where two or more cache lines can have the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
Achieve good hit ratios despite smaller size.

2-Way Set Associative Cache

The simplest set associative cache.
The total number of cache entries is split into two halves, each with its own comparator for testing the High Order Address Bits (HOAB).
The tag field size must be increased by one bit.
Typically achieves a higher hit ratio than a directly mapped cache of the same size.

N-Way Set Associative Caches : Operation

Left Miss: the data is not found in the left side of the cache.
Right Hit: the data is found in the right side of the cache.
Double Miss: the data is not found in either side of the cache.

4-Way, 8-Way Set Associative Caches

Higher orders of set associativity are possible but hardware costs increase.
Improvement in hit ratios diminishes beyond 2-way set associativity.
This is a “diminishing return on investment” as hardware costs increase faster than performance gains.

Fully Associative Caches

The order of set associativity equals the number of cache lines.
Every cache line has its own comparator.
For N cache lines, there are N comparators, and the cache is split N-way.
Since the depth of each section is 1, there's no need for the Low Order Address Bits (LOAB).
Complexity and cost make it impractical except for very small caches.

Cache Busses

The speed of the bus connecting the CPU to the cache significantly impacts performance.
Performance improves with a faster or wider cache bus.
Putting the cache on the same chip as the CPU is optimal but limited by transistor count.

Cache Hierarchies

Large caches are expensive due to the SRAMs used.
A cache hierarchy balances cost and hit ratio.
A 2-Level cache hierarchy uses a very small, fast Level 1 (L1) cache and a larger, slower Level 2 (L2) cache.
2-Level hierarchies are most common.

2-Level Cache Hierarchy Operation

CPU attempts to read data.
L1 cache is checked; if the data is present (L1 Hit), it's retrieved.
If the data is not in L1 (L1 Miss), L2 cache is checked.
If the data is not in L2 (L2 Miss), it's fetched from main memory.
The data is then placed in both L1 and L2 caches (Fill).
If the data is in L2 (L2 Hit), it's moved to L1 cache.

Examples of Cache Implementation

Intel Pentium Pro (1996):
- 16 KB L1 on-chip, 256-512 KB L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L1 instruction cache is 4-way set associative with 8 KB.
- L1 data cache is 2-way set associative with 8 KB.
- L2 cache is 2-way set associative with 256 or 512 KB.
- 64-bit cache bus.
Pentium-III Xeon (1998):
- L1 on-chip, 512 KB, 1, 2 MB L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L2 cache is 2-way set associative with 512 KB, 1 MB, or 2 MB.
- 64-bit cache bus.
- CPU clock speed up to 600 MHz.
Pentium III (2000):
- L1 and L2 both on a single chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L2 cache is 8-way set associative with 256 KB and error correction logic.
- 256-bit cache bus.
- CPU clock speed 800 MHz or better.
- L2 cache occupies about 30% of the total chip area.
AMD K7/Athlon (2000):
- L1 on-chip, L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L1 is 2-way set associative, total 128 KB.
- L2 cache with 512 KB, 1, 2, 4, 8 MB options.
- 72-bit cache bus (64 bits data, 8 bits error control).
- CPU clock speed up to 1.1 GHz.
2nd Gen Intel® Core™ i7:
- Four or two execution cores
- 32-KB instruction and 32-KB data first-level cache (L1) for each core
- 256-KB shared instruction/data second-level cache (L2) for each core
- Up to 8-MB shared instruction/data third-level cache (L3), shared among all cores
- L1 and L2 for each core, L3 is shared
Apple M1 ARM SoC 2021:
- Four high performance cores, Four high efficiency cores
- HP Cores: 192KB instruction cache, 128KN data cache, shared 12MB L2 cache. L1 and L2 for each core.
- HE Cores: 128KB instruction cache, 64KB data cache, shared 4MB L2 cache

Summary

Topics covered: Directly mapped caches, Set associative caches, and Cache hierarchies.
Next: Cache management policies, Cache coherency, Cache performance, and Cache design tradeoffs.