Computer Architecture - Cache Organisation Part 1
Why Cache Organisation?
- Cache organization and design have a significant impact on the performance and functionality of modern computers.
- Caches are critical to system performance and a major area in contemporary machine design.
- Understanding caches allows optimization of application code to maximize performance and avoid performance problems. It also enables anticipating application performance problems by relating application design to cache organization.
- Understanding cache limitations helps in differentiating machine performance for value-for-money hardware shopping.
- The performance of a computer system depends on the CPU's performance and the storage hierarchy it uses.
- Ideally, using only the fastest RAM would allow the CPU to fetch instructions and data without delay, but this is very costly.
- Alternatively, using economical DRAMs, SDRAMs, SSDs, and rotating disks is affordable.
- If storage is significantly slower than the CPU, the CPU may stall, waiting for instructions from memory.
- The performance loss due to a CPU stall depends on the hardware. For example, if the CPU executes an instruction every 2 nanoseconds and DRAM access time is 40 nanoseconds, the CPU wastes time.
- The solution is to use a fast, small memory (cache) between the CPU and main memory.
Cache Memory Layout
- A cache is a small, fast memory between the CPU and main memory (DRAM).
- Registers are very fast, with access times around 1 nanosecond.
- Cache memory has moderately fast access times, around tens of nanoseconds (e.g., 50 ns).
Spatial and Temporal Locality
- Spatial Locality: Accessing a memory location implies that nearby memory locations are likely to be accessed soon.
- Temporal Locality: Accessing a memory location implies it will likely be accessed again shortly.
Locality Example
- Consider a loop writing to a data structure; the CPU fetches each instruction in the loop once per loop and writes the structure once per loop.
- This exhibits both temporal and spatial locality.
- Programs access memory most frequently to fetch instructions, less frequently to read data, and least frequently to write data.
Caches
- Caches are small memories between the CPU and main memory, operating at speeds similar to CPU registers.
- Accessed instructions and data are stored in the cache for quick retrieval.
- Accessing instructions or data already in the cache minimizes time loss.
Issues in Caching
- How to find data in the cache.
- How to make the cache invisible to the programmer and OS.
- How to maximize the ratio of cache hits to misses.
- How to minimize cost and maximize cache speed.
- How to ensure consistency between the cache and main memory.
Directly Mapped Caches
- The cache contains a Cache Data Memory which holds cached contents. And each line contains one location's content.
- Each entry has a Valid Bit, set when the entry contains valid data.
- Each entry has a Tag, stored in the Cache Tag Memory.
- A comparator compares tag values with address bits.
Directly Mapped Cache - Hit
- Low Order Address Bits (LOAB) index into the Cache Tag Memory.
- The comparator checks High Order Address Bits (HOAB) against the Cache Tag Memory value indexed by the LOAB.
- If the values match, it's a Cache Hit, and the value from Cache Data Memory is retrieved.
Directly Mapped Cache - Miss #1
- If the cache line is empty (Valid Bit = 0).
- Low Order Address Bits (LOAB) index into the Cache Tag Memory.
- High Order Address Bits (HOAB) are compared against the value in the Cache Tag Memory location.
- Since the Valid Bit is zero, it's a Cache Miss, requiring the CPU to fetch data from memory.
Directly Mapped Cache - Miss #2
- Assume cache contains an entry with identical Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
- Low Order Address Bits (LOAB) index into the Cache Tag Memory.
- The comparator tests High Order Address Bits (HOAB) against the value held in the Cache Tag Memory location indexed by the Low Order Address Bits.
- Since the values are different, it is a “Cache Miss”; the CPU must fetch the data from memory instead.
Directly Mapped Cache Operation
- Spatial locality suggests that accessing one location means nearby locations (with common High Order Address Bits) will likely be accessed.
- This cache can hold only one location for each set of Low Order Address Bits (LOAB) at a time.
- Increasing the number of cache lines (depth) increases "C" (number of cache lines) and reduces the tag size, N−C.
Directly Mapped Cache Operation - Filling
- On every cache miss, data fetched from memory is saved in the cache.
- The number of consecutive instructions that fit in the cache, assuming one instruction per line, depends on the number of cache lines.
Directly Mapped Cache - Pros/Cons
- Advantage: simple design.
- Disadvantage: Frequent accesses to locations with the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB) cause overwrites.
- Cache performance is measured by the hit ratio: Hitratio=(Hits)/(Hits+Misses)
- A reduced hit ratio increases the average access time and reduces total performance.
Direct Mapped Cache (a worked example)
- Read 0x12345: High Order Address Bits (HOAB) is 12, Low Order Address Bits (LOAB) is 34, byte-in-line is 5.
- Cache line indexed by 34. If there's a mismatch, it's a MISS!
- Fill cache line: 12 | 003456729A7B89453400FFAABBCCDDEE, Read 0x12345, Cache HIT!
- Reading 0x12340 will also be a HIT in the same cache line.
- Reading 0x22343 results in a MISS, filling the cache line with new data.
Set Associative Caches
- Caches where two or more cache lines can have the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
- Achieve good hit ratios despite smaller size.
2-Way Set Associative Cache
- The simplest set associative cache.
- The total number of cache entries is split into two halves, each with its own comparator for testing the High Order Address Bits (HOAB).
- The tag field size must be increased by one bit.
- Typically achieves a higher hit ratio than a directly mapped cache of the same size.
N-Way Set Associative Caches : Operation
- Left Miss: the data is not found in the left side of the cache.
- Right Hit: the data is found in the right side of the cache.
- Double Miss: the data is not found in either side of the cache.
4-Way, 8-Way Set Associative Caches
- Higher orders of set associativity are possible but hardware costs increase.
- Improvement in hit ratios diminishes beyond 2-way set associativity.
- This is a “diminishing return on investment” as hardware costs increase faster than performance gains.
Fully Associative Caches
- The order of set associativity equals the number of cache lines.
- Every cache line has its own comparator.
- For N cache lines, there are N comparators, and the cache is split N-way.
- Since the depth of each section is 1, there's no need for the Low Order Address Bits (LOAB).
- Complexity and cost make it impractical except for very small caches.
Cache Busses
- The speed of the bus connecting the CPU to the cache significantly impacts performance.
- Performance improves with a faster or wider cache bus.
- Putting the cache on the same chip as the CPU is optimal but limited by transistor count.
Cache Hierarchies
- Large caches are expensive due to the SRAMs used.
- A cache hierarchy balances cost and hit ratio.
- A 2-Level cache hierarchy uses a very small, fast Level 1 (L1) cache and a larger, slower Level 2 (L2) cache.
- 2-Level hierarchies are most common.
2-Level Cache Hierarchy Operation
- CPU attempts to read data.
- L1 cache is checked; if the data is present (L1 Hit), it's retrieved.
- If the data is not in L1 (L1 Miss), L2 cache is checked.
- If the data is not in L2 (L2 Miss), it's fetched from main memory.
- The data is then placed in both L1 and L2 caches (Fill).
- If the data is in L2 (L2 Hit), it's moved to L1 cache.
Examples of Cache Implementation
- Intel Pentium Pro (1996):
- 16 KB L1 on-chip, 256-512 KB L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L1 instruction cache is 4-way set associative with 8 KB.
- L1 data cache is 2-way set associative with 8 KB.
- L2 cache is 2-way set associative with 256 or 512 KB.
- 64-bit cache bus.
- Pentium-III Xeon (1998):
- L1 on-chip, 512 KB, 1, 2 MB L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L2 cache is 2-way set associative with 512 KB, 1 MB, or 2 MB.
- 64-bit cache bus.
- CPU clock speed up to 600 MHz.
- Pentium III (2000):
- L1 and L2 both on a single chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L2 cache is 8-way set associative with 256 KB and error correction logic.
- 256-bit cache bus.
- CPU clock speed 800 MHz or better.
- L2 cache occupies about 30% of the total chip area.
- AMD K7/Athlon (2000):
- L1 on-chip, L2 off-chip.
- L1 split into data and instruction caches (Harvard Architecture).
- L1 is 2-way set associative, total 128 KB.
- L2 cache with 512 KB, 1, 2, 4, 8 MB options.
- 72-bit cache bus (64 bits data, 8 bits error control).
- CPU clock speed up to 1.1 GHz.
- 2nd Gen Intel® Core™ i7:
- Four or two execution cores
- 32-KB instruction and 32-KB data first-level cache (L1) for each core
- 256-KB shared instruction/data second-level cache (L2) for each core
- Up to 8-MB shared instruction/data third-level cache (L3), shared among all cores
- L1 and L2 for each core, L3 is shared
- Apple M1 ARM SoC 2021:
- Four high performance cores, Four high efficiency cores
- HP Cores: 192KB instruction cache, 128KN data cache, shared 12MB L2 cache. L1 and L2 for each core.
- HE Cores: 128KB instruction cache, 64KB data cache, shared 4MB L2 cache
Summary
- Topics covered: Directly mapped caches, Set associative caches, and Cache hierarchies.
- Next: Cache management policies, Cache coherency, Cache performance, and Cache design tradeoffs.