Computer Architecture - Cache Organisation Part 1

Why Cache Organisation?

  • Cache organization and design have a significant impact on the performance and functionality of modern computers.
  • Caches are critical to system performance and a major area in contemporary machine design.
  • Understanding caches allows optimization of application code to maximize performance and avoid performance problems. It also enables anticipating application performance problems by relating application design to cache organization.
  • Understanding cache limitations helps in differentiating machine performance for value-for-money hardware shopping.

Storage Speed vs. Performance

  • The performance of a computer system depends on the CPU's performance and the storage hierarchy it uses.
  • Ideally, using only the fastest RAM would allow the CPU to fetch instructions and data without delay, but this is very costly.
  • Alternatively, using economical DRAMs, SDRAMs, SSDs, and rotating disks is affordable.
  • If storage is significantly slower than the CPU, the CPU may stall, waiting for instructions from memory.
  • The performance loss due to a CPU stall depends on the hardware. For example, if the CPU executes an instruction every 2 nanoseconds and DRAM access time is 40 nanoseconds, the CPU wastes time.
  • The solution is to use a fast, small memory (cache) between the CPU and main memory.

Cache Memory Layout

  • A cache is a small, fast memory between the CPU and main memory (DRAM).
  • Registers are very fast, with access times around 1 nanosecond.
  • Cache memory has moderately fast access times, around tens of nanoseconds (e.g., 50 ns).

Spatial and Temporal Locality

  • Spatial Locality: Accessing a memory location implies that nearby memory locations are likely to be accessed soon.
  • Temporal Locality: Accessing a memory location implies it will likely be accessed again shortly.

Locality Example

  • Consider a loop writing to a data structure; the CPU fetches each instruction in the loop once per loop and writes the structure once per loop.
  • This exhibits both temporal and spatial locality.
  • Programs access memory most frequently to fetch instructions, less frequently to read data, and least frequently to write data.

Caches

  • Caches are small memories between the CPU and main memory, operating at speeds similar to CPU registers.
  • Accessed instructions and data are stored in the cache for quick retrieval.
  • Accessing instructions or data already in the cache minimizes time loss.

Issues in Caching

  • How to find data in the cache.
  • How to make the cache invisible to the programmer and OS.
  • How to maximize the ratio of cache hits to misses.
  • How to minimize cost and maximize cache speed.
  • How to ensure consistency between the cache and main memory.

Directly Mapped Caches

  • The cache contains a Cache Data Memory which holds cached contents. And each line contains one location's content.
  • Each entry has a Valid Bit, set when the entry contains valid data.
  • Each entry has a Tag, stored in the Cache Tag Memory.
  • A comparator compares tag values with address bits.

Directly Mapped Cache - Hit

  • Low Order Address Bits (LOAB) index into the Cache Tag Memory.
  • The comparator checks High Order Address Bits (HOAB) against the Cache Tag Memory value indexed by the LOAB.
  • If the values match, it's a Cache Hit, and the value from Cache Data Memory is retrieved.

Directly Mapped Cache - Miss #1

  • If the cache line is empty (Valid Bit = 0).
  • Low Order Address Bits (LOAB) index into the Cache Tag Memory.
  • High Order Address Bits (HOAB) are compared against the value in the Cache Tag Memory location.
  • Since the Valid Bit is zero, it's a Cache Miss, requiring the CPU to fetch data from memory.

Directly Mapped Cache - Miss #2

  • Assume cache contains an entry with identical Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
  • Low Order Address Bits (LOAB) index into the Cache Tag Memory.
  • The comparator tests High Order Address Bits (HOAB) against the value held in the Cache Tag Memory location indexed by the Low Order Address Bits.
  • Since the values are different, it is a “Cache Miss”; the CPU must fetch the data from memory instead.

Directly Mapped Cache Operation

  • Spatial locality suggests that accessing one location means nearby locations (with common High Order Address Bits) will likely be accessed.
  • This cache can hold only one location for each set of Low Order Address Bits (LOAB) at a time.
  • Increasing the number of cache lines (depth) increases "C" (number of cache lines) and reduces the tag size, NCN-C.

Directly Mapped Cache Operation - Filling

  • On every cache miss, data fetched from memory is saved in the cache.
  • The number of consecutive instructions that fit in the cache, assuming one instruction per line, depends on the number of cache lines.

Directly Mapped Cache - Pros/Cons

  • Advantage: simple design.
  • Disadvantage: Frequent accesses to locations with the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB) cause overwrites.
  • Cache performance is measured by the hit ratio: Hitratio=(Hits)/(Hits+Misses)Hit ratio = (Hits) / (Hits + Misses)
  • A reduced hit ratio increases the average access time and reduces total performance.

Direct Mapped Cache (a worked example)

  • Read 0x12345: High Order Address Bits (HOAB) is 12, Low Order Address Bits (LOAB) is 34, byte-in-line is 5.
  • Cache line indexed by 34. If there's a mismatch, it's a MISS!
  • Fill cache line: 12 | 003456729A7B89453400FFAABBCCDDEE, Read 0x12345, Cache HIT!
  • Reading 0x12340 will also be a HIT in the same cache line.
  • Reading 0x22343 results in a MISS, filling the cache line with new data.

Set Associative Caches

  • Caches where two or more cache lines can have the same Low Order Address Bits (LOAB) but different High Order Address Bits (HOAB).
  • Achieve good hit ratios despite smaller size.

2-Way Set Associative Cache

  • The simplest set associative cache.
  • The total number of cache entries is split into two halves, each with its own comparator for testing the High Order Address Bits (HOAB).
  • The tag field size must be increased by one bit.
  • Typically achieves a higher hit ratio than a directly mapped cache of the same size.

N-Way Set Associative Caches : Operation

  • Left Miss: the data is not found in the left side of the cache.
  • Right Hit: the data is found in the right side of the cache.
  • Double Miss: the data is not found in either side of the cache.

4-Way, 8-Way Set Associative Caches

  • Higher orders of set associativity are possible but hardware costs increase.
  • Improvement in hit ratios diminishes beyond 2-way set associativity.
  • This is a “diminishing return on investment” as hardware costs increase faster than performance gains.

Fully Associative Caches

  • The order of set associativity equals the number of cache lines.
  • Every cache line has its own comparator.
  • For N cache lines, there are N comparators, and the cache is split N-way.
  • Since the depth of each section is 1, there's no need for the Low Order Address Bits (LOAB).
  • Complexity and cost make it impractical except for very small caches.

Cache Busses

  • The speed of the bus connecting the CPU to the cache significantly impacts performance.
  • Performance improves with a faster or wider cache bus.
  • Putting the cache on the same chip as the CPU is optimal but limited by transistor count.

Cache Hierarchies

  • Large caches are expensive due to the SRAMs used.
  • A cache hierarchy balances cost and hit ratio.
  • A 2-Level cache hierarchy uses a very small, fast Level 1 (L1) cache and a larger, slower Level 2 (L2) cache.
  • 2-Level hierarchies are most common.

2-Level Cache Hierarchy Operation

  • CPU attempts to read data.
  • L1 cache is checked; if the data is present (L1 Hit), it's retrieved.
  • If the data is not in L1 (L1 Miss), L2 cache is checked.
  • If the data is not in L2 (L2 Miss), it's fetched from main memory.
  • The data is then placed in both L1 and L2 caches (Fill).
  • If the data is in L2 (L2 Hit), it's moved to L1 cache.

Examples of Cache Implementation

  • Intel Pentium Pro (1996):
    • 16 KB L1 on-chip, 256-512 KB L2 off-chip.
    • L1 split into data and instruction caches (Harvard Architecture).
    • L1 instruction cache is 4-way set associative with 8 KB.
    • L1 data cache is 2-way set associative with 8 KB.
    • L2 cache is 2-way set associative with 256 or 512 KB.
    • 64-bit cache bus.
  • Pentium-III Xeon (1998):
    • L1 on-chip, 512 KB, 1, 2 MB L2 off-chip.
    • L1 split into data and instruction caches (Harvard Architecture).
    • L2 cache is 2-way set associative with 512 KB, 1 MB, or 2 MB.
    • 64-bit cache bus.
    • CPU clock speed up to 600 MHz.
  • Pentium III (2000):
    • L1 and L2 both on a single chip.
    • L1 split into data and instruction caches (Harvard Architecture).
    • L2 cache is 8-way set associative with 256 KB and error correction logic.
    • 256-bit cache bus.
    • CPU clock speed 800 MHz or better.
    • L2 cache occupies about 30% of the total chip area.
  • AMD K7/Athlon (2000):
    • L1 on-chip, L2 off-chip.
    • L1 split into data and instruction caches (Harvard Architecture).
    • L1 is 2-way set associative, total 128 KB.
    • L2 cache with 512 KB, 1, 2, 4, 8 MB options.
    • 72-bit cache bus (64 bits data, 8 bits error control).
    • CPU clock speed up to 1.1 GHz.
  • 2nd Gen Intel® Core™ i7:
    • Four or two execution cores
    • 32-KB instruction and 32-KB data first-level cache (L1) for each core
    • 256-KB shared instruction/data second-level cache (L2) for each core
    • Up to 8-MB shared instruction/data third-level cache (L3), shared among all cores
    • L1 and L2 for each core, L3 is shared
  • Apple M1 ARM SoC 2021:
    • Four high performance cores, Four high efficiency cores
    • HP Cores: 192KB instruction cache, 128KN data cache, shared 12MB L2 cache. L1 and L2 for each core.
    • HE Cores: 128KB instruction cache, 64KB data cache, shared 4MB L2 cache

Summary

  • Topics covered: Directly mapped caches, Set associative caches, and Cache hierarchies.
  • Next: Cache management policies, Cache coherency, Cache performance, and Cache design tradeoffs.