RAM Caches Notes
RAM Caches
Reading: Section 13.4 of Computer Organization and Design Fundamentals
Access Times
Registers: 1ns → 2ns
L1 Cache: 3ns → 10ns
L2 Cache: 25ns → 50ns
Main Memory: 30ns → 90ns
Hard Drive: 5ms → 20ms
CD/DVD-ROM/RW: 100ms → 5s
Tape Backup: 10s → 3min (if volume is mounted)
Cost: Less costly internal cabling, More costly external (far).
Memory Hierarchy Trade-offs
Amount: Software will ALWAYS fill available memory.
Speed: Memory should be able to keep up with the processor.
Cost: Whatever the market will bear.
Terminology
Block/Line: Basic unit of cache storage, containing multiple contiguous bytes (words) of data.
Cache Hit: When data is found in the cache.
Hit Rate: Fraction of hits / total accesses.
Cache Miss: When data is not found in the cache.
Miss Rate: Fraction of misses / total accesses (equal to 1 – hit rate).
Memory “Speed” from CPU Point of View
The effective memory access time from the CPU's perspective can be calculated using the hit rate and access times of different memory levels:
is the hit rate (hits / # memory accesses).
is the miss rate (misses / # memory accesses).
is the time to access closer (faster) memory.
is the time to access distant (slower) memory.
Principle of Locality
The principle of locality is exploited in caching. The code example given relates to how data is accessed and potentially cached during program execution.
public static long fibonacci(long number) {
long prev = -1;
long result = 1;
long sum;
for(int i = 0; i < number; i++) {
sum = result + prev;
prev = result;
result = sum;
}
return result;
}
Memory Addressing and Block Identification
Each gray block represents an addressable memory location containing a word.
Block 2 (2^18 - 1 = 262,143)
General Organization of a Cache
The cache consists of lines, each containing a block of words. Each block corresponds to a tag (TAG0, TAG1, TAG2, TAG3, …, TAGn-1). The number of lines is determined by the number of word ID bits (), resulting in lines.
Fully Associative Cache Organization
Memory Address is split into Tag and Word.
The tag is compared against all tags in the cache.
On a hit, the data from the corresponding cache line is retrieved.
On a miss, the data is fetched from main memory.
Fully Associative Cache Example
What value is stored in the cache for the following addresses?
Word id bits (in binary) | Tag | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
|---|---|---|---|---|---|---|---|---|---|
0110110110010 | 16 | 36 | 66 | 28 | A1 | 3B | D6 | 78 | |
0100011010101 | 54 | C9 | 6A | 63 | 54 | 32 | 00 | D3 | |
0001001101101 | B9 | 0C | 00 | BE | EE | 15 | 13 | 07 | |
1010100101101 | F5 | A0 | 9D | A8 | 57 | 65 | 04 | 12 | |
0001000111011 | 29 | 8C | ED | FD | 29 | 54 | 12 | F3 | |
0001111001100 | 39 | FA | B5 | C1 | 33 | 9E | 33 | C8 | |
1001100101101 | 23 | 4C | D2 | 40 | 6A | 76 | A3 | F0 |
a.) 0x1E65 b.) 0xA96A c.) 0x6D97 d.) 0x166B Cache Miss
Fully Associative Cache Summary
Memory address is interpreted as two parts:
Least significant bits = word position within block
Most significant bits = tag used to identify which block is stored in a particular line of cache
PROS:
A memory block can load into any line of cache reducing chance of miss
The algorithm for storing is independent of the size of the cache
CONS:
Every line's tag must be examined for a match → expensive and slow
Gets worse for larger caches
Replacement Algorithms
Least Recently Used (LRU): Replace the block that hasn't been touched in the longest period of time.
First In First Out (FIFO): Replace the block that has been in the cache the longest.
Least Frequently Used (LFU): Replace the block which has had the fewest hits.
Random: Just pick one. Only slight performance hit with respect to use-based algorithms LRU, FIFO, and LFU, but uses less power and fewer transistors.
Line Status “Flag” Bits
Additional bits are maintained with each line of cache to identify “status”:
Dirty Bit: Indicates that one or more words in the line have been written to and need to be written to main memory.
Valid Bit: Indicates whether the line contains valid data loaded from memory.
Lock Bit: Prevents replacement with a new block from memory.
Type: In a unified (non-split) cache, identifies line contents as code or data.
Split Cache
A “split cache” is where the top-level cache (L1) is divided – one cache for instructions and one cache for data.
One cache feeds the instruction decoder, and one feeds data to the execution unit.
Supports pipelined architectures and other mechanisms intended to improve performance.
The instruction cache doesn't need to worry about writes.
The data cache doesn't need to worry about branch history or decoding information.
Inclusive and Exclusive Caches
Multi-processor systems now create competition between cores for space in shared caches.
Solutions:
Inclusive Cache: Any block present in a level of the cache must be present in all levels below it.
Exclusive Cache: If a block is present anywhere, it is only present in one level of the cache system.
Non-inclusive: Between inclusive and exclusive in that a block may or may not be present in only one level of the cache system.
Inclusive and Exclusive (continued)
Inclusive:
Simplifies cache coherence.
Performance is limited when the size of larger caches is not much greater than that of caches above it.
Exclusive:
More blocks can be stored.
The protocol necessary to support exclusive caches increases cache resource use and communication.
Write Policies
Research shows that 15% of memory accesses are writes.
Write Through: All writes go to main memory as well as the cache.
Simpler logic.
Slows down writes.
Increases bus traffic.
Multi-cores can monitor traffic to update their own caches.
Write Back: Updates are initially made in cache only.
Utilizes a “dirty” bit to know when to update memory on overwrite.
More complex to implement.
Slows up loading of new block to cache.
Additional sync mechanism needed for multi-core systems.
Overall Memory Architecture
System Memory
Shared L2 Cache
L1 Cache (Instructions and Data) for each CPU Core
CPU Core
Memory locations, Block N+1, Block N, Block N+2. Each location is 1 byte.
Data Structure Alignment
Modern memory busses are set up to be read as blocks, not individual bytes/words.
Memory elements that straddle two blocks force the processor to perform multiple reads, leading to slower operation.
Data structure alignment is used to limit straddling blocks.
Sometimes it is necessary to pad data in order to obey alignment requirements, even within a data structure.
Enforcing Data Structure Alignment
In general, k-byte operands must align (start at) an address that is a multiple of k.
Examples:
Two-byte values/instructions must start at an even address, i.e., the least significant bit of the address must be a 0 (Hex addresses can only end in 0, 2, 4, 6, 8, A, C, and E.)
Four-byte values/instructions must start at an address that is a multiple of 4, i.e., the least significant two bits of the address must be 00 (Hex addresses can only end in 0, 4, 8, and C.)
One More Note on Memory Storage
Endianness identifies which order a sequence of bytes is stored in memory – either starting with the most significant byte (big end) or the least significant bytes (little end).
Example: How is the integer 0x12345678 stored in big endian and little endian starting at address 1000?
Big Endian
Address | Contents |
|---|---|
1000 | 12 |
1001 | 34 |
1002 | 56 |
1003 | 78 |
Little Endian
Address | Contents |
|---|---|
1000 | 78 |
1001 | 56 |
1002 | 34 |
1003 | 12 |
Next Time…
In our next class, we will examine the motivation behind and operation of virtual memory.