RAM Caches Notes

RAM Caches

Reading: Section 13.4 of Computer Organization and Design Fundamentals

Access Times

Registers: 1ns → 2ns
L1 Cache: 3ns → 10ns
L2 Cache: 25ns → 50ns
Main Memory: 30ns → 90ns
Hard Drive: 5ms → 20ms
CD/DVD-ROM/RW: 100ms → 5s
Tape Backup: 10s → 3min (if volume is mounted)
Cost: Less costly internal cabling, More costly external (far).

Memory Hierarchy Trade-offs

Amount: Software will ALWAYS fill available memory.
Speed: Memory should be able to keep up with the processor.
Cost: Whatever the market will bear.

Terminology

Block/Line: Basic unit of cache storage, containing multiple contiguous bytes (words) of data.
Cache Hit: When data is found in the cache.
Hit Rate: Fraction of hits / total accesses.
Cache Miss: When data is not found in the cache.
Miss Rate: Fraction of misses / total accesses (equal to 1 – hit rate).

Memory “Speed” from CPU Point of View

The effective memory access time from the CPU's perspective can be calculated using the hit rate and access times of different memory levels:
$h * 1 + (1 - h) * 2$

$h$ is the hit rate (hits / # memory accesses).
$(1 - h)$ is the miss rate (misses / # memory accesses).
$1$ is the time to access closer (faster) memory.
$2$ is the time to access distant (slower) memory.

Principle of Locality

The principle of locality is exploited in caching. The code example given relates to how data is accessed and potentially cached during program execution.

public static long fibonacci(long number) {
 long prev = -1;
 long result = 1;
 long sum;
 for(int i = 0; i < number; i++) {
 sum = result + prev;
 prev = result;
 result = sum;
 }
 return result;
}

Memory Addressing and Block Identification

Each gray block represents an addressable memory location containing a word.
Block 2 (2^18 - 1 = 262,143)

General Organization of a Cache

The cache consists of lines, each containing a block of words. Each block corresponds to a tag (TAG0, TAG1, TAG2, TAG3, …, TAGn-1). The number of lines is determined by the number of word ID bits ( $w$ ), resulting in $2^w$ lines.

Fully Associative Cache Organization

Memory Address is split into Tag and Word.
The tag is compared against all tags in the cache.
On a hit, the data from the corresponding cache line is retrieved.
On a miss, the data is fetched from main memory.

Fully Associative Cache Example

What value is stored in the cache for the following addresses?

Word id bits (in binary)	Tag	000	001	010	011	100	101	110
0110110110010	16	36	66	28	A1	3B	D6	78
0100011010101	54	C9	6A	63	54	32	00	D3
0001001101101	B9	0C	00	BE	EE	15	13	07
1010100101101	F5	A0	9D	A8	57	65	04	12
0001000111011	29	8C	ED	FD	29	54	12	F3
0001111001100	39	FA	B5	C1	33	9E	33	C8
1001100101101	23	4C	D2	40	6A	76	A3	F0

a.) 0x1E65 b.) 0xA96A c.) 0x6D97 d.) 0x166B Cache Miss

Fully Associative Cache Summary

Memory address is interpreted as two parts:
- Least significant $w$ bits = word position within block
- Most significant $s$ bits = tag used to identify which block is stored in a particular line of cache
PROS:
- A memory block can load into any line of cache reducing chance of miss
- The algorithm for storing is independent of the size of the cache
CONS:
- Every line's tag must be examined for a match → expensive and slow
- Gets worse for larger caches

Replacement Algorithms

Least Recently Used (LRU): Replace the block that hasn't been touched in the longest period of time.
First In First Out (FIFO): Replace the block that has been in the cache the longest.
Least Frequently Used (LFU): Replace the block which has had the fewest hits.
Random: Just pick one. Only slight performance hit with respect to use-based algorithms LRU, FIFO, and LFU, but uses less power and fewer transistors.

Line Status “Flag” Bits

Additional bits are maintained with each line of cache to identify “status”:

Dirty Bit: Indicates that one or more words in the line have been written to and need to be written to main memory.
Valid Bit: Indicates whether the line contains valid data loaded from memory.
Lock Bit: Prevents replacement with a new block from memory.
Type: In a unified (non-split) cache, identifies line contents as code or data.

Split Cache

A “split cache” is where the top-level cache (L1) is divided – one cache for instructions and one cache for data.

One cache feeds the instruction decoder, and one feeds data to the execution unit.
Supports pipelined architectures and other mechanisms intended to improve performance.
The instruction cache doesn't need to worry about writes.
The data cache doesn't need to worry about branch history or decoding information.

Inclusive and Exclusive Caches

Multi-processor systems now create competition between cores for space in shared caches.
Solutions:
- Inclusive Cache: Any block present in a level of the cache must be present in all levels below it.
- Exclusive Cache: If a block is present anywhere, it is only present in one level of the cache system.
- Non-inclusive: Between inclusive and exclusive in that a block may or may not be present in only one level of the cache system.

Inclusive and Exclusive (continued)

Inclusive:
- Simplifies cache coherence.
- Performance is limited when the size of larger caches is not much greater than that of caches above it.
Exclusive:
- More blocks can be stored.
- The protocol necessary to support exclusive caches increases cache resource use and communication.

Write Policies

Research shows that 15% of memory accesses are writes.

Write Through: All writes go to main memory as well as the cache.
- Simpler logic.
- Slows down writes.
- Increases bus traffic.
- Multi-cores can monitor traffic to update their own caches.
Write Back: Updates are initially made in cache only.
- Utilizes a “dirty” bit to know when to update memory on overwrite.
- More complex to implement.
- Slows up loading of new block to cache.
- Additional sync mechanism needed for multi-core systems.

Overall Memory Architecture

System Memory
Shared L2 Cache
L1 Cache (Instructions and Data) for each CPU Core
CPU Core
Memory locations, Block N+1, Block N, Block N+2. Each location is 1 byte.

Data Structure Alignment

Modern memory busses are set up to be read as blocks, not individual bytes/words.
Memory elements that straddle two blocks force the processor to perform multiple reads, leading to slower operation.
Data structure alignment is used to limit straddling blocks.
Sometimes it is necessary to pad data in order to obey alignment requirements, even within a data structure.

Enforcing Data Structure Alignment

In general, k-byte operands must align (start at) an address that is a multiple of k.

Examples:

Two-byte values/instructions must start at an even address, i.e., the least significant bit of the address must be a 0 (Hex addresses can only end in 0, 2, 4, 6, 8, A, C, and E.)
Four-byte values/instructions must start at an address that is a multiple of 4, i.e., the least significant two bits of the address must be 00 (Hex addresses can only end in 0, 4, 8, and C.)

One More Note on Memory Storage

Endianness identifies which order a sequence of bytes is stored in memory – either starting with the most significant byte (big end) or the least significant bytes (little end).
Example: How is the integer 0x12345678 stored in big endian and little endian starting at address 1000?

Big Endian

Address	Contents
1000	12
1001	34
1002	56
1003	78

Little Endian

Address	Contents
1000	78
1001	56
1002	34
1003	12

Next Time…

In our next class, we will examine the motivation behind and operation of virtual memory.