RAM Caches Notes

RAM Caches

Reading: Section 13.4 of Computer Organization and Design Fundamentals

Access Times

  • Registers: 1ns → 2ns

  • L1 Cache: 3ns → 10ns

  • L2 Cache: 25ns → 50ns

  • Main Memory: 30ns → 90ns

  • Hard Drive: 5ms → 20ms

  • CD/DVD-ROM/RW: 100ms → 5s

  • Tape Backup: 10s → 3min (if volume is mounted)

  • Cost: Less costly internal cabling, More costly external (far).

Memory Hierarchy Trade-offs

  • Amount: Software will ALWAYS fill available memory.

  • Speed: Memory should be able to keep up with the processor.

  • Cost: Whatever the market will bear.

Terminology

  • Block/Line: Basic unit of cache storage, containing multiple contiguous bytes (words) of data.

  • Cache Hit: When data is found in the cache.

  • Hit Rate: Fraction of hits / total accesses.

  • Cache Miss: When data is not found in the cache.

  • Miss Rate: Fraction of misses / total accesses (equal to 1 – hit rate).

Memory “Speed” from CPU Point of View

The effective memory access time from the CPU's perspective can be calculated using the hit rate and access times of different memory levels:
h1+(1h)2h * 1 + (1 - h) * 2

  • hh is the hit rate (hits / # memory accesses).

  • (1h)(1 - h) is the miss rate (misses / # memory accesses).

  • 11 is the time to access closer (faster) memory.

  • 22 is the time to access distant (slower) memory.

Principle of Locality

The principle of locality is exploited in caching. The code example given relates to how data is accessed and potentially cached during program execution.

public static long fibonacci(long number) {
 long prev = -1;
 long result = 1;
 long sum;
 for(int i = 0; i < number; i++) {
 sum = result + prev;
 prev = result;
 result = sum;
 }
 return result;
}

Memory Addressing and Block Identification

  • Each gray block represents an addressable memory location containing a word.

  • Block 2 (2^18 - 1 = 262,143)

General Organization of a Cache

The cache consists of lines, each containing a block of words. Each block corresponds to a tag (TAG0, TAG1, TAG2, TAG3, …, TAGn-1). The number of lines is determined by the number of word ID bits (ww), resulting in 2w2^w lines.

Fully Associative Cache Organization

  • Memory Address is split into Tag and Word.

  • The tag is compared against all tags in the cache.

  • On a hit, the data from the corresponding cache line is retrieved.

  • On a miss, the data is fetched from main memory.

Fully Associative Cache Example

What value is stored in the cache for the following addresses?

Word id bits (in binary)

Tag

000

001

010

011

100

101

110

111

0110110110010

16

36

66

28

A1

3B

D6

78

0100011010101

54

C9

6A

63

54

32

00

D3

0001001101101

B9

0C

00

BE

EE

15

13

07

1010100101101

F5

A0

9D

A8

57

65

04

12

0001000111011

29

8C

ED

FD

29

54

12

F3

0001111001100

39

FA

B5

C1

33

9E

33

C8

1001100101101

23

4C

D2

40

6A

76

A3

F0

a.) 0x1E65 b.) 0xA96A c.) 0x6D97 d.) 0x166B Cache Miss

Fully Associative Cache Summary

  • Memory address is interpreted as two parts:

    • Least significant ww bits = word position within block

    • Most significant ss bits = tag used to identify which block is stored in a particular line of cache

  • PROS:

    • A memory block can load into any line of cache reducing chance of miss

    • The algorithm for storing is independent of the size of the cache

  • CONS:

    • Every line's tag must be examined for a match → expensive and slow

    • Gets worse for larger caches

Replacement Algorithms

  • Least Recently Used (LRU): Replace the block that hasn't been touched in the longest period of time.

  • First In First Out (FIFO): Replace the block that has been in the cache the longest.

  • Least Frequently Used (LFU): Replace the block which has had the fewest hits.

  • Random: Just pick one. Only slight performance hit with respect to use-based algorithms LRU, FIFO, and LFU, but uses less power and fewer transistors.

Line Status “Flag” Bits

Additional bits are maintained with each line of cache to identify “status”:

  • Dirty Bit: Indicates that one or more words in the line have been written to and need to be written to main memory.

  • Valid Bit: Indicates whether the line contains valid data loaded from memory.

  • Lock Bit: Prevents replacement with a new block from memory.

  • Type: In a unified (non-split) cache, identifies line contents as code or data.

Split Cache

A “split cache” is where the top-level cache (L1) is divided – one cache for instructions and one cache for data.

  • One cache feeds the instruction decoder, and one feeds data to the execution unit.

  • Supports pipelined architectures and other mechanisms intended to improve performance.

  • The instruction cache doesn't need to worry about writes.

  • The data cache doesn't need to worry about branch history or decoding information.

Inclusive and Exclusive Caches

  • Multi-processor systems now create competition between cores for space in shared caches.

  • Solutions:

    • Inclusive Cache: Any block present in a level of the cache must be present in all levels below it.

    • Exclusive Cache: If a block is present anywhere, it is only present in one level of the cache system.

    • Non-inclusive: Between inclusive and exclusive in that a block may or may not be present in only one level of the cache system.

Inclusive and Exclusive (continued)

  • Inclusive:

    • Simplifies cache coherence.

    • Performance is limited when the size of larger caches is not much greater than that of caches above it.

  • Exclusive:

    • More blocks can be stored.

    • The protocol necessary to support exclusive caches increases cache resource use and communication.

Write Policies

Research shows that 15% of memory accesses are writes.

  • Write Through: All writes go to main memory as well as the cache.

    • Simpler logic.

    • Slows down writes.

    • Increases bus traffic.

    • Multi-cores can monitor traffic to update their own caches.

  • Write Back: Updates are initially made in cache only.

    • Utilizes a “dirty” bit to know when to update memory on overwrite.

    • More complex to implement.

    • Slows up loading of new block to cache.

    • Additional sync mechanism needed for multi-core systems.

Overall Memory Architecture

  • System Memory

  • Shared L2 Cache

  • L1 Cache (Instructions and Data) for each CPU Core

  • CPU Core

  • Memory locations, Block N+1, Block N, Block N+2. Each location is 1 byte.

Data Structure Alignment

  • Modern memory busses are set up to be read as blocks, not individual bytes/words.

  • Memory elements that straddle two blocks force the processor to perform multiple reads, leading to slower operation.

  • Data structure alignment is used to limit straddling blocks.

  • Sometimes it is necessary to pad data in order to obey alignment requirements, even within a data structure.

Enforcing Data Structure Alignment

In general, k-byte operands must align (start at) an address that is a multiple of k.

Examples:

  • Two-byte values/instructions must start at an even address, i.e., the least significant bit of the address must be a 0 (Hex addresses can only end in 0, 2, 4, 6, 8, A, C, and E.)

  • Four-byte values/instructions must start at an address that is a multiple of 4, i.e., the least significant two bits of the address must be 00 (Hex addresses can only end in 0, 4, 8, and C.)

One More Note on Memory Storage

  • Endianness identifies which order a sequence of bytes is stored in memory – either starting with the most significant byte (big end) or the least significant bytes (little end).

  • Example: How is the integer 0x12345678 stored in big endian and little endian starting at address 1000?

Big Endian

Address

Contents

1000

12

1001

34

1002

56

1003

78

Little Endian

Address

Contents

1000

78

1001

56

1002

34

1003

12

Next Time…

In our next class, we will examine the motivation behind and operation of virtual memory.