Memory System Design

Problem 1: If you put everything into a box, it will work, but, you don’t have any performance in your laptop. Speed

Problem 2: Flexibility. Your application has to be able to run on older laptops.

Processor is much faster than Main memory

Solution: insert fast cache memory between processor and main memory

Amount of main memory available in a machine varies; program/data might not fit into memory

Program contains hard-coded address; How to move a program to a different memory location?

Solution: Virtual Memory hides details of memory hierarchy providing the illusion of a flat address space

Component	CPU	Main Memory	Disk memory	tape memory
Access	Random = constant time to access any memory	Random	Direct	Sequential
Capacity	64-1024B	<512GB	>1TB	many TB
Latency	0.15-0.3ns	30-200ns	5ms	<10s
Block Size	CPU→Main memory = 1 word	Main Memory → Disk memory = 4kb	Disk Memory → tape memory = KB/MB
Bandwidth	100-1000GB/s	5-20GB/s	0.05-0.5GB/s	0.001GB/s
Cost	***** (high) (uses SRAMS and flip flops because they have to be “blazing fast”)	*** (Uses DRAM)	**	* (low)

The longer a signal has to travel, geographically, the longer the latency

how much information a network can transfer at a time = bandwidth

Main memory is 500x slower than cpu memory because it has to use DRAM

the further you get away from the processor, the larger and slower the memory

Your system is only as fast as your weakest link. This is why we have a problem if you put it all in a box. Your process can be however fast but it will have to wait on the memory.

Temporal and Spatial locality

Working set.

Many programs involve loops, as long as you are in a loop, the temporal locality for a variable or operation in that loop will be a 1. The same can be said for functions in loops

Instructions are stored sequentially in memory, so spatial locality for the next space above your current operand is high. Variables are also stored close in memory on the stack

then on average, our programs will have pretty high temporal and spatial locality

Working set - small area of memory cells that don’t change much over time. Program will be very fast as long as that working set fits into the Cache

MAIN MEMORY IS SLOW BECAUSE OF DRAM

Whether an item is in Cache or not is a Cache hit or miss (whether it is in cache or main memory)

If we have a cache hit, we have to translate the address on the address bus to the corresponding cache address

We have a MMU (Memory management unit) that decides whether we have a cache hit or cache miss → (miss) send the address the processor provides to the memory and send that to the cache. If a hit, MMU translates the address to the cache chip via mapping tables

If programs has a high temporal and spacial locality, the cache will speed everything up. Working set will also be small. If working set fits into cache, that will speed everything up.

PAGING: Block→lookup table→block

word→ word

Segmentation: Block→lookup table→base address + = primary address

word → ^

segmentation has an addition, so it takes longer than paging

paging is a little less flexible but it is a lot faster, so it’s used in cache

Segmentation will be used in your operating system

Hit: the word was found at the level from which it was requested

Miss: not found in cache

Hit ratio = h = num of hits / total num of references

Miss ratio: 1-hit ratio

tp = primary memory access time. ts = secondary memory access time

access time, Ta = h*tp + (1-h)*ts

Page: commonly, a disk book. Page fault: synonymous with a miss.

Demand paging: pages are moved from disk to main memory only when a word in the page is requested by the processor

Block placement and replacement decisions must be made each time a block is moved

Block-size: block transfer efficiency and miss ratio will be affected. The more size in a cache block, the less lines you will half. If you cut the block size in half, you will have more cache lines. Both decrease hit ratio. It’s a trade off

How did they get to 16 words? - simulations

THERE IS NO EMPTY MEMORY, JUST UNINITIALIZED MEMORY

Associative mapped cache model: any block from main memory can be put anywhere in the cache. Assume a 16 bit main memory

Main memory address: Tag:13, Byte:3

If the Tag is in tag memory with a valid bit, then you can find it in cache

cache memory is still regular memory, but the tag is CAM

Because any block can reside anywhere in the cache, an Associative memory (CAM: Content Addressable Memory) is used. All locations are searched simultaneously.

If the CAM match logic (XOR gates) produces a 1, and the valid bit is a 1, it will enable the corresponding tristate buffer and clock in the cache block (each cache block has its own tristate buffer). The byte will be the selector for what to grab out of the block.

Each Cache has a set amount of transistors. To implement the match logic, you would have to take some from the cache memory, leaving less available cache memory.

Highest flexibility (Good)

Highest hardware overhead (bad)

Direct Mapped Cache: All the Main memory blocks from a given group can go into only one location in the cache, corresponding to the group number

The main memory is chopped up to fit into one cache line. Group number determines which cache line it will be stored in. One cache line = 8 bytes

Cache address: | 8 | 3 |

Main memory address: | 5 | 8 | 3 |

| tag | group | byte |

Now the cache needs to only examine the single group that its reference specifies

255 cache lines

Low flexibility

low overhead hardware

If two blocks from the same group are frequently referenced, then the cache will “thrash”. This is, repeatedly bringing the two competing blocks into and out of the cache causing performance degration.

2-Way Set Associative Cache: Two cache lines, 127 groups. Tag memory is now 6 bits

Cache group address: | 7 | 3 |

Main Memory address: | 6 | 7 | 3 |

| tag | set | Byte |

Decisions in Designing a 2-level hierarchy

Cache hit

Write through: updates both cache and MM upon each write
- either one or two data movements
Write back - updates only cache. Updates MM only upon block removal
- “dirty bit” is set upon first write to indicate

UNICORE PROCESSOR:

Separate cache for instruction and data (Harvard architecture) - L1
Stored unified L2 Cache on the Processor chip. Bigger but not as fast
Unified L3 cache - fast memory on the processor board near MM

MULTICORE PROCESSOR:

multiple cores = multiple caches
both L2 caches combine into L3 cache
Cache coherence problem
- write through cache (L1 and L2) to update the copy in L3 cache (write back cache)
Cache snooping
- snoops on L3 cache to see if the shared variable has been updated
- Core now knows its copies are stale, so it just deletes the variable copies out of cache by setting the valid bit flag to 0

Virtual Memory

Amount of main memory available in a machine varies; program/data might not fit into memory

Program contains hard-coded addresses; how to move a program to a different memory locaiton

solution: Virtual memory hides details of memory hierarchy providing the illusion of a flat address space
mean memory acts as a cache for the disk
Multiprogramming shares the processor among independent programs that are resident in main memory and thus available for execution
The virtual address for any process starts at $0000 0000 (in general)

Cpu puts out logical address, then reinterpreted as a virtual address

logical address - an address computed by the processor while executing a program. Synonymous with effective address

virtual address - the address generated from the logical address

Processor is not producing any physical addresses, it is producing virtual addresses for this specific process. The virtual memory address starts at $0000 0000

1 page table per user per program unit
1 translation per memory access
potentially large page table

You will have as many page tables as processes that are pulled up

Page tables are stored in the os

From the logical address → page lookup → grab the byte out of main memory if valid hit

This is slow because you access the page table (in main memory so like 1000 clock ticks) just to grab from main memory again (another 1000 clock ticks)

Added data cache and instruction cache

We add a page table cache to the processor hardware so lookup is extremely fast

page table lookup is done whenever the processor needs anything

Logical address is the flat address given by cpu

Virtual address is reinterpreted to be offset in page and page number, it is technically the same number

problems

dont know how much memory we have in our box
hardcoded addresses

use the main memory as a cache for the disk, so it becomes transparent to the processor

now the processor is running in a virtual address space, getting rid of the hardcoded addresses in the executable

Page table does physical to virtual address translation. Each processes has its own page table

to access instruction data we have instruction cache and data cache. Added a page table cache for quick lookup (translation lookaside buffer tlb).

CPU → virtual address → search tlb → miss in tlb → search page table →if miss inpage table → page fault, get page from disk and update Main memory, put the real physical address into page table.

When we update page table the first time, we dont update the tlb. If we want to update the tlb cache, on the cpu) you would have to send it over the data bus, however, the data bus is already taken because we are sending the instruction to the instruction cache

on our second go: CPU → virtual address → search tlb → miss in tlb → search page table → hit in page table → generate physical address → update TLB and search cache → cache hit → return value from cache

load 4kb into main memory if you have a page miss. have one entry for that page in the page table in the tlb. only load 16 words into cache

Most recently used virtual to physical address

MEMORIZE THESE !!!!!!!

Cache summary

alleviates performance gap between fast processor and slow main memory
associative cache: most flexible, but highest hardware overhead
direct cache: least flexible, but lowest hardware overhead
set associative cache: compromise between associative and direct cache ( closer to direct than associative)

Virtual memory

program still runs even if not enough main memory available
program can be put anywhere in memory although hard-coded address in program (program always starts at fixed address
need of a page table per process IN MAIN MEMORY (SLOW)
- table holds virtual-to-physical address translation (memorize)
need of a translation lookaside buffer (TLB) on processor chip to speed-up address translation
- TLB holds RECENT virtual-to-physical address translations
- TLB lookup much faster than page table lookup because page table resides in main memory while TBL is cache