Memory System Design
Problem 1: If you put everything into a box, it will work, but, you don’t have any performance in your laptop. Speed
Problem 2: Flexibility. Your application has to be able to run on older laptops.
Processor is much faster than Main memory
Solution: insert fast cache memory between processor and main memory
Amount of main memory available in a machine varies; program/data might not fit into memory
Program contains hard-coded address; How to move a program to a different memory location?
Solution: Virtual Memory hides details of memory hierarchy providing the illusion of a flat address space
Component | CPU | Main Memory | Disk memory | tape memory |
Access | Random = constant time to access any memory | Random | Direct | Sequential |
Capacity | 64-1024B | <512GB | >1TB | many TB |
Latency | 0.15-0.3ns | 30-200ns | 5ms | <10s |
Block Size | CPU→Main memory = 1 word | Main Memory → Disk memory = 4kb | Disk Memory → tape memory = KB/MB | |
Bandwidth | 100-1000GB/s | 5-20GB/s | 0.05-0.5GB/s | 0.001GB/s |
Cost | ***** (high) (uses SRAMS and flip flops because they have to be “blazing fast”) | *** (Uses DRAM) | ** | * (low) |
The longer a signal has to travel, geographically, the longer the latency
how much information a network can transfer at a time = bandwidth
Main memory is 500x slower than cpu memory because it has to use DRAM
the further you get away from the processor, the larger and slower the memory
Your system is only as fast as your weakest link. This is why we have a problem if you put it all in a box. Your process can be however fast but it will have to wait on the memory.
Temporal and Spatial locality
Working set.
Many programs involve loops, as long as you are in a loop, the temporal locality for a variable or operation in that loop will be a 1. The same can be said for functions in loops
Instructions are stored sequentially in memory, so spatial locality for the next space above your current operand is high. Variables are also stored close in memory on the stack
then on average, our programs will have pretty high temporal and spatial locality
Working set - small area of memory cells that don’t change much over time. Program will be very fast as long as that working set fits into the Cache
MAIN MEMORY IS SLOW BECAUSE OF DRAM
Whether an item is in Cache or not is a Cache hit or miss (whether it is in cache or main memory)
If we have a cache hit, we have to translate the address on the address bus to the corresponding cache address
We have a MMU (Memory management unit) that decides whether we have a cache hit or cache miss → (miss) send the address the processor provides to the memory and send that to the cache. If a hit, MMU translates the address to the cache chip via mapping tables
If programs has a high temporal and spacial locality, the cache will speed everything up. Working set will also be small. If working set fits into cache, that will speed everything up.
PAGING: Block→lookup table→block
word→ word
Segmentation: Block→lookup table→base address + = primary address
word → ^
segmentation has an addition, so it takes longer than paging
paging is a little less flexible but it is a lot faster, so it’s used in cache
Segmentation will be used in your operating system
Hit: the word was found at the level from which it was requested
Miss: not found in cache
Hit ratio = h = num of hits / total num of references
Miss ratio: 1-hit ratio
tp = primary memory access time. ts = secondary memory access time
access time, Ta = h*tp + (1-h)*ts
Page: commonly, a disk book. Page fault: synonymous with a miss.
Demand paging: pages are moved from disk to main memory only when a word in the page is requested by the processor
Block placement and replacement decisions must be made each time a block is moved
Block-size: block transfer efficiency and miss ratio will be affected. The more size in a cache block, the less lines you will half. If you cut the block size in half, you will have more cache lines. Both decrease hit ratio. It’s a trade off
How did they get to 16 words? - simulations
THERE IS NO EMPTY MEMORY, JUST UNINITIALIZED MEMORY
Associative mapped cache model: any block from main memory can be put anywhere in the cache. Assume a 16 bit main memory
Main memory address: Tag:13, Byte:3
If the Tag is in tag memory with a valid bit, then you can find it in cache
cache memory is still regular memory, but the tag is CAM
Because any block can reside anywhere in the cache, an Associative memory (CAM: Content Addressable Memory) is used. All locations are searched simultaneously.
If the CAM match logic (XOR gates) produces a 1, and the valid bit is a 1, it will enable the corresponding tristate buffer and clock in the cache block (each cache block has its own tristate buffer). The byte will be the selector for what to grab out of the block.
Each Cache has a set amount of transistors. To implement the match logic, you would have to take some from the cache memory, leaving less available cache memory.
Highest flexibility (Good)
Highest hardware overhead (bad)
Direct Mapped Cache: All the Main memory blocks from a given group can go into only one location in the cache, corresponding to the group number
The main memory is chopped up to fit into one cache line. Group number determines which cache line it will be stored in. One cache line = 8 bytes
Cache address: | 8 | 3 |
Main memory address: | 5 | 8 | 3 |
| tag | group | byte |
Now the cache needs to only examine the single group that its reference specifies
255 cache lines
Low flexibility
low overhead hardware
If two blocks from the same group are frequently referenced, then the cache will “thrash”. This is, repeatedly bringing the two competing blocks into and out of the cache causing performance degration.
2-Way Set Associative Cache: Two cache lines, 127 groups. Tag memory is now 6 bits
Cache group address: | 7 | 3 |
Main Memory address: | 6 | 7 | 3 |
| tag | set | Byte |
Decisions in Designing a 2-level hierarchy
Cache hit
Write through: updates both cache and MM upon each write
either one or two data movements
Write back - updates only cache. Updates MM only upon block removal
“dirty bit” is set upon first write to indicate
UNICORE PROCESSOR:
Separate cache for instruction and data (Harvard architecture) - L1
Stored unified L2 Cache on the Processor chip. Bigger but not as fast
Unified L3 cache - fast memory on the processor board near MM
MULTICORE PROCESSOR:
multiple cores = multiple caches
both L2 caches combine into L3 cache
Cache coherence problem
write through cache (L1 and L2) to update the copy in L3 cache (write back cache)
Cache snooping
snoops on L3 cache to see if the shared variable has been updated
Core now knows its copies are stale, so it just deletes the variable copies out of cache by setting the valid bit flag to 0
Virtual Memory
Amount of main memory available in a machine varies; program/data might not fit into memory
Program contains hard-coded addresses; how to move a program to a different memory locaiton
solution: Virtual memory hides details of memory hierarchy providing the illusion of a flat address space
mean memory acts as a cache for the disk
Multiprogramming shares the processor among independent programs that are resident in main memory and thus available for execution
The virtual address for any process starts at $0000 0000 (in general)
Cpu puts out logical address, then reinterpreted as a virtual address
logical address - an address computed by the processor while executing a program. Synonymous with effective address
virtual address - the address generated from the logical address
Processor is not producing any physical addresses, it is producing virtual addresses for this specific process. The virtual memory address starts at $0000 0000
1 page table per user per program unit
1 translation per memory access
potentially large page table
You will have as many page tables as processes that are pulled up
Page tables are stored in the os
From the logical address → page lookup → grab the byte out of main memory if valid hit
This is slow because you access the page table (in main memory so like 1000 clock ticks) just to grab from main memory again (another 1000 clock ticks)
Added data cache and instruction cache
We add a page table cache to the processor hardware so lookup is extremely fast
page table lookup is done whenever the processor needs anything
Logical address is the flat address given by cpu
Virtual address is reinterpreted to be offset in page and page number, it is technically the same number
problems
dont know how much memory we have in our box
hardcoded addresses
use the main memory as a cache for the disk, so it becomes transparent to the processor
now the processor is running in a virtual address space, getting rid of the hardcoded addresses in the executable
Page table does physical to virtual address translation. Each processes has its own page table
to access instruction data we have instruction cache and data cache. Added a page table cache for quick lookup (translation lookaside buffer tlb).
CPU → virtual address → search tlb → miss in tlb → search page table →if miss inpage table → page fault, get page from disk and update Main memory, put the real physical address into page table.
When we update page table the first time, we dont update the tlb. If we want to update the tlb cache, on the cpu) you would have to send it over the data bus, however, the data bus is already taken because we are sending the instruction to the instruction cache
on our second go: CPU → virtual address → search tlb → miss in tlb → search page table → hit in page table → generate physical address → update TLB and search cache → cache hit → return value from cache
load 4kb into main memory if you have a page miss. have one entry for that page in the page table in the tlb. only load 16 words into cache
Most recently used virtual to physical address
MEMORIZE THESE !!!!!!!
Cache summary
alleviates performance gap between fast processor and slow main memory
associative cache: most flexible, but highest hardware overhead
direct cache: least flexible, but lowest hardware overhead
set associative cache: compromise between associative and direct cache ( closer to direct than associative)
Virtual memory
program still runs even if not enough main memory available
program can be put anywhere in memory although hard-coded address in program (program always starts at fixed address
need of a page table per process IN MAIN MEMORY (SLOW)
table holds virtual-to-physical address translation (memorize)
need of a translation lookaside buffer (TLB) on processor chip to speed-up address translation
TLB holds RECENT virtual-to-physical address translations
TLB lookup much faster than page table lookup because page table resides in main memory while TBL is cache