Computer Systems and Organization
Course Outline
- Introduction
- LEGV8 instruction set architecture
- LEGV8 microarchitecture
- Pipelining
- Computer memory
- Physical memory
- Virtual memory
- Cache
- Input/Output
- Parallel processors
Assessment
- 20% Practical Assignment
- 80% Examination
Introduction, Computer Abstractions and Technology
- Computers are Everywhere
- Most aspects of our lives are dependent on or controlled by computers such as:
- Office work
- Driving a car
- Searching and accessing information on the World Wide Web
- Keeping up-to-date with your family and friends using smart phones
- Finding vaccines for COVID-19
- First programmable general-purpose electronic computer - Electronic Numerical Integrator and Computer (ENIAC)
- 1946
- 500 floating point operations per second (FLOPS)
- Most powerful computer in 2022 - Hewlett Packard Enterprise Frontier
- 1,685.65 PetaFLOPS (x 1015)
Moore's Law
- Moore's Law: The number of transistors on microchips doubles every two years
Computer Classes
- Personal Computers
- Good performance for one user at a time
- Relatively cheap price
- Servers
- Run more powerful software
- Multi-user and usually accessed via network
- Higher reliability
- Supercomputers
- Highest performance and highest cost
- Usually used for high-end scientific and engineering applications
- Embedded Computers
- Brains of various types of systems and devices (e.g. televisions, cars, etc.)
- Stringent power/performance/cost constraints
Post-PC Era
- Personal Mobile Device (PMD)
- Battery operated
- Mostly depends on permanent internet connection
- Hundreds of dollars
- Smart phones, tablets, electronic glasses
- Cloud computing
- Replacing Servers
- Warehouse Scale Computers (WSC)
- Software as a Service (SaaS)
- Portion of software run on PC or PMD and a portion run in the Cloud
- Amazon, Google and Microsoft
- Algorithm
- Determines number of operations executed to achieve desired outcomes
- Programming language, compiler, architecture
- Determine number of machine instructions executed per operation
- Processor and memory system
- Determine how fast instructions are executed
- I/O system (including OS)
- Determines how fast I/O operations are executed
From Application Software to Machine Code
- High-level Language
- Details of computer hardware implementation abstracted away
- Higher productivity
- Portability
- Assembly Language
- Specific to processor used
- Machine instructions in text form
- Machine code
- Specific processor instructions
- Instructions and data encoded into bits
- Difficult for human to understand
Organisation of a Computer
- Inputting data
- Data to be processed or stored
- Control of computer
- eg keyboard, mouse
- Outputting data
- Processed or stored data
- Control outside devices
- eg display, printer
- Some devices provide both input and output, eg storage devices such as a HDD
- Memory
- Fast storage of data (permanent and/or temporary)
- Processor (CPU)
- Datapath- includes arithmetic, logic, registers, buses
- Control unit- orchestrates operation of processor
- Cache memory
Manufacturing Integrated Circuits (ICs)
- Yield is the proportion of working dies per wafer
2nd Generation Intel® Core™ i7 Wafer
- 300 mm wafer, 280 chips, 32 nm technology
- Each chip is 20.7 x 10.5 mm
Instruction Set Architecture (ISA)
- Defines interface between the hardware and the lowest-level software
- Also simply referred to as the architecture of the processor
- Includes
- Machine code instructions
- Registers
- Memory access
- I/O access
- The same architecture may be implemented in different ways (with different costs and performance) -> ex: x86-64 Mpuments in diff ways by AMD and Intel. This is another example of abstraction
- The application binary interface (ABI) is the combination of the ISA and the operating system
- There are different performance measures, depending on the needs
- Response time or execution time
- How long it takes to do a task from start to finish (including I/O)
- This metric is most relevant for individual using a computer
- Throughput or bandwidth
- Total number of work done per unit of time
- This metric is most relevant in servers
- These two measures are interrelated but distinct
- Replacing processor with a faster one will decrease execution time but will also increase throughput
- Adding more processors to a system may decrease execution time but will definitely increase throughput
- Initially we will focus on execution time
- Performance=ExecutionTime1
Measuring Execution Time
- Elapsed time
- Wall clock time- total response time
- Processing
- I/O
- OS overhead
- Idle time waiting for resources
- Determines system performance
- CPU time
- Time spent by CPU processing a given task
- Does not include I/O time, time take by other tasks, idle time
- Includes user and system CPU time
- Determines the CPU performance
- Clock speed
- Clock cycles per instruction (CPI)
- Number of instructions required
CPU Clock
- The operation of a synchronous digital system is governed by a constant rate clock
- Clock period
- duration of a clock cycle
- eg, 250ps = 0.25×10−9s = 250×10−12s
- Clock frequency (rate): cycles per second
- eg. 4.0GHz = 4000MHz = 4.0×109 Hz
- Clockfrequency=ClockPeriod1
CPU Time
- CPUTime=CPUClockCycles×ClockCycleTime
- CPUClockCycles=ClockCycleTimeCPUTime
- ClockRate=ClockCycleTime1
- Performance is improved by
- Reducing number of clock cycles
- Increasing clock rate
- Hardware designer must often trade-off clock rate against cycle count
CPU Time Example
- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B:
- Aim for 6s CPU time
- Can increase clock speed, but causes 1.2 × clock cycles
- How fast must Computer B clock be?
- Solution:
- ClockRate</em>BClockCycles<em>B=CPUTimeB
- ClockRate</em>B1.2×ClockCycles<em>A=CPUTimeB
- ClockCycles<em>A=CPUTime</em>A×ClockRateA=10s×2GHz=20×109
- ClockRateB1.2×20×109=6s
- ClockRateB=6s1.2×20×109=6s24×109=4GHz
Instruction Count and CPI
- ClockCycles=InstructionCount×CyclesperInstruction
- CPUTime=InstructionCount×CPI×ClockCycleTime=ClockRateInstructionCount×CPI
- Instruction Count for a program is determined by
- Type of program
- Algorithm used
- Implementation of the algorithm
- Instruction Set Architecture (ISA)
- Programming language
- Compiler
- Average number of cycles per instruction is determined by
- CPU hardware
- Instruction mix to execute program if different instructions have different CPI
CPI Example
- Computer A: Clock Speed = 4GHz, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which computer is faster, and by how much?
- CPUTime<em>A=InstructionCount×CPI</em>A×CycleTimeA=InstructionCount×2.0×4×1091=InstructionCount×0.5×10−9s
- CPUTime<em>B=InstructionCount×CPI</em>B×CycleTimeB=InstructionCount×1.2×500ps=InstructionCount×1.2×500×10−12s=InstructionCount×600×10−12s
- CPUTimeA=InstructionCount×500ps
- CPUTimeB=InstructionCount×600ps
- CPUTime</em>ACPUTime<em>B=500ps600ps=1.2
- Therefore computer A is faster by a factor of 1.2
CPI when Different Instructions have Different Number of Cycles
- In some CPUs, different instructions may take a different number of clock cycles to execute
- ClockCycles=∑(CPI<em>i×InstructionCount</em>i)
- Weighted average CPI
- CPI=InstructionCountClockCycles=InstructionCount∑(CPI<em>i×InstructionCount</em>i)=∑(CPI<em>i×InstructionCountInstructionCount</em>i)
CPI Example
- Comparing average CPI for program compiled in two different ways
- Compiled program makes use of three different classes of instructions: A, B and C
- Instruction class
- A
- CPI for class: 1
- Instruction Count in compilation 1: 2
- Instruction Count in compilation 2: 4
- B
- CPI for class: 2
- Instruction Count in compilation 1: 1
- Instruction Count in compilation 2: 1
- C
- CPI for class: 3
- Instruction Count in compilation 1: 2
- Instruction Count in compilation 2: 1
- Compilation 1: Instruction Count=5
- Clock Cycles = 2x1+1x2+2x3=10
- Average CPI=10/5=2.0
- Compilation 2: Instruction Count=6
- Clock Cycles = 4x1+1x2+1x3=9
- Average CPI=9/6=1.5
Limitations imposed by Power Dissipation
- In CMOS IC technology
- Power=CapacitiveLoad×Voltage2×Frequency
- Voltage has been reduced to 1V (from 5V) to compensate for increase in frequency
- Constrained by power dissipation, instruction-level parallelism, memory latency
- Multicore microprocessors
- More than one processor per chip
- Motherboard may also contain multiple processors
- Improves throughput in the case of multiple users/tasks
- To reduce execution time requires the explicit use of parallel programming
- This is different from instruction level parallelism, where hardware overlaps the execution of multiple instructions together and is hidden from programmer
- Parallel programming is in general a difficult problem
- Inherently requires programmer to program for performance (not just correctness)
- Need to split execution of algorithm in parts with roughly equal execution time (load balancing)
- Avoiding waste of time in exchanging and synchronising data between processors
SPEC CPU Benchmark
- Need to compare performance of different computers
- Difficult to achieve since different work loads behave differently
- Standard Performance Evaluation Corp (SPEC) develops benchmarks for CPU, I/O, Web, Virtualisation, Storage etc
- SPEC CPU 2017
- Designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems
- Contains 43 benchmarks organized into four suites
- SPEC speed 2017 Integer
- SPEC speed 2017 Floating Point
- SPECrate 2017 Integer
- SPECrate 2017 Floating Point
- Also includes an optional metric for measuring energy consumption
Amdahl's Law
- The overall performance improvement gained by optimising a single part of a system is limited by the fraction of time that the improved part is actually used
- T<em>improved=ImprovementFactorT</em>affected+Tunaffected
- Example: In a program that executes for 100s, multiplication operations take a total of 80s. How much improvement is required in the multiplication operation to get an overall speed improvement of 5 times?
- Solution:
- For a five times speed improvement, Timproved=5100=20s
- T<em>affected=80s and T</em>unaffected=100−80=20s
- If the improvement factor required is n, then 20=n80+20
- n=080 which is impossible
Register Operands
- Arithmetic instructions use register operands or immediate operands only
- LEGV8 has a 32 x 64-bit register file
- Used for frequently accessed data
- 64-bit data is called a "double word"
- 32-bit data is called a "word"
- 31 x 64-bit general purpose registers (X0-X30)
- 31 x 32 general purpose subregisters (W0-W30)
- XZR (register 31) 64-bit register is always equal to 0
- Design Principle 2: Smaller is faster
- The register file is small compared to main memory (with millions or billions of locations)
LEGV8 Registers
- X0-X7: procedure arguments/results
- X8: indirect result location register
- X9-X15: temporary storage
- X16-X17 (IP0-IP1): may be used by linker as a scratch register, otherwise a temporary register
- X18: platform register for platform independent code; otherwise a temporary register
- X19-X27: saved
- X28 (SP): stack pointer
- X29 (FP): frame pointer
- X30 (LR): link register (return address)
- X31 (XZR): the constant value 0
Memory Addresses
- In LEGv8 (like most other modern CPUs) memory is byte-addressable
- Therefore, consecutive double words have an address that differs by 8
Memory Operands
- Main memory used for complex and large amount of data
- Arrays, structures, dynamic data
- To apply arithmetic operations on data in memory
- Load values from memory into registers
- Store result from register back to memory
- Memory is byte addressed
- Each address identifies an 8-bit byte
- LEGV8 does not require words to be aligned in memory, except for instructions and the stack
- Memory is accessed using a base address (stored in a register) and an offset
- This is called base addressing
LEGV8 Instructions to Access Memory
- LDUR - LoaD Unscaled Register
- Transfers data from memory to register
- STUR - STore Unscaled Register
- Both take three operands
- First operand is the destination/source register
- Second operand is the base address of the data structure in memory
- Third operand is the offset within the data structure of the required element
- A 64-bit double word is transferred in both cases
Memory Operand Example
- C code (array A contains 64-bit integers):
- Compiled LEGB code:
- //h in X21, base address of A in X22
- LDUR X9, [X22, #64] // Offset is for double word (8 bytes), 8 x 8 = 64
- ADD X9, X21, X9 // Notice that arithmetic is still performed with registers
- STUR X9, [X22, #96] // 12 x 8=96
Registers vs Memory
- Registers are much faster to access than memory
- Operating on memory data requires loads and stores
- More instructions need to be executed
- Given the limited number of registers, memory still has to be used
- Compiler must use registers for variables as much as possible
- Only spill to memory for less frequently used variables
- Register optimization is important
- Constant data specified in an instruction
- ADD X22, X22, #4 // X22 = X22 + 4
- SUB X23, X22, #30 // X23 = X22 - 30
- This is called immediate addressing
- Design Principle 3: Make the common case fast
- Small constants are common
- Immediate operand avoids a load instruction from memory
Representing Instructions
- Instructions are encoded in binary
- Binary representation is called machine code
- LEGv8 instructions
- Encoded as 32-bit instruction words
- Small number of formats encoding operation code (opcode), register numbers,
- Instruction fields
- opcode: operation code (determines the type of operation to perform
- Rm: the second register source operand
- shamt: shift amount (00000 for non-shift)
- Rn: the first register source operand
- Rd: the register destination
- Notice that the 5 bits allocated to the register fields can address the 2^5 = 32 registers
- ADD X9, X20, X21
- Binary: 10001011000101010000001010001001
- Load/store instructions
- Rn: base register
- address: constant offset from contents of base register
- ± 2^8 byte addresses, or ± 2^5 double word addresses
- giving a range of ± 128 double words offsets
- Rt: destination (load) or source (store) register number
- Design Principle 3: Good design demands good compromises
- Different formats complicate decoding, but allow 32-bit instructions uniformly
- Keep formats as similar as possible
- Immediate instructions
- Rn: source register
- Rd: destination register
- immediate: immediate value
- Immediate field is zero-extended
*
- The opcode value dictates which format the instruction belongs to
Logical Operations
- Instructions for bitwise manipulation
| Operation | C | Java | LEGB |
|---|
| Shift left | << | << | LSL |
| Shift right | >> | >>> | LSR |
| Bit-by-bit AND | & | & | AND, ANDI |
| Bit-by-bit OR | | | | | OR, ORI |
| Bit-by-bit XOR | ^ | ^ | EOR. EORI |
- Useful for extracting, inserting and manipulating groups of bits in a word
Shift Operations
- shamt: how many positions to shift
- Logical shift left (LSL)
- Shift left and fill with 0 bits
- LSL by s is equivalent to a multiplication by 2^s
- Logical shift right (LSR)
- Shift right and fill with 0 bits
- LSR by s bits is equivalent to a division by 2^s (unsigned only)
Logical AND Operations
- Useful to mask in a word
- Select some bits (where the mask is 1) and clear others to 0
- AND X9, X10, X11 // X9 = X10 & X11 (X11 is the mask in this case)
Logical OR Operations
- Useful to set some bits to 1, leaving the others unchanged
- OR X9, X10, X11 // X9 = X10 | X11
Logical XOR operations
- The EOR instruction performs a bit-wise XOR operation between two operands
- This instruction is useful to toggle some or all bits
- EOR X9, X10, X12 // X9 = X10 XOR X12
Conditional Operations
- Branch to a label if a condition is true, otherwise, continue sequentially
- Three conditional branch (CB) instructions are available
- CBZ register, L // if register == 0 branch to instruction labeled L
- CBNZ register, L // if register != 0 branch to instruction labeled L
- B.condition LL // if condition is true branch to instruction labeled LL
Subtract set Flags
- Condition codes or flags, are set following arithmetic instruction with an S-suffix
- ADDS
- ADDIS
- ANDS
- ANDIS
- SUBS
- SUBIS
- Flags
- negative (N) result had 1 in MSB
- zero (Z) result was 0
- overflow (V) result overflowed
- carry (C) result had carryout from MSB
- If operation is ADDS or ANDS the test is simply on the result of the operation as compared to zero.
- For AND, C and V are always set to 0
Unconditional Branch Operations
- Three types of unconditional branch instructions
- B L // Branch to instruction labeled L
- BR register // Branch to address contained in register
- BL L1 // Branch with link: X30 = PC + 4 & PC = L1
- The BR and BL instructions are useful for procedure calls
- BL L1, calls procedure labelled L1, making register X30 (also called link register LR) point to the next instruction following the procedure call
- To return from procedure, the instruction BR LR is used
- Note that this method cannot be used for nested procedure calls
Compiling If Statements
- C code
- if (i == j) f = g + h;
- else f = g - h;
- Compiled LEGvB code:
- //f,g,h in X19, X20, X21, i,j in X22, X23 respectively
- SUB X9, X22, X23 // X9 = i - j
- CBNZ X9, Else // Branch if i != j
- ADD X19, X20, X21 // f = g+h
- B Exit // Skip the next instruction
- Else: SUB X19, X20, X21 // f = g-h
- Exit:
Compiling Loop Statements
- i = 0;
- while (A[i] != k) i = i + 1
- Compiled LEGv8 code:
- //i in X22, k in X24, address of A in X25
- ADD X22, XZR, XZR // i = 0
- Loop: LSL X10, X22, #3 // Multiply i by 8 to address double words
- ADD X10, X25, X10 // X10 = address of A[i]
- LDUR X11, [X10, #0] // X11 = A[i]
- SUB X11, X11, X24 // X11 = A[i] - k
- CBNZ X11, Exit // if A[i] != k exit loop
- ADDI X22, X22, #1 // otherwise increment i
- B Loop // Loop back
- Exit:
Memory Layout
- Text: program code
- Static data: global variables
- eg static variables in C, constant arrays and strings
- Dynamic data (heap)
- Eg malloc in C, new in Java
- Stack: automatic storage
Non-Leaf Procedure Example
- The following example uses recursive function to calculate the factorial of a number n
- C code
- int fact (int n) {
- if (n < 1)
- return 1;
- else
- return n * fact(n-1);
- }
- Argument n in X0
- Result in X1
Non-Leaf Procedure Example (Continued)
- LEGv8 code:
- fact:
- STP LR, [SP, #-16]! //Push Link Register onto the stack
- STP X0, [SP, #-16]! //Push X0 onto the stack
- SUBS X0, X0, #1
- BL fact
- RETI
Basic Blocks
- A basic block is a sequence of instructions with
- No embedded branches (except at end)
- No branch targets (except at beginning)
- A compiler identifies basic blocks for optimization
- An advanced processor can accelerate execution of basic blocks
Byte/Halfword Operations
- Load byte
- LDURB Rt, [Rn, offset]
- Load the byte at Rn+offset and zero extend to fill Rt. Rightmost byte considered to be offset
- Store byte
- STURB Rt, [Rn, offset]
- Store just rightmost byte
- Load half word
- LDURH Rt, [Rn, offset]
- Load the half word at Rn+offset and zero extend to fill Rt. Rightmost half word considered to be offset
- Store half word
- STURH Rt, [Rn, offset]
- Store just rightmost half word
Constants larger than 12-bit
- Most constants are small
- 12 bit immediate is normally sufficient
- For the occasional 32-bit (or larger) constant or address use:
- MOVZ: move wide with zeros (remaining bits are set to zero)
- MOVK: move wide with keep (remaining bits are kept the same)
- 16 bits of the constant are loaded into the register
- A second (shift) operand specify where the 16 bits are loaded
- Allowed shift values are 0, 16, 32 or 48
- Multiple MOVZ/MOVK instructions can be used to load constants with more than 16 bits
Example: Loading a 64-bit Constant in a Register
- Suppose that we wish to load the constant FF00 1020 3040 5060, in register X9
- To do this we need to split the constant into 16-bit numbers and load these separately using MOVZ and MOVK instructions
- MOVZ X9, #0x5060, LSL #0 // X9 = 0x5060
- MOVK X9, #0x3040, LSL #16 // X9 = 0x30405060
- MOVK X9, #0x1020, LSL #32 // X9 = 0x102030405060
- MOVK X9, #0xFF00, LSL #48 // X9 = 0xFF00102030405060
- This format is used for the MOVZ and MOVK instructions
- The LSL value is encoded as a 2-bit sub-field as part of the op-code
- B-format (B- Branch)
- B 10500 // to location PC + 10500
- CB-format (CB-Conditional Branch)
- CBZ X19 Ext // go to Exit if (X19 == 0)
- Both addresses are PC-relative (an offset from the current program position)
- Branch address = PC + offset
- Offset is calculated by assembled
Conditional Branch Instruction B.cond
LEGV8 Addressing Summary
Synchronization between two Processors
- When writing parallel programs, different processors must communicate results
- Synchronization is required between processors
- Two processors sharing an area of memory
- Process 1 (P1) writes and Process 2 (P2) reads
- Data race if P1 and P2 do not synchronize
- Hardware support required to synchronize processes
- Atomic read/write memory operation
- No other access to the memory location allowed between the read and write
- Could be a single instruction
- Eganic swap of register memary
- Or an atomic pair of instructions
Synchronization in LEGV8
- Load exclusive register: LDXR
- Store exclusive register: STXR
- To use:
- Execute LDXR then STXR with same address (not necessarily immediately one after the other)
If there is an intervening change to the address, store fails (communicated with additional output parameter)
Only use register instructions in between
Any time a processor intervenes and modifies the value in memory between LDXR and STXR instructions, STXR returns 1 in the indicated register, otherwise it returns 0
Synchronization Example
- Example 1: Atomic swap. This process needs to swap value in X23 with a second processor using memory location pointed to by X20
- Example 2: Lock
Translation and Startup
LEGV8 Multiplication
- Three multiply instructions:
- MUL: multiply
- SMULH: signed multiply high
- use upper 64s of the product, the operands are signed
- UMULH: unsigned multiply high
- LEGV8 multiply instructions do not set the overflow condition code,
- it is up to the software to check if the product is too big to fit in 64 bits
LEGV8 Division
- Two instructions:
- SDIV (signed)
- UDIV (unsigned)
- Both instructions ignore overflow and division-by-zero
Floating Point Arithmetic Hardware
- floating point adder
- Much more complex than integer adder or multiplier
- Floating point multiplier is of similar complexity
- Floating point arithmetic hardware usually does
- Addition, subtraction, multiplication, division, reciprocal square root
- Floating point to integer (and vice versa) conversion
- Doing it in one clock cycle would take too long
- Much longer than integer operations
- Slower clock would penalize all instructions
- Floating point operations usually