Computer Systems and Organization

Course Outline

Introduction
LEGV8 instruction set architecture
LEGV8 microarchitecture
Pipelining
Computer memory
- Physical memory
- Virtual memory
- Cache
Input/Output
Parallel processors

Assessment

20% Practical Assignment
80% Examination

Introduction, Computer Abstractions and Technology

Computers are Everywhere
- Most aspects of our lives are dependent on or controlled by computers such as:
  - Office work
  - Driving a car
  - Searching and accessing information on the World Wide Web
  - Keeping up-to-date with your family and friends using smart phones
  - Finding vaccines for COVID-19
- First programmable general-purpose electronic computer - Electronic Numerical Integrator and Computer (ENIAC)
  - 1946
  - 500 floating point operations per second (FLOPS)
- Most powerful computer in 2022 - Hewlett Packard Enterprise Frontier
  - 1,685.65 PetaFLOPS (x $10^{15}$ )

Moore's Law

Moore's Law: The number of transistors on microchips doubles every two years

Computer Classes

Personal Computers
- Good performance for one user at a time
- Relatively cheap price
Servers
- Run more powerful software
- Multi-user and usually accessed via network
- Higher reliability
Supercomputers
- Highest performance and highest cost
- Usually used for high-end scientific and engineering applications
Embedded Computers
- Brains of various types of systems and devices (e.g. televisions, cars, etc.)
- Stringent power/performance/cost constraints

Post-PC Era

Personal Mobile Device (PMD)
- Battery operated
- Mostly depends on permanent internet connection
- Hundreds of dollars
- Smart phones, tablets, electronic glasses
Cloud computing
- Replacing Servers
- Warehouse Scale Computers (WSC)
- Software as a Service (SaaS)
- Portion of software run on PC or PMD and a portion run in the Cloud
- Amazon, Google and Microsoft

What Determines the Performance of a Computer?

Algorithm
- Determines number of operations executed to achieve desired outcomes
Programming language, compiler, architecture
- Determine number of machine instructions executed per operation
Processor and memory system
- Determine how fast instructions are executed
I/O system (including OS)
- Determines how fast I/O operations are executed

From Application Software to Machine Code

High-level Language
- Details of computer hardware implementation abstracted away
- Higher productivity
- Portability
Assembly Language
- Specific to processor used
- Machine instructions in text form
Machine code
- Specific processor instructions
- Instructions and data encoded into bits
- Difficult for human to understand

Organisation of a Computer

Inputting data
- Data to be processed or stored
- Control of computer
- eg keyboard, mouse
Outputting data
- Processed or stored data
- Control outside devices
- eg display, printer
- Some devices provide both input and output, eg storage devices such as a HDD
Memory
- Fast storage of data (permanent and/or temporary)
Processor (CPU)
- Datapath- includes arithmetic, logic, registers, buses
- Control unit- orchestrates operation of processor
- Cache memory

Manufacturing Integrated Circuits (ICs)

Yield is the proportion of working dies per wafer

2nd Generation Intel® Core™ i7 Wafer

300 mm wafer, 280 chips, 32 nm technology
Each chip is 20.7 x 10.5 mm

Instruction Set Architecture (ISA)

Defines interface between the hardware and the lowest-level software
- Also simply referred to as the architecture of the processor
Includes
- Machine code instructions
- Registers
- Memory access
- I/O access
- The same architecture may be implemented in different ways (with different costs and performance) -> ex: x86-64 Mpuments in diff ways by AMD and Intel. This is another example of abstraction
- The application binary interface (ABI) is the combination of the ISA and the operating system

Different Performance Measures

There are different performance measures, depending on the needs

Computer Performance

Response time or execution time
- How long it takes to do a task from start to finish (including I/O)
- This metric is most relevant for individual using a computer
Throughput or bandwidth
- Total number of work done per unit of time
- This metric is most relevant in servers
These two measures are interrelated but distinct
Replacing processor with a faster one will decrease execution time but will also increase throughput
Adding more processors to a system may decrease execution time but will definitely increase throughput
Initially we will focus on execution time
$Performance = \frac{1}{Execution Time}$

Measuring Execution Time

Elapsed time
- Wall clock time- total response time
  - Processing
  - I/O
  - OS overhead
  - Idle time waiting for resources
- Determines system performance
CPU time
- Time spent by CPU processing a given task
- Does not include I/O time, time take by other tasks, idle time
- Includes user and system CPU time
- Determines the CPU performance

CPU Performance Metrics

Clock speed
Clock cycles per instruction (CPI)
Number of instructions required

CPU Clock

The operation of a synchronous digital system is governed by a constant rate clock
Clock period
- duration of a clock cycle
- eg, 250ps = $0.25 \times 10^{-9}$ s = $250 \times 10^{-12}$ s
Clock frequency (rate): cycles per second
- eg. 4.0GHz = 4000MHz = $4.0 \times 10^9$ Hz
$Clock frequency = \frac{1}{Clock Period}$

CPU Time

$CPU Time = CPU Clock Cycles \times Clock Cycle Time$
$CPU Clock Cycles = \frac{CPU Time}{Clock Cycle Time}$
$Clock Rate = \frac{1}{Clock Cycle Time}$
Performance is improved by
- Reducing number of clock cycles
- Increasing clock rate
Hardware designer must often trade-off clock rate against cycle count

CPU Time Example

Computer A: 2GHz clock, 10s CPU time
Designing Computer B:
- Aim for 6s CPU time
- Can increase clock speed, but causes 1.2 × clock cycles
- How fast must Computer B clock be?
Solution:
- $\frac{Clock CyclesB}{Clock RateB} = CPU Time_B$
- $\frac{1.2 \times Clock CyclesA}{Clock RateB} = CPU Time_B$
- $Clock CyclesA = CPU TimeA \times Clock Rate_A = 10s \times 2 GHz = 20 \times 10^9$
- $\frac{1.2 \times 20 \times 10^9}{Clock Rate_B} = 6s$
- $Clock Rate_B = \frac{1.2 \times 20 \times 10^9}{6s} = \frac{24 \times 10^9}{6s} = 4 GHz$

Instruction Count and CPI

$Clock Cycles = Instruction Count \times Cycles per Instruction$
$CPU Time = Instruction Count \times CPI \times Clock Cycle Time = \frac{Instruction Count \times CPI}{Clock Rate}$
Instruction Count for a program is determined by
- Type of program
- Algorithm used
- Implementation of the algorithm
- Instruction Set Architecture (ISA)
- Programming language
- Compiler
Average number of cycles per instruction is determined by
- CPU hardware
- Instruction mix to execute program if different instructions have different CPI

CPI Example

Computer A: Clock Speed = 4GHz, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which computer is faster, and by how much?
$CPU TimeA = Instruction Count \times CPIA \times Cycle Time_A = Instruction Count \times 2.0 \times \frac{1}{4 \times 10^9} = Instruction Count \times 0.5 \times 10^{-9}s$
$CPU TimeB = Instruction Count \times CPIB \times Cycle Time_B = Instruction Count \times 1.2 \times 500ps = Instruction Count \times 1.2 \times 500 \times 10^{-12}s = Instruction Count \times 600 \times 10^{-12}s$
$CPU Time_A = Instruction Count \times 500ps$
$CPU Time_B = Instruction Count \times 600ps$
$\frac{CPU TimeB}{CPU TimeA} = \frac{600ps}{500ps} = 1.2$
Therefore computer A is faster by a factor of 1.2

CPI when Different Instructions have Different Number of Cycles

In some CPUs, different instructions may take a different number of clock cycles to execute
$Clock Cycles = \sum (CPIi \times Instruction Counti)$
Weighted average CPI
$CPI = \frac{Clock Cycles}{Instruction Count} = \frac{\sum (CPIi \times Instruction Counti)}{Instruction Count} = \sum (CPIi \times \frac{Instruction Counti}{Instruction Count})$

CPI Example

Comparing average CPI for program compiled in two different ways
Compiled program makes use of three different classes of instructions: A, B and C
- Instruction class
  - A
    - CPI for class: 1
    - Instruction Count in compilation 1: 2
    - Instruction Count in compilation 2: 4
  - B
    - CPI for class: 2
    - Instruction Count in compilation 1: 1
    - Instruction Count in compilation 2: 1
  - C
    - CPI for class: 3
    - Instruction Count in compilation 1: 2
    - Instruction Count in compilation 2: 1
Compilation 1: Instruction Count=5
- Clock Cycles = 2x1+1x2+2x3=10
- Average CPI=10/5=2.0
Compilation 2: Instruction Count=6
- Clock Cycles = 4x1+1x2+1x3=9
- Average CPI=9/6=1.5

Limitations imposed by Power Dissipation

In CMOS IC technology
$Power = Capacitive Load \times Voltage^2 \times Frequency$
Voltage has been reduced to 1V (from 5V) to compensate for increase in frequency

Uniprocessor Performance Limits

Constrained by power dissipation, instruction-level parallelism, memory latency

Increasing Performance by using Multiprocessors

Multicore microprocessors
- More than one processor per chip
- Motherboard may also contain multiple processors
Improves throughput in the case of multiple users/tasks
To reduce execution time requires the explicit use of parallel programming
- This is different from instruction level parallelism, where hardware overlaps the execution of multiple instructions together and is hidden from programmer
Parallel programming is in general a difficult problem
- Inherently requires programmer to program for performance (not just correctness)
- Need to split execution of algorithm in parts with roughly equal execution time (load balancing)
- Avoiding waste of time in exchanging and synchronising data between processors

SPEC CPU Benchmark

Need to compare performance of different computers
- Difficult to achieve since different work loads behave differently
Standard Performance Evaluation Corp (SPEC) develops benchmarks for CPU, I/O, Web, Virtualisation, Storage etc
SPEC CPU 2017
- Designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems
- Contains 43 benchmarks organized into four suites
  - SPEC speed 2017 Integer
  - SPEC speed 2017 Floating Point
  - SPECrate 2017 Integer
  - SPECrate 2017 Floating Point
- Also includes an optional metric for measuring energy consumption

Amdahl's Law

The overall performance improvement gained by optimising a single part of a system is limited by the fraction of time that the improved part is actually used
$T{improved} = \frac{T{affected}}{Improvement Factor} + T_{unaffected}$
Example: In a program that executes for 100s, multiplication operations take a total of 80s. How much improvement is required in the multiplication operation to get an overall speed improvement of 5 times?
- Solution:
 - For a five times speed improvement, $T_{improved} = \frac{100}{5} = 20s$
 - $T{affected} = 80s$ and $T{unaffected} = 100 - 80 = 20s$
 - If the improvement factor required is n, then $20 = \frac{80}{n} + 20$
 - $n = \frac{80}{0}$ which is impossible

Register Operands

Arithmetic instructions use register operands or immediate operands only
LEGV8 has a 32 x 64-bit register file
- Used for frequently accessed data
- 64-bit data is called a "double word"
- 32-bit data is called a "word"
- 31 x 64-bit general purpose registers (X0-X30)
- 31 x 32 general purpose subregisters (W0-W30)
- XZR (register 31) 64-bit register is always equal to 0
Design Principle 2: Smaller is faster
The register file is small compared to main memory (with millions or billions of locations)

LEGV8 Registers

X0-X7: procedure arguments/results
X8: indirect result location register
X9-X15: temporary storage
X16-X17 (IP0-IP1): may be used by linker as a scratch register, otherwise a temporary register
X18: platform register for platform independent code; otherwise a temporary register
X19-X27: saved
X28 (SP): stack pointer
X29 (FP): frame pointer
X30 (LR): link register (return address)
X31 (XZR): the constant value 0

Memory Addresses

In LEGv8 (like most other modern CPUs) memory is byte-addressable
- Therefore, consecutive double words have an address that differs by 8

Memory Operands

Main memory used for complex and large amount of data
- Arrays, structures, dynamic data
To apply arithmetic operations on data in memory
- Load values from memory into registers
- Store result from register back to memory
Memory is byte addressed
- Each address identifies an 8-bit byte
LEGV8 does not require words to be aligned in memory, except for instructions and the stack
Memory is accessed using a base address (stored in a register) and an offset
- This is called base addressing

LEGV8 Instructions to Access Memory

LDUR - LoaD Unscaled Register
- Transfers data from memory to register
STUR - STore Unscaled Register
Both take three operands
- First operand is the destination/source register
- Second operand is the base address of the data structure in memory
- Third operand is the offset within the data structure of the required element
- A 64-bit double word is transferred in both cases

Memory Operand Example

C code (array A contains 64-bit integers):
- A[12] = h + A[8];
Compiled LEGB code:
- //h in X21, base address of A in X22
- LDUR X9, [X22, #64] // Offset is for double word (8 bytes), 8 x 8 = 64
- ADD X9, X21, X9 // Notice that arithmetic is still performed with registers
- STUR X9, [X22, #96] // 12 x 8=96

Registers vs Memory

Registers are much faster to access than memory
Operating on memory data requires loads and stores
- More instructions need to be executed
Given the limited number of registers, memory still has to be used
Compiler must use registers for variables as much as possible
- Only spill to memory for less frequently used variables
Register optimization is important

Immediate Operands

Constant data specified in an instruction
- ADD X22, X22, #4 // X22 = X22 + 4
- SUB X23, X22, #30 // X23 = X22 - 30
This is called immediate addressing
Design Principle 3: Make the common case fast
- Small constants are common
- Immediate operand avoids a load instruction from memory

Representing Instructions

Instructions are encoded in binary
- Binary representation is called machine code
LEGv8 instructions
- Encoded as 32-bit instruction words
- Small number of formats encoding operation code (opcode), register numbers,
  - Regularity

LEGV8 R-format Instructions (R- Register)

Instruction fields
- opcode: operation code (determines the type of operation to perform
- Rm: the second register source operand
- shamt: shift amount (00000 for non-shift)
- Rn: the first register source operand
- Rd: the register destination
Notice that the 5 bits allocated to the register fields can address the 2^5 = 32 registers

R-format Example

ADD X9, X20, X21
Binary: 10001011000101010000001010001001

LEGV8 D-format Instructions (D - Data)

Load/store instructions
Rn: base register
address: constant offset from contents of base register
- ± 2^8 byte addresses, or ± 2^5 double word addresses
- giving a range of ± 128 double words offsets
Rt: destination (load) or source (store) register number
Design Principle 3: Good design demands good compromises
- Different formats complicate decoding, but allow 32-bit instructions uniformly
- Keep formats as similar as possible

LEGv8 I-format Instructions (I - Immediate)

Immediate instructions
Rn: source register
Rd: destination register
immediate: immediate value
Immediate field is zero-extended
*

Distinguishing between the Different Instruction Formats

The opcode value dictates which format the instruction belongs to

Logical Operations

Instructions for bitwise manipulation

Operation	C	Java	LEGB
Shift left	<<	<<	LSL
Shift right	>>	>>>	LSR
Bit-by-bit AND	&	&	AND, ANDI
Bit-by-bit OR	`\|`	`\|`	OR, ORI
Bit-by-bit XOR	^	^	EOR. EORI

Useful for extracting, inserting and manipulating groups of bits in a word

Shift Operations

shamt: how many positions to shift
Logical shift left (LSL)
- Shift left and fill with 0 bits
- LSL by s is equivalent to a multiplication by 2^s
Logical shift right (LSR)
- Shift right and fill with 0 bits
- LSR by s bits is equivalent to a division by 2^s (unsigned only)

Logical AND Operations

Useful to mask in a word
- Select some bits (where the mask is 1) and clear others to 0
AND X9, X10, X11 // X9 = X10 & X11 (X11 is the mask in this case)

Logical OR Operations

Useful to set some bits to 1, leaving the others unchanged
OR X9, X10, X11 // X9 = X10 | X11

Logical XOR operations

The EOR instruction performs a bit-wise XOR operation between two operands
This instruction is useful to toggle some or all bits
EOR X9, X10, X12 // X9 = X10 XOR X12

Conditional Operations

Branch to a label if a condition is true, otherwise, continue sequentially
Three conditional branch (CB) instructions are available
- CBZ register, L // if register == 0 branch to instruction labeled L
- CBNZ register, L // if register != 0 branch to instruction labeled L
- B.condition LL // if condition is true branch to instruction labeled LL

Subtract set Flags

Condition codes or flags, are set following arithmetic instruction with an S-suffix
- ADDS
- ADDIS
- ANDS
- ANDIS
- SUBS
- SUBIS
Flags
- negative (N) result had 1 in MSB
- zero (Z) result was 0
- overflow (V) result overflowed
- carry (C) result had carryout from MSB
If operation is ADDS or ANDS the test is simply on the result of the operation as compared to zero.
For AND, C and V are always set to 0

Unconditional Branch Operations

Three types of unconditional branch instructions
- B L // Branch to instruction labeled L
- BR register // Branch to address contained in register
- BL L1 // Branch with link: X30 = PC + 4 & PC = L1
The BR and BL instructions are useful for procedure calls
- BL L1, calls procedure labelled L1, making register X30 (also called link register LR) point to the next instruction following the procedure call
- To return from procedure, the instruction BR LR is used
- Note that this method cannot be used for nested procedure calls

Compiling If Statements

C code
- if (i == j) f = g + h;
- else f = g - h;
Compiled LEGvB code:
- //f,g,h in X19, X20, X21, i,j in X22, X23 respectively
- SUB X9, X22, X23 // X9 = i - j
- CBNZ X9, Else // Branch if i != j
- ADD X19, X20, X21 // f = g+h
- B Exit // Skip the next instruction
- Else: SUB X19, X20, X21 // f = g-h
- Exit:

Compiling Loop Statements

i = 0;
while (A[i] != k) i = i + 1
Compiled LEGv8 code:
- //i in X22, k in X24, address of A in X25
- ADD X22, XZR, XZR // i = 0
- Loop: LSL X10, X22, #3 // Multiply i by 8 to address double words
- ADD X10, X25, X10 // X10 = address of A[i]
- LDUR X11, [X10, #0] // X11 = A[i]
- SUB X11, X11, X24 // X11 = A[i] - k
- CBNZ X11, Exit // if A[i] != k exit loop
- ADDI X22, X22, #1 // otherwise increment i
- B Loop // Loop back
- Exit:

Memory Layout

Text: program code
Static data: global variables
- eg static variables in C, constant arrays and strings
Dynamic data (heap)
- Eg malloc in C, new in Java
Stack: automatic storage

Non-Leaf Procedure Example

The following example uses recursive function to calculate the factorial of a number n
C code
- int fact (int n) {
- if (n < 1)
- return 1;
- else
- return n * fact(n-1);
- }
- Argument n in X0
- Result in X1

Non-Leaf Procedure Example (Continued)

LEGv8 code:
- fact:
  - STP LR, [SP, #-16]! //Push Link Register onto the stack
  - STP X0, [SP, #-16]! //Push X0 onto the stack
  - SUBS X0, X0, #1
  - BL fact
  - RETI

Basic Blocks

A basic block is a sequence of instructions with
- No embedded branches (except at end)
- No branch targets (except at beginning)
A compiler identifies basic blocks for optimization
An advanced processor can accelerate execution of basic blocks

Byte/Halfword Operations

Load byte
- LDURB Rt, [Rn, offset]
  - Load the byte at Rn+offset and zero extend to fill Rt. Rightmost byte considered to be offset
Store byte
- STURB Rt, [Rn, offset]
  - Store just rightmost byte
Load half word
- LDURH Rt, [Rn, offset]
  - Load the half word at Rn+offset and zero extend to fill Rt. Rightmost half word considered to be offset
Store half word
- STURH Rt, [Rn, offset]
  - Store just rightmost half word

Constants larger than 12-bit

Most constants are small
- 12 bit immediate is normally sufficient
For the occasional 32-bit (or larger) constant or address use:
- MOVZ: move wide with zeros (remaining bits are set to zero)
- MOVK: move wide with keep (remaining bits are kept the same)
- 16 bits of the constant are loaded into the register
- A second (shift) operand specify where the 16 bits are loaded
  - Allowed shift values are 0, 16, 32 or 48
- Multiple MOVZ/MOVK instructions can be used to load constants with more than 16 bits

Example: Loading a 64-bit Constant in a Register

Suppose that we wish to load the constant FF00 1020 3040 5060, in register X9
To do this we need to split the constant into 16-bit numbers and load these separately using MOVZ and MOVK instructions
- MOVZ X9, #0x5060, LSL #0 // X9 = 0x5060
- MOVK X9, #0x3040, LSL #16 // X9 = 0x30405060
- MOVK X9, #0x1020, LSL #32 // X9 = 0x102030405060
- MOVK X9, #0xFF00, LSL #48 // X9 = 0xFF00102030405060

LEGV8 IW-format Instructions (IW - Immediate Wide)

This format is used for the MOVZ and MOVK instructions
The LSL value is encoded as a 2-bit sub-field as part of the op-code

LEGv8 B-format and CB-format Instructions

B-format (B- Branch)
- B 10500 // to location PC + 10500
CB-format (CB-Conditional Branch)
- CBZ X19 Ext // go to Exit if (X19 == 0)
Both addresses are PC-relative (an offset from the current program position)
- Branch address = PC + offset
- Offset is calculated by assembled

Conditional Branch Instruction B.cond

LEGV8 Addressing Summary

LEGV8 Instruction Format Summary

Synchronization between two Processors

When writing parallel programs, different processors must communicate results
Synchronization is required between processors
- Two processors sharing an area of memory
  - Process 1 (P1) writes and Process 2 (P2) reads
  - Data race if P1 and P2 do not synchronize
Hardware support required to synchronize processes
- Atomic read/write memory operation
  - No other access to the memory location allowed between the read and write
  - Could be a single instruction
    - Eganic swap of register memary
  - Or an atomic pair of instructions

Synchronization in LEGV8

Load exclusive register: LDXR
Store exclusive register: STXR
To use:
- Execute LDXR then STXR with same address (not necessarily immediately one after the other)
  If there is an intervening change to the address, store fails (communicated with additional output parameter)
  Only use register instructions in between
  Any time a processor intervenes and modifies the value in memory between LDXR and STXR instructions, STXR returns 1 in the indicated register, otherwise it returns 0

Synchronization Example

Example 1: Atomic swap. This process needs to swap value in X23 with a second processor using memory location pointed to by X20
Example 2: Lock

Translation and Startup

LEGV8 Multiplication

Three multiply instructions:
- MUL: multiply
- SMULH: signed multiply high
  - use upper 64s of the product, the operands are signed
- UMULH: unsigned multiply high
LEGV8 multiply instructions do not set the overflow condition code,
- it is up to the software to check if the product is too big to fit in 64 bits

LEGV8 Division

Two instructions:
- SDIV (signed)
- UDIV (unsigned)
Both instructions ignore overflow and division-by-zero

Floating Point Arithmetic Hardware

floating point adder
- Much more complex than integer adder or multiplier
- Floating point multiplier is of similar complexity
Floating point arithmetic hardware usually does
- Addition, subtraction, multiplication, division, reciprocal square root
- Floating point to integer (and vice versa) conversion
Doing it in one clock cycle would take too long
- Much longer than integer operations
- Slower clock would penalize all instructions
- Floating point operations usually