Computer Systems and Organization

Course Outline

  • Introduction
  • LEGV8 instruction set architecture
  • LEGV8 microarchitecture
  • Pipelining
  • Computer memory
    • Physical memory
    • Virtual memory
    • Cache
  • Input/Output
  • Parallel processors

Assessment

  • 20% Practical Assignment
  • 80% Examination

Introduction, Computer Abstractions and Technology

  • Computers are Everywhere
    • Most aspects of our lives are dependent on or controlled by computers such as:
      • Office work
      • Driving a car
      • Searching and accessing information on the World Wide Web
      • Keeping up-to-date with your family and friends using smart phones
      • Finding vaccines for COVID-19
    • First programmable general-purpose electronic computer - Electronic Numerical Integrator and Computer (ENIAC)
      • 1946
      • 500 floating point operations per second (FLOPS)
    • Most powerful computer in 2022 - Hewlett Packard Enterprise Frontier
      • 1,685.65 PetaFLOPS (x 101510^{15})

Moore's Law

  • Moore's Law: The number of transistors on microchips doubles every two years

Computer Classes

  • Personal Computers
    • Good performance for one user at a time
    • Relatively cheap price
  • Servers
    • Run more powerful software
    • Multi-user and usually accessed via network
    • Higher reliability
  • Supercomputers
    • Highest performance and highest cost
    • Usually used for high-end scientific and engineering applications
  • Embedded Computers
    • Brains of various types of systems and devices (e.g. televisions, cars, etc.)
    • Stringent power/performance/cost constraints

Post-PC Era

  • Personal Mobile Device (PMD)
    • Battery operated
    • Mostly depends on permanent internet connection
    • Hundreds of dollars
    • Smart phones, tablets, electronic glasses
  • Cloud computing
    • Replacing Servers
    • Warehouse Scale Computers (WSC)
    • Software as a Service (SaaS)
    • Portion of software run on PC or PMD and a portion run in the Cloud
    • Amazon, Google and Microsoft

What Determines the Performance of a Computer?

  • Algorithm
    • Determines number of operations executed to achieve desired outcomes
  • Programming language, compiler, architecture
    • Determine number of machine instructions executed per operation
  • Processor and memory system
    • Determine how fast instructions are executed
  • I/O system (including OS)
    • Determines how fast I/O operations are executed

From Application Software to Machine Code

  • High-level Language
    • Details of computer hardware implementation abstracted away
    • Higher productivity
    • Portability
  • Assembly Language
    • Specific to processor used
    • Machine instructions in text form
  • Machine code
    • Specific processor instructions
    • Instructions and data encoded into bits
    • Difficult for human to understand

Organisation of a Computer

  • Inputting data
    • Data to be processed or stored
    • Control of computer
    • eg keyboard, mouse
  • Outputting data
    • Processed or stored data
    • Control outside devices
    • eg display, printer
    • Some devices provide both input and output, eg storage devices such as a HDD
  • Memory
    • Fast storage of data (permanent and/or temporary)
  • Processor (CPU)
    • Datapath- includes arithmetic, logic, registers, buses
    • Control unit- orchestrates operation of processor
    • Cache memory

Manufacturing Integrated Circuits (ICs)

  • Yield is the proportion of working dies per wafer

2nd Generation Intel® Core™ i7 Wafer

  • 300 mm wafer, 280 chips, 32 nm technology
  • Each chip is 20.7 x 10.5 mm

Instruction Set Architecture (ISA)

  • Defines interface between the hardware and the lowest-level software
    • Also simply referred to as the architecture of the processor
  • Includes
    • Machine code instructions
    • Registers
    • Memory access
    • I/O access
    • The same architecture may be implemented in different ways (with different costs and performance) -> ex: x86-64 Mpuments in diff ways by AMD and Intel. This is another example of abstraction
    • The application binary interface (ABI) is the combination of the ISA and the operating system

Different Performance Measures

  • There are different performance measures, depending on the needs

Computer Performance

  • Response time or execution time
    • How long it takes to do a task from start to finish (including I/O)
    • This metric is most relevant for individual using a computer
  • Throughput or bandwidth
    • Total number of work done per unit of time
    • This metric is most relevant in servers
  • These two measures are interrelated but distinct
  • Replacing processor with a faster one will decrease execution time but will also increase throughput
  • Adding more processors to a system may decrease execution time but will definitely increase throughput
  • Initially we will focus on execution time
  • Performance=1ExecutionTimePerformance = \frac{1}{Execution Time}

Measuring Execution Time

  • Elapsed time
    • Wall clock time- total response time
      • Processing
      • I/O
      • OS overhead
      • Idle time waiting for resources
    • Determines system performance
  • CPU time
    • Time spent by CPU processing a given task
    • Does not include I/O time, time take by other tasks, idle time
    • Includes user and system CPU time
    • Determines the CPU performance

CPU Performance Metrics

  • Clock speed
  • Clock cycles per instruction (CPI)
  • Number of instructions required

CPU Clock

  • The operation of a synchronous digital system is governed by a constant rate clock
  • Clock period
    • duration of a clock cycle
    • eg, 250ps = 0.25×1090.25 \times 10^{-9}s = 250×1012250 \times 10^{-12}s
  • Clock frequency (rate): cycles per second
    • eg. 4.0GHz = 4000MHz = 4.0×1094.0 \times 10^9 Hz
  • Clockfrequency=1ClockPeriodClock frequency = \frac{1}{Clock Period}

CPU Time

  • CPUTime=CPUClockCycles×ClockCycleTimeCPU Time = CPU Clock Cycles \times Clock Cycle Time
  • CPUClockCycles=CPUTimeClockCycleTimeCPU Clock Cycles = \frac{CPU Time}{Clock Cycle Time}
  • ClockRate=1ClockCycleTimeClock Rate = \frac{1}{Clock Cycle Time}
  • Performance is improved by
    • Reducing number of clock cycles
    • Increasing clock rate
  • Hardware designer must often trade-off clock rate against cycle count

CPU Time Example

  • Computer A: 2GHz clock, 10s CPU time
  • Designing Computer B:
    • Aim for 6s CPU time
    • Can increase clock speed, but causes 1.2 × clock cycles
    • How fast must Computer B clock be?
  • Solution:
    • ClockCycles<em>BClockRate</em>B=CPUTimeB\frac{Clock Cycles<em>B}{Clock Rate</em>B} = CPU Time_B
    • 1.2×ClockCycles<em>AClockRate</em>B=CPUTimeB\frac{1.2 \times Clock Cycles<em>A}{Clock Rate</em>B} = CPU Time_B
    • ClockCycles<em>A=CPUTime</em>A×ClockRateA=10s×2GHz=20×109Clock Cycles<em>A = CPU Time</em>A \times Clock Rate_A = 10s \times 2 GHz = 20 \times 10^9
    • 1.2×20×109ClockRateB=6s\frac{1.2 \times 20 \times 10^9}{Clock Rate_B} = 6s
    • ClockRateB=1.2×20×1096s=24×1096s=4GHzClock Rate_B = \frac{1.2 \times 20 \times 10^9}{6s} = \frac{24 \times 10^9}{6s} = 4 GHz

Instruction Count and CPI

  • ClockCycles=InstructionCount×CyclesperInstructionClock Cycles = Instruction Count \times Cycles per Instruction
  • CPUTime=InstructionCount×CPI×ClockCycleTime=InstructionCount×CPIClockRateCPU Time = Instruction Count \times CPI \times Clock Cycle Time = \frac{Instruction Count \times CPI}{Clock Rate}
  • Instruction Count for a program is determined by
    • Type of program
    • Algorithm used
    • Implementation of the algorithm
    • Instruction Set Architecture (ISA)
    • Programming language
    • Compiler
  • Average number of cycles per instruction is determined by
    • CPU hardware
    • Instruction mix to execute program if different instructions have different CPI

CPI Example

  • Computer A: Clock Speed = 4GHz, CPI = 2.0
  • Computer B: Cycle Time = 500ps, CPI = 1.2
  • Same ISA
  • Which computer is faster, and by how much?
  • CPUTime<em>A=InstructionCount×CPI</em>A×CycleTimeA=InstructionCount×2.0×14×109=InstructionCount×0.5×109sCPU Time<em>A = Instruction Count \times CPI</em>A \times Cycle Time_A = Instruction Count \times 2.0 \times \frac{1}{4 \times 10^9} = Instruction Count \times 0.5 \times 10^{-9}s
  • CPUTime<em>B=InstructionCount×CPI</em>B×CycleTimeB=InstructionCount×1.2×500ps=InstructionCount×1.2×500×1012s=InstructionCount×600×1012sCPU Time<em>B = Instruction Count \times CPI</em>B \times Cycle Time_B = Instruction Count \times 1.2 \times 500ps = Instruction Count \times 1.2 \times 500 \times 10^{-12}s = Instruction Count \times 600 \times 10^{-12}s
  • CPUTimeA=InstructionCount×500psCPU Time_A = Instruction Count \times 500ps
  • CPUTimeB=InstructionCount×600psCPU Time_B = Instruction Count \times 600ps
  • CPUTime<em>BCPUTime</em>A=600ps500ps=1.2\frac{CPU Time<em>B}{CPU Time</em>A} = \frac{600ps}{500ps} = 1.2
  • Therefore computer A is faster by a factor of 1.2

CPI when Different Instructions have Different Number of Cycles

  • In some CPUs, different instructions may take a different number of clock cycles to execute
  • ClockCycles=(CPI<em>i×InstructionCount</em>i)Clock Cycles = \sum (CPI<em>i \times Instruction Count</em>i)
  • Weighted average CPI
  • CPI=ClockCyclesInstructionCount=(CPI<em>i×InstructionCount</em>i)InstructionCount=(CPI<em>i×InstructionCount</em>iInstructionCount)CPI = \frac{Clock Cycles}{Instruction Count} = \frac{\sum (CPI<em>i \times Instruction Count</em>i)}{Instruction Count} = \sum (CPI<em>i \times \frac{Instruction Count</em>i}{Instruction Count})

CPI Example

  • Comparing average CPI for program compiled in two different ways
  • Compiled program makes use of three different classes of instructions: A, B and C
    • Instruction class
      • A
        • CPI for class: 1
        • Instruction Count in compilation 1: 2
        • Instruction Count in compilation 2: 4
      • B
        • CPI for class: 2
        • Instruction Count in compilation 1: 1
        • Instruction Count in compilation 2: 1
      • C
        • CPI for class: 3
        • Instruction Count in compilation 1: 2
        • Instruction Count in compilation 2: 1
  • Compilation 1: Instruction Count=5
    • Clock Cycles = 2x1+1x2+2x3=10
    • Average CPI=10/5=2.0
  • Compilation 2: Instruction Count=6
    • Clock Cycles = 4x1+1x2+1x3=9
    • Average CPI=9/6=1.5

Limitations imposed by Power Dissipation

  • In CMOS IC technology
  • Power=CapacitiveLoad×Voltage2×FrequencyPower = Capacitive Load \times Voltage^2 \times Frequency
  • Voltage has been reduced to 1V (from 5V) to compensate for increase in frequency

Uniprocessor Performance Limits

  • Constrained by power dissipation, instruction-level parallelism, memory latency

Increasing Performance by using Multiprocessors

  • Multicore microprocessors
    • More than one processor per chip
    • Motherboard may also contain multiple processors
  • Improves throughput in the case of multiple users/tasks
  • To reduce execution time requires the explicit use of parallel programming
    • This is different from instruction level parallelism, where hardware overlaps the execution of multiple instructions together and is hidden from programmer
  • Parallel programming is in general a difficult problem
    • Inherently requires programmer to program for performance (not just correctness)
    • Need to split execution of algorithm in parts with roughly equal execution time (load balancing)
    • Avoiding waste of time in exchanging and synchronising data between processors

SPEC CPU Benchmark

  • Need to compare performance of different computers
    • Difficult to achieve since different work loads behave differently
  • Standard Performance Evaluation Corp (SPEC) develops benchmarks for CPU, I/O, Web, Virtualisation, Storage etc
  • SPEC CPU 2017
    • Designed to provide performance measurements that can be used to compare compute-intensive workloads on different computer systems
    • Contains 43 benchmarks organized into four suites
      • SPEC speed 2017 Integer
      • SPEC speed 2017 Floating Point
      • SPECrate 2017 Integer
      • SPECrate 2017 Floating Point
    • Also includes an optional metric for measuring energy consumption

Amdahl's Law

  • The overall performance improvement gained by optimising a single part of a system is limited by the fraction of time that the improved part is actually used
  • T<em>improved=T</em>affectedImprovementFactor+TunaffectedT<em>{improved} = \frac{T</em>{affected}}{Improvement Factor} + T_{unaffected}
  • Example: In a program that executes for 100s, multiplication operations take a total of 80s. How much improvement is required in the multiplication operation to get an overall speed improvement of 5 times?
    • Solution:
      • For a five times speed improvement, Timproved=1005=20sT_{improved} = \frac{100}{5} = 20s
      • T<em>affected=80sT<em>{affected} = 80s and T</em>unaffected=10080=20sT</em>{unaffected} = 100 - 80 = 20s
      • If the improvement factor required is n, then 20=80n+2020 = \frac{80}{n} + 20
      • n=800n = \frac{80}{0} which is impossible

Register Operands

  • Arithmetic instructions use register operands or immediate operands only
  • LEGV8 has a 32 x 64-bit register file
    • Used for frequently accessed data
    • 64-bit data is called a "double word"
    • 32-bit data is called a "word"
    • 31 x 64-bit general purpose registers (X0-X30)
    • 31 x 32 general purpose subregisters (W0-W30)
    • XZR (register 31) 64-bit register is always equal to 0
  • Design Principle 2: Smaller is faster
  • The register file is small compared to main memory (with millions or billions of locations)

LEGV8 Registers

  • X0-X7: procedure arguments/results
  • X8: indirect result location register
  • X9-X15: temporary storage
  • X16-X17 (IP0-IP1): may be used by linker as a scratch register, otherwise a temporary register
  • X18: platform register for platform independent code; otherwise a temporary register
  • X19-X27: saved
  • X28 (SP): stack pointer
  • X29 (FP): frame pointer
  • X30 (LR): link register (return address)
  • X31 (XZR): the constant value 0

Memory Addresses

  • In LEGv8 (like most other modern CPUs) memory is byte-addressable
    • Therefore, consecutive double words have an address that differs by 8

Memory Operands

  • Main memory used for complex and large amount of data
    • Arrays, structures, dynamic data
  • To apply arithmetic operations on data in memory
    • Load values from memory into registers
    • Store result from register back to memory
  • Memory is byte addressed
    • Each address identifies an 8-bit byte
  • LEGV8 does not require words to be aligned in memory, except for instructions and the stack
  • Memory is accessed using a base address (stored in a register) and an offset
    • This is called base addressing

LEGV8 Instructions to Access Memory

  • LDUR - LoaD Unscaled Register
    • Transfers data from memory to register
  • STUR - STore Unscaled Register
  • Both take three operands
    • First operand is the destination/source register
    • Second operand is the base address of the data structure in memory
    • Third operand is the offset within the data structure of the required element
    • A 64-bit double word is transferred in both cases

Memory Operand Example

  • C code (array A contains 64-bit integers):
    • A[12] = h + A[8];
  • Compiled LEGB code:
    • //h in X21, base address of A in X22
    • LDUR X9, [X22, #64] // Offset is for double word (8 bytes), 8 x 8 = 64
    • ADD X9, X21, X9 // Notice that arithmetic is still performed with registers
    • STUR X9, [X22, #96] // 12 x 8=96

Registers vs Memory

  • Registers are much faster to access than memory
  • Operating on memory data requires loads and stores
    • More instructions need to be executed
  • Given the limited number of registers, memory still has to be used
  • Compiler must use registers for variables as much as possible
    • Only spill to memory for less frequently used variables
  • Register optimization is important

Immediate Operands

  • Constant data specified in an instruction
    • ADD X22, X22, #4 // X22 = X22 + 4
    • SUB X23, X22, #30 // X23 = X22 - 30
  • This is called immediate addressing
  • Design Principle 3: Make the common case fast
    • Small constants are common
    • Immediate operand avoids a load instruction from memory

Representing Instructions

  • Instructions are encoded in binary
    • Binary representation is called machine code
  • LEGv8 instructions
    • Encoded as 32-bit instruction words
    • Small number of formats encoding operation code (opcode), register numbers,
      • Regularity

LEGV8 R-format Instructions (R- Register)

  • Instruction fields
    • opcode: operation code (determines the type of operation to perform
    • Rm: the second register source operand
    • shamt: shift amount (00000 for non-shift)
    • Rn: the first register source operand
    • Rd: the register destination
  • Notice that the 5 bits allocated to the register fields can address the 2^5 = 32 registers

R-format Example

  • ADD X9, X20, X21
  • Binary: 10001011000101010000001010001001

LEGV8 D-format Instructions (D - Data)

  • Load/store instructions
  • Rn: base register
  • address: constant offset from contents of base register
    • ± 2^8 byte addresses, or ± 2^5 double word addresses
    • giving a range of ± 128 double words offsets
  • Rt: destination (load) or source (store) register number
  • Design Principle 3: Good design demands good compromises
    • Different formats complicate decoding, but allow 32-bit instructions uniformly
    • Keep formats as similar as possible

LEGv8 I-format Instructions (I - Immediate)

  • Immediate instructions
  • Rn: source register
  • Rd: destination register
  • immediate: immediate value
  • Immediate field is zero-extended
    *

Distinguishing between the Different Instruction Formats

  • The opcode value dictates which format the instruction belongs to

Logical Operations

  • Instructions for bitwise manipulation
OperationCJavaLEGB
Shift left<<<<LSL
Shift right>>>>>LSR
Bit-by-bit AND&&AND, ANDI
Bit-by-bit OR||OR, ORI
Bit-by-bit XOR^^EOR. EORI
  • Useful for extracting, inserting and manipulating groups of bits in a word

Shift Operations

  • shamt: how many positions to shift
  • Logical shift left (LSL)
    • Shift left and fill with 0 bits
    • LSL by s is equivalent to a multiplication by 2^s
  • Logical shift right (LSR)
    • Shift right and fill with 0 bits
    • LSR by s bits is equivalent to a division by 2^s (unsigned only)

Logical AND Operations

  • Useful to mask in a word
    • Select some bits (where the mask is 1) and clear others to 0
  • AND X9, X10, X11 // X9 = X10 & X11 (X11 is the mask in this case)

Logical OR Operations

  • Useful to set some bits to 1, leaving the others unchanged
  • OR X9, X10, X11 // X9 = X10 | X11

Logical XOR operations

  • The EOR instruction performs a bit-wise XOR operation between two operands
  • This instruction is useful to toggle some or all bits
  • EOR X9, X10, X12 // X9 = X10 XOR X12

Conditional Operations

  • Branch to a label if a condition is true, otherwise, continue sequentially
  • Three conditional branch (CB) instructions are available
    • CBZ register, L // if register == 0 branch to instruction labeled L
    • CBNZ register, L // if register != 0 branch to instruction labeled L
    • B.condition LL // if condition is true branch to instruction labeled LL

Subtract set Flags

  • Condition codes or flags, are set following arithmetic instruction with an S-suffix
    • ADDS
    • ADDIS
    • ANDS
    • ANDIS
    • SUBS
    • SUBIS
  • Flags
    • negative (N) result had 1 in MSB
    • zero (Z) result was 0
    • overflow (V) result overflowed
    • carry (C) result had carryout from MSB
  • If operation is ADDS or ANDS the test is simply on the result of the operation as compared to zero.
  • For AND, C and V are always set to 0

Unconditional Branch Operations

  • Three types of unconditional branch instructions
    • B L // Branch to instruction labeled L
    • BR register // Branch to address contained in register
    • BL L1 // Branch with link: X30 = PC + 4 & PC = L1
  • The BR and BL instructions are useful for procedure calls
    • BL L1, calls procedure labelled L1, making register X30 (also called link register LR) point to the next instruction following the procedure call
    • To return from procedure, the instruction BR LR is used
    • Note that this method cannot be used for nested procedure calls

Compiling If Statements

  • C code
    • if (i == j) f = g + h;
    • else f = g - h;
  • Compiled LEGvB code:
    • //f,g,h in X19, X20, X21, i,j in X22, X23 respectively
    • SUB X9, X22, X23 // X9 = i - j
    • CBNZ X9, Else // Branch if i != j
    • ADD X19, X20, X21 // f = g+h
    • B Exit // Skip the next instruction
    • Else: SUB X19, X20, X21 // f = g-h
    • Exit:

Compiling Loop Statements

  • i = 0;
  • while (A[i] != k) i = i + 1
  • Compiled LEGv8 code:
    • //i in X22, k in X24, address of A in X25
    • ADD X22, XZR, XZR // i = 0
    • Loop: LSL X10, X22, #3 // Multiply i by 8 to address double words
    • ADD X10, X25, X10 // X10 = address of A[i]
    • LDUR X11, [X10, #0] // X11 = A[i]
    • SUB X11, X11, X24 // X11 = A[i] - k
    • CBNZ X11, Exit // if A[i] != k exit loop
    • ADDI X22, X22, #1 // otherwise increment i
    • B Loop // Loop back
    • Exit:

Memory Layout

  • Text: program code
  • Static data: global variables
    • eg static variables in C, constant arrays and strings
  • Dynamic data (heap)
    • Eg malloc in C, new in Java
  • Stack: automatic storage

Non-Leaf Procedure Example

  • The following example uses recursive function to calculate the factorial of a number n
  • C code
    • int fact (int n) {
    • if (n < 1)
    • return 1;
    • else
    • return n * fact(n-1);
    • }
    • Argument n in X0
    • Result in X1

Non-Leaf Procedure Example (Continued)

  • LEGv8 code:
    • fact:
      • STP LR, [SP, #-16]! //Push Link Register onto the stack
      • STP X0, [SP, #-16]! //Push X0 onto the stack
      • SUBS X0, X0, #1
      • BL fact
      • RETI

Basic Blocks

  • A basic block is a sequence of instructions with
    • No embedded branches (except at end)
    • No branch targets (except at beginning)
  • A compiler identifies basic blocks for optimization
  • An advanced processor can accelerate execution of basic blocks

Byte/Halfword Operations

  • Load byte
    • LDURB Rt, [Rn, offset]
      • Load the byte at Rn+offset and zero extend to fill Rt. Rightmost byte considered to be offset
  • Store byte
    • STURB Rt, [Rn, offset]
      • Store just rightmost byte
  • Load half word
    • LDURH Rt, [Rn, offset]
      • Load the half word at Rn+offset and zero extend to fill Rt. Rightmost half word considered to be offset
  • Store half word
    • STURH Rt, [Rn, offset]
      • Store just rightmost half word

Constants larger than 12-bit

  • Most constants are small
    • 12 bit immediate is normally sufficient
  • For the occasional 32-bit (or larger) constant or address use:
    • MOVZ: move wide with zeros (remaining bits are set to zero)
    • MOVK: move wide with keep (remaining bits are kept the same)
    • 16 bits of the constant are loaded into the register
    • A second (shift) operand specify where the 16 bits are loaded
      • Allowed shift values are 0, 16, 32 or 48
    • Multiple MOVZ/MOVK instructions can be used to load constants with more than 16 bits

Example: Loading a 64-bit Constant in a Register

  • Suppose that we wish to load the constant FF00 1020 3040 5060, in register X9
  • To do this we need to split the constant into 16-bit numbers and load these separately using MOVZ and MOVK instructions
    • MOVZ X9, #0x5060, LSL #0 // X9 = 0x5060
    • MOVK X9, #0x3040, LSL #16 // X9 = 0x30405060
    • MOVK X9, #0x1020, LSL #32 // X9 = 0x102030405060
    • MOVK X9, #0xFF00, LSL #48 // X9 = 0xFF00102030405060

LEGV8 IW-format Instructions (IW - Immediate Wide)

  • This format is used for the MOVZ and MOVK instructions
  • The LSL value is encoded as a 2-bit sub-field as part of the op-code

LEGv8 B-format and CB-format Instructions

  • B-format (B- Branch)
    • B 10500 // to location PC + 10500
  • CB-format (CB-Conditional Branch)
    • CBZ X19 Ext // go to Exit if (X19 == 0)
  • Both addresses are PC-relative (an offset from the current program position)
    • Branch address = PC + offset
    • Offset is calculated by assembled

Conditional Branch Instruction B.cond

LEGV8 Addressing Summary

LEGV8 Instruction Format Summary

Synchronization between two Processors

  • When writing parallel programs, different processors must communicate results
  • Synchronization is required between processors
    • Two processors sharing an area of memory
      • Process 1 (P1) writes and Process 2 (P2) reads
      • Data race if P1 and P2 do not synchronize
  • Hardware support required to synchronize processes
    • Atomic read/write memory operation
      • No other access to the memory location allowed between the read and write
      • Could be a single instruction
        • Eganic swap of register memary
      • Or an atomic pair of instructions

Synchronization in LEGV8

  • Load exclusive register: LDXR
  • Store exclusive register: STXR
  • To use:
    • Execute LDXR then STXR with same address (not necessarily immediately one after the other)
      If there is an intervening change to the address, store fails (communicated with additional output parameter)
      Only use register instructions in between
      Any time a processor intervenes and modifies the value in memory between LDXR and STXR instructions, STXR returns 1 in the indicated register, otherwise it returns 0

Synchronization Example

  • Example 1: Atomic swap. This process needs to swap value in X23 with a second processor using memory location pointed to by X20
  • Example 2: Lock

Translation and Startup

LEGV8 Multiplication

  • Three multiply instructions:
    • MUL: multiply
    • SMULH: signed multiply high
      • use upper 64s of the product, the operands are signed
    • UMULH: unsigned multiply high
  • LEGV8 multiply instructions do not set the overflow condition code,
    • it is up to the software to check if the product is too big to fit in 64 bits

LEGV8 Division

  • Two instructions:
    • SDIV (signed)
    • UDIV (unsigned)
  • Both instructions ignore overflow and division-by-zero

Floating Point Arithmetic Hardware

  • floating point adder
    • Much more complex than integer adder or multiplier
    • Floating point multiplier is of similar complexity
  • Floating point arithmetic hardware usually does
    • Addition, subtraction, multiplication, division, reciprocal square root
    • Floating point to integer (and vice versa) conversion
  • Doing it in one clock cycle would take too long
    • Much longer than integer operations
    • Slower clock would penalize all instructions
    • Floating point operations usually