GPU Programming for Physics PhD Course Notes

Introduction to GPUs

GPUs (Graphics Processing Units) are used to generate 3D graphics in real-time.
Mid 1990s: First 3D accelerating Graphic Cards and fully 3D capable consoles.
2020s: General Purpose GPUs (GP-GPUs) are used for generic parallel calculations.
Since 2007 GPUs are used not only for graphics but also for generic parallel calculations.
Some GPUs don’t even implement a video output.
Cryptocurrencies boosted the interest in this topic (Bitcoin mining).

Why GPUs?

Geometrical operations on objects in 3D graphics can be computed efficiently in a parallel architecture:
- Sum of vectors: just sum vector components independently
- Rotations: compute each component of the rotated vector independently

Ray Tracing (ray tracing is a technique for modeling light transport for use in a wide variety of rendering algorithms for generating digital images)

Reflection, refraction and shadow effects.
Ideally each pixel could be processed independently.

CPUs vs GPUs

CPUs and GPUs are both necessary, they solve different problems and have different roles.
CPU: Central Processing Unit
- Runs the Operating System which manages all system resources
- Applications run on it, even if they take advantage of GPUs at some point
- Features multiple cores (approximately 10 in a typical laptop)
- Has a control unit that manages code execution and memory I/O
Moore’s Law: the number of transistors in an integrated circuit doubles every two years.
Current integration scale is in the order of nanometers.

Heterogeneous Architecture and GP-GPUs

Heterogeneous computing refers to systems that use more than one kind of processor or core.
Simple Use Case and Basic Concepts:
- A program exposes a parallelizable part: for example, all iterations of some cycle are independent from each other, so we can run them at the same time, one per GPU core.
- There are also sequential parts and they run on the CPU.
- The program has been written in a supported language with some GPU-related annotations and functions (e.g., CUDA C++) and compiled with a compiler (e.g., nvcc) that understands them and generates GPU-runnable code, along with the parts needed to run the other sequential parts on the CPU.
- Memory management is a problem that needs to be addressed: data required for computation needs to be copied from system RAM to GPU RAM, and results back from GPU RAM to system RAM when computation is done.

Gaming Laptop for Science

The CPU is the host, the GPU is the device.
They are connected via the PCI-Express bus.
The GPU can’t access the system RAM directly (unless it’s Unified).
Functions need to be called explicitly in the host code to allocate GPU memory and copy data to it (cudaMalloc, cudaMemcpy).
Asynchronous / Synchronous execution:
- CPU can submit tasks to the GPU and go on with its own host code.
- We need to tell explicitly in our host code to wait for the GPU to finish all jobs (cudaDeviceSynchronize).
CPU: up to 128 independent cores (let’s say 10 in a typical laptop machine) with a complex control unit optimized for sequential execution.
GPU: thousands of cores grouped in Stream Multiprocessors (SM).
Each SM has its own control unit.
Cores in the same SM execute the same instruction at the same time.

Flynn’s Taxonomy

In CUDA, we care about threads, they are executed in warps, and logically grouped into blocks in a grid.
SIMT: Single Instruction Multiple Threads.

CUDA

Compute Unified Device Architecture.
It enables general-purpose programming on NVIDIA GPUs.
Current version of CUDA Toolkit: 12.8.
Each Nvidia board has an architecture and a Compute Capability version associated.

CUDA Architecture

Scalable array of Streaming Multiprocessors (SM).
Each SM supports concurrent execution of hundreds of threads.
Threads are executed in batches of 32 known as warps.
All threads in a warp must execute the same instruction at the same time.
CUDA cores can process integer data as well as floating-point data.

CUDA API

CUDA exposes two APIs:
- CUDA Runtime API
- CUDA Driver API
Driver API is low level but gives more control.
Runtime API is built upon Driver API: less control but easier to use.
We will use Runtime API.

CUDA Program

A CUDA program contains two kinds of code:
- Host code
- Device code
CUDA compiler nvcc takes care of this separation.
Pure host code can be compiled separately with another compiler like gcc and linked to binaries produced by nvcc later.

Requirements

A shell on a machine equipped with an Nvidia GPU with CUDA toolkit installed.
An IDE to edit code (Visual Studio Code with necessary extensions).

Testing CUDA Installation

Use nvidia-smi to check GPU status and CUDA version.
Use nvcc --version to check the CUDA compiler version.

Hello World

Example code:

#include <stdio.h>

int main() {
  printf("Hello world!!!\n");
}

Compile and run using nvcc hello_world.cu -o hello.o and ./hello.o.
In this example, there is no involvement of the GPU; nvcc compiles host code that runs on the CPU.

First Kernel

__global__ identifies kernels: functions that can be launched by the CPU to be executed on the GPU.
Kernels must return void and can accept arguments.

Example:

__global__ void helloGPU() {
  printf("Hello world!!!\n");
}

int main() {
  printf("Launch kernel!\n");
  helloGPU<<<1,10>>>();
  cudaDeviceSynchronize();
  return 0;
}

<<<1,10>>>: launch 1 block of 10 threads of the helloGPU kernel.
cudaDeviceSynchronize() halts the host code flow until all kernels have returned.
Kernel launches are asynchronous with respect to the host.

Thread Index

Each thread has an index associated.

Example:

__global__ void helloGPU() {
  int index = threadIdx.x;
  printf("Hello world!!! thread Id: %d \n", index);
}

int main() {
  printf("Launch kernel!\n");
  helloGPU<<<1,10>>>();
  cudaDeviceSynchronize();
  return 0;
}

Thread Organization

Threads are organized as blocks in a grid.
Blocks can be 1D, 2D, or 3D.
Grids can be 1D, 2D, or 3D.

Blocks

Max # of threads per block: 1024. It’s important!

__global__ void helloGPU() {
  int index = threadIdx.x;
  printf("Hello world!!! thread Id: %d \n", index);
}

int main() {
  printf("Launch kernel!\n");
  helloGPU<<<1,1025>>>();
  cudaDeviceSynchronize();
  return 0;
}

Code fails silently!

Error Handling

Errors must be retrieved explicitly in code: no such thing as a standard error output exist.

CUDA code probably failed at some point and no logic for error retrieval was set in code!

__global__ void helloGPU() {
  int index = threadIdx.x;
  printf("Hello world!!! thread Id: %d \n", index);
}

int main() {
  printf("Launch kernel!\n");
  helloGPU<<<1,1025>>>();
  cudaDeviceSynchronize();
  cudaError_t err = cudaGetLastError();
  printf("%s\n", cudaGetErrorString(err));
  return 0;
}

cudaGetLastError() returns the error code of the most recent error in the pipeline. It resets the error status to CUDA_SUCCESS
cudaGetErrorString(err) returns a human-readable string describing the error.

Thread Blocks and Indexing

Threads in the same block can be synchronized and can access the block-local shared memory.
Threads are identified by two uint3 variables:
- blockIdx: identifies the thread block in the grid (x,y,z)
- threadIdx: identifies the thread in its block (x,y,z)
Two more dim3 variables are accessible directly:
- blockDim: size of block dimensions in threads (x,y,z)
- gridDim: size of grid dimensions in blocks (x,y,z)
All these variables are pre-initialized by the CUDA runtime when the kernel is launched.

uint3 and dim3

uint3 is defined as a struct of 3 unsigned ints named x, y, and z:
- threadIdx.x
- threadIdx.y
- threadIdx.z
- blockIdx.x
- blockIdx.y
- blockIdx.z
dim3 is defined as a struct of 3 unsigned ints named x, y, and z (same as uint3)
Example:

dim3 grid(128); // 128 x 1 x 1
dim3 block(256, 256); // 256 x 256 x 1
kernel<<<grid, block>>>();

How to Index

There is no real 2D or 3D (or 4D!) memory. Memory is 1D (linear).
We define multidimensional arrays as an abstraction over 1D memory and each cell has two addresses: one is the address in the multidimensional array (x, y, z, …) and one is the address in the 1D underlying memory (usually an offset with respect to a starting pointer).
These ideas apply also to threads in CUDA.

Column-Major and Row-Major Order

C: row-major order (lexicographical access order), zero-based indexing
- Address = x + N_x * y
- Access = A[y][x]
Fortran: column-major order (colexicographical access order), one-based indexing
- Address = y + N_y * (x-1)
- Access = A(y,x)

1D Grid and 1D Blocks Example

Thread 1D global index = blockIdx.x * blockDim.x + threadIdx.x

1D Grid and 2D Blocks Example

Thread 1D global index = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x

2D Grid 2D Blocks Matrix-Like Coordinates

X = blockIdx.x * blockDim.x + threadIdx.x
Y = blockIdx.y * blockDim.y + threadIdx.y

First Task: Sum of Two Vectors

Calculate the sums on the GPU (ideal problem).

Necessary Pieces for GPU Computation

cudaDeviceReset(): Resets the GPU: memory wipe + kill all threads.
int *d_a, *d_b, *d_c: Pointers to device memory locations.
cudaMalloc(&d_a,sizeof(int)*N): Allocate device global memory (the first argument is the address of the pointer!).
cudaMemcpy(d_a, h_a, sizeof(int)*N, cudaMemcpyHostToDevice): Copy data host->device.
sumVectors<<<1, N>>>(d_a, d_b, d_c, N): Call kernel.
cudaMemcpy(h_cgpu, d_c, sizeof(int)*N,cudaMemcpyDeviceToHost): Copy result device->host (I conveniently defined the h_cgpu buffer).
cudaFree(d_a): Free device memory.

Kernel for Vector Sum

__global__ void sumVectors(int* a, int* b, int* c, int maxN) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  if(i >= maxN) return; // It's better to be safe than to be sorry
  c[i] = a[i] + b[i];
}

The index is calculated as explained before.
We need to be sure that no reads, or worse writes, happen out-of-bound.
Threads that would attempt to access out-of-bound elements return immediately.

Host and Device Pointers

Host pointers (that point to host memory) can be dereferenced only by host code (main in this case).
Device pointers (that point to device memory) can be dereferenced only by device code.

API Prototypes

All these functions have one thing in common: they return cudaError_t!

Error Macro

#define ERRCHK(error) {
  const cudaError_t e = error;
  if(e!=cudaSuccess) {
    printf("Cuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));
    exit(0);
  }
}

Wrap any CUDA api call you want to check with it.

Execution Time Measurement

typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::time_point<Clock> timePoint;
typedef std::chrono::duration<double, std::milli> msInterval;
#define BLOCKSIZE 64

timePoint start = Clock::now();
sumVectors<<<N/BLOCKSIZE, BLOCKSIZE>>>(d_a, d_b, d_c, N);
ERRCHK(cudaDeviceSynchronize()); // This is how we check kernels
timePoint stop = Clock::now();
msInterval interval = stop - start;
printf("GPU elapsed time: %f\n", interval.count());

By defining a BLOCKSIZE I can easily play with N without worrying about keeping the number of threads per block smaller than 1024.

Parallel Execution Model in CUDA

Threads are executed in parallel: one per core.
Threads are organized hierarchically in a grid of blocks.
We decide the block and grid size (max 1024 threads per block!).
Blocks and grids can be 1D, 2D, 3D.

Warps

Block execution happens in groups of 32 threads with consecutive IDs.
In a warp: same instruction at the same time.

Blocks and Stream Multiprocessors

Each SM has a certain amount of registers.
Each thread has its own associated registers when it’s executed.
Blocks are bound to one specific SM for their lifetime.
One SM can host the concurrent execution of several blocks.

Compute Capabilities and Limits

Reference the CUDA documentation for details on compute capabilities.

Warp Divergence

All threads in a warp must execute the same instruction at the same time.
This is a strong constraint and easily violated.
Think about if…then…else conditionals. Divergence can happen for any kind of condition (not necessarily related to thread indexing as in this example!!!)

Serialization in Divergent Warps

Worst case scenario: each thread diverges from each other, it means 32 different execution paths serialized.
If divergence is inevitable:
- Keep the divergent code sections small: a quick if…then…else doesn’t harm too much.
- Avoid nested divergence
- You won’t see switch statements in a CUDA kernel, probably

? : Operator

Variable = (condition) ? (Value if the condition is true) : (Value if the condition is false);

Standard Problem: Parallel Reduction

Problem of reduction: given a vector of numbers compute their sum (reduce it)
It’s a fundamental problem that exposes some parallelism:

What if we change the order of summation?

Results Floating point sums are not associative!
See IEEE 754 standard and related topics. This happens on CPUs and GPUs as well. Keep this in mind while debugging or assessing the correctness of your code by comparing results.

First Parallel Reduction Kernel

The number of elements (N) is a power of 2

__global__ void reduce0(double *x, int m) {
  int tid = blockDim.x*blockIdx.x + threadIdx.x;
  x[tid] += x[tid+m];
}

Introduction to Shared Memory and Thread Synchronization

Shared memory is used for data sharing and thread communication within a block

CUDA Memory Model

Local memory is used by threads if the compiler decides that registers are not enough (spilling).
Constant memory is fast, since it’s cached, and it’s best used if all threads in a warp access the same location (broadcast).
Textures are best used to access and interpolate data that have spatial locality (linear or surfaces)

Query your Device for Memory Features

Locate your CUDA installation and execute the program deviceQuery in its extras/demo_suite subfolder

Accessing Global Memory

The art of coalescence
Memory access is coalesced whenever it’s possible for the threads of a warp to combine reads and writes in as few as possible transactions.
Each coalesced transaction is 32 bytes wide

Introducing Shared Memory

Shared memory is a fast-access memory pool, typically 64 KB available per SM.
Two possible ways to allocate it:
- __shared__ double data[256]: Size defined at compile time. Static allocation
- extern __shared__ double data[]: Size defined at run time. Dynamic allocation

Shared Memory and Multiple Static Allocations

Multiple STATIC allocations of shared memory behave correctly: distinct arrays are available inside the kernel.
__syncthreads() We synchronize all non-exited threads in the block: they all must reach this instruction before any of them can go past it.

Shared Memory and Multiple Dynamic Allocations

Multiple DYNAMIC allocations of shared memory behave in a peculiar way.
All extern __shared__ pointer declared in a kernel will point to the same location: the beginning of the shared memory block allocated at kernel launch!

Reduction with Shared Memory

__syncthreads() is necessary anytime we need to access a shared memory location that has been written by another thread in the block.
We reuse the folding algorithm from reduce0 to reduce the partial sums stored in the block shared memory.

Deadlocks

Deadlocks happen when two processes wait for each other to perform some task (e.g. finish, reach some instruction, release a lock on some resource).
The program stalls.

Introduction to Textures and Interpolation

Images are 2D matrices of numbers.
Can we use the same hardware that rotates and scales them for scientific computing? Yes!

Texture Memory and API

CUDA supports a subset of the texturing hardware that the GPU uses for graphics to access texture memory.
Texture memory is read-only and cached.
Data retrieval from textures is known as texture fetch: a fetch costs one global memory read if requested data is not in cache (miss), a fast cache read otherwise (hit)

Applications of GPU Programming in Physics

Integrate differential equations
Monte Carlo simulations
Matrix multiplications (next Lesson!)

Derivatives on a Grid of Values

A function f(x,y) can be evaluated on a 2D grid of equally spaced points.
The resulting grid is a discrete approximation of the function.

Stencils

Derivatives in the past slide can be represented as stencils: operators that replace any element of the grid with a linear combination of their neighbors.

Laplace’s Equation on a Grid

We want to solve it numerically on a 2D square grid.
Boundary conditions: left and right edges set at 1, top and bottom edges set at 0.
We set the interior of the grid at 0, then we apply Laplace’s stencil to the whole grid, i.e. replacing each value of the interior of the grid with the average of the 4 surrounding values (horizontally and vertically).
Jacobi iteration

Implementation in CUDA

__global__ void stencil2D(double *d_a, double *d_b, int nx, int ny) {
  auto idx = [&nx] (int y, int x) { return y*nx + x; };
  int x = blockIdx.x*blockDim.x + threadIdx.x;
  int y = blockIdx.y*blockDim.y + threadIdx.y;
  if(x < 1  || y < 1  || x >= nx-1  || y >= ny-1) return;
  d_b[idx(y,x)] = 0.25*(d_a[idx(y,x+1)] + d_a[idx(y,x-1)] + d_a[idx(y+1,x)] + d_a[idx(y-1,x)]);
}

Embarrassingly Parallel Problems

Monte Carlo simulations
Independent tasks

Monte Carlo

Monte-Carlo methods rely on random sampling to obtain some numerical result.
Two well-known applications:
- Integration
- Simulation

Monte Carlo Integration

Let’s find an approximate value of p by measuring the area of one quarter of a circle.

Monte Carlo Integration with CUDA

CUDA includes the cuRAND Library to generate large numbers of random floats.
These numbers can be stored in device buffers allocated with cudaMalloc as usual.

Integrate with cuRAND Device API

We can move generation and integration to GPU using cuRAND device APIs, which can be called from kernels.