GPU Programming for Physics PhD Course Notes
Introduction to GPUs
GPUs (Graphics Processing Units) are used to generate 3D graphics in real-time.
Mid 1990s: First 3D accelerating Graphic Cards and fully 3D capable consoles.
2020s: General Purpose GPUs (GP-GPUs) are used for generic parallel calculations.
Since 2007 GPUs are used not only for graphics but also for generic parallel calculations.
Some GPUs don’t even implement a video output.
Cryptocurrencies boosted the interest in this topic (Bitcoin mining).
Why GPUs?
Geometrical operations on objects in 3D graphics can be computed efficiently in a parallel architecture:
Sum of vectors: just sum vector components independently
Rotations: compute each component of the rotated vector independently
Ray Tracing (ray tracing is a technique for modeling light transport for use in a wide variety of rendering algorithms for generating digital images)
Reflection, refraction and shadow effects.
Ideally each pixel could be processed independently.
CPUs vs GPUs
CPUs and GPUs are both necessary, they solve different problems and have different roles.
CPU: Central Processing Unit
Runs the Operating System which manages all system resources
Applications run on it, even if they take advantage of GPUs at some point
Features multiple cores (approximately 10 in a typical laptop)
Has a control unit that manages code execution and memory I/O
Moore’s Law: the number of transistors in an integrated circuit doubles every two years.
Current integration scale is in the order of nanometers.
Heterogeneous Architecture and GP-GPUs
Heterogeneous computing refers to systems that use more than one kind of processor or core.
Simple Use Case and Basic Concepts:
A program exposes a parallelizable part: for example, all iterations of some cycle are independent from each other, so we can run them at the same time, one per GPU core.
There are also sequential parts and they run on the CPU.
The program has been written in a supported language with some GPU-related annotations and functions (e.g., CUDA C++) and compiled with a compiler (e.g., nvcc) that understands them and generates GPU-runnable code, along with the parts needed to run the other sequential parts on the CPU.
Memory management is a problem that needs to be addressed: data required for computation needs to be copied from system RAM to GPU RAM, and results back from GPU RAM to system RAM when computation is done.
Gaming Laptop for Science
The CPU is the host, the GPU is the device.
They are connected via the PCI-Express bus.
The GPU can’t access the system RAM directly (unless it’s Unified).
Functions need to be called explicitly in the host code to allocate GPU memory and copy data to it (cudaMalloc, cudaMemcpy).
Asynchronous / Synchronous execution:
CPU can submit tasks to the GPU and go on with its own host code.
We need to tell explicitly in our host code to wait for the GPU to finish all jobs (cudaDeviceSynchronize).
CPU: up to 128 independent cores (let’s say 10 in a typical laptop machine) with a complex control unit optimized for sequential execution.
GPU: thousands of cores grouped in Stream Multiprocessors (SM).
Each SM has its own control unit.
Cores in the same SM execute the same instruction at the same time.
Flynn’s Taxonomy
In CUDA, we care about threads, they are executed in warps, and logically grouped into blocks in a grid.
SIMT: Single Instruction Multiple Threads.
CUDA
Compute Unified Device Architecture.
It enables general-purpose programming on NVIDIA GPUs.
Current version of CUDA Toolkit: 12.8.
Each Nvidia board has an architecture and a Compute Capability version associated.
CUDA Architecture
Scalable array of Streaming Multiprocessors (SM).
Each SM supports concurrent execution of hundreds of threads.
Threads are executed in batches of 32 known as warps.
All threads in a warp must execute the same instruction at the same time.
CUDA cores can process integer data as well as floating-point data.
CUDA API
CUDA exposes two APIs:
CUDA Runtime API
CUDA Driver API
Driver API is low level but gives more control.
Runtime API is built upon Driver API: less control but easier to use.
We will use Runtime API.
CUDA Program
A CUDA program contains two kinds of code:
Host code
Device code
CUDA compiler nvcc takes care of this separation.
Pure host code can be compiled separately with another compiler like gcc and linked to binaries produced by nvcc later.
Requirements
A shell on a machine equipped with an Nvidia GPU with CUDA toolkit installed.
An IDE to edit code (Visual Studio Code with necessary extensions).
Testing CUDA Installation
Use
nvidia-smito check GPU status and CUDA version.Use
nvcc --versionto check the CUDA compiler version.
Hello World
Example code:
#include <stdio.h> int main() { printf("Hello world!!!\n"); }Compile and run using
nvcc hello_world.cu -o hello.oand./hello.o.In this example, there is no involvement of the GPU;
nvcccompiles host code that runs on the CPU.
First Kernel
__global__identifies kernels: functions that can be launched by the CPU to be executed on the GPU.Kernels must return void and can accept arguments.
Example:
__global__ void helloGPU() { printf("Hello world!!!\n"); } int main() { printf("Launch kernel!\n"); helloGPU<<<1,10>>>(); cudaDeviceSynchronize(); return 0; }<<<1,10>>>: launch 1 block of 10 threads of thehelloGPUkernel.cudaDeviceSynchronize()halts the host code flow until all kernels have returned.Kernel launches are asynchronous with respect to the host.
Thread Index
Each thread has an index associated.
Example:
__global__ void helloGPU() { int index = threadIdx.x; printf("Hello world!!! thread Id: %d \n", index); } int main() { printf("Launch kernel!\n"); helloGPU<<<1,10>>>(); cudaDeviceSynchronize(); return 0; }
Thread Organization
Threads are organized as blocks in a grid.
Blocks can be 1D, 2D, or 3D.
Grids can be 1D, 2D, or 3D.
Blocks
Max # of threads per block: 1024. It’s important!
__global__ void helloGPU() { int index = threadIdx.x; printf("Hello world!!! thread Id: %d \n", index); } int main() { printf("Launch kernel!\n"); helloGPU<<<1,1025>>>(); cudaDeviceSynchronize(); return 0; }Code fails silently!
Error Handling
Errors must be retrieved explicitly in code: no such thing as a standard error output exist.
CUDA code probably failed at some point and no logic for error retrieval was set in code!
__global__ void helloGPU() { int index = threadIdx.x; printf("Hello world!!! thread Id: %d \n", index); } int main() { printf("Launch kernel!\n"); helloGPU<<<1,1025>>>(); cudaDeviceSynchronize(); cudaError_t err = cudaGetLastError(); printf("%s\n", cudaGetErrorString(err)); return 0; }cudaGetLastError()returns the error code of the most recent error in the pipeline. It resets the error status toCUDA_SUCCESScudaGetErrorString(err)returns a human-readable string describing the error.
Thread Blocks and Indexing
Threads in the same block can be synchronized and can access the block-local shared memory.
Threads are identified by two
uint3variables:blockIdx: identifies the thread block in the grid (x,y,z)threadIdx: identifies the thread in its block (x,y,z)
Two more
dim3variables are accessible directly:blockDim: size of block dimensions in threads (x,y,z)gridDim: size of grid dimensions in blocks (x,y,z)
All these variables are pre-initialized by the CUDA runtime when the kernel is launched.
uint3 and dim3
uint3is defined as a struct of 3 unsigned ints named x, y, and z:threadIdx.xthreadIdx.ythreadIdx.zblockIdx.xblockIdx.yblockIdx.z
dim3is defined as a struct of 3 unsigned ints named x, y, and z (same asuint3)
Example:
dim3 grid(128); // 128 x 1 x 1
dim3 block(256, 256); // 256 x 256 x 1
kernel<<<grid, block>>>();
How to Index
There is no real 2D or 3D (or 4D!) memory. Memory is 1D (linear).
We define multidimensional arrays as an abstraction over 1D memory and each cell has two addresses: one is the address in the multidimensional array (x, y, z, …) and one is the address in the 1D underlying memory (usually an offset with respect to a starting pointer).
These ideas apply also to threads in CUDA.
Column-Major and Row-Major Order
C: row-major order (lexicographical access order), zero-based indexing
Address =
x + N_x * yAccess =
A[y][x]
Fortran: column-major order (colexicographical access order), one-based indexing
Address =
y + N_y * (x-1)Access =
A(y,x)
1D Grid and 1D Blocks Example
Thread 1D global index =
blockIdx.x * blockDim.x + threadIdx.x
1D Grid and 2D Blocks Example
Thread 1D global index =
blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x
2D Grid 2D Blocks Matrix-Like Coordinates
X = blockIdx.x * blockDim.x + threadIdx.xY = blockIdx.y * blockDim.y + threadIdx.y
First Task: Sum of Two Vectors
Calculate the sums on the GPU (ideal problem).
Necessary Pieces for GPU Computation
cudaDeviceReset(): Resets the GPU: memory wipe + kill all threads.int *d_a, *d_b, *d_c: Pointers to device memory locations.cudaMalloc(&d_a,sizeof(int)*N): Allocate device global memory (the first argument is the address of the pointer!).cudaMemcpy(d_a, h_a, sizeof(int)*N, cudaMemcpyHostToDevice): Copy data host->device.sumVectors<<<1, N>>>(d_a, d_b, d_c, N): Call kernel.cudaMemcpy(h_cgpu, d_c, sizeof(int)*N,cudaMemcpyDeviceToHost): Copy result device->host (I conveniently defined the h_cgpu buffer).cudaFree(d_a): Free device memory.
Kernel for Vector Sum
__global__ void sumVectors(int* a, int* b, int* c, int maxN) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i >= maxN) return; // It's better to be safe than to be sorry
c[i] = a[i] + b[i];
}
The index is calculated as explained before.
We need to be sure that no reads, or worse writes, happen out-of-bound.
Threads that would attempt to access out-of-bound elements return immediately.
Host and Device Pointers
Host pointers (that point to host memory) can be dereferenced only by host code (main in this case).
Device pointers (that point to device memory) can be dereferenced only by device code.
API Prototypes
All these functions have one thing in common: they return
cudaError_t!
Error Macro
#define ERRCHK(error) {
const cudaError_t e = error;
if(e!=cudaSuccess) {
printf("Cuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));
exit(0);
}
}
Wrap any CUDA api call you want to check with it.
Execution Time Measurement
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::time_point<Clock> timePoint;
typedef std::chrono::duration<double, std::milli> msInterval;
#define BLOCKSIZE 64
timePoint start = Clock::now();
sumVectors<<<N/BLOCKSIZE, BLOCKSIZE>>>(d_a, d_b, d_c, N);
ERRCHK(cudaDeviceSynchronize()); // This is how we check kernels
timePoint stop = Clock::now();
msInterval interval = stop - start;
printf("GPU elapsed time: %f\n", interval.count());
By defining a
BLOCKSIZEI can easily play withNwithout worrying about keeping the number of threads per block smaller than 1024.
Parallel Execution Model in CUDA
Threads are executed in parallel: one per core.
Threads are organized hierarchically in a grid of blocks.
We decide the block and grid size (max 1024 threads per block!).
Blocks and grids can be 1D, 2D, 3D.
Warps
Block execution happens in groups of 32 threads with consecutive IDs.
In a warp: same instruction at the same time.
Blocks and Stream Multiprocessors
Each SM has a certain amount of registers.
Each thread has its own associated registers when it’s executed.
Blocks are bound to one specific SM for their lifetime.
One SM can host the concurrent execution of several blocks.
Compute Capabilities and Limits
Reference the CUDA documentation for details on compute capabilities.
Warp Divergence
All threads in a warp must execute the same instruction at the same time.
This is a strong constraint and easily violated.
Think about if…then…else conditionals. Divergence can happen for any kind of condition (not necessarily related to thread indexing as in this example!!!)
Serialization in Divergent Warps
Worst case scenario: each thread diverges from each other, it means 32 different execution paths serialized.
If divergence is inevitable:
Keep the divergent code sections small: a quick if…then…else doesn’t harm too much.
Avoid nested divergence
You won’t see switch statements in a CUDA kernel, probably
? : Operator
Variable = (condition) ? (Value if the condition is true) : (Value if the condition is false);
Standard Problem: Parallel Reduction
Problem of reduction: given a vector of numbers compute their sum (reduce it)
It’s a fundamental problem that exposes some parallelism:
What if we change the order of summation?
Results Floating point sums are not associative!
See IEEE 754 standard and related topics. This happens on CPUs and GPUs as well. Keep this in mind while debugging or assessing the correctness of your code by comparing results.
First Parallel Reduction Kernel
The number of elements (N) is a power of 2
__global__ void reduce0(double *x, int m) {
int tid = blockDim.x*blockIdx.x + threadIdx.x;
x[tid] += x[tid+m];
}
Introduction to Shared Memory and Thread Synchronization
Shared memory is used for data sharing and thread communication within a block
CUDA Memory Model
Local memory is used by threads if the compiler decides that registers are not enough (spilling).
Constant memory is fast, since it’s cached, and it’s best used if all threads in a warp access the same location (broadcast).
Textures are best used to access and interpolate data that have spatial locality (linear or surfaces)
Query your Device for Memory Features
Locate your CUDA installation and execute the program deviceQuery in its extras/demo_suite subfolder
Accessing Global Memory
The art of coalescence
Memory access is coalesced whenever it’s possible for the threads of a warp to combine reads and writes in as few as possible transactions.
Each coalesced transaction is 32 bytes wide
Introducing Shared Memory
Shared memory is a fast-access memory pool, typically 64 KB available per SM.
Two possible ways to allocate it:
__shared__ double data[256]: Size defined at compile time. Static allocationextern __shared__ double data[]: Size defined at run time. Dynamic allocation
Shared Memory and Multiple Static Allocations
Multiple STATIC allocations of shared memory behave correctly: distinct arrays are available inside the kernel.
__syncthreads()We synchronize all non-exited threads in the block: they all must reach this instruction before any of them can go past it.
Shared Memory and Multiple Dynamic Allocations
Multiple DYNAMIC allocations of shared memory behave in a peculiar way.
All
extern __shared__pointer declared in a kernel will point to the same location: the beginning of the shared memory block allocated at kernel launch!
Reduction with Shared Memory
__syncthreads()is necessary anytime we need to access a shared memory location that has been written by another thread in the block.We reuse the folding algorithm from reduce0 to reduce the partial sums stored in the block shared memory.
Deadlocks
Deadlocks happen when two processes wait for each other to perform some task (e.g. finish, reach some instruction, release a lock on some resource).
The program stalls.
Introduction to Textures and Interpolation
Images are 2D matrices of numbers.
Can we use the same hardware that rotates and scales them for scientific computing? Yes!
Texture Memory and API
CUDA supports a subset of the texturing hardware that the GPU uses for graphics to access texture memory.
Texture memory is read-only and cached.
Data retrieval from textures is known as texture fetch: a fetch costs one global memory read if requested data is not in cache (miss), a fast cache read otherwise (hit)
Applications of GPU Programming in Physics
Integrate differential equations
Monte Carlo simulations
Matrix multiplications (next Lesson!)
Derivatives on a Grid of Values
A function f(x,y) can be evaluated on a 2D grid of equally spaced points.
The resulting grid is a discrete approximation of the function.
Stencils
Derivatives in the past slide can be represented as stencils: operators that replace any element of the grid with a linear combination of their neighbors.
Laplace’s Equation on a Grid
We want to solve it numerically on a 2D square grid.
Boundary conditions: left and right edges set at 1, top and bottom edges set at 0.
We set the interior of the grid at 0, then we apply Laplace’s stencil to the whole grid, i.e. replacing each value of the interior of the grid with the average of the 4 surrounding values (horizontally and vertically).
Jacobi iteration
Implementation in CUDA
__global__ void stencil2D(double *d_a, double *d_b, int nx, int ny) {
auto idx = [&nx] (int y, int x) { return y*nx + x; };
int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;
if(x < 1 || y < 1 || x >= nx-1 || y >= ny-1) return;
d_b[idx(y,x)] = 0.25*(d_a[idx(y,x+1)] + d_a[idx(y,x-1)] + d_a[idx(y+1,x)] + d_a[idx(y-1,x)]);
}
Embarrassingly Parallel Problems
Monte Carlo simulations
Independent tasks
Monte Carlo
Monte-Carlo methods rely on random sampling to obtain some numerical result.
Two well-known applications:
Integration
Simulation
Monte Carlo Integration
Let’s find an approximate value of p by measuring the area of one quarter of a circle.
Monte Carlo Integration with CUDA
CUDA includes the cuRAND Library to generate large numbers of random floats.
These numbers can be stored in device buffers allocated with cudaMalloc as usual.
Integrate with cuRAND Device API
We can move generation and integration to GPU using cuRAND device APIs, which can be called from kernels.