CUDA Notes
CUDA – Compute Unified Device Architecture
CUDA stands for Compute Unified Device Architecture and is an extension of the C programming language created by NVIDIA. It allows programmers to utilize the parallel computing power of NVIDIA graphics cards for general-purpose computation.
Key Concepts
CUDA: A parallel computing platform and programming model by NVIDIA, implemented by their GPUs.
OpenCL: An open standard for programming CPUs, GPUs, and other devices from various vendors. It offers portability but may incur a performance penalty compared to CUDA.
Kernel: A piece of code that runs on the GPU.
CUDA vs. OpenCL
CUDA is proprietary to NVIDIA GPUs, while OpenCL is an open standard. OpenCL aims for portability across different hardware, but this generality can sometimes lead to reduced performance compared to CUDA, which is tailored specifically for NVIDIA's architecture.
CUDA's Massively Parallel Computation
CUDA enables massively parallel computations on graphics accelerators and was first available with NVIDIA's G8X line of graphics cards. Approximately 1 million CUDA-capable GPUs are shipped weekly, presenting a significant opportunity for developing widely deployed parallel applications.
Addressing the Power and Latency Walls
With the increasing challenges posed by the Power Wall and Latency Wall, CUDA provides a solution to prevent processor-intensive programs from slowing down. CUDA facilitates simulations of complex systems like networks of brain neurons and brings the potential of ubiquitous supercomputing to everyday computers.
CUDA Support and Architecture
CUDA is supported on NVIDIA's G8X and above graphics cards. The current CUDA GPU architecture is branded Tesla. 8-series GPUs offer 50-200 GFLOPS (floating-point operations per second).
CUDA Compilation
CUDA uses a set of extensions to ANSI C. CPU code is compiled by the host C compiler, while GPU code (kernels) is compiled by the CUDA compiler (cudacc). This process produces separate binaries.
The compilation process involves:
Integrated Source (
.c,.cufiles)cudacc: Front End and Global OptimizerGPU Assembly/Kernel Code (
.sfiles)Kernel Object Code Generator
CPU Host Code (
.cfiles)Host Compiler
Kernel Object Code (
.gpufiles)Executable Host Binary (
.ofiles)
CUDA Stack
The CUDA stack includes libraries (FFT, BLAS, etc.), example source code, the NVIDIA C Compiler, NVIDIA assembly, CPU host code for computing, the CUDA driver, debugger, profiler, a standard C compiler, and both GPU and CPU components.
Limitations of CUDA
Tesla GPUs do not fully support the IEEE standard for double-precision floating-point operations.
Code is only supported on NVIDIA hardware.
Recursive functions are not supported (though workarounds exist).
There is bus latency between the host CPU and the GPU.
Example: CUDA Speed vs. CPU Speed
This example demonstrates the performance difference between CPU and GPU using the PyTorch library.
import torch
import time
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
print("using", device, "device")
matrix_size = 32 * 512
x = torch.randn((matrix_size, matrix_size))
y = torch.randn((matrix_size, matrix_size))
print("************ CPU SPEED ****************")
start = time.time()
result = torch.matmul(x, y)
print(time.time() - start)
print("verify device:", result.device)
x_gpu = x.to(device)
y_gpu = y.to(device)
torch.cuda.synchronize()
for i in range(3):
print("************ GPU SPEED ****************")
start = time.time()
result_gpu = torch.matmul(x_gpu, y_gpu)
torch.cuda.synchronize()
print(time.time() - start)
print("verify device:", result_gpu.device)
Results:
CPU Speed:
GPU Speed (initial):
GPU Speed (subsequent): Approximately seconds
Data Parallelism - Vector Addition Example
Vector A:
Vector B:
Vector C:
CUDA/OpenCL - Execution Model
Illustrates a heterogeneous host+device application with serial parts in host C code and parallel parts in device SPMD kernel C code.
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
From Natural Language to Electrons
Describes the abstraction layers from natural language down to the physical execution by electrons:
Natural Language (e.g., English)
Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons
Instruction Set Architecture (ISA)
The ISA is a contract between the hardware and the software; it defines the set of instructions that the hardware can execute.
Von-Neumann Processor
Memory
Processing Unit
PC
Reg File
ALU
Control Unit
IR
I/O
Arrays of Parallel Threads
A CUDA kernel is executed by a grid (array) of threads.
All threads in a grid run the same kernel code (SPMD - Single Program, Multiple Data).
Each thread uses indexes to compute memory addresses and make control decisions.
i = blockldx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
Thread Blocks: Scalable Cooperation
Divide the thread array into multiple blocks.
Threads within a block cooperate via shared memory, atomic operations, and barrier synchronization.
Threads in different blocks do not interact.
i = blockldx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
Thread Indices
Each thread uses indices to decide what data to work on.
blockldx: 1D, 2D, or 3D (CUDA 4.0)threadldx: 1D, 2D, or 3DSimplifies memory addressing when processing multidimensional data (e.g., image processing, solving PDEs).
CUDA C Program Compilation
The host code in a CUDA C program is processed by a host C pre-processor and compiler/linker, while the device code is processed by the NVIDIA C compiler and a device JIT compiler, resulting in a heterogeneous computing platform.
Data Parallelism
Data parallelism involves performing many arithmetic operations simultaneously on data structures.
Matrix Multiplication: Each element of the product matrix is generated by performing a dot product between a row of input matrix and a column of input matrix .
For example, calculating involves a dot product of the th row of and the th column of .
Since dot product operations for different elements of can be performed simultaneously, matrix multiplication exhibits high data parallelism.
For large matrices, the number of dot products can be very large. For instance, a matrix multiplication involves independent dot products, each with 1000 multiply and 1000 accumulate operations.
By executing many dot products in parallel, a CUDA device can significantly accelerate the execution of matrix multiplication.
CUDA Program Structure
A CUDA program contains both host and device code. The NVIDIA C compiler (nvcc) separates the two during compilation.
Host code is compiled with standard C compilers and runs as a CPU process.
Device code (kernels) is compiled by
nvccand executed on a GPU device.Kernels generate a large number of threads to exploit data parallelism.
CUDA programmers assume that threads are lightweight and take very few cycles to generate and schedule.
CUDA groups threads into warps, blocks, and grids.
CUDA Execution Flow
Execution starts on the host (CPU). When a kernel function is invoked (launched), execution moves to the device (GPU), where a large number of threads are generated.
All threads generated by a kernel during an invocation form a grid.
When all threads of a kernel complete, the grid terminates, and execution returns to the host.
Examples: Vector Addition
CPU Only:
void vecAdd(float* h_A, float* h_B, float* h_C, int n) {
for (i = 0; i < n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main() {
float *h_A, *h_B, *h_C;
int n; // Specify size of Vectors
h_A = (float*)malloc(n * sizeof(float));
h_B = (float*)malloc(n * sizeof(float));
h_C = (float*)malloc(n * sizeof(float));
vecAdd(h_A, h_B, h_C, N);
}
CPU-GPU:
#include <cuda.h>
#include <cuda_runtime.h>
__global__ void vectorAdd(float*, float*, float*, int);
__global__ void vectorAdd(float *A, float *B, float* C, int n) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < n)
C[i] = A[i] + B[i];
}
void vecAdd(float * h_A, float *h_B, float* h_C, int n) {
int size = n * sizeof(float);
float *d_A=NULL, *d_B=NULL, *d_C=NULL;
cudaError_t err = cudaSuccess;
// Device Memory Allocation
err = cudaMalloc((void**)&d_A, size);
if (err != cudaSuccess) { ... }
// Host to Device Data Transfer
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) { ... }
// Kernel Launch
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, n);
err = cudaGetLastError();
// Device to Host Memory Transfer
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) { ... }
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
Device Memory Allocation and Data Transfer
**cudaMalloc((void *) &d_A, size):* Allocates memory on the GPU from global memory.
The function expects a generic pointer (
void **).
cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice): Transfers data from CPU to GPU memory.
The pointer
d_Acannot be dereferenced in host code.
cudaMemcpy (hC, dC, size, cudaMemcpyDeviceToHost): Transfers data from GPU to CPU memory.
Data transfers can occur among different device memory locations as well.
Transfers can also occur from host to host, but transfers are not allowed among different GPU devices.
cudaFree(d_A): Frees GPU global memory.
CUDA Kernel Launch
The CUDA kernel launches multiple threads in a 2-level hierarchy.
Example:
vectorAdd <<<ceil(n/256), 256>>> (d_A, d_B, d_C, n)Specifies a grid of threads to be launched.
Organized hierarchically (number of blocks, number of threads per block).
All blocks contain the same number of threads (up to a maximum of 1024).
Blocks can be numbered as triplets (
_, _, _).
Kernel-Specific System Variables
gridDim: Number of blocks in the grid.gridDim.x: Number of blocks in dimension x of a multi-dimensional grid.
blockDim: Number of threads per block.blockDim.x: Number of threads per block in dimension x of a multi-dimensional block.
blockIdx.x: Block number for a thread.threadIdx.x: Thread number inside a block.
Function Declaration Keywords
CUDA uses keywords to define where functions are executed and from where they can be called.
Keyword | Executed on the | Only callable from the |
|---|---|---|
| device | device |
| device | host |
| host | host |
Matrix Multiplication Example (Traditional C Code)
void MatrixMultiplication(float* M, float* N, float* P, int Width) {
for (int i = 0; i < Width; ++i) {
for (int j = 0; j < Width; ++j) {
float sum = 0;
for (int k = 0; k < Width; ++k) {
float a = M[i * Width + k];
float b = N[k * Width + j];
sum += a * b;
}
P[i * Width + j] = sum;
}
}
}
Traditional Vector Addition (C Code), Revised CUDA Code
The example highlights the transformation from CPU-based code to CUDA-enabled code, involving memory allocation on the device, data transfer, and kernel launch.
#include <cuda.h>
void vecAdd(float* A, float* B, float* C, int n) {
int size = n * sizeof(float);
float *A_d, *B_d, *C_d;
cudaMalloc((void**)&A_d, size);
cudaMalloc((void**)&B_d, size);
cudaMalloc((void**)&C_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(A_d, B_d, C_d, n);
cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);
cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
}
Device Memories and Data Transfer
In CUDA, the host (CPU) and devices (GPUs) have separate memory spaces.
Devices have their own DRAM (Dynamic Random Access Memory).
cudaMalloc(): Allocates global memory on the device.
The first parameter is the address of a pointer that will point to the allocated memory.
The second parameter is the size of the object to be allocated (in bytes).
cudaFree(): Frees the allocated global memory specified by a pointer.
cudaMemcpy(): Transfers data between memories.
Takes four parameters:
Pointer to the destination location.
Pointer to the source data.
Number of bytes to be copied.
Type of memory transfer (host-to-host, host-to-device, device-to-host, device-to-device).
Cannot be used to copy between different GPUs in multi-GPU systems.
Kernel Functions and Threading
CUDA programming uses the Single-Program, Multiple-Data (SPMD) parallel programming style. The __global__ keyword indicates a CUDA kernel function that will execute on the device and can only be called from the host to generate a grid of threads.
Kernel Functions and Threading (Keywords)
Keyword | Executed on the | Only callable from the |
|---|---|---|
| device | device |
| device | host |
| host | host |
The device keyword indicates that the function being declared is a CUDA device function. A device function executes on a CUDA device and can only be called from a kernel function or another device function. Device functions can have neither recursive function calls nor indirect function calls through pointers in them.
The host keyword indicates that the function being declared is a CUDA host function. A host function is simply a traditional C function that executes on the host and can only be called from another host function.
GPU Memory Spaces
Each thread has local memory (thread lifetime).
Each block has shared memory (block lifetime).
Global memory is accessible to all threads (persists indefinitely).
Constant and texture memory spaces are also available (persists indefinitely).
CUDA Execution Model
Host code uses the GPU device as a coprocessor.
Transfers input data from the host to the device.
Launches code on the device.
Transfers output data from the device to the host.
Device code (kernel): A function executed by all threads (SIMT).
Accesses device resources directly: special data types, memory types (constant, texture, shared, global).
Interoperates with OpenGL and Direct3D graphics APIs.
Thread Index
Each thread has a unique thread index, accessible via the built-in
threadIdxvariable.
Built-In Variables
threadIdx: Index of a thread.blockIdx: Index of a block.gridDim: Dimension of a grid.blockDim: Dimension of a block.
Indexing Within Grid
threadIdxis unique only within its thread block.To determine the unique grid index of a thread:
i = threadIdx.x + blockIdx.x * blockDim.x;
Examples of Kernel Functions
__global__ void kernel( int* a ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
a[i] = blockDim.x;
}
__global__ void kernel( int* a ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
a[i] = threadIdx.x;
}
__global__ void kernel( int* a ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
a[i] = blockIdx.x;
}
__global__ void kernel( int* a ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
a[i] = i;
}
Threads Execute in Parallel
Threads can read a result before another thread writes to that address.
This can create a race condition.
Thread Synchronization via Explicit Barrier
Threads need to synchronize to avoid race conditions.
A barrier is a point where all threads stop and wait for others.
When all threads reach the barrier, they can proceed.
Barriers are implemented with
__syncthreads();
Syncthreads() Example
The purpose of synchronizing threads is to demonstrate how to safeguard against data races when cooperative execution requires multiple threads to access and modify shared data.
Thread-Memory Correspondence
Local Memory (and Registers): Private to its corresponding thread (thread lifetime).
Shared Memory: Every thread in the block has access (block lifetime).
Global Memory: Every thread in all grids has access (entire program lifetime).
Memory Model
Yellow rectangles represent Streaming Multiprocessors (SMs).
Memory located on SM is called "On-Chip" Device memory.
Memory not located on SM is called "Off-Chip" Device Memory.
Memory Speed
Relative speed of memory spaces:
Memory Space | Bandwidth | Latency |
|---|---|---|
Registers | ~8TB/s | ~1 clock |
Shared | ~1.5TB/s | ~32 clocks |
Global | ~200GB/s | ~800 clocks |
Host (PCIe) | ~5GB/s | N/A |
CUDA Device Memory Model
Device code can:
Read/Write per-thread registers.
Read/Write per-thread local memory.
Read/Write per-block shared memory.
Read/Write per-grid global memory.
Read-only per-grid constant memory.
Host code can:
Transfer data to/from per-grid global and constant memories.
CUDA Memory Efficiency
Simple CUDA kernels often achieve only a small fraction of the potential hardware speed, primarily due to the long access latencies associated with global memory (DRAM).
Key Factors:
Compute to global memory access (CGMA) ratio: This ratio is defined as the number of floating-point calculations performed for each access to the global memory.
To maximize CUDA memory efficiency, prioritize memory access patterns with a high compute-to-global memory access (CGMA) ratio. This means striving to perform more computations for each access to global memory, thereby increasing the overall efficiency of calculations. Implement optimizations that reduce the number of global memory accesses while maximizing the amount of computation using data already available in registers or shared memory.
Maximize CGMA ratio:
Achieve high execution speeds in kernels by maximizing the CGMA ratio.
CUDA Device Memory Types
The primary memory types in CUDA programming include global memory, constant memory, registers, and shared memory. Each serves a unique purpose and offers distinct advantages for different types of data access patterns.
Memory Type | Host Access | Device Access | Latency | Scope | Lifetime |
|---|---|---|---|---|---|
Global | Read/Write | Read/Write | High | Grid | Application |
Constant | Read/Write | Read-Only | Low (cached) | Grid | Application |
Registers | N/A | Read/Write | Very Low | Thread | Kernel |
Shared | N/A | Read/Write | Low | Block | Kernel |
Examples kernels
multiplyKernel_a<<<1,ha>>>(d_a, d_b, d_c, wa, wb);
__global__ void multiplyKernel_rowwise(int * a, int * b, int * c, int wa, int wb) {
int id = threadIdx.x;
int sum=0;
for(int i = 0; i < wb; i++) {
sum= 0;
for(int j = 0; j < wa; j++) {
sum += (a[id * wa + j] * b[j * wb + i]);
}
c[id * wb+ i] = sum;
}
}
multiplyKernel_b<<<1, wb>>>(d_a, d_b, d_c, ha,wa);
__global__ void multiplyKernel_colwise(int * a, int * b, int * c, int ha, int wa) {
int id = threadIdx.x;
int sum, i, j;
for(i = 0; i < ha; i++) {
sum = 0;
for( j = 0; j < wb; j++) {
sum += (a[i * wa + j] * b[j * wb + id]);
}
c[i * wb + id] =sum;
}
}
multiplyKernel_b<<<(1,1), (wb, ha)>>>(d_a, d_b, d_c, wa);
__global__ void multiplyKernel_elementwise(int * a, int * b, int * c, int wa) {
int rid = threadIdx.y;
int cid= threadIdx.x;
int sum, i, j;
sum = 0;
for( i = 0; i < wb; i++) {
sum += (a[rid * wa + i] * b[i * wb + cid]);
}
c[rid * wb + cid] =sum;
}
This comprehensive outline includes detailed explanations of CUDA concepts, memory management, and kernel execution, providing a thorough understanding of CUDA programming.