CUDA Notes

CUDA – Compute Unified Device Architecture

CUDA stands for Compute Unified Device Architecture and is an extension of the C programming language created by NVIDIA. It allows programmers to utilize the parallel computing power of NVIDIA graphics cards for general-purpose computation.

Key Concepts

CUDA: A parallel computing platform and programming model by NVIDIA, implemented by their GPUs.
OpenCL: An open standard for programming CPUs, GPUs, and other devices from various vendors. It offers portability but may incur a performance penalty compared to CUDA.
Kernel: A piece of code that runs on the GPU.

CUDA vs. OpenCL

CUDA is proprietary to NVIDIA GPUs, while OpenCL is an open standard. OpenCL aims for portability across different hardware, but this generality can sometimes lead to reduced performance compared to CUDA, which is tailored specifically for NVIDIA's architecture.

CUDA's Massively Parallel Computation

CUDA enables massively parallel computations on graphics accelerators and was first available with NVIDIA's G8X line of graphics cards. Approximately 1 million CUDA-capable GPUs are shipped weekly, presenting a significant opportunity for developing widely deployed parallel applications.

Addressing the Power and Latency Walls

With the increasing challenges posed by the Power Wall and Latency Wall, CUDA provides a solution to prevent processor-intensive programs from slowing down. CUDA facilitates simulations of complex systems like networks of brain neurons and brings the potential of ubiquitous supercomputing to everyday computers.

CUDA Support and Architecture

CUDA is supported on NVIDIA's G8X and above graphics cards. The current CUDA GPU architecture is branded Tesla. 8-series GPUs offer 50-200 GFLOPS (floating-point operations per second).

CUDA Compilation

CUDA uses a set of extensions to ANSI C. CPU code is compiled by the host C compiler, while GPU code (kernels) is compiled by the CUDA compiler (cudacc). This process produces separate binaries.

The compilation process involves:
- Integrated Source (.c, .cu files)
- cudacc: Front End and Global Optimizer
- GPU Assembly/Kernel Code (.s files)
- Kernel Object Code Generator
- CPU Host Code (.c files)
- Host Compiler
- Kernel Object Code (.gpu files)
- Executable Host Binary (.o files)

CUDA Stack

The CUDA stack includes libraries (FFT, BLAS, etc.), example source code, the NVIDIA C Compiler, NVIDIA assembly, CPU host code for computing, the CUDA driver, debugger, profiler, a standard C compiler, and both GPU and CPU components.

Limitations of CUDA

Tesla GPUs do not fully support the IEEE standard for double-precision floating-point operations.
Code is only supported on NVIDIA hardware.
Recursive functions are not supported (though workarounds exist).
There is bus latency between the host CPU and the GPU.

Example: CUDA Speed vs. CPU Speed

This example demonstrates the performance difference between CPU and GPU using the PyTorch library.

import torch
import time

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print("using", device, "device")

matrix_size = 32 * 512
x = torch.randn((matrix_size, matrix_size))
y = torch.randn((matrix_size, matrix_size))

print("************ CPU SPEED ****************")
start = time.time()
result = torch.matmul(x, y)
print(time.time() - start)
print("verify device:", result.device)

x_gpu = x.to(device)
y_gpu = y.to(device)

torch.cuda.synchronize()
for i in range(3):
    print("************ GPU SPEED ****************")
    start = time.time()
    result_gpu = torch.matmul(x_gpu, y_gpu)
    torch.cuda.synchronize()
    print(time.time() - start)
    print("verify device:", result_gpu.device)

Results:

CPU Speed: $8.460786581039429$
GPU Speed (initial): $1.4088332653045654$
GPU Speed (subsequent): Approximately $0.26$ seconds

Data Parallelism - Vector Addition Example

Vector A: $A[0], A[1], A[2], …, A[N-1]$
Vector B: $B[0], B[1], B[2], …, B[N-1]$
Vector C: $C[0], C[1], C[2], …, C[N-1]$

CUDA/OpenCL - Execution Model

Illustrates a heterogeneous host+device application with serial parts in host C code and parallel parts in device SPMD kernel C code.

Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);

From Natural Language to Electrons

Describes the abstraction layers from natural language down to the physical execution by electrons:

Natural Language (e.g., English)
Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons

Instruction Set Architecture (ISA)

The ISA is a contract between the hardware and the software; it defines the set of instructions that the hardware can execute.

Von-Neumann Processor

Memory
Processing Unit
- PC
- Reg File
- ALU
- Control Unit
- IR
- I/O

Arrays of Parallel Threads

A CUDA kernel is executed by a grid (array) of threads.

All threads in a grid run the same kernel code (SPMD - Single Program, Multiple Data).
Each thread uses indexes to compute memory addresses and make control decisions.

i = blockldx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

Thread Blocks: Scalable Cooperation

Divide the thread array into multiple blocks.

Threads within a block cooperate via shared memory, atomic operations, and barrier synchronization.
Threads in different blocks do not interact.

i = blockldx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

Thread Indices

Each thread uses indices to decide what data to work on.

blockldx: 1D, 2D, or 3D (CUDA 4.0)
threadldx: 1D, 2D, or 3D
Simplifies memory addressing when processing multidimensional data (e.g., image processing, solving PDEs).

CUDA C Program Compilation

The host code in a CUDA C program is processed by a host C pre-processor and compiler/linker, while the device code is processed by the NVIDIA C compiler and a device JIT compiler, resulting in a heterogeneous computing platform.

Data Parallelism

Data parallelism involves performing many arithmetic operations simultaneously on data structures.

Matrix Multiplication: Each element of the product matrix $P$ is generated by performing a dot product between a row of input matrix $M$ and a column of input matrix $N$ .
For example, calculating $P[i, j]$ involves a dot product of the $i$ th row of $M$ and the $j$ th column of $N$ .
Since dot product operations for different elements of $P$ can be performed simultaneously, matrix multiplication exhibits high data parallelism.
For large matrices, the number of dot products can be very large. For instance, a $1000 \times 1000$ matrix multiplication involves $1,000,000$ independent dot products, each with 1000 multiply and 1000 accumulate operations.
By executing many dot products in parallel, a CUDA device can significantly accelerate the execution of matrix multiplication.

CUDA Program Structure

A CUDA program contains both host and device code. The NVIDIA C compiler (nvcc) separates the two during compilation.

Host code is compiled with standard C compilers and runs as a CPU process.
Device code (kernels) is compiled by nvcc and executed on a GPU device.
Kernels generate a large number of threads to exploit data parallelism.
CUDA programmers assume that threads are lightweight and take very few cycles to generate and schedule.
CUDA groups threads into warps, blocks, and grids.

CUDA Execution Flow

Execution starts on the host (CPU). When a kernel function is invoked (launched), execution moves to the device (GPU), where a large number of threads are generated.

All threads generated by a kernel during an invocation form a grid.
When all threads of a kernel complete, the grid terminates, and execution returns to the host.

Examples: Vector Addition

CPU Only:

void vecAdd(float* h_A, float* h_B, float* h_C, int n) {
    for (i = 0; i < n; i++)
        h_C[i] = h_A[i] + h_B[i];
}

int main() {
    float *h_A, *h_B, *h_C;
    int n;  // Specify size of Vectors
    h_A = (float*)malloc(n * sizeof(float));
    h_B = (float*)malloc(n * sizeof(float));
    h_C = (float*)malloc(n * sizeof(float));
    vecAdd(h_A, h_B, h_C, N);
}

CPU-GPU:

#include <cuda.h>
#include <cuda_runtime.h>

__global__ void vectorAdd(float*, float*, float*, int);

__global__ void vectorAdd(float *A, float *B, float* C, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n)
        C[i] = A[i] + B[i];
}

void vecAdd(float * h_A, float *h_B, float* h_C, int n) {
    int size = n * sizeof(float);
    float *d_A=NULL, *d_B=NULL, *d_C=NULL;
    cudaError_t err = cudaSuccess;

    // Device Memory Allocation
    err = cudaMalloc((void**)&d_A, size);
    if (err != cudaSuccess) { ... }

    // Host to Device Data Transfer
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) { ... }

    // Kernel Launch
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, n);
    err = cudaGetLastError();

    // Device to Host Memory Transfer
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) { ... }

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}

Device Memory Allocation and Data Transfer

**cudaMalloc((void *) &d_A, size):* Allocates memory on the GPU from global memory.
- The function expects a generic pointer (void **).
cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice): Transfers data from CPU to GPU memory.
- The pointer d_A cannot be dereferenced in host code.
cudaMemcpy (hC, dC, size, cudaMemcpyDeviceToHost): Transfers data from GPU to CPU memory.
- Data transfers can occur among different device memory locations as well.
- Transfers can also occur from host to host, but transfers are not allowed among different GPU devices.
cudaFree(d_A): Frees GPU global memory.

CUDA Kernel Launch

The CUDA kernel launches multiple threads in a 2-level hierarchy.
Example: vectorAdd <<<ceil(n/256), 256>>> (d_A, d_B, d_C, n)
- Specifies a grid of threads to be launched.
  - Organized hierarchically (number of blocks, number of threads per block).
  - All blocks contain the same number of threads (up to a maximum of 1024).
  - Blocks can be numbered as triplets (_, _, _).

Kernel-Specific System Variables

gridDim: Number of blocks in the grid.
- gridDim.x: Number of blocks in dimension x of a multi-dimensional grid.
blockDim: Number of threads per block.
- blockDim.x: Number of threads per block in dimension x of a multi-dimensional block.
blockIdx.x: Block number for a thread.
threadIdx.x: Thread number inside a block.

Function Declaration Keywords

CUDA uses keywords to define where functions are executed and from where they can be called.

Keyword	Executed on the	Only callable from the
`__device__`	device	device
`__global__`	device	host
`__host__`	host	host

Matrix Multiplication Example (Traditional C Code)

void MatrixMultiplication(float* M, float* N, float* P, int Width) {
    for (int i = 0; i < Width; ++i) {
        for (int j = 0; j < Width; ++j) {
            float sum = 0;
            for (int k = 0; k < Width; ++k) {
                float a = M[i * Width + k];
                float b = N[k * Width + j];
                sum += a * b;
            }
            P[i * Width + j] = sum;
        }
    }
}

Traditional Vector Addition (C Code), Revised CUDA Code

The example highlights the transformation from CPU-based code to CUDA-enabled code, involving memory allocation on the device, data transfer, and kernel launch.

#include <cuda.h>

void vecAdd(float* A, float* B, float* C, int n) {
    int size = n * sizeof(float);
    float *A_d, *B_d, *C_d;

    cudaMalloc((void**)&A_d, size);
    cudaMalloc((void**)&B_d, size);
    cudaMalloc((void**)&C_d, size);

    cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(A_d, B_d, C_d, n);

    cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);
}

Device Memories and Data Transfer

In CUDA, the host (CPU) and devices (GPUs) have separate memory spaces.

Devices have their own DRAM (Dynamic Random Access Memory).
cudaMalloc(): Allocates global memory on the device.
- The first parameter is the address of a pointer that will point to the allocated memory.
- The second parameter is the size of the object to be allocated (in bytes).
cudaFree(): Frees the allocated global memory specified by a pointer.
cudaMemcpy(): Transfers data between memories.
- Takes four parameters:
  1. Pointer to the destination location.
  2. Pointer to the source data.
  3. Number of bytes to be copied.
  4. Type of memory transfer (host-to-host, host-to-device, device-to-host, device-to-device).
- Cannot be used to copy between different GPUs in multi-GPU systems.

Kernel Functions and Threading

CUDA programming uses the Single-Program, Multiple-Data (SPMD) parallel programming style. The __global__ keyword indicates a CUDA kernel function that will execute on the device and can only be called from the host to generate a grid of threads.

Kernel Functions and Threading (Keywords)

Keyword	Executed on the	Only callable from the
`__device__`	device	device
`__global__`	device	host
`__host__`	host	host

The device keyword indicates that the function being declared is a CUDA device function. A device function executes on a CUDA device and can only be called from a kernel function or another device function. Device functions can have neither recursive function calls nor indirect function calls through pointers in them.

The host keyword indicates that the function being declared is a CUDA host function. A host function is simply a traditional C function that executes on the host and can only be called from another host function.

GPU Memory Spaces

Each thread has local memory (thread lifetime).
Each block has shared memory (block lifetime).
Global memory is accessible to all threads (persists indefinitely).
Constant and texture memory spaces are also available (persists indefinitely).

CUDA Execution Model

Host code uses the GPU device as a coprocessor.
- Transfers input data from the host to the device.
- Launches code on the device.
- Transfers output data from the device to the host.
Device code (kernel): A function executed by all threads (SIMT).
- Accesses device resources directly: special data types, memory types (constant, texture, shared, global).
- Interoperates with OpenGL and Direct3D graphics APIs.

Thread Index

Each thread has a unique thread index, accessible via the built-in threadIdx variable.

Built-In Variables

threadIdx: Index of a thread.
blockIdx: Index of a block.
gridDim: Dimension of a grid.
blockDim: Dimension of a block.

Indexing Within Grid

threadIdx is unique only within its thread block.
To determine the unique grid index of a thread:
```
i = threadIdx.x + blockIdx.x * blockDim.x;
```

Examples of Kernel Functions

__global__ void kernel( int* a ) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = blockDim.x;
}

__global__ void kernel( int* a ) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = threadIdx.x;
}

__global__ void kernel( int* a ) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = blockIdx.x;
}

__global__ void kernel( int* a ) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    a[i] = i;
}

Threads Execute in Parallel

Threads can read a result before another thread writes to that address.
This can create a race condition.

Thread Synchronization via Explicit Barrier

Threads need to synchronize to avoid race conditions.
A barrier is a point where all threads stop and wait for others.
When all threads reach the barrier, they can proceed.
Barriers are implemented with __syncthreads();

Syncthreads() Example

The purpose of synchronizing threads is to demonstrate how to safeguard against data races when cooperative execution requires multiple threads to access and modify shared data.

Thread-Memory Correspondence

Local Memory (and Registers): Private to its corresponding thread (thread lifetime).
Shared Memory: Every thread in the block has access (block lifetime).
Global Memory: Every thread in all grids has access (entire program lifetime).

Memory Model

Yellow rectangles represent Streaming Multiprocessors (SMs).
Memory located on SM is called "On-Chip" Device memory.
Memory not located on SM is called "Off-Chip" Device Memory.

Memory Speed

Relative speed of memory spaces:

Memory Space	Bandwidth	Latency
Registers	~8TB/s	~1 clock
Shared	~1.5TB/s	~32 clocks
Global	~200GB/s	~800 clocks
Host (PCIe)	~5GB/s	N/A

CUDA Device Memory Model

Device code can:
- Read/Write per-thread registers.
- Read/Write per-thread local memory.
- Read/Write per-block shared memory.
- Read/Write per-grid global memory.
- Read-only per-grid constant memory.
Host code can:
- Transfer data to/from per-grid global and constant memories.

CUDA Memory Efficiency

Simple CUDA kernels often achieve only a small fraction of the potential hardware speed, primarily due to the long access latencies associated with global memory (DRAM).

Key Factors:

Compute to global memory access (CGMA) ratio: This ratio is defined as the number of floating-point calculations performed for each access to the global memory.
- To maximize CUDA memory efficiency, prioritize memory access patterns with a high compute-to-global memory access (CGMA) ratio. This means striving to perform more computations for each access to global memory, thereby increasing the overall efficiency of calculations. Implement optimizations that reduce the number of global memory accesses while maximizing the amount of computation using data already available in registers or shared memory.
Maximize CGMA ratio:
- Achieve high execution speeds in kernels by maximizing the CGMA ratio.

CUDA Device Memory Types

The primary memory types in CUDA programming include global memory, constant memory, registers, and shared memory. Each serves a unique purpose and offers distinct advantages for different types of data access patterns.

Memory Type	Host Access	Device Access	Latency	Scope	Lifetime
Global	Read/Write	Read/Write	High	Grid	Application
Constant	Read/Write	Read-Only	Low (cached)	Grid	Application
Registers	N/A	Read/Write	Very Low	Thread	Kernel
Shared	N/A	Read/Write	Low	Block	Kernel

Examples kernels

multiplyKernel_a<<<1,ha>>>(d_a, d_b, d_c, wa, wb);
__global__ void multiplyKernel_rowwise(int * a, int * b, int * c, int wa, int wb) {
    int id = threadIdx.x;
    int sum=0;

    for(int i = 0; i < wb; i++) {
        sum= 0;
        for(int j = 0; j < wa; j++) {
            sum += (a[id * wa + j] * b[j * wb + i]);
        }
        c[id * wb+ i] = sum;
    }
}

multiplyKernel_b<<<1, wb>>>(d_a, d_b, d_c, ha,wa);
__global__ void multiplyKernel_colwise(int * a, int * b, int * c, int ha, int wa) {
    int id = threadIdx.x;
    int sum, i, j;

    for(i = 0; i < ha; i++) {
        sum = 0;
        for( j = 0; j < wb; j++) {
            sum += (a[i * wa + j] * b[j * wb + id]);
        }
        c[i * wb + id] =sum;
    }
}

multiplyKernel_b<<<(1,1), (wb, ha)>>>(d_a, d_b, d_c, wa);
__global__ void multiplyKernel_elementwise(int * a, int * b, int * c, int wa) {
    int rid = threadIdx.y;
    int cid= threadIdx.x;
    int sum, i, j;

    sum = 0;
    for( i = 0; i < wb; i++) {
        sum += (a[rid * wa + i] * b[i * wb + cid]);
    }
    c[rid * wb + cid] =sum;
}

This comprehensive outline includes detailed explanations of CUDA concepts, memory management, and kernel execution, providing a thorough understanding of CUDA programming.