1/33
A comprehensive set of vocabulary cards covering High Performance Computing (HPC) fundamentals, hardware, parallel programming models, and scaling concepts.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Moore's Law
A historical trend of exponential growth in transistor density where integrated circuit resources doubled roughly every 18−24 months.
Dennard scaling
A principle stating that scaling voltage with transistor size keeps electric fields constant and preserves device behavior; it ceased to function once voltage scaling slowed.
Pollack's Rule
The observation that processor performance increases roughly with the square root of transistor count or area.
Flynn's taxonomy
A classification system for computer architectures based on instruction and data streams, including SISD, SIMD, MISD, and MIMD.
SIMD
Single Instruction, Multiple Data; a form of data-level parallelism involving lockstep execution across vector lanes.
Slurm
A cluster management tool used to run intensive tasks on login nodes; commands include sinfo, sbatch, squeue, scancel, and srun.
DRAM
Dynamic Random Access Memory; the type of memory used for the main memory of a typical computing node.
SRAM
Static Random Access Memory; the type of memory used for high-speed caches.
Double precision
A numerical format using 64 bits, equivalent to 8 bytes per value.
L1 Cache
The cache closest to the ALU, typically around 30KB in size with a latency of roughly 4 cycles.
L3 Cache
A larger cache level, typically around 30MB in size with a latency of roughly 50cycles.
Latency
The response time measured from the moment a data request is made until the data arrives.
Bandwidth
The rate at which data is transferred or requests are satisfied; also known as throughput.
AVX512
An instruction set with 512-bit vector registers, capable of holding 8 double-precision or 16 single-precision values.
Spatial locality
A memory access pattern where nearby memory addresses are used frequently.
Temporal locality
A memory access pattern where the same data is reused within a short period.
Race condition
A situation where the outcome of a program depends on the specific timing or interleaving of thread execution.
Atomic operation
An operation that appears indivisible to other threads and cannot be interrupted, preventing exposure of intermediate states.
MPI Rank
A unique identifier assigned to each process within an MPI communicator.
MPI_COMM_WORLD
The default MPI communicator containing all processes in the current run.
MPI_Barrier
A collective operation that forces all ranks in a communicator to wait until every rank has reached the barrier.
Amdahl's Law
A formula used to predict speedup based on the serial fraction f: Sp=f+p1−f1.
Strong scaling
A measure of parallel efficiency where problem size is fixed and processors are increased to reduce execution time.
Weak scaling
A measure of parallel efficiency where problem size is increased proportionally to the number of processors to maintain fixed work per processor.
NVIDIA warp
A fixed group of 32 threads that execute together on a GPU.
global
A CUDA C++ qualifier for a GPU kernel that can be called from the host and returns void.
device
A CUDA C++ qualifier for functions that execute on the GPU and can only be called from other GPU functions.
Unified Memory
A managed memory system (e.g., cudaMallocManaged) that is accessible from both CPU and GPU via implicit page migration.
GEMM
General Matrix Multiplication; a Level 3 BLAS operation highly optimized for cache locality.
Block-cyclic distribution
A method of distributing matrices across nodes that balances computational load while keeping local blocks for efficient BLAS operations.
Ghost atoms
Copies of atoms belonging to neighboring ranks used in MPI molecular dynamics to compute interactions across domain boundaries.
Kokkos
A C++ performance-portability library designed to map parallel execution patterns to different backends like CUDA or OpenMP.
Operational intensity
A metric defined as floating-point operations per byte of DRAM traffic after cache filtering.
Ridge point
The value on a roofline plot calculated as peak memory bandwidthpeak FLOP/s, representing the threshold between memory-bound and compute-bound regimes.