High-Performance Computing (HPC) and Parallel Programming

0.0(0)
Studied by 0 people
call kaiCall Kai
Locked
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/29

flashcard set

Earn XP

Description and Tags

Comprehensive flashcards covering High-Performance Computing (HPC) fundamentals, machine architecture, parallel decomposition, Amdahl's Law, computer architecture basics, process/thread management, concurrency principles, MPI programming, and CUDA fundamentals.

Last updated 10:50 PM on 6/27/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai
Chat

No analytics yet

Send a link to your students to track their progress

30 Terms

1
New cards

What are the three core components of HPC defined in the lecture?

The core components are Hardware Systems (Multi-core CPUs, GPUs, and clusters), Software Tools (Compilers, libraries, debuggers), and Programming Paradigms (Parallel programming frameworks like MPI, OpenMP, and CUDA).

2
New cards

As of November 2025, which supercomputer is ranked #1 on the TOP500 list, and what is its country of origin?

El Capitan, United States.

3
New cards

What characterizes Exascale computing today?

Systems capable of 101810^{18} calculations per second.

4
New cards

What are the two primary concerns in Interconnection Topology design for HPC clusters?

  1. Minimizing the number of connections per node as the cluster scales. 2. Ensuring that communication time and bandwidth remain constant regardless of the number of processing nodes.
5
New cards

How does a Torus topology differ from a standard Mesh topology?

The Torus topology extends the mesh by connecting the edges of the grid, creating a wrap-around effect to reduce communication distance.

6
New cards

Describe the connection requirements and communication time scaling for a Hypercube topology.

It requires log2(n)\log_2(n) connections per node, with communication times scaling up to (log2(n))(\log_2(n)) where n=2kn = 2^k.

7
New cards

What is the purpose of a Fat-Tree topology in modern supercomputers?

It is a version of the tree topology with additional bandwidth in higher levels of the hierarchy to prevent bottlenecks and maintain consistent bandwidth across all bisections.

8
New cards

Match the levels of parallel granularity (Fine, Medium, Coarse) to their respective categories.

Fine-grained: SIMD (operations on variables); Medium-grained: MIMD shared memory (thread level); Coarse-grained: MIMD distributed memory (process level).

9
New cards

State the formula for Amdahl's Law speedup including communication overhead.

Soverall=1(1P)+C+PNS_{overall} = \frac{1}{(1 - P) + C + \frac{P}{N}} where PP is the parallelizable fraction, CC is the communications overhead, and NN is the speedup factor.

10
New cards

What is the Stored-Program Architecture, and who developed it?

The concept of representing instructions as data stored in main memory alongside other data, developed independently by John von Neumann and Alan Turing.

11
New cards

What are the two fundamental internal states a computer system alternates between?

  1. Instruction Fetch (retrieving the next instruction from memory) and 2. Instruction Execution (decoding and carrying out the operation).
12
New cards

How is speedup (SS) calculated for an N-stage pipeline compared to a non-pipelined version?

S=tnonpipelinedtpipelinedS = \frac{t_{non-pipelined}}{t_{pipelined}} where for a large number of objects, the speedup ideally approaches an NN-fold improvement (SNS \approx N).

13
New cards

What is the purpose of the write-through policy in multicore cache systems?

When a write occurs, all copies in L1 and L2 are marked as stale to ensure subsequent reads trigger a transfer from L3, which holds the most recent version of data.

14
New cards

Define a 'Process' in the context of computer architecture.

An active instance of a program characterized by addressing space, processor context, I/O context, and execution state.

15
New cards

Contrast User-Level Threads (ULTs) and Kernel-Level Threads (KLTs).

ULTs are managed by user libraries without kernel involvement but can block the whole process if one thread blocks. KLTs are managed directly by the kernel, allowing other threads to run if one blocks and enabling parallel execution on multicore processors.

16
New cards

What is a 'Race Condition'?

A scenario occurring when multiple threads access and modify shared data concurrently, so the final outcome depends on the timing or order of execution.

17
New cards

List the four necessary conditions for Deadlock to occur.

  1. Mutual Exclusion, 2. Hold and Wait (Waiting with Retention), 3. No Preemption (Non-Liberation), and 4. Circular Wait (Vicious Circle).
18
New cards

What is 'Aging' in the context of resource allocation?

A policy to prevent starvation (indefinite postponement) by gradually increasing the priority of a waiting process over time.

19
New cards

Compare Blocking vs. Non-blocking synchronization in message exchange.

Blocking (synchronous) operations halt the process until the message is received/arrives. Non-blocking (asynchronous) operations return immediately, requiring processes to manage synchronization themselves.

20
New cards

Define MPI as per the transcript.

MPI (Message Passing Interface) is a library specification (not a language) for the message-passing programming model where data is explicitly transferred between processes through a standardized API.

21
New cards

What are the two mandatory MPI instructions for starting and ending an MPI program?

The first instruction must be MPI\_Init(&argc, &argv) and the last must be MPI_Finalize()MPI\_Finalize().

22
New cards

What information is contained in the MPI Message Header (Envelope)?

Communication context (communicator), Source Identification (rank), Destination Identification (rank), and Tag (integer label).

23
New cards

Distinguish between MPI_Scatter and MPI_Scatterv.

MPI_Scatter distributes uniform, distinct parts of a message from a root to all processes. MPI_Scatterv is an extended version that supports non-uniform message sizes using an array of displacements (displsdispls).

24
New cards

How does MPI_Reduce function?

It performs a global element-wise computation (e.g., MPI_SUMMPI\_SUM, MPI_MAXMPI\_MAX) on value arrays from all processes and delivers the final result to a root process.

25
New cards

What are the two routines MPI provides to monitor non-blocking operations?

MPI_WaitMPI\_Wait (blocks until completion) and MPI_TestMPI\_Test (checks completion status without blocking).

26
New cards

Explain the role of MPI_Barrier.

A collective synchronization mechanism where all participating processes block until every process in the communicator has reached the barrier.

27
New cards

What is the execution configuration syntax for launching a CUDA kernel?

<<
28
New cards

Define a 'Warp' in CUDA programming.

A warp is a batch of 32 threads that the GPU executes concurrently in SIMD/SIMT (Single Instruction, Multiple Threads) fashion.

29
New cards

What is Unified Memory (Managed Memory) in CUDA?

A single coherent memory space shared between the CPU and GPU that provides automatic data migration and simplifies programming by removing the need for manual memory copying.

30
New cards

What are the restrictions for CUDA Kernels?

They must return void, only access device memory, cannot use static variables, cannot use function pointers, and launches are asynchronous (control returns to the host immediately).