ADL Module 1 & 2

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/70

There's no tags or description

Looks like no tags are added yet.

Last updated 1:22 PM on 6/10/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

71 Terms

New cards

What are the three main areas that use deep learning today?

Generative models, large language models, and computer vision.

New cards

What is a deep network?

A really big differentiable function. Like f(x) made up of stacking layers of simple functions. It winds up being a computation. They are trained with gradient descent and backpropagation.

New cards

Linear layers

Matrix multiplication, convolution. These have weights which are very simple computation.

New cards

Nonlinear layers

Stacking these layers creates information. Normalization, activation, attention, pooling. These have very few trainable parameters and much more complex computation.

New cards

MLP

Multi layer perceptron, is made by stacking linear, activation, over and over again.

New cards

Multi head attention

Linear layer, a bunch of side by side single head attention layers, and fed into a linear layer.

New cards

Transformer

Combines a MLP that feeds into a multi head attention

New cards

Convolution

Used on image-like data. Such as over time or spatial data. Where you can slide your lens across the dataset and make calculations for only that area instead of trying to do it on the entire dataset at once.

New cards

What are the three types of loss functions

Regression, classification, and embeddings.

New cards

What are the three parts of training a deep network

Dataset, architecture, and optimizer/training objective (loss calculation)

New cards

What are the most common optimizer to use?

Adam/Adam W. They have tons of parameters which makes them largely the best option in most scenarios

New cards

Within a warp what happens to the instructions?

Every unit executes the same instruction and they use the same scheduler. They also all share memory and L1 Cache with the other units in the same warp.

New cards

What is the issue with nearly unlimited compute?

Memory bandwidth, memory is the bottleneck and slows down.

New cards

Float 32

Sign is 1 bit, 8 bits for the exponent and 23 bits for the mantissa. 1^-07 precision. Its about (x2-x1)/x1 precision where x2 < x1

New cards

Float 16

1 bit sign, 5 bit exponent, 10 bit mantissa. 1^-4 accuracy.

New cards

Bfloat 16

1 bit sign 8 bit exponent 7 bit mantissa.

New cards

Why would you want to use BFloat16 over Float16 when training deep networks

Bfloat can represent a much wider range of values at the cost of precision. The maximum and minimum values are more orders of magnitude farther apart. It is less precise than Float32 but the same set of values so it takes much less memory to represent.

New cards

Underflows

This is what happens when the gradient you are computing is smaller than precision that the floating point would allow. The solution is to use a gradient scaler to scale the loss up by multiplying by 2^16

New cards

What does auto cast do for operations that don't work well in low bit regimes?

It casts them up to full precision and then down again after execution

New cards

Approximately how much can low precision training reduce memory requirements?

About one fourth, reducing from 16 N bytes to approximately 12 N bytes

New cards

Why don't certain operations like Normalizations work well with F16 or BF16 precision in neural networks?

They lack the precision required to compute normalization factors properly

New cards

Data parallelism solution

Solution used to address the fact that the model parameters do not fit on a single gpu. So we can have a single server as the model source of truth and copy part of the model to each gpu. After each pass, synchronize each gpu to the server by sending up its gradients and send the new weights back for the next pass.

New cards

Issues with data parallelism solution for large model training

Once you need more than 8 gpus, the network speed becomes too much of a limiting factor. Synchronization takes too long.

New cards

Modern data parallelism solution

Evenly spreading the data around the gpus. Instead of having a server to synchronize to, we synchronize between the gpu’s themselves. This is done by using hardware supported operations to achieve this. Namely all reduce will sum a set of values from each gpu together and then store the result on each one. That way we solve the network load with no large amount of data leaving the machine.

One bug of this is that the optimizer on each gpu can begin to diverge especially if you use different gpu’s or need to update the optimizer depending ons some factors.

New cards

model parallelism solution

Has two options one is to put each layer of the pipeline on a gpu such that each gpu feeds its layer into the next gpu.

Option 2 is to split the layers across gpus. It can get very difficult to implement.

New cards

Pipeline parallelism

put each layer of the pipeline on a gpu such that each gpu feeds its layer into the next gpu.
Generally need to split the model by hand so that you can get the computation to be more or less evenly.

Issues: Bubbles. One gpu needs to hold the entire dataset at once, if the layers differ in training speed then the next gpu has to wait for the first to finish before it can begin work. The first gpu won’t be able to work on the next pass until the last gpu is done with computations.
Bubble remedy is to make microbatches where we make it so that the first gpu gets through a smaller portion of data more quickly and then passes is on.

New cards

momentum

The optimization technique that speeds up gradient descent by using the velocity of past gradients. That way you keep the velocity also known as this.

New cards

Zero-2 optimizer state partioning

distribute the gradient across all of the gpu’s. Each gpu computes the forward and the backward. Requires synchronization for reduce scatter and all reduce

New cards

zero-3 optimizer state partitioning

Each gpu only keeps a subset of the weights, a subset of the momentum. Each time we require the weights we will synchronize them between all the gpus.
Trains about 50x models than otherwise possible. The synchronization is the bottleneck.

New cards

gradient

The derivative from the loss function is the gradient for each weight. So there are a lot of these. It is the direction that the weight needs to move to minimize the loss function in the next pass.
Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley (minimum loss). The gradient tells you the slope beneath your feet — you step in the direction that goes downhill. That's gradient descent.

New cards

Fully sharded data parallelism

Similar to zero-3 with more optimizations. Synchronizes the gradients in groups to be more efficient. It has schedules to make the communication latency be hidden.
Uses hybrid sharding

New cards

Hybrid sharding

Rather than fully shard something with one single source node, you have multiple source nodes which get synchronized to by normal nodes and then those source nodes communicate with eachother.

New cards

What uses up memory in the training?

weights, gradients, momentum, activations

New cards

freeze backbone

stop updating all or some of the input parameters which means that we are now only learning the first layer or few layers of classification

New cards

Low rank adapters

LoRA Keep weights frozen, learn adapter AB. Its using W which is a pretrained model weights kept frozen. AB is what changes. The original weights were something like 4N whereas this is N size.

New cards

Adapter

an adapter is a small, trainable module inserted into a pre-trained model that allows you to fine-tune the model for a specific task without updating the original weights
Imagine a brilliant expert (the pre-trained model) who speaks only English. You need them to work in France. Instead of re-educating the expert (expensive, risky), you give them a small earpiece translator — it listens to what they're about to say, quietly adapts it for the French context, and passes it along.

New cards

How do you implement a LoRA adapter?

Add a linear layer as the LoRA adapter.

New cards

Which initialization method is recommended for the second lower layer in a Lora linear implementation?

All zeros

New cards

How can you incorporate Lora adapters into an existing network model in PyTorch easily?

Copy the definition of that model and replace "Linear" with "LoraLinear"

New cards

Why do we divide by the rank when scaling factors are applied to matrix multiplications in Lora adapters

To ensure hyperparameters like learning rate or momentum remain independent of the rank

New cards

integer scale quantization

the process of converting a model's high-precision floating point weights (e.g. float32) into lower-precision integers (e.g. int8) to reduce memory and speed up inference.
So you have the range of float values you want to keep and then map them onto an integer range. You get them by multiplying it by the scale and saving the integer and the scale.
It is faster to do this than floats, more memory efficient, but lower precision and results in slight degradation of precision.

New cards

Integer affine quantization.

The same as integer scale quantization but adds an offset to shift the integer grid to line up with the values you need. This avoids wasting your integer range on values you dont use. Suffers if some of the weights are outliers.

New cards

Blockwise quantization

Compute a scale for each set of blocks rather than one concentrated scale that they all use. Fixes issues from both the affine and normal scale quantization

New cards

double quantization

This is like the dictionary of dictionaries for pages in ram. Quantize the quantization factors with a second round of scale factors. Trading space for compute.

New cards

Stochastic rounding

a randomized rounding method where instead of always rounding to the nearest integer, you randomly round up or down with a probability proportional to how close the value is to each option
Imagine a man standing at 2.8 on a number line, trying to decide whether to step to 2 or 3:

He's much closer to 3, so he leans heavily toward 3
But there's still a small chance he stumbles back to 2
- Over many steps, his average position is exactly 2.8

New cards

QLora

Stores a quantized model and then learn a LoRA adpater with just a small addition to those weights. Requires a pretrained model.

New cards

What format are the original weights stored in during training with Q-LoRA?

Quantized format (4 or 8 bits)

New cards

How does Q-LoRA affect the memory requirement for training large models?

Reduces memory requirements significantly

New cards

What type of memory is used to store gradients and momentum in the LoRA adapter?

Float or B16 format

New cards

QLoRA tradeoffs

It is only meant for very specific tasks, requires fine tuning, may require large rank R

New cards

Backpropagation

used to compute gradients for every weight in the network by applying the chain rule of calculus backwards through the layers.

Must store every layers gradient

New cards

memory efficent backpropagation

Instead of storing all activations during the forward pass, you store only a subset of checkpoints and recompute the rest on demand during the backward pass.
Divide the network into segments

Only store activations at segment boundaries (checkpoints)
During backward pass, recompute activations within each segment from the nearest checkpoint

Instead of saving your state after every single room (expensive), you save only at boss checkpoints. If you die mid-section, you replay from the last checkpoint. You do a bit more work but use far less storage.

New cards

GaLore (gradient low rank projection)

rather than saving memory on activations, it saves memory on the optimizer states (which in Adam can be 2x the model size due to first and second moment estimates).
Stores a projection of the optimizer and recomputes it every 200 steps.
Doesn’t really work on multiple gpus

New cards

Q-GaLore

The same as GaLore but applying quantization to those optimizer states.
Project full gradient to low-rank subspace (GaLore step)
Quantize the low-rank gradient and optimizer states to low precision (e.g. int8)
Run optimizer in quantized low-rank space
Dequantize the update
Project back to full parameter space

New cards

What does Q-Galore do to manage memory requirements for projections?

Stores them in lower bit quantization

New cards

What is a benefit of GALORE over other methods?

It allows training all weights with minimal memory usage

New cards

What is a potential issue when weights are quantized in GALORE?

Gradients smaller than the quantization range may cause weights not to update

New cards

Backward pass nonlinear layer backpropagation

You must store the activations from the forward pass because they will be needed on the backward pass.

New cards

Backprop recompute activation

We could choose to recompute the forward pass again when we go do backpropagation rather than store the activations.
Downside is that it is slower and everything needs to take

New cards

Activation checkpointing

only recomputing layers on demand as the activations demand.

New cards

CPU Offloading

In the case that the batch sizes are inconsistent, you can send the overflow of a batch to the CPU. Very unlikely but useful for some cases.

New cards

Intermediate/scratch memory

New cards

Attention

the mechanism that lets every token in a sequence look at every other token and decide how much to "pay attention" to it. For each token (the "query"), you compute a similarity score against all other tokens (the "keys"), run that through softmax to get weights, then use those weights to blend the "values" together into a new representation.

New cards

softmax

an activation function that converts a vector of raw scores (called logits) into a probability distribution.

New cards

Flash attention

Same as normal attention but uses tiling — it processes small blocks of Q, K, V that fit entirely in SRAM (fast on-chip memory), computes partial softmax results using a running numerically-stable trick, and never writes the full N×N matrix to HBM at all.
Can be tricky to keep the numerical stability

New cards

flash attention 2

it reorganized the loop order (iterates over Q in the outer loop, K/V in the inner), reducing non-matmul FLOPs by ~2×, better parallelizes across both the sequence length dimension and the batch/head dimensions, and handles causal masking by skipping masked tiles entirely. Result: roughly 2× speedup over FA1, reaching ~70% of theoretical GPU FLOP utilization.

New cards

flash attention 3

software pipelining: while one tile is being computed on tensor cores, the next tile's data is already being asynchronously fetched from HBM into SRAM via TMA. On older GPUs these were sequential. On H100 they overlap — hiding memory latency almost entirely. FA3 also uses a smarter two-stage softmax that keeps the pipeline full. Result: ~1.5–2× over FA2, reaching ~75% FP16 utilization and ~2.6× FP8 speedup on H100.

New cards

torch.compile

Compile your code for optimizations to do inference on specific hardware.

New cards

When using chunking, what happens to the number of kernel calls?

The number of kernel calls decreases

New cards

What is the term for dividing an operation into smaller parts to reduce intermediate memory usage?

chunking

New cards

If operations in the forward or backward pass are using too much additional memory, what should be considered?

Specialized implementations and torch.compile