1/70
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What are the three main areas that use deep learning today?
Generative models, large language models, and computer vision.
What is a deep network?
A really big differentiable function. Like f(x) made up of stacking layers of simple functions. It winds up being a computation. They are trained with gradient descent and backpropagation.
Linear layers
Matrix multiplication, convolution. These have weights which are very simple computation.
Nonlinear layers
Stacking these layers creates information. Normalization, activation, attention, pooling. These have very few trainable parameters and much more complex computation.
MLP
Multi layer perceptron, is made by stacking linear, activation, over and over again.
Multi head attention
Linear layer, a bunch of side by side single head attention layers, and fed into a linear layer.
Transformer
Combines a MLP that feeds into a multi head attention
Convolution
Used on image-like data. Such as over time or spatial data. Where you can slide your lens across the dataset and make calculations for only that area instead of trying to do it on the entire dataset at once.
What are the three types of loss functions
Regression, classification, and embeddings.
What are the three parts of training a deep network
Dataset, architecture, and optimizer/training objective (loss calculation)
What are the most common optimizer to use?
Adam/Adam W. They have tons of parameters which makes them largely the best option in most scenarios
Within a warp what happens to the instructions?
Every unit executes the same instruction and they use the same scheduler. They also all share memory and L1 Cache with the other units in the same warp.
What is the issue with nearly unlimited compute?
Memory bandwidth, memory is the bottleneck and slows down.
Float 32
Sign is 1 bit, 8 bits for the exponent and 23 bits for the mantissa. 1^-07 precision. Its about (x2-x1)/x1 precision where x2 < x1
Float 16
1 bit sign, 5 bit exponent, 10 bit mantissa. 1^-4 accuracy.
Bfloat 16
1 bit sign 8 bit exponent 7 bit mantissa.
Why would you want to use BFloat16 over Float16 when training deep networks
Bfloat can represent a much wider range of values at the cost of precision. The maximum and minimum values are more orders of magnitude farther apart. It is less precise than Float32 but the same set of values so it takes much less memory to represent.
Underflows
This is what happens when the gradient you are computing is smaller than precision that the floating point would allow. The solution is to use a gradient scaler to scale the loss up by multiplying by 2^16
What does auto cast do for operations that don't work well in low bit regimes?
It casts them up to full precision and then down again after execution
Approximately how much can low precision training reduce memory requirements?
About one fourth, reducing from 16 N bytes to approximately 12 N bytes
Why don't certain operations like Normalizations work well with F16 or BF16 precision in neural networks?
They lack the precision required to compute normalization factors properly
Data parallelism solution
Solution used to address the fact that the model parameters do not fit on a single gpu. So we can have a single server as the model source of truth and copy part of the model to each gpu. After each pass, synchronize each gpu to the server by sending up its gradients and send the new weights back for the next pass.
Issues with data parallelism solution for large model training
Once you need more than 8 gpus, the network speed becomes too much of a limiting factor. Synchronization takes too long.
Modern data parallelism solution
Evenly spreading the data around the gpus. Instead of having a server to synchronize to, we synchronize between the gpu’s themselves. This is done by using hardware supported operations to achieve this. Namely all reduce will sum a set of values from each gpu together and then store the result on each one. That way we solve the network load with no large amount of data leaving the machine.
One bug of this is that the optimizer on each gpu can begin to diverge especially if you use different gpu’s or need to update the optimizer depending ons some factors.
model parallelism solution
Has two options one is to put each layer of the pipeline on a gpu such that each gpu feeds its layer into the next gpu.
Option 2 is to split the layers across gpus. It can get very difficult to implement.
Pipeline parallelism
put each layer of the pipeline on a gpu such that each gpu feeds its layer into the next gpu.
Generally need to split the model by hand so that you can get the computation to be more or less evenly.
Issues: Bubbles. One gpu needs to hold the entire dataset at once, if the layers differ in training speed then the next gpu has to wait for the first to finish before it can begin work. The first gpu won’t be able to work on the next pass until the last gpu is done with computations.
Bubble remedy is to make microbatches where we make it so that the first gpu gets through a smaller portion of data more quickly and then passes is on.
momentum
The optimization technique that speeds up gradient descent by using the velocity of past gradients. That way you keep the velocity also known as this.
Zero-2 optimizer state partioning
distribute the gradient across all of the gpu’s. Each gpu computes the forward and the backward. Requires synchronization for reduce scatter and all reduce
zero-3 optimizer state partitioning
Each gpu only keeps a subset of the weights, a subset of the momentum. Each time we require the weights we will synchronize them between all the gpus.
Trains about 50x models than otherwise possible. The synchronization is the bottleneck.
gradient
The derivative from the loss function is the gradient for each weight. So there are a lot of these. It is the direction that the weight needs to move to minimize the loss function in the next pass.
Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley (minimum loss). The gradient tells you the slope beneath your feet — you step in the direction that goes downhill. That's gradient descent.
Fully sharded data parallelism
Similar to zero-3 with more optimizations. Synchronizes the gradients in groups to be more efficient. It has schedules to make the communication latency be hidden.
Uses hybrid sharding
Hybrid sharding
Rather than fully shard something with one single source node, you have multiple source nodes which get synchronized to by normal nodes and then those source nodes communicate with eachother.
What uses up memory in the training?
weights, gradients, momentum, activations
freeze backbone
stop updating all or some of the input parameters which means that we are now only learning the first layer or few layers of classification
Low rank adapters
LoRA Keep weights frozen, learn adapter AB. Its using W which is a pretrained model weights kept frozen. AB is what changes. The original weights were something like 4N whereas this is N size.
Adapter
an adapter is a small, trainable module inserted into a pre-trained model that allows you to fine-tune the model for a specific task without updating the original weights
Imagine a brilliant expert (the pre-trained model) who speaks only English. You need them to work in France. Instead of re-educating the expert (expensive, risky), you give them a small earpiece translator — it listens to what they're about to say, quietly adapts it for the French context, and passes it along.
How do you implement a LoRA adapter?
Add a linear layer as the LoRA adapter.
Which initialization method is recommended for the second lower layer in a Lora linear implementation?
All zeros
How can you incorporate Lora adapters into an existing network model in PyTorch easily?
Copy the definition of that model and replace "Linear" with "LoraLinear"
Why do we divide by the rank when scaling factors are applied to matrix multiplications in Lora adapters
To ensure hyperparameters like learning rate or momentum remain independent of the rank
integer scale quantization
the process of converting a model's high-precision floating point weights (e.g. float32) into lower-precision integers (e.g. int8) to reduce memory and speed up inference.
So you have the range of float values you want to keep and then map them onto an integer range. You get them by multiplying it by the scale and saving the integer and the scale.
It is faster to do this than floats, more memory efficient, but lower precision and results in slight degradation of precision.
Integer affine quantization.
The same as integer scale quantization but adds an offset to shift the integer grid to line up with the values you need. This avoids wasting your integer range on values you dont use. Suffers if some of the weights are outliers.
Blockwise quantization
Compute a scale for each set of blocks rather than one concentrated scale that they all use. Fixes issues from both the affine and normal scale quantization
double quantization
This is like the dictionary of dictionaries for pages in ram. Quantize the quantization factors with a second round of scale factors. Trading space for compute.
Stochastic rounding
a randomized rounding method where instead of always rounding to the nearest integer, you randomly round up or down with a probability proportional to how close the value is to each option
Imagine a man standing at 2.8 on a number line, trying to decide whether to step to 2 or 3:
He's much closer to 3, so he leans heavily toward 3
But there's still a small chance he stumbles back to 2
Over many steps, his average position is exactly 2.8
QLora
Stores a quantized model and then learn a LoRA adpater with just a small addition to those weights. Requires a pretrained model.
What format are the original weights stored in during training with Q-LoRA?
Quantized format (4 or 8 bits)
How does Q-LoRA affect the memory requirement for training large models?
Reduces memory requirements significantly
What type of memory is used to store gradients and momentum in the LoRA adapter?
Float or B16 format
QLoRA tradeoffs
It is only meant for very specific tasks, requires fine tuning, may require large rank R
Backpropagation
used to compute gradients for every weight in the network by applying the chain rule of calculus backwards through the layers.
Must store every layers gradient
memory efficent backpropagation
Instead of storing all activations during the forward pass, you store only a subset of checkpoints and recompute the rest on demand during the backward pass.
Divide the network into segments
Only store activations at segment boundaries (checkpoints)
During backward pass, recompute activations within each segment from the nearest checkpoint
Instead of saving your state after every single room (expensive), you save only at boss checkpoints. If you die mid-section, you replay from the last checkpoint. You do a bit more work but use far less storage.
GaLore (gradient low rank projection)
rather than saving memory on activations, it saves memory on the optimizer states (which in Adam can be 2x the model size due to first and second moment estimates).
Stores a projection of the optimizer and recomputes it every 200 steps.
Doesn’t really work on multiple gpus
Q-GaLore
The same as GaLore but applying quantization to those optimizer states.
Project full gradient to low-rank subspace (GaLore step)
Quantize the low-rank gradient and optimizer states to low precision (e.g. int8)
Run optimizer in quantized low-rank space
Dequantize the update
Project back to full parameter space
What does Q-Galore do to manage memory requirements for projections?
Stores them in lower bit quantization
What is a benefit of GALORE over other methods?
It allows training all weights with minimal memory usage
What is a potential issue when weights are quantized in GALORE?
Gradients smaller than the quantization range may cause weights not to update
Backward pass nonlinear layer backpropagation
You must store the activations from the forward pass because they will be needed on the backward pass.
Backprop recompute activation
We could choose to recompute the forward pass again when we go do backpropagation rather than store the activations.
Downside is that it is slower and everything needs to take
Activation checkpointing
only recomputing layers on demand as the activations demand.
CPU Offloading
In the case that the batch sizes are inconsistent, you can send the overflow of a batch to the CPU. Very unlikely but useful for some cases.
Intermediate/scratch memory
Attention
the mechanism that lets every token in a sequence look at every other token and decide how much to "pay attention" to it. For each token (the "query"), you compute a similarity score against all other tokens (the "keys"), run that through softmax to get weights, then use those weights to blend the "values" together into a new representation.
softmax
an activation function that converts a vector of raw scores (called logits) into a probability distribution.
Flash attention
Same as normal attention but uses tiling — it processes small blocks of Q, K, V that fit entirely in SRAM (fast on-chip memory), computes partial softmax results using a running numerically-stable trick, and never writes the full N×N matrix to HBM at all.
Can be tricky to keep the numerical stability
flash attention 2
it reorganized the loop order (iterates over Q in the outer loop, K/V in the inner), reducing non-matmul FLOPs by ~2×, better parallelizes across both the sequence length dimension and the batch/head dimensions, and handles causal masking by skipping masked tiles entirely. Result: roughly 2× speedup over FA1, reaching ~70% of theoretical GPU FLOP utilization.
flash attention 3
software pipelining: while one tile is being computed on tensor cores, the next tile's data is already being asynchronously fetched from HBM into SRAM via TMA. On older GPUs these were sequential. On H100 they overlap — hiding memory latency almost entirely. FA3 also uses a smarter two-stage softmax that keeps the pipeline full. Result: ~1.5–2× over FA2, reaching ~75% FP16 utilization and ~2.6× FP8 speedup on H100.
torch.compile
Compile your code for optimizations to do inference on specific hardware.
When using chunking, what happens to the number of kernel calls?
The number of kernel calls decreases
What is the term for dividing an operation into smaller parts to reduce intermediate memory usage?
chunking
If operations in the forward or backward pass are using too much additional memory, what should be considered?
Specialized implementations and torch.compile