1/35
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the core objective and underlying assumption of Fine-Grained (Magnitude) Pruning?
To isolate and eliminate individual elements within a weight tensor W that contribute the least to layer activations.
Assumption: Weights with an absolute value close to zero (∣W∣≈0) have a negligible impact on the network's final output.
Note: Tensor dimensions do not change.
How is the pruning threshold τ determined, and how is the binary mask matrix M generated for a target sparsity ratio s?
1. Threshold (τ): Calculate the absolute values ∣W∣ and find the s-th percentile value.
2. Binary Mask (M): Create a matrix matching the shape of W using the rule:
Mi,j,k,l={10amp;if ∣Wi,j,k,l∣≥τamp;if ∣Wi,j,k,l∣lt;τ
What are the mathematical operations for the Forward Pass and Backward Pass during fine-grained pruning fine-tuning?
Forward Pass (Inference): Uses an element-wise Hadamard product (⊙):
Uses an element-wise Hadamard product (⊙):
Wpruned=W⊙M
Y=X∗Wpruned+b
Backward Pass (Retraining): Mask gradients to prevent zeroed-out parameters from waking up:
∇Wupdated=∇W⊙M
W←W−η⋅∇Wupdated
What is Channel (Structured) Pruning, and what metric is commonly used to score structural filters?
Instead of masking individual elements, it physically removes an entire 3D slice (a filter) from a convolutional layer, shrinking the physical tensor dimensions.
Scoring Metric: The L1-norm of each filter i, which sums the absolute values of all its parameters:
Si=∥Wi∥1=j=1∑Ck=1∑Hl=1∑W∣Wi,j,k,l∣
How does physical channel pruning affect the weight tensor of the current layer (L) and the immediate next layer (L+1)?
Layer L (Output Slicing): Sorts channels by score, keeps the surviving indices K, and down-samples the tensor:
Wnew=W[K,:,:,:]
Layer L+1 (Downstream Dependency): Because the input channels have shrunk, Layer L+1 must immediately slice its input dimension to match:
WL+1,new=WL+1[:,K,:,:]
What is the mathematical objective of K-Means Weight Quantization, and what two structures represent the layer after convergence?
Objective: Group continuous weights into K=2b clusters (for b-bits) by minimizing the within-cluster sum of squares (WCSS):
argC,μmink=1∑Kw∈Ck∑∣w−μk∣2
Storage Structures:
Codebook/Lookup Table (μ): A 1D array of K 32-bit floating-point centroids.
Index Map (I): A discrete matrix matching the original shape of W, containing b-bit integer cluster labels.
Why is the Straight-Through Estimator (STE) needed during Quantization-Aware Training (QAT), and what is its mathematical assumption?
The Problem: Mapping weights to discrete indices is a step-function with a derivative of zero almost everywhere, which blocks backpropagation.
The STE Solution: Assumes the derivative of the quantization function is exactly 1, passing the gradient completely unchanged to the underlying full-precision weights (Wfp32):
∂Wfp32∂L≈∂Wquant∂L
Forward Pass: Wquant=μ[I]
How are the codebook centroids (μk) updated during the backward pass of Quantization-Aware Training (QAT)?
The gradient for an individual centroid μk is calculated by accumulating the gradients of all weights assigned to that specific cluster Ck:
∇μk=i,j∈Ck∑(∂Wquant∂L)i,j
μk←μk−η⋅∇μk
Compare the structures of FP32 vs. FP16 formats, and state which layers are explicitly excluded from FP16 downcasting in Vision Transformers (ViTs).
FP32: 1 sign bit, 8 exponent bits, 23 mantissa bits.
FP16: 1 sign bit, 5 exponent bits, 10 mantissa bits.
Forward Pass Downcasting: Linear projections and MLPs are executed in FP16:
Yfp16=Xfp16⋅Wfp16+bfp16
Exceptions: Softmax and LayerNorm stay in FP32 to prevent numerical underflow/overflow variance.
Why is Gradient Scaling necessary in FP16 training, and how is it applied during the backward pass?
Why: FP16's restricted dynamic exponent range causes tiny transformer gradients to underflow to absolute zero (0.0).
The loss function L is multiplied by a large shifting scalar factor S (S≫1) before backpropagation:
Lscaled=L×S
∇Wscaled=∂Wfp16∂Lscaled
How are master weights updated using scaled gradients, and what happens if an inf or NaN value is encountered?
Unscaling: Before updating the continuous latent master weights ($W_{\text{fp32}}$), the gradient scale is inverted:
∇Wunscaled=S∇Wscaled
Wfp32←Wfp32−η⋅∇Wunscaled
Exception Handling: If inf or NaN is detected, the entire optimizer update step is skipped, and the scale factor is shrunk: S←S×0.5 .
What is a Convolution operation in CNNs
A mathematical operation where a small matrix (kernel) slides across an input image, performing element-wise multiplication and summing the results. Its purpose is to extract local features like edges, textures, or shapes from the input.
For an input I and kernel K
(I∗K)(i,j)=m∑n∑I(i−m,j−n)⋅K(m,n)
What is Pooling in CNNs?
A downsampling operation typically placed after convolutional layers. It reduces the spatial dimensions (width and height) of the feature maps, which cuts down computation, reduces memory usage, and helps prevent overfitting while maintaining the most important features.
What is the difference between Max Pooling and Average Pooling?
Max Pooling selects the maximum value from the covered region of the feature map, highlighting the most prominent features (like bright edges).
Average Pooling calculates the average of all values in the region, providing a smoother, more generalized downsampling.
Max Pooling=i,j∈Rmax(Xi,j)
Average Pooling=∣R∣1i,j∈R∑Xi,j
What is a Dense Layer?
A standard neural network layer where every neuron is connected to every single neuron in the previous layer. In CNNs, these are placed at the very end of the network to take the extracted 2D/3D features (flattened into a 1D vector) and map them to the final class probabilities or outputs.
y=σ(W⋅x+b)
(Where W is weights, x is input, b is bias, and σ is the activation function)
What are the standard layers in a CNN and their order?
1. Input Layer: Holds raw pixel values.
2. Convolutional Layer: Extracts feature maps using kernels.
3. Activation Layer (ReLU): Introduces non-linearity.
4. Pooling Layer: Downsamples spatial dimensions.
5. Fully Connected (Dense) Layer: Performs final classification based on extracted features.
What is a Kernel/Filter in a CNN?
A small, learnable matrix of weights (e.g., 3×3 or 5×5) that slides across the input data to detect specific patterns. Early layers learn simple filters (edges), while deeper layers learn complex filters (faces, objects).
Represented as a matrix W∈Rk×k×c where k is size and c is input channels.-
What are Padding and Stride in convolution?
Padding adds extra pixels (usually zeros) around the outer border of an input image so the kernel can overlap the edges, preventing the image from shrinking too fast.
Stride is the step size (number of pixels) the kernel shifts by as it slides across the image. Equation (Output Spatial Size O):
O=⌊SW−K+2P⌋+1
(Where W$ = input size, K=kernelsize,P=padding,S$$ = stride)
What is a Sobel Filter, and how do horizontal/vertical variants work?
An edge-detection filter that computes the gradient of image intensity. The Horizontal Sobel (Gx) detects vertical edges by looking for changes in intensity horizontally.
The Vertical Sobel (Gy) detects horizontal edges by looking for changes vertically.
Gx=−1−2−1amp;0amp;0amp;0amp;1amp;2amp;1,Gy=−101amp;−2amp;0amp;2amp;−1amp;0amp;1
What is Gradient Magnitude in image processing?
A measure used after applying horizontal (Gx) and vertical (Gy ) Sobel filters to determine the overall strength or sharpness of an edge at any given pixel, combining both direction forces.
∣G∣=Gx2+Gy2
What is a Laplacian Filter?
A derivative filter used to find edges by calculating the second derivative of the image intensity. Unlike Sobel (which finds edge direction), the Laplacian detects rapid intensity changes in all directions at once and is highly sensitive to noise. * Equation (Standard 3×3 Kernel):
L=010amp;1amp;−4amp;1amp;0amp;1amp;0or111amp;1amp;−8amp;1amp;1amp;1amp;1
What is Weight Sharing in CNNs?
The concept that the same kernel (and its weights) is used to scan every part of an input image. If a feature (like a horizontal edge) is useful to find in the top-left corner, it is equally useful to find in the bottom-right. This drastically reduces the number of parameters compared to dense layers.
Instead of unique weights for every input-output pair (Wij), a kernel weight Km,n is applied universally across all spatial locations.
What is Filter Hierarchy in a CNN?
The progression of feature complexity as you go deeper into a network. Early layers capture low-level features (edges, lines), middle layers combine these into mid-level features (textures, shapes), and deep layers combine those into high-level features (entire objects, faces).
What is a Residual Connection (Skip Connection)?
A structural mechanism that allows the input of a layer to bypass one or more intermediate layers and be added directly to the output. Its purpose is to solve the vanishing gradient problem, allowing deep networks (like ResNet) to train efficiently by letting gradients flow backward unhindered.
H(x)=F(x)+x
(Where x is input, F(x) is the learned layer transformation, and H(x) is the final output)
What is a Basic Block in ResNet?
The fundamental building block of ResNet architectures (specifically ResNet-18 and ResNet-34). It consists of two successive 3×3 Convolutional layers, each followed by Batch Normalization and a ReLU activation, wrapped together by a residual (skip) connection.
What is Batch Normalization?
A technique that normalizes the inputs of each layer across a mini-batch during training. This stabilizes and accelerates neural network training by mitigating internal covariate shift, allowing for higher learning rates.
x^i=σB2+ϵxi−μB,yi=γx^i+β
(Where μB is batch mean, σB2 is batch variance, γ and β are learnable scale/shift parameters)
What does Downsample mean in a ResNet block?
When spatial dimensions shrink (via a stride of 2) and the number of feature channels increases between ResNet stages, the dimensions of input x and output F(x) no longer match. A downsample operation (usually a 1×1 convolution with stride 2) is applied to x so it can be mathematically added to F(x).
H(x)=F(x)+Ws(x)
(Where Ws is the downsampling/projection projection layer matrix)
What components make up a full ResNet network?
Initial Stage: A large conv layer (7×7) and Max Pooling to rapidly reduce resolution.
Stack of Residual Blocks: Groups of Basic Blocks or Bottleneck Blocks divided into stages.
Downsample Blocks: Transitions between stages that reduce spatial resolution while doubling channel depth.
Global Average Pooling (GAP): Flattens spatial dimensions into a single vector per channel before classification.
Final Dense Layer: Maps features to final class output.
What is a LambdaLR Scheduler?
A learning rate scheduling technique in PyTorch where the learning rate is adjusted dynamically at every epoch based on a user-defined custom lambda function.
LRepoch=LRinitial×λ(epoch)
What optimizer is traditionally used for training ResNet?
The original ResNet paper uses Stochastic Gradient Descent (SGD) with Momentum (typically momentum value of 0.9) combined with weight decay. While Adam can be used, SGD with momentum is generally preferred for ResNet because it yields better final generalization on datasets like ImageNet.
vt+1=βvt+η∇L(θt)
θt+1=θt−vt+1
(Where v is velocity, β is momentum, η is learning rate, and θ represents the weights)
L1 Regularization (Lasso)
L1 regularization adds a penalty equal to the absolute sum of the weights to the data loss function. Statistically, applying an L1 penalty is equivalent to finding the Maximum A Posteriori (MAP) estimate of the parameters using a Laplace prior centered at zero.
w^map=argwminNLL(w)+λ∥w∥1
Where the L1 norm constraint ∥w∥1 is defined as:
∥w∥1≜d=1∑D∣wd∣
The hyperparameter $\lambda$ controls the regularization strength. A larger λ penalizes parameter size more aggressively.
When does a Residual Block use an Identity Shortcut vs. a Projection Shortcut, and what is the formula for the latter?
Identity Shortcut: Used when the dimensions of the input tensor x match the output dimensions of the residual function F(x). The shortcut performs a direct, parameter-free element-wise addition.
Projection Shortcut: Used when spatial dimensions shrink (due to stride) and channel dimensions increase. A linear projection matrix $W_s$ is applied to x to match the shapes:
H(x)=F(x)+Wsx
Describe how Multi-Head Self-Attention (MHSA) builds upon single scaled dot-product attention, including its projection and concatenation steps.
Instead of performing attention once on D-dimensional queries, keys, and values, MHSA linearly projects Q, K, and V exactly h times (heads) with different, learnable projections to lower dimensions dk,dk, and dv.
Formulation:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
where headi=Attention(QWiQ,KWiK,VWiV)
Advantage: Allows the model to jointly attend to information from different representation subspaces at different positions simultaneously.
In the Self-Attention mechanism, what do the Queries (Q), Keys (K), and Values (V) conceptually represent?
Queries (Q): What a token is actively searching for in the sentence.
Keys (K): A label/index of what a token contains, matching itself against incoming queries.
Values (V): The actual informational content of the token that is extracted once a query matches a key.
State the complete mathematical formula for Scaled Dot-Product Attention and define each matrix variable.
Attention(Q,K,V)=softmax(dkQKT)V
Q: Query matrix (XWQ)
K: Key matrix (XWK)
V: Value matrix (XWV)
dk: The channel dimensionality of the key vectors (used for scaling variance).
Why is the scale factor dk1 explicitly required inside the Transformer self-attention formula?
As dk grows large, the dot product QKT yields massive scalar values. This pushes the softmax activation function into its saturation regions where its derivative is nearly zero. Dividing by dk scales the activation variance back down, preventing vanishing gradients during backpropagation.