Project: CNNs, Pruning, Quantization

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/35

There's no tags or description

Looks like no tags are added yet.

Last updated 8:18 AM on 5/29/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

36 Terms

New cards

What is the core objective and underlying assumption of Fine-Grained (Magnitude) Pruning?

To isolate and eliminate individual elements within a weight tensor $W$ that contribute the least to layer activations.

Assumption: Weights with an absolute value close to zero ( $|W| \approx 0$ ) have a negligible impact on the network's final output.
Note: Tensor dimensions do not change.

New cards

How is the pruning threshold $\tau$ determined, and how is the binary mask matrix $M$ generated for a target sparsity ratio $s$ ?

1. Threshold ( $\tau$ ): Calculate the absolute values $|W|$ and find the $s$ -th percentile value.

2. Binary Mask ( $M$ ): Create a matrix matching the shape of $W$ using the rule:

$M_{i,j,k,l} = \begin{cases} 1 & \text{if } |W_{i,j,k,l}| \ge \tau \\ 0 & \text{if } |W_{i,j,k,l}| < \tau \end{cases}$

New cards

What are the mathematical operations for the Forward Pass and Backward Pass during fine-grained pruning fine-tuning?

Forward Pass (Inference): Uses an element-wise Hadamard product ( $\odot$ ):

Uses an element-wise Hadamard product ( $\odot$ ):

$W_{\text{pruned}} = W \odot M$

$Y = X * W_{\text{pruned}} + b$

Backward Pass (Retraining): Mask gradients to prevent zeroed-out parameters from waking up:

$\nabla W_{\text{updated}} = \nabla W \odot M$

$W \leftarrow W - \eta \cdot \nabla W_{\text{updated}}$

New cards

What is Channel (Structured) Pruning, and what metric is commonly used to score structural filters?

Instead of masking individual elements, it physically removes an entire 3D slice (a filter) from a convolutional layer, shrinking the physical tensor dimensions.

Scoring Metric: The $L_1$ -norm of each filter $i$ , which sums the absolute values of all its parameters:
- $\mathcal{S}_i = \|W_i\|_1 = \sum_{j=1}^{C} \sum_{k=1}^{H} \sum_{l=1}^{W} |W_{i, j, k, l}|$

New cards

How does physical channel pruning affect the weight tensor of the current layer ( $L$ ) and the immediate next layer ( $L+1$ )?

Layer $L$ (Output Slicing): Sorts channels by score, keeps the surviving indices $\mathcal{K}$ , and down-samples the tensor:
- $W_{\text{new}} = W[\mathcal{K}, :, :, :]$

Layer $L+1$ (Downstream Dependency): Because the input channels have shrunk, Layer $L+1$ must immediately slice its input dimension to match:
- $W_{L+1, \text{new}} = W_{L+1}[:, \mathcal{K}, :, :]$

New cards

What is the mathematical objective of K-Means Weight Quantization, and what two structures represent the layer after convergence?

Objective: Group continuous weights into $K = 2^b$ clusters (for $b$ -bits) by minimizing the within-cluster sum of squares (WCSS):

$\arg\min_{\mathcal{C}, \mu} \sum_{k=1}^{K} \sum_{w \in \mathcal{C}_k} |w - \mu_k|^2$
Storage Structures:
- Codebook/Lookup Table ( $\mu$ ): A 1D array of $K$ 32-bit floating-point centroids.
- Index Map ( $I$ ): A discrete matrix matching the original shape of $W$ , containing $b$ -bit integer cluster labels.

New cards

Why is the Straight-Through Estimator (STE) needed during Quantization-Aware Training (QAT), and what is its mathematical assumption?

The Problem: Mapping weights to discrete indices is a step-function with a derivative of zero almost everywhere, which blocks backpropagation.

The STE Solution: Assumes the derivative of the quantization function is exactly $1$ , passing the gradient completely unchanged to the underlying full-precision weights ( $W_{\text{fp32}}$ ):
- $\frac{\partial L}{\partial W_{\text{fp32}}} \approx \frac{\partial L}{\partial W_{\text{quant}}}$
Forward Pass: $W_{\text{quant}} = \mu[I]$

New cards

How are the codebook centroids ( $\mu_k$ ) updated during the backward pass of Quantization-Aware Training (QAT)?

The gradient for an individual centroid $\mu_k$ is calculated by accumulating the gradients of all weights assigned to that specific cluster $\mathcal{C}_k$ :

$\nabla \mu_k = \sum_{i,j \in \mathcal{C}_k} \left( \frac{\partial L}{\partial W_{\text{quant}}} \right)_{i,j}$
$\mu_k \leftarrow \mu_k - \eta \cdot \nabla \mu_k$

New cards

Compare the structures of FP32 vs. FP16 formats, and state which layers are explicitly excluded from FP16 downcasting in Vision Transformers (ViTs).

FP32: 1 sign bit, 8 exponent bits, 23 mantissa bits.

FP16: 1 sign bit, 5 exponent bits, 10 mantissa bits.
Forward Pass Downcasting: Linear projections and MLPs are executed in FP16:
- $Y_{\text{fp16}} = X_{\text{fp16}} \cdot W_{\text{fp16}} + b_{\text{fp16}}$
Exceptions: Softmax and LayerNorm stay in FP32 to prevent numerical underflow/overflow variance.

New cards

Why is Gradient Scaling necessary in FP16 training, and how is it applied during the backward pass?

Why: FP16's restricted dynamic exponent range causes tiny transformer gradients to underflow to absolute zero ( $0.0$ ).

The loss function $L$ is multiplied by a large shifting scalar factor $S$ ( $S \gg 1$ ) before backpropagation:
- $L_{\text{scaled}} = L \times S$
- $\nabla W_{\text{scaled}} = \frac{\partial L_{\text{scaled}}}{\partial W_{\text{fp16}}}$

New cards

How are master weights updated using scaled gradients, and what happens if an inf or NaN value is encountered?

Unscaling: Before updating the continuous latent master weights ($W_{\text{fp32}}$), the gradient scale is inverted:

$\nabla W_{\text{unscaled}} = \frac{\nabla W_{\text{scaled}}}{S}$
$W_{\text{fp32}} \leftarrow W_{\text{fp32}} - \eta \cdot \nabla W_{\text{unscaled}}$

Exception Handling: If inf or NaN is detected, the entire optimizer update step is skipped, and the scale factor is shrunk: $S \leftarrow S \times 0.5$ .

New cards

What is a Convolution operation in CNNs

A mathematical operation where a small matrix (kernel) slides across an input image, performing element-wise multiplication and summing the results. Its purpose is to extract local features like edges, textures, or shapes from the input.

For an input $I$ and kernel $K$

$(I * K)(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) \cdot K(m, n)$

New cards

What is Pooling in CNNs?

A downsampling operation typically placed after convolutional layers. It reduces the spatial dimensions (width and height) of the feature maps, which cuts down computation, reduces memory usage, and helps prevent overfitting while maintaining the most important features.

New cards

What is the difference between Max Pooling and Average Pooling?

Max Pooling selects the maximum value from the covered region of the feature map, highlighting the most prominent features (like bright edges).
Average Pooling calculates the average of all values in the region, providing a smoother, more generalized downsampling.

$\text{Max Pooling} = \max_{i,j \in R} (X_{i,j})$

$\text{Average Pooling} = \frac{1}{|R|} \sum_{i,j \in R} X_{i,j}$

New cards

What is a Dense Layer?

A standard neural network layer where every neuron is connected to every single neuron in the previous layer. In CNNs, these are placed at the very end of the network to take the extracted 2D/3D features (flattened into a 1D vector) and map them to the final class probabilities or outputs.

$y = \sigma(W \cdot x + b)$

(Where $W$ is weights, $x$ is input, $b$ is bias, and $\sigma$ is the activation function)

New cards

What are the standard layers in a CNN and their order?

1. Input Layer: Holds raw pixel values.

2. Convolutional Layer: Extracts feature maps using kernels.

3. Activation Layer (ReLU): Introduces non-linearity.

4. Pooling Layer: Downsamples spatial dimensions.

5. Fully Connected (Dense) Layer: Performs final classification based on extracted features.

New cards

What is a Kernel/Filter in a CNN?

A small, learnable matrix of weights (e.g., $3 \times 3$ or $5 \times 5$ ) that slides across the input data to detect specific patterns. Early layers learn simple filters (edges), while deeper layers learn complex filters (faces, objects).

Represented as a matrix $W \in \mathbb{R}^{k \times k \times c}$ where $k$ is size and $c$ is input channels.-

New cards

What are Padding and Stride in convolution?

Padding adds extra pixels (usually zeros) around the outer border of an input image so the kernel can overlap the edges, preventing the image from shrinking too fast.

Stride is the step size (number of pixels) the kernel shifts by as it slides across the image. Equation (Output Spatial Size $O$ ):

$O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$
(Where W$ = input size, K $= kernel size,$ P $= padding,$ S$$ = stride)

New cards

What is a Sobel Filter, and how do horizontal/vertical variants work?

An edge-detection filter that computes the gradient of image intensity. The Horizontal Sobel ( $G_x$ ) detects vertical edges by looking for changes in intensity horizontally.

The Vertical Sobel ( $G_y$ ) detects horizontal edges by looking for changes vertically.

$G_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}, \quad G_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}$

New cards

What is Gradient Magnitude in image processing?

A measure used after applying horizontal ( $G_x$ ) and vertical ( $G_y$ ) Sobel filters to determine the overall strength or sharpness of an edge at any given pixel, combining both direction forces.

$|G| = \sqrt{G_x^2 + G_y^2}$

New cards

What is a Laplacian Filter?

A derivative filter used to find edges by calculating the second derivative of the image intensity. Unlike Sobel (which finds edge direction), the Laplacian detects rapid intensity changes in all directions at once and is highly sensitive to noise. * Equation (Standard $3 \times 3$ Kernel):

$L = \begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix} \quad \text{or} \quad \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix}$

New cards

What is Weight Sharing in CNNs?

The concept that the same kernel (and its weights) is used to scan every part of an input image. If a feature (like a horizontal edge) is useful to find in the top-left corner, it is equally useful to find in the bottom-right. This drastically reduces the number of parameters compared to dense layers.

Instead of unique weights for every input-output pair ( $W_{ij}$ ), a kernel weight $K_{m,n}$ is applied universally across all spatial locations.

New cards

What is Filter Hierarchy in a CNN?

The progression of feature complexity as you go deeper into a network. Early layers capture low-level features (edges, lines), middle layers combine these into mid-level features (textures, shapes), and deep layers combine those into high-level features (entire objects, faces).

New cards

What is a Residual Connection (Skip Connection)?

A structural mechanism that allows the input of a layer to bypass one or more intermediate layers and be added directly to the output. Its purpose is to solve the vanishing gradient problem, allowing deep networks (like ResNet) to train efficiently by letting gradients flow backward unhindered.

$H(x) = F(x) + x$
(Where $x$ is input, $F(x)$ is the learned layer transformation, and $H(x)$ is the final output)

New cards

What is a Basic Block in ResNet?

The fundamental building block of ResNet architectures (specifically ResNet-18 and ResNet-34). It consists of two successive $3 \times 3$ Convolutional layers, each followed by Batch Normalization and a ReLU activation, wrapped together by a residual (skip) connection.

New cards

What is Batch Normalization?

A technique that normalizes the inputs of each layer across a mini-batch during training. This stabilizes and accelerates neural network training by mitigating internal covariate shift, allowing for higher learning rates.

$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta$
(Where $\mu_B$ is batch mean, $\sigma_B^2$ is batch variance, $\gamma$ and $\beta$ are learnable scale/shift parameters)

New cards

What does Downsample mean in a ResNet block?

When spatial dimensions shrink (via a stride of 2) and the number of feature channels increases between ResNet stages, the dimensions of input $x$ and output $F(x)$ no longer match. A downsample operation (usually a $1 \times 1$ convolution with stride 2) is applied to x so it can be mathematically added to $F(x)$ .

$H(x) = F(x) + W_s(x)$
(Where $W_s$ is the downsampling/projection projection layer matrix)

New cards

What components make up a full ResNet network?

Initial Stage: A large conv layer ( $7 \times 7$ ) and Max Pooling to rapidly reduce resolution.

Stack of Residual Blocks: Groups of Basic Blocks or Bottleneck Blocks divided into stages.

Downsample Blocks: Transitions between stages that reduce spatial resolution while doubling channel depth.

Global Average Pooling (GAP): Flattens spatial dimensions into a single vector per channel before classification.

Final Dense Layer: Maps features to final class output.

New cards

What is a LambdaLR Scheduler?

A learning rate scheduling technique in PyTorch where the learning rate is adjusted dynamically at every epoch based on a user-defined custom lambda function.

$\text{LR}_{\text{epoch}} = \text{LR}_{\text{initial}} \times \lambda(\text{epoch})$

New cards

What optimizer is traditionally used for training ResNet?

The original ResNet paper uses Stochastic Gradient Descent (SGD) with Momentum (typically momentum value of 0.9) combined with weight decay. While Adam can be used, SGD with momentum is generally preferred for ResNet because it yields better final generalization on datasets like ImageNet.

$v_{t+1} = \beta v_t + \eta \nabla L(\theta_t)$
$\theta_{t+1} = \theta_t - v_{t+1}$

(Where $v$ is velocity, $\beta$ is momentum, $\eta$ is learning rate, and $\theta$ represents the weights)

New cards

L1 Regularization (Lasso)

L1 regularization adds a penalty equal to the absolute sum of the weights to the data loss function. Statistically, applying an L1 penalty is equivalent to finding the Maximum A Posteriori (MAP) estimate of the parameters using a Laplace prior centered at zero.

$\hat{w}_{\text{map}} = \arg\min_{w} \text{NLL}(w) + \lambda \|w\|_1$

Where the L1 norm constraint $\|w\|_1$ is defined as:

$\|w\|_1 \triangleq \sum_{d=1}^{D} |w_d|$

The hyperparameter $\lambda$ controls the regularization strength. A larger $\lambda$ penalizes parameter size more aggressively.

New cards

When does a Residual Block use an Identity Shortcut vs. a Projection Shortcut, and what is the formula for the latter?

Identity Shortcut: Used when the dimensions of the input tensor $x$ match the output dimensions of the residual function $\mathcal{F}(x)$ . The shortcut performs a direct, parameter-free element-wise addition.

Projection Shortcut: Used when spatial dimensions shrink (due to stride) and channel dimensions increase. A linear projection matrix $W_s$ is applied to $x$ to match the shapes:

$\mathcal{H}(x) = \mathcal{F}(x) + W_s x$

New cards

Describe how Multi-Head Self-Attention (MHSA) builds upon single scaled dot-product attention, including its projection and concatenation steps.

Instead of performing attention once on $D$ -dimensional queries, keys, and values, MHSA linearly projects $Q$ , $K$ , and $V$ exactly $h$ times (heads) with different, learnable projections to lower dimensions $d_k, d_k,$ and $d_v$ .

Formulation:
$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$
$\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
Advantage: Allows the model to jointly attend to information from different representation subspaces at different positions simultaneously.

New cards

In the Self-Attention mechanism, what do the Queries ( $Q$ ), Keys ( $K$ ), and Values ( $V$ ) conceptually represent?

Queries ( $Q$ ): What a token is actively searching for in the sentence.

Keys ( $K$ ): A label/index of what a token contains, matching itself against incoming queries.
Values ( $V$ ): The actual informational content of the token that is extracted once a query matches a key.

New cards

State the complete mathematical formula for Scaled Dot-Product Attention and define each matrix variable.

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

$Q$ : Query matrix ( $XW^Q$ )
$K$ : Key matrix ( $XW^K$ )
$V$ : Value matrix ( $XW^V$ )
$d_k$ : The channel dimensionality of the key vectors (used for scaling variance).

New cards

Why is the scale factor $\frac{1}{\sqrt{d_k}}$ explicitly required inside the Transformer self-attention formula?

As $d_k$ grows large, the dot product $QK^T$ yields massive scalar values. This pushes the softmax activation function into its saturation regions where its derivative is nearly zero. Dividing by $\sqrt{d_k}$ scales the activation variance back down, preventing vanishing gradients during backpropagation.