Project: CNNs, Pruning, Quantization

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/35

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:18 AM on 5/29/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

36 Terms

1
New cards

What is the core objective and underlying assumption of Fine-Grained (Magnitude) Pruning?

To isolate and eliminate individual elements within a weight tensor WW that contribute the least to layer activations.

  • Assumption: Weights with an absolute value close to zero (W0|W| \approx 0) have a negligible impact on the network's final output.

  • Note: Tensor dimensions do not change.

2
New cards

How is the pruning threshold τ\tau determined, and how is the binary mask matrix MM generated for a target sparsity ratio ss?

1. Threshold (τ\tau): Calculate the absolute values W|W| and find the ss-th percentile value.

2. Binary Mask (MM): Create a matrix matching the shape of WW using the rule:

Mi,j,k,l={1amp;if Wi,j,k,lτ0amp;if Wi,j,k,llt;τM_{i,j,k,l} = \begin{cases} 1 & \text{if } |W_{i,j,k,l}| \ge \tau \\ 0 & \text{if } |W_{i,j,k,l}| < \tau \end{cases}

3
New cards

What are the mathematical operations for the Forward Pass and Backward Pass during fine-grained pruning fine-tuning?

  • Forward Pass (Inference): Uses an element-wise Hadamard product (\odot):

Uses an element-wise Hadamard product (\odot):

Wpruned=WMW_{\text{pruned}} = W \odot M

Y=XWpruned+bY = X * W_{\text{pruned}} + b

  • Backward Pass (Retraining): Mask gradients to prevent zeroed-out parameters from waking up:

Wupdated=WM\nabla W_{\text{updated}} = \nabla W \odot M

WWηWupdatedW \leftarrow W - \eta \cdot \nabla W_{\text{updated}}

4
New cards

What is Channel (Structured) Pruning, and what metric is commonly used to score structural filters?

Instead of masking individual elements, it physically removes an entire 3D slice (a filter) from a convolutional layer, shrinking the physical tensor dimensions.

  • Scoring Metric: The L1L_1-norm of each filter ii, which sums the absolute values of all its parameters:

    • Si=Wi1=j=1Ck=1Hl=1WWi,j,k,l\mathcal{S}_i = \|W_i\|_1 = \sum_{j=1}^{C} \sum_{k=1}^{H} \sum_{l=1}^{W} |W_{i, j, k, l}|

5
New cards

How does physical channel pruning affect the weight tensor of the current layer (LL) and the immediate next layer (L+1L+1)?

  • Layer LL (Output Slicing): Sorts channels by score, keeps the surviving indices K\mathcal{K}, and down-samples the tensor:

    • Wnew=W[K,:,:,:]W_{\text{new}} = W[\mathcal{K}, :, :, :]

  • Layer L+1L+1 (Downstream Dependency): Because the input channels have shrunk, Layer L+1L+1 must immediately slice its input dimension to match:

    • WL+1,new=WL+1[:,K,:,:]W_{L+1, \text{new}} = W_{L+1}[:, \mathcal{K}, :, :]

6
New cards

What is the mathematical objective of K-Means Weight Quantization, and what two structures represent the layer after convergence?

Objective: Group continuous weights into K=2bK = 2^b clusters (for bb-bits) by minimizing the within-cluster sum of squares (WCSS):

  • argminC,μk=1KwCkwμk2\arg\min_{\mathcal{C}, \mu} \sum_{k=1}^{K} \sum_{w \in \mathcal{C}_k} |w - \mu_k|^2

  • Storage Structures:

    • Codebook/Lookup Table (μ\mu): A 1D array of KK 32-bit floating-point centroids.

    • Index Map (II): A discrete matrix matching the original shape of WW, containing bb-bit integer cluster labels.

7
New cards

Why is the Straight-Through Estimator (STE) needed during Quantization-Aware Training (QAT), and what is its mathematical assumption?

The Problem: Mapping weights to discrete indices is a step-function with a derivative of zero almost everywhere, which blocks backpropagation.

  • The STE Solution: Assumes the derivative of the quantization function is exactly 11, passing the gradient completely unchanged to the underlying full-precision weights (Wfp32W_{\text{fp32}}):

    • LWfp32LWquant\frac{\partial L}{\partial W_{\text{fp32}}} \approx \frac{\partial L}{\partial W_{\text{quant}}}

  • Forward Pass: Wquant=μ[I]W_{\text{quant}} = \mu[I]

8
New cards

How are the codebook centroids (μk\mu_k) updated during the backward pass of Quantization-Aware Training (QAT)?

The gradient for an individual centroid μk\mu_k is calculated by accumulating the gradients of all weights assigned to that specific cluster Ck\mathcal{C}_k:

  • μk=i,jCk(LWquant)i,j\nabla \mu_k = \sum_{i,j \in \mathcal{C}_k} \left( \frac{\partial L}{\partial W_{\text{quant}}} \right)_{i,j}

  • μkμkημk\mu_k \leftarrow \mu_k - \eta \cdot \nabla \mu_k

9
New cards

Compare the structures of FP32 vs. FP16 formats, and state which layers are explicitly excluded from FP16 downcasting in Vision Transformers (ViTs).

FP32: 1 sign bit, 8 exponent bits, 23 mantissa bits.

  • FP16: 1 sign bit, 5 exponent bits, 10 mantissa bits.

  • Forward Pass Downcasting: Linear projections and MLPs are executed in FP16:

    • Yfp16=Xfp16Wfp16+bfp16Y_{\text{fp16}} = X_{\text{fp16}} \cdot W_{\text{fp16}} + b_{\text{fp16}}

  • Exceptions: Softmax and LayerNorm stay in FP32 to prevent numerical underflow/overflow variance.

10
New cards

Why is Gradient Scaling necessary in FP16 training, and how is it applied during the backward pass?

Why: FP16's restricted dynamic exponent range causes tiny transformer gradients to underflow to absolute zero (0.00.0).

  • The loss function LL is multiplied by a large shifting scalar factor SS (S1S \gg 1) before backpropagation:

    • Lscaled=L×SL_{\text{scaled}} = L \times S

    • Wscaled=LscaledWfp16\nabla W_{\text{scaled}} = \frac{\partial L_{\text{scaled}}}{\partial W_{\text{fp16}}}

11
New cards

How are master weights updated using scaled gradients, and what happens if an inf or NaN value is encountered?

Unscaling: Before updating the continuous latent master weights ($W_{\text{fp32}}$), the gradient scale is inverted:

  • Wunscaled=WscaledS\nabla W_{\text{unscaled}} = \frac{\nabla W_{\text{scaled}}}{S}

  • Wfp32Wfp32ηWunscaledW_{\text{fp32}} \leftarrow W_{\text{fp32}} - \eta \cdot \nabla W_{\text{unscaled}}

Exception Handling: If inf or NaN is detected, the entire optimizer update step is skipped, and the scale factor is shrunk: SS×0.5S \leftarrow S \times 0.5 .

12
New cards

What is a Convolution operation in CNNs

A mathematical operation where a small matrix (kernel) slides across an input image, performing element-wise multiplication and summing the results. Its purpose is to extract local features like edges, textures, or shapes from the input.

For an input II and kernel KK

  • (IK)(i,j)=mnI(im,jn)K(m,n)(I * K)(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) \cdot K(m, n)

13
New cards

What is Pooling in CNNs?

A downsampling operation typically placed after convolutional layers. It reduces the spatial dimensions (width and height) of the feature maps, which cuts down computation, reduces memory usage, and helps prevent overfitting while maintaining the most important features.

14
New cards

What is the difference between Max Pooling and Average Pooling?

  • Max Pooling selects the maximum value from the covered region of the feature map, highlighting the most prominent features (like bright edges).

  • Average Pooling calculates the average of all values in the region, providing a smoother, more generalized downsampling.

Max Pooling=maxi,jR(Xi,j)\text{Max Pooling} = \max_{i,j \in R} (X_{i,j})

Average Pooling=1Ri,jRXi,j\text{Average Pooling} = \frac{1}{|R|} \sum_{i,j \in R} X_{i,j}

15
New cards

What is a Dense Layer?

A standard neural network layer where every neuron is connected to every single neuron in the previous layer. In CNNs, these are placed at the very end of the network to take the extracted 2D/3D features (flattened into a 1D vector) and map them to the final class probabilities or outputs.

  • y=σ(Wx+b)y = \sigma(W \cdot x + b)

(Where WW is weights, xx is input, bb is bias, and σ\sigma is the activation function)

16
New cards

What are the standard layers in a CNN and their order?

1. Input Layer: Holds raw pixel values.

2. Convolutional Layer: Extracts feature maps using kernels.

3. Activation Layer (ReLU): Introduces non-linearity.

4. Pooling Layer: Downsamples spatial dimensions.

5. Fully Connected (Dense) Layer: Performs final classification based on extracted features.

17
New cards

What is a Kernel/Filter in a CNN?

A small, learnable matrix of weights (e.g., 3×33 \times 3 or 5×55 \times 5) that slides across the input data to detect specific patterns. Early layers learn simple filters (edges), while deeper layers learn complex filters (faces, objects).

  • Represented as a matrix WRk×k×cW \in \mathbb{R}^{k \times k \times c} where kk is size and cc is input channels.-

18
New cards

What are Padding and Stride in convolution?

Padding adds extra pixels (usually zeros) around the outer border of an input image so the kernel can overlap the edges, preventing the image from shrinking too fast.

Stride is the step size (number of pixels) the kernel shifts by as it slides across the image. Equation (Output Spatial Size OO):

  • O=WK+2PS+1O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1

  • (Where W$ = input size, K=kernelsize,= kernel size,P=padding,= padding,S$$ = stride)

19
New cards

What is a Sobel Filter, and how do horizontal/vertical variants work?

An edge-detection filter that computes the gradient of image intensity. The Horizontal Sobel (GxG_x) detects vertical edges by looking for changes in intensity horizontally.

The Vertical Sobel (GyG_y) detects horizontal edges by looking for changes vertically.

  • Gx=[1amp;0amp;12amp;0amp;21amp;0amp;1],Gy=[1amp;2amp;10amp;0amp;01amp;2amp;1]G_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}, \quad G_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}

20
New cards

What is Gradient Magnitude in image processing?

A measure used after applying horizontal (GxG_x) and vertical (GyG_y ) Sobel filters to determine the overall strength or sharpness of an edge at any given pixel, combining both direction forces.

  • G=Gx2+Gy2|G| = \sqrt{G_x^2 + G_y^2}

21
New cards

What is a Laplacian Filter?

A derivative filter used to find edges by calculating the second derivative of the image intensity. Unlike Sobel (which finds edge direction), the Laplacian detects rapid intensity changes in all directions at once and is highly sensitive to noise. * Equation (Standard 3×33 \times 3 Kernel):

  • L=[0amp;1amp;01amp;4amp;10amp;1amp;0]or[1amp;1amp;11amp;8amp;11amp;1amp;1]L = \begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix} \quad \text{or} \quad \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix}

22
New cards

What is Weight Sharing in CNNs?

The concept that the same kernel (and its weights) is used to scan every part of an input image. If a feature (like a horizontal edge) is useful to find in the top-left corner, it is equally useful to find in the bottom-right. This drastically reduces the number of parameters compared to dense layers.

  • Instead of unique weights for every input-output pair (WijW_{ij}), a kernel weight Km,nK_{m,n} is applied universally across all spatial locations.

23
New cards

What is Filter Hierarchy in a CNN?

The progression of feature complexity as you go deeper into a network. Early layers capture low-level features (edges, lines), middle layers combine these into mid-level features (textures, shapes), and deep layers combine those into high-level features (entire objects, faces).

24
New cards

What is a Residual Connection (Skip Connection)?

A structural mechanism that allows the input of a layer to bypass one or more intermediate layers and be added directly to the output. Its purpose is to solve the vanishing gradient problem, allowing deep networks (like ResNet) to train efficiently by letting gradients flow backward unhindered.

  • H(x)=F(x)+xH(x) = F(x) + x

  • (Where xx is input, F(x)F(x) is the learned layer transformation, and H(x)H(x) is the final output)

25
New cards

What is a Basic Block in ResNet?

The fundamental building block of ResNet architectures (specifically ResNet-18 and ResNet-34). It consists of two successive 3×33 \times 3 Convolutional layers, each followed by Batch Normalization and a ReLU activation, wrapped together by a residual (skip) connection.

26
New cards

What is Batch Normalization?

A technique that normalizes the inputs of each layer across a mini-batch during training. This stabilizes and accelerates neural network training by mitigating internal covariate shift, allowing for higher learning rates.

  • x^i=xiμBσB2+ϵ,yi=γx^i+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta

  • (Where μB\mu_B is batch mean, σB2\sigma_B^2 is batch variance, γ\gamma and β\beta are learnable scale/shift parameters)

27
New cards

What does Downsample mean in a ResNet block?

When spatial dimensions shrink (via a stride of 2) and the number of feature channels increases between ResNet stages, the dimensions of input xx and output F(x)F(x) no longer match. A downsample operation (usually a 1×11 \times 1 convolution with stride 2) is applied to x so it can be mathematically added to F(x)F(x).

  • H(x)=F(x)+Ws(x)H(x) = F(x) + W_s(x)

  • (Where WsW_s is the downsampling/projection projection layer matrix)

28
New cards

What components make up a full ResNet network?

Initial Stage: A large conv layer (7×77 \times 7) and Max Pooling to rapidly reduce resolution.

Stack of Residual Blocks: Groups of Basic Blocks or Bottleneck Blocks divided into stages.

Downsample Blocks: Transitions between stages that reduce spatial resolution while doubling channel depth.

Global Average Pooling (GAP): Flattens spatial dimensions into a single vector per channel before classification.

Final Dense Layer: Maps features to final class output.

29
New cards

What is a LambdaLR Scheduler?

A learning rate scheduling technique in PyTorch where the learning rate is adjusted dynamically at every epoch based on a user-defined custom lambda function.

LRepoch=LRinitial×λ(epoch)\text{LR}_{\text{epoch}} = \text{LR}_{\text{initial}} \times \lambda(\text{epoch})

30
New cards

What optimizer is traditionally used for training ResNet?

The original ResNet paper uses Stochastic Gradient Descent (SGD) with Momentum (typically momentum value of 0.9) combined with weight decay. While Adam can be used, SGD with momentum is generally preferred for ResNet because it yields better final generalization on datasets like ImageNet.

  • vt+1=βvt+ηL(θt)v_{t+1} = \beta v_t + \eta \nabla L(\theta_t)

  • θt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}

(Where vv is velocity, β\beta is momentum, η\eta is learning rate, and θ\theta represents the weights)

31
New cards

L1 Regularization (Lasso)

L1 regularization adds a penalty equal to the absolute sum of the weights to the data loss function. Statistically, applying an L1 penalty is equivalent to finding the Maximum A Posteriori (MAP) estimate of the parameters using a Laplace prior centered at zero.

  • w^map=argminwNLL(w)+λw1\hat{w}_{\text{map}} = \arg\min_{w} \text{NLL}(w) + \lambda \|w\|_1

Where the L1 norm constraint w1\|w\|_1 is defined as:

  • w1d=1Dwd\|w\|_1 \triangleq \sum_{d=1}^{D} |w_d|

The hyperparameter $\lambda$ controls the regularization strength. A larger λ\lambda penalizes parameter size more aggressively.

32
New cards

When does a Residual Block use an Identity Shortcut vs. a Projection Shortcut, and what is the formula for the latter?

  • Identity Shortcut: Used when the dimensions of the input tensor xx match the output dimensions of the residual function F(x)\mathcal{F}(x). The shortcut performs a direct, parameter-free element-wise addition.

  • Projection Shortcut: Used when spatial dimensions shrink (due to stride) and channel dimensions increase. A linear projection matrix $W_s$ is applied to xx to match the shapes:

H(x)=F(x)+Wsx\mathcal{H}(x) = \mathcal{F}(x) + W_s x

33
New cards

Describe how Multi-Head Self-Attention (MHSA) builds upon single scaled dot-product attention, including its projection and concatenation steps.

Instead of performing attention once on DD-dimensional queries, keys, and values, MHSA linearly projects QQ, KK, and VV exactly hh times (heads) with different, learnable projections to lower dimensions dk,dk,d_k, d_k, and dvd_v.

  • Formulation:

  • MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

  • where headi=Attention(QWiQ,KWiK,VWiV)\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

  • Advantage: Allows the model to jointly attend to information from different representation subspaces at different positions simultaneously.

34
New cards

In the Self-Attention mechanism, what do the Queries (QQ), Keys (KK), and Values (VV) conceptually represent?

  • Queries (QQ): What a token is actively searching for in the sentence.

  • Keys (KK): A label/index of what a token contains, matching itself against incoming queries.

  • Values (VV): The actual informational content of the token that is extracted once a query matches a key.

35
New cards

State the complete mathematical formula for Scaled Dot-Product Attention and define each matrix variable.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  • QQ: Query matrix (XWQXW^Q)

  • KK: Key matrix (XWKXW^K)

  • VV: Value matrix (XWVXW^V)

  • dkd_k: The channel dimensionality of the key vectors (used for scaling variance).

36
New cards

Why is the scale factor 1dk\frac{1}{\sqrt{d_k}} explicitly required inside the Transformer self-attention formula?

As dkd_k grows large, the dot product QKTQK^T yields massive scalar values. This pushes the softmax activation function into its saturation regions where its derivative is nearly zero. Dividing by dk\sqrt{d_k} scales the activation variance back down, preventing vanishing gradients during backpropagation.