1/46
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Correct parameter-tuning protocol
Split data into train, validation, and test sets.
Tune on validation, and use the test set only once at the very end to report the final score.
Softmax loss
A function that measures how bad a model’s predictions are in a multi-class classification problem by comparing predicted probabilities to the actual class
Purpose of regularization in deep learning
To prevent the model from memorizing the training data (overfitting), helping it perform better on new, unseen data
Difference between L1 and L2 regularization
L1 regularization pushes some model weights to become exactly zero, effectively selecting important features
L2 Regularization forces weights to be small but barely zero
Common regularization techniques (besides L1/L2)
Dropout (randomly turns off neurons during training)
Batch Normalization (stabilizes and speeds up training)
k-NN vs a linear classifier
k-NN is slower at making predictions, requires no training time, and uses more memory to store the entire training dataset
Numeric vs Analytic gradients
Numeric gradient is easy to implement but is slow and an approximation
Analytic gradient is fast and exact but it’s easier to make coding mistakes
Learning-rate schedules
A pre-defined plan for how the learning rate changes during training, such as gradually decreasing it over time to improve convergence
Normalization Layers (BatchNorm, LayerNorm, InstanceNorm)
Techniques to standardize the inputs to a layer, which helps to speed up and stabilize the training of deep neural networks
Residual connections (ResNet)
A “shortcut” that skips some layers, allowing the model to easily learn to do nothing if a layer is not useful, which helps in training very deep networks
Problem with naive weights intialization
Initializing all weights to very small/large random values can cause the signals (gradients) to shrink/explode, making the network very difficult to train
Limitation of Xavier intialization
It doesn’t work well for networks that use ReLU activation functions, as it can lead to neurons dying (always outputting 0)
Best initialization for ReLU-based networks
Kaiming initialization
Because it is specifically designed to work with the properties of the ReLU activation function
Hinge loss
A loss function used for training classifiers, which aims to ensure that correct predications are made with a confident margin
Effective receptive field of three 3×3 conv layers
The same as a single 7×7 convolution layer
Why stack small 3×3 convolutions?
It uses fewer parameters than a single large kernel (like 7×7) and allows for more non-linear activation functions, making the network more powerful and efficient
Role of a loss function
To measure how far off the model’s predictions are from the correct answers, guiding the model on how to adjust its weights
Purpose of a non-linear activation function
It allows the neural network to learn complex patterns and relationships in the data that a simple linear model cannot
Benefit of pooling layer in CNNs
It reduces the size of the feature maps, which makes the computation faster and helps the network become more robust to the exact position of objects in an image
Two common types of pooling
Max Pooling (takes maximum value in a window)
Average Pooling (takes average value)
Final layer in a classification CNN
A fully connected layer is typically added at the end to take the high level features learned by the CNN and use them to make the final classification
Weight update rule
An algorithm, like gradient descent, that adjusts the weights of the network in the direction that reduces the loss function
Effect of a large learning rate
It can cause the optimization to overshoot the ideal solution and bounce around, possibly preventing the model from converging
Vanishing gradients
A problem in very deep networks where the gradient becomes extremely small, causing the weights in the early layers to stop updating, effectively halting learning
How to lessen vanishing gradients
Use architectures like ResNet with residual connections, employ proper weight initialization (like Kaiming), and use normalization layers (like BatchNorm)
Softmax Loss/Cross Entropy Loss
Li = - log( esy / Σjesj)
Hinge Loss (Multi class SVM) / (Single)
Li = Σj≠ymax(0, sj - sy + 1)
Hinge Loss Average
L = 1 / N ΣNi=1Li
Total Squared Error
Etotal = Σ ½ (target - output)2
Convolution Layer Parameters (weights)
K × K × Cin × Cout
K: Kernel size
Cin: input channels
Cout: output channels (num of filters)
FC Layer Params
Input size × output size + output size
Sequential Stacking of Convs (ex, three 3×3 layers)
If all layers go from C → C, total weights is:
3×(3×3×C×C)
ResNet (Residual Block Relation)
y = F(x) + x
output = transformation of input + og input
ReLU Activation
a = max(0, z)
z: pre-activation input
Weight Update Rule (Gradient Descent)
Wnew = Wold - η∂W/∂E
η: learning rate
Effective Receptive Field
For a stack of 3×3 convos w stride 1, effective receptive field is 7×7
(formula for L layers = 1 + L(K - 1), where K = 3 for 3×3)
L1 vs L2 regularization specifics
L1: loss + λ(Σni=1|Wi|)
L2: loss + λ(Σni=1 Wi2)
What is the correct learning-rate schedule statement
Warm-up increases LR linearly from a small value at the start
Exponential decay multiplies the LR by a fixed factor every epoch
Time-based decay reduces LR gradually as num epochs increases
What is the need for normalization layers
Internal Covariate Shift (changing distribution of layer inputs during training) and allow use of higher LR
BatchNorm2D Axes
Computes statistics over
N (batch), H (height), and W (width) axes
Normalizes across
channels C for the entire batch
LayerNorm Axes
Computes statistics over
C (channel), H (height), W (width)
Normalizes
within a single sample
InstanceNorm Axes
Computes statistics over
H (height) and W (width) axes
Normalizes
within a single sample and single channel
Effective receptive field of single 7×7 conv
7×7
Parameter Count:
Three 3×3 stack (C to C)
3×(3×3×C×C) = 27C2
Parameter Count: Single 7×7 conv (C to C)
1×(7×7×C×C) = 49C2
Usefulness of a pooling layer
Reduces feature map size
Lowers computational cost and memory
Provides a degree of translation invariance
Cause of Vanishing Gradients
Repeated multiplication of small gradients through many layers during backpropagation