1/18
Flashcards on optimizers, loss functions, and learning rates in deep learning.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is a Loss Function?
A differentiable, bounded-below measure (L >= 0) of prediction quality. Technically, it quantifies how well a model's predictions match the actual values; it needs to be differentiable so we know how to adjust the models and bounded below so we have a direction we want to head, which is to minimize! Analogy: Like measuring the distance between a map (prediction) and the actual terrain (reality). We want this distance to be as small as possible.
What is Mean Squared Error (MSE) Loss?
LMSE(y^, y) = (1/N) \* ∑(y^i - yi)^2 , used for regression tasks. Technically, it calculates the average of the squares of the errors between predicted and actual values. The squaring emphasizes larger errors and ensures the loss is always positive. Analogy: measures the average squared 'miss' distance in a series of shots.
What is Cross-Entropy Loss (Log Loss)?
Llog(y^, y) = -∑∑ yi,k \* log(y^i,k) , used for classification tasks. Technically, it quantifies the difference between two probability distributions (predicted and actual). The closer the predicted probability is to the actual value, the lower the loss. Analogy: It's like measuring how surprised you are when the actual outcome differs from your predicted probabilities.
What is Gradient Descent?
θn+1 = θn − η∇L, where η is the learning rate. Technically, it's an iterative optimization algorithm to find the minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point. Analogy: Like rolling a ball down a hill; it takes steps in the direction of the steepest descent.
What is Momentum Optimization?
Keeping track of past gradients to improve gradient descent. Technically, it adds a fraction of the previous update to the current update, smoothing the optimization process and helping to accelerate convergence in the relevant direction. Analogy: Like a ball rolling down a hill gains momentum, helping it overcome small bumps.
What is the Momentum update rule?
θn+1 = βmn−1 − η∇L(θn), where β is the momentum. The parameter update considers a fraction of the previous update vector. Technically, incorporates historical gradients into the current update direction. Analogy: builds inertia into the gradient descent process.
What is Momentum?
A hyperparameter in momentum optimization that controls the amount of friction (β ∈ [0, 1]). The closer to 1, the more information to retain. Analogy: The higher the momentum, the less friction the ball experiences rolling down the hill.
What is Nesterov Accelerated Gradient?
Compute the gradient slightly ahead when using momentum. Technically, it evaluates the gradient at the 'anticipated' future position to make corrections before actually getting there. Analogy: Instead of blindly following the current gradient (like in standard momentum), NAG looks ahead in the direction where momentum is carrying us to correct its course.
What is Nesterov accelerated gradient update?
mn = βmn−1 − η∇L(θn + βmn−1). The gradient is evaluated at the 'anticipated' future position. Technically, this adjusts the gradient calculation to account for the momentum. Analogy: corrects trajectory before getting there
What is AdaGrad (Adaptive Gradient)?
It introduces an adaptive learning rate, adjusted independently for different parameters by adjusting by the sum of all past gradients. It adapts the learning rate to each parameter, giving larger updates to infrequent and smaller updates to frequent parameters. Analogy: Like having individual volume knobs for each instrument in an orchestra, automatically adjusting to balance the sound.
What is RMSProp (Root Mean Square Propagation)?
Improves on AdaGrad by exponentially scaling down old gradients. Technically, it addresses AdaGrad's diminishing learning rates by discounting the influence of very early gradients. Analogy: Only considers the recent gradients like focusing on recent traffic conditions.
What is Adam (Adaptive Moment Estimation)?
Momentum + RMSProp + some technical details. Technically, it computes adaptive learning rates for each parameter, incorporating both momentum and scaling of gradients. It's computationally efficient and well-suited for problems with large parameter spaces. Analogy: Like a car with both acceleration (momentum) and traction control (RMSProp).
What is AdaMax?
Scale the past gradients differently. A variant of Adam that uses infinity norm. Instead of using the squared gradients directly, it normalizes with the maximum of the past gradients. Analogy: Consider the absolute largest past gradients.
What is Nadam?
Add Nesterov momentum. It combines Adam with Nesterov Accelerated Gradient for faster convergence. Analogy: combining both acceleration and pre-emptive correction
What is AdamW?
Add L2 regularization (aka weight decay). It decouples weight decay from the gradient-based optimization step. It helps prevent overfitting by penalizing large weights, promoting simpler models. Analogy: add stabilizing regularization seperately than loss.
What is the Learning Rate?
Affects training progress. Determines the step size during optimization. If too high might overshoot, if too low might take too long. Analogy: Like adjusting the stride length during walking.
What is Learning Rate Scheduling?
Reduce learning rate when learning stops. This could improve convergence as a result. Analogy: Slowing down as you approach the destination.
What is Exponential Decay?
Gradually reduce η for each step. A common method of learning rate scheduling. Analogy: Gradually reducing throttle.
What is Custom Learning Rate Schedule?
Change η by some other rule. Analogy: dynamic throttle