L5_Optimizers and Learning Rates

Loss Functions

A loss function, L, quantifies the discrepancy between predicted values and actual values. It must be differentiable to allow gradient-based optimization and should be bounded below (L \geq 0) to ensure stability.

Mean Squared Error (MSE) for regression:
L{MSE}(\hat{y}, y) = \frac{1}{N} \sum{i=1}^{N} (\hat{y}i - yi)^2
MSE calculates the average of the squared differences between predicted and actual values. It is suitable for regression problems where the goal is to minimize the error between continuous values.
Cross-entropy loss for classification:
L{log}(\hat{y}, y) = - \frac{1}{N} \sum{i=1}^{N} \sum{k=1}^{K} y{i,k} \log(\hat{y}_{i,k})
Cross-entropy loss measures the dissimilarity between predicted probability distributions and actual distributions. It is commonly used in classification problems where the goal is to predict the correct class label.

Gradient Descent

An iterative optimization algorithm that moves toward the optimal solution by updating parameters in the opposite direction of the gradient of the loss function:

\theta{n+1} = \thetan - \eta \nabla L

Learning rate \eta is a crucial hyperparameter that determines the step size during optimization. A smaller learning rate may lead to slower convergence, while a larger learning rate may cause instability.

Local Minima

Gradient descent may get trapped in local minima, which are suboptimal solutions. Techniques like momentum and stochastic gradient descent can help escape local minima by introducing randomness and inertia into the optimization process.

Momentum Optimization

Incorporates past gradients to accelerate convergence and dampen oscillations:

mn = \beta m{n-1} - \eta \nabla L(\thetan) \theta{n+1} = \thetan + mn

Hyperparameter \beta controls the momentum, determining the contribution of past gradients to the current update. A higher momentum value allows the optimizer to move more quickly through flat regions and overcome small obstacles.

Nesterov Accelerated Gradient

Computes the gradient slightly ahead to improve convergence:

mn = \beta m{n-1} - \eta \nabla L(\thetan + \beta m{n-1})
\theta{n+1} = \thetan + m_n

By evaluating the gradient at a point slightly ahead in the direction of momentum, Nesterov Accelerated Gradient can make more informed updates and converge faster than standard momentum optimization.

AdaGrad (Adaptive Gradient)

Adjusts the learning rate for each parameter based on the historical gradients. Parameters with frequently occurring gradients receive smaller updates, while parameters with infrequent gradients receive larger updates. This helps to prevent oscillations and improves convergence.

RMSProp (Root Mean Square Propagation)

Improves AdaGrad by scaling down old gradients exponentially. This prevents the learning rate from becoming too small and helps to maintain progress during training.

Adam (and friends)

Combines momentum and RMSProp with additional technical details to provide an effective optimization algorithm.

AdaMax: Scales past gradients differently to provide robustness to outliers.
Nadam: Adds Nesterov momentum to Adam for improved convergence.
AdamW: Adds L2 regularization to Adam to prevent overfitting.

Learning Rate and Loss Curves

Choosing an appropriate learning rate \eta is crucial for successful training. Learning rate and loss curves can provide insights into the training process and help to identify potential issues such as overfitting or underfitting.

Learning Rate Scheduling

Strategies to optimize the learning rate during training:

Reduce \eta when learning stops to fine-tune the model and improve performance.
Gradually reduce \eta for each step to smoothly transition from exploration to exploitation.
Change \eta by some other rule to adapt to the specific characteristics of the problem.

Note

0.0(0)

Take a practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

View the linked PDF