Loss Functions
A loss function, L, quantifies the discrepancy between predicted values and actual values. It must be differentiable to allow gradient-based optimization and should be bounded below (L \geq 0) to ensure stability.
Mean Squared Error (MSE) for regression:
L{MSE}(\hat{y}, y) = \frac{1}{N} \sum{i=1}^{N} (\hat{y}i - yi)^2
MSE calculates the average of the squared differences between predicted and actual values. It is suitable for regression problems where the goal is to minimize the error between continuous values.
Cross-entropy loss for classification:
L{log}(\hat{y}, y) = - \frac{1}{N} \sum{i=1}^{N} \sum{k=1}^{K} y{i,k} \log(\hat{y}_{i,k})
Cross-entropy loss measures the dissimilarity between predicted probability distributions and actual distributions. It is commonly used in classification problems where the goal is to predict the correct class label.
Gradient Descent
An iterative optimization algorithm that moves toward the optimal solution by updating parameters in the opposite direction of the gradient of the loss function:
\theta{n+1} = \thetan - \eta \nabla L
Learning rate \eta is a crucial hyperparameter that determines the step size during optimization. A smaller learning rate may lead to slower convergence, while a larger learning rate may cause instability.
Local Minima
Gradient descent may get trapped in local minima, which are suboptimal solutions. Techniques like momentum and stochastic gradient descent can help escape local minima by introducing randomness and inertia into the optimization process.
Momentum Optimization
Incorporates past gradients to accelerate convergence and dampen oscillations:
mn = \beta m{n-1} - \eta \nabla L(\thetan) \theta{n+1} = \thetan + mn
Hyperparameter \beta controls the momentum, determining the contribution of past gradients to the current update. A higher momentum value allows the optimizer to move more quickly through flat regions and overcome small obstacles.
Nesterov Accelerated Gradient
Computes the gradient slightly ahead to improve convergence:
mn = \beta m{n-1} - \eta \nabla L(\thetan + \beta m{n-1})
\theta{n+1} = \thetan + m_n
By evaluating the gradient at a point slightly ahead in the direction of momentum, Nesterov Accelerated Gradient can make more informed updates and converge faster than standard momentum optimization.
AdaGrad (Adaptive Gradient)
Adjusts the learning rate for each parameter based on the historical gradients. Parameters with frequently occurring gradients receive smaller updates, while parameters with infrequent gradients receive larger updates. This helps to prevent oscillations and improves convergence.
RMSProp (Root Mean Square Propagation)
Improves AdaGrad by scaling down old gradients exponentially. This prevents the learning rate from becoming too small and helps to maintain progress during training.
Adam (and friends)
Combines momentum and RMSProp with additional technical details to provide an effective optimization algorithm.
AdaMax: Scales past gradients differently to provide robustness to outliers.
Nadam: Adds Nesterov momentum to Adam for improved convergence.
AdamW: Adds L2 regularization to Adam to prevent overfitting.
Learning Rate and Loss Curves
Choosing an appropriate learning rate \eta is crucial for successful training. Learning rate and loss curves can provide insights into the training process and help to identify potential issues such as overfitting or underfitting.
Learning Rate Scheduling
Strategies to optimize the learning rate during training:
Reduce \eta when learning stops to fine-tune the model and improve performance.
Gradually reduce \eta for each step to smoothly transition from exploration to exploitation.
Change \eta by some other rule to adapt to the specific characteristics of the problem.