1/19
Week 3
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Model Parameters vs Hyper-Parameters
In Deep learning, we train a model to find the optimal set of weights that meet our performance target
These weights are called model parameters
The model parameters are functions of training data, model architecture, and many other factors and cannot be determined manually.
To find the optimal model parameters, we often need to decide on thengs like what activation functions to use, the number of neurons, number of layers, learning rate, dropout rate, L1/L2 regularization, number of epochs, etc…
These are called hyper-parameters and are determined manually by model builder
Activation functions
There are some general guidelines on which of these to use
Generally, we avoid using sigmoid/tanh in hidden layers
RELU is a good default, but can experiment with Leaky RELU, ELU and SELU
Overfitting
A model experiences this when it performs well on training data, but generalizes poorly on unseen data samples.
Limiting the deth and width may work to prevent this, but it may not always be the best approach. A large neural network may need to be implemented instead in many cases, where the appropriate regularization strategies are applied.
Regularization strategies
Early stopping
L1 Parameter Regularization
L2 Parameter Regularization
Dropouts
Early Stopping
Does not require a change to the loss function or network setup
Use a validation data set to compute the loss after every epoch and stop training when loss value stops improving.
Can be efficient since we stop early instead of completing the execution of the entire epoch
L1/L2 Regularization
Works by penalizing larger weight values
Smaller weights imply a simpler network
Dropouts
One of the most effective and commonly used regularization techniques for neural networks
Applied to layers; You can choose the layers to apply dropout to
During training:
Each forward pass, it will randomly select some of the neurons and set their outputs to 0
Any weight updates on the backward pass will not apply to the selected neurons as well
Forces a neural network to learn with many different random subsets of the network.
Batch Normalization
Can also act as a regularizer, making other regularization redundant
Can be applied before or after each layer’s activation function
Model will generally converge faster using this
Larger learning rates can be used with this
Can be used as a replacement for input normalization
Use after the Convolutional layer instead of the Dropout layer
Momentum Optimization
Stochastic Gradient descent with momentum can help to overcome local minimum and help reach global minimum faster
A variation of this is called the Nesterov Accelerated Gradient, which measures the gradient of the loss function, not at the current position, but slightly ahead in the direction of momentum
RMSProp
In ___, the decay in the learning rate is faster for steeper dimensions than for dimensions with a gentler slope.
It accumulates only the gradients from the most recent iterations
It helps point the resulting updates more directly toward the global optimum
Adam (Adaptive Momentum Estimation)
Takes the best of both worlds from Momentum and RMSProp
Gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp
Choosing Optimizers
Different optimizers differ in convergence speed and convergence quality
Adaptive optimizations methods (including RMSProp, Adam optimization) adapt the gradient descent rate as it moves towards optimum points
Learning Rate Schedule
When training deep neural networks, it is often useful to reduce the learning rate as the training progresses
We can use pre-defined ____s together with SGD
Common learning rate schedules
Exponential Decay
Polynomial Decay
Piecewise Constant Decay
Finding the Best Hyper-Parameters (Manual)
Select a few hyper-parameters to try
Tedious, slow and non-scalable
Finding the Best Hyper-Parameters (Grid search)
Try all combinations of specified hyper-parameter values
Brute-force method that does not take advantage of experience from previous trials
Optimal values may not be present in the search range
Finding the Best Hyper-Parameters (Random Search)
Randomly sample the values of each hyper-parameter to try
Faster than grid search, but may miss out on the optimal values for the hyper-parameters
Finding the Best Hyper-Parameters (Bayesian Optimization)
Aims to train the model as few times as possible to avoid costs
Spends more time exploring the path that seems promising (learns from previous trials)
Finding the Best Hyper-Parameters (Hyperband)
Compares the performances of different trials
Terminates the bad performing trials early, while continuing the good trials for longer times
Faster than Bayesian Optimization
Hyper-parameters Tuning Tools
Ray Tune
Optuna
Hyperopt
Heras Tuner
Google Vizer
AWS SageMaker