Improving ML Model Performance (Hyper-parameter Tuning)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/19

Earn XP

Description and Tags

Week 3

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

20 Terms

New cards

Model Parameters vs Hyper-Parameters

In Deep learning, we train a model to find the optimal set of weights that meet our performance target
- These weights are called model parameters
- The model parameters are functions of training data, model architecture, and many other factors and cannot be determined manually.
To find the optimal model parameters, we often need to decide on thengs like what activation functions to use, the number of neurons, number of layers, learning rate, dropout rate, L1/L2 regularization, number of epochs, etc…
- These are called hyper-parameters and are determined manually by model builder

New cards

Activation functions

There are some general guidelines on which of these to use
Generally, we avoid using sigmoid/tanh in hidden layers
RELU is a good default, but can experiment with Leaky RELU, ELU and SELU

New cards

Overfitting

A model experiences this when it performs well on training data, but generalizes poorly on unseen data samples.
Limiting the deth and width may work to prevent this, but it may not always be the best approach. A large neural network may need to be implemented instead in many cases, where the appropriate regularization strategies are applied.

New cards

Regularization strategies

Early stopping
L1 Parameter Regularization
L2 Parameter Regularization
Dropouts

New cards

Early Stopping

Does not require a change to the loss function or network setup
Use a validation data set to compute the loss after every epoch and stop training when the loss value stops improving.
Can be efficient since we do not need to complete the execution of the entire epoch

New cards

L1/L2 Regularization

Works by penalizing larger weight values
- Smaller weights imply a simpler network

New cards

Dropouts

One of the most effective and commonly used regularization techniques for neural networks
Applied to layers; You can choose the layers to apply this to
During training:
- Each forward pass, it will randomly select some of the neurons and set their outputs to 0
- Any weight updates on the backward pass will not apply to the selected neurons as well
Forces a neural network to learn with many different random subsets of the network.

New cards

Batch Normalization

Can also act as a regularizer, making other regularization redundant
Can be applied before or after each layer’s activation function
Model will generally converge faster using this
Larger learning rates can be used with this
Can be used as a replacement for input normalization
Use after the Convolutional layer instead of the Dropout layer

New cards

Momentum Optimization

Stochastic Gradient descent with momentum can help to overcome local minimum and help reach global minimum faster
A variation of this is called the Nesterov Accelerated Gradient, which measures the gradient of the loss function, not at the current position, but slightly ahead in the direction of momentum

New cards

RMSProp

In ___, the decay in the learning rate is faster for steeper dimensions than for dimensions with a gentler slope.
It accumulates only the gradients from the most recent iterations
It helps point the resulting updates more directly toward the global optimum

New cards

Adam (Adaptive Momentum Estimation)

Takes the best of both worlds from Momentum and RMSProp
Gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp

New cards

Choosing Optimizers

Different optimizers differ in convergence speed and convergence quality
Adaptive optimizations methods (including RMSProp, Adam optimization) adapt the gradient descent rate as it moves towards optimum points

New cards

Learning Rate Schedule

When training deep neural networks, it is often useful to reduce the learning rate as the training progresses
We can use pre-defined ____s together with SGD

New cards

Common learning rate schedules

Exponential Decay
Polynomial Decay
Piecewise Constant Decay

New cards

Finding the Best Hyper-Parameters (Manual)

Select a few hyper-parameters to try
Tedious, slow and non-scalable

New cards

Finding the Best Hyper-Parameters (Grid search)

Try all combinations of specified hyper-parameter values
Brute-force method that does not take advantage of experience from previous trials
Optimal values may not be present in the search range

New cards

Finding the Best Hyper-Parameters (Random Search)

Randomly sample the values of each hyper-parameter to try
Faster than grid search, but may miss out on the optimal values for the hyper-parameters

New cards

Finding the Best Hyper-Parameters (Bayesian Optimization)

Aims to train the model as few times as possible to avoid costs
Spends more time exploring the path that seems promising (learns from previous trials)

New cards

Finding the Best Hyper-Parameters (Hyperband)

Compares the performances of different trials
Terminates the bad performing trials early, while continuing the good trials for longer times
Faster than Bayesian Optimization

New cards

Hyper-parameters Tuning Tools

Ray Tune
Optuna
Hyperopt
Heras Tuner
Google Vizer
AWS SageMaker