Improving ML Model Performance (Hyper-parameter Tuning)

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/19

flashcard set

Earn XP

Description and Tags

Week 3

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

20 Terms

1
New cards

Model Parameters vs Hyper-Parameters

  • In Deep learning, we train a model to find the optimal set of weights that meet our performance target

    • These weights are called model parameters

    • The model parameters are functions of training data, model architecture, and many other factors and cannot be determined manually.

  • To find the optimal model parameters, we often need to decide on thengs like what activation functions to use, the number of neurons, number of layers, learning rate, dropout rate, L1/L2 regularization, number of epochs, etc…

    • These are called hyper-parameters and are determined manually by model builder

2
New cards

Activation functions

  • There are some general guidelines on which of these to use

  • Generally, we avoid using sigmoid/tanh in hidden layers

  • RELU is a good default, but can experiment with Leaky RELU, ELU and SELU

3
New cards

Overfitting

  • A model experiences this when it performs well on training data, but generalizes poorly on unseen data samples.

  • Limiting the deth and width may work to prevent this, but it may not always be the best approach. A large neural network may need to be implemented instead in many cases, where the appropriate regularization strategies are applied.

4
New cards

Regularization strategies

  • Early stopping

  • L1 Parameter Regularization

  • L2 Parameter Regularization

  • Dropouts

5
New cards

Early Stopping

  • Does not require a change to the loss function or network setup

  • Use a validation data set to compute the loss after every epoch and stop training when loss value stops improving.

  • Can be efficient since we stop early instead of completing the execution of the entire epoch

6
New cards

L1/L2 Regularization

  • Works by penalizing larger weight values

    • Smaller weights imply a simpler network

7
New cards

Dropouts

  • One of the most effective and commonly used regularization techniques for neural networks

  • Applied to layers; You can choose the layers to apply dropout to

  • During training:

    • Each forward pass, it will randomly select some of the neurons and set their outputs to 0

    • Any weight updates on the backward pass will not apply to the selected neurons as well

  • Forces a neural network to learn with many different random subsets of the network.

8
New cards

Batch Normalization

  • Can also act as a regularizer, making other regularization redundant

  • Can be applied before or after each layer’s activation function

  • Model will generally converge faster using this

  • Larger learning rates can be used with this

  • Can be used as a replacement for input normalization

  • Use after the Convolutional layer instead of the Dropout layer

9
New cards

Momentum Optimization

  • Stochastic Gradient descent with momentum can help to overcome local minimum and help reach global minimum faster

  • A variation of this is called the Nesterov Accelerated Gradient, which measures the gradient of the loss function, not at the current position, but slightly ahead in the direction of momentum

10
New cards

RMSProp

  • In ___, the decay in the learning rate is faster for steeper dimensions than for dimensions with a gentler slope.

  • It accumulates only the gradients from the most recent iterations

  • It helps point the resulting updates more directly toward the global optimum

11
New cards

Adam (Adaptive Momentum Estimation)

  • Takes the best of both worlds from Momentum and RMSProp

  • Gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp

12
New cards

Choosing Optimizers

  • Different optimizers differ in convergence speed and convergence quality

  • Adaptive optimizations methods (including RMSProp, Adam optimization) adapt the gradient descent rate as it moves towards optimum points

13
New cards

Learning Rate Schedule

  • When training deep neural networks, it is often useful to reduce the learning rate as the training progresses

  • We can use pre-defined ____s together with SGD

14
New cards

Common learning rate schedules

  • Exponential Decay

  • Polynomial Decay

  • Piecewise Constant Decay

15
New cards

Finding the Best Hyper-Parameters (Manual)

  • Select a few hyper-parameters to try

  • Tedious, slow and non-scalable

16
New cards

Finding the Best Hyper-Parameters (Grid search)

  • Try all combinations of specified hyper-parameter values

  • Brute-force method that does not take advantage of experience from previous trials

  • Optimal values may not be present in the search range

17
New cards

Finding the Best Hyper-Parameters (Random Search)

  • Randomly sample the values of each hyper-parameter to try

  • Faster than grid search, but may miss out on the optimal values for the hyper-parameters

18
New cards

Finding the Best Hyper-Parameters (Bayesian Optimization)

  • Aims to train the model as few times as possible to avoid costs

  • Spends more time exploring the path that seems promising (learns from previous trials)

19
New cards

Finding the Best Hyper-Parameters (Hyperband)

  • Compares the performances of different trials

  • Terminates the bad performing trials early, while continuing the good trials for longer times

  • Faster than Bayesian Optimization

20
New cards

Hyper-parameters Tuning Tools

  • Ray Tune

  • Optuna

  • Hyperopt

  • Heras Tuner

  • Google Vizer

  • AWS SageMaker