Learning Rate and Overfitting/dropout

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/14

flashcard set

Earn XP

Description and Tags

Da slide 166 a slide 189

Last updated 1:41 PM on 3/27/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

15 Terms

1
New cards

Importance of calibrating learning rate

if too small slow convergence;

if too great a swing and / or divergence

the optimal value changes according to the network architecture of the loss function, etc. if you start from a known architecture and already applied by others to similar problems, start from the recommended values and try to increase / decrease by monitoring convergence and generalization.

Attention: the fact that the loss decreases without oscillating does not mean that the value is optimal (especially in the tuning of pre-trained networks): monitor accuracy on the validation set.

Adaptive learning rate techniques can help.

2
New cards

Learning rate annealing

knowt flashcard image
3
New cards

Cyclic learning rate (what is it and why it works)

knowt flashcard image
4
New cards

Cyclic learning rate: thresholds

knowt flashcard image
5
New cards

CLR according to Smith

knowt flashcard image
6
New cards

Stochastic Gradient Descent with Warm Restarts

knowt flashcard image
7
New cards

Momentum

SGD with mini-batch can cause a “zigzag gradient” which slows down convergence and makes it less stable.

You can think of Gradient Descent as a ball rolling down on a valley. We want it to sit in the deepest place of the mountains, however, it is easy to see that things can go wrong. Attracted by the force of gravity, the ball tries to move to the lowest point.

In physics, the moment gives a motion without sudden oscillations and changes of direction.

In SGD, to avoid oscillations, the physical behavior can be emulated: the last update of each parameter is saved, and the new update is calculated as a linear combination of the previous update (which gives stability) and the current gradient (which corrects the direction )

During training the update direction tends to resist change when momentum is added to the update scheme. When the neural net approaches a shallow local minimum it's like applying brakes but not sufficient to instantly affect the update direction and magnitude. Hence the neural nets trained this way will overshoot past smaller local minima points and only stop in a deeper global minimum.

Thus momentum in neural nets helps them get out of local minima points so that a more important global minimum is found. Too much of momentum may create issues as well as systems that are not stable may create oscillations that grow in magnitude, in such cases one needs to add decay terms and so on. It's just physics applied to neural net training or numerical optimizations.

<p>SGD with mini-batch can cause a “zigzag gradient” which slows down convergence and makes it less stable.</p><p>You can think of Gradient Descent as a ball rolling down on a valley. We want it to sit in the deepest place of the mountains, however, it is easy to see that things can go wrong. Attracted by the force of gravity, the ball tries to move to the lowest point.</p><p>In physics, the moment gives a motion without sudden oscillations and changes of direction.</p><p>In SGD, to avoid oscillations, the physical behavior can be emulated: the last update of each parameter is saved, and the new update is calculated as a linear combination of the previous update (which gives stability) and the current gradient (which corrects the direction )</p><p>During training the update direction tends to resist change when momentum is added to the update scheme. When the neural net approaches a shallow local minimum it's like applying brakes but not sufficient to instantly affect the update direction and magnitude. Hence the neural nets trained this way will overshoot past smaller local minima points and only stop in a deeper global minimum.</p><p>Thus momentum in neural nets helps them get out of local minima points so that a more important global minimum is found. Too much of momentum may create issues as well as systems that are not stable may create oscillations that grow in magnitude, in such cases one needs to add decay terms and so on. It's just physics applied to neural net training or numerical optimizations.</p>
8
New cards

One cycle learning rate

knowt flashcard image
9
New cards

Superconvergence

knowt flashcard image
10
New cards

Adam

The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search.

Adam tends to converge faster, while SGD often converges to more optimal solutions. SGD’s high variance disadvantages gets rectified by Adam (as advantage for Adam).

To summarize, Adam definitely converges rapidly to a “sharp minima” whereas SGD is computationally heavy, converges to a “flat minima” but performs well on the test data.

Adam is useful when the computation time of SGD is unfeasible (or very expensive). Moreover, it is usually applied in transformer

<p>The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search.</p><p>Adam tends to converge faster, while SGD often converges to more optimal solutions. SGD’s high variance disadvantages gets rectified by Adam (as advantage for Adam).</p><p>To summarize, Adam definitely converges rapidly to a “sharp minima” whereas SGD is computationally heavy, converges to a “flat minima” but performs well on the test data.</p><p>Adam is useful when the computation time of SGD is unfeasible (or very expensive). Moreover, it is usually applied in transformer</p>
11
New cards

diffGrad

knowt flashcard image
12
New cards

Check list for convergence problems

knowt flashcard image
13
New cards

Overfitting revisited: bias-variance tradeoff

knowt flashcard image
14
New cards

Robustness through perturbations

knowt flashcard image
15
New cards

Dropout

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

<p>Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.</p><p>As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.</p><p>You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.</p><p>The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.</p>

Explore top notes

note
Photosynthesis in Higher Plants
Updated 899d ago
0.0(0)
note
AFPF casus 3A
Updated 436d ago
0.0(0)
note
AP Chemistry Review Notes
Updated 331d ago
0.0(0)
note
Market Revolution
Updated 469d ago
0.0(0)
note
Photosynthesis
Updated 560d ago
0.0(0)
note
Romeo and Juliet
Updated 555d ago
0.0(0)
note
Photosynthesis in Higher Plants
Updated 899d ago
0.0(0)
note
AFPF casus 3A
Updated 436d ago
0.0(0)
note
AP Chemistry Review Notes
Updated 331d ago
0.0(0)
note
Market Revolution
Updated 469d ago
0.0(0)
note
Photosynthesis
Updated 560d ago
0.0(0)
note
Romeo and Juliet
Updated 555d ago
0.0(0)

Explore top flashcards

flashcards
Chem - Units 6-10: reactions
58
Updated 887d ago
0.0(0)
flashcards
AP BIO Unit 8 ALL
85
Updated 212d ago
0.0(0)
flashcards
Fluency Fast Vocabulary
30
Updated 1177d ago
0.0(0)
flashcards
Lecture 9
49
Updated 694d ago
0.0(0)
flashcards
AUTENTICO 5A (copy)
67
Updated 1128d ago
0.0(0)
flashcards
Chem - Units 6-10: reactions
58
Updated 887d ago
0.0(0)
flashcards
AP BIO Unit 8 ALL
85
Updated 212d ago
0.0(0)
flashcards
Fluency Fast Vocabulary
30
Updated 1177d ago
0.0(0)
flashcards
Lecture 9
49
Updated 694d ago
0.0(0)
flashcards
AUTENTICO 5A (copy)
67
Updated 1128d ago
0.0(0)