Comparison of Optimization Algorithms in Machine Learning

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/19

Earn XP

Description and Tags

20 question-and-answer flashcards covering update equations, properties, and practical applications of common optimization algorithms used in machine learning and deep learning.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

20 Terms

New cards

What is the parameter update rule for basic Gradient Descent (GD)?

β{t+1} = βt − α · gt, where gt is the gradient of the full loss with respect to β.

New cards

On what data sample is the gradient computed in Stochastic Gradient Descent (SGD)?

On a single training example, so gt = ∂Ri(β)/∂β for one sample i.

New cards

How does Mini-Batch SGD compute its gradient g_t?

By averaging gradients over a small batch: gt = (1/n) Σ{i=1}^{n} ∂R_i(β)/∂β.

New cards

Write the velocity update equation used in SGD with Momentum.

v{t+1} = γ vt + α · g_t.

New cards

Within what range does the momentum coefficient γ typically fall?

Between 0 and 1 (γ ∈ (0,1)).

New cards

After computing the velocity, how are parameters updated in Momentum SGD?

β{t+1} = βt − v_{t+1}.

New cards

How is the accumulated squared gradient G_t updated in Adagrad?

Gt = G{t−1} + g_t^2.

New cards

Why is Adagrad especially useful for sparse data problems?

Because it adapts the learning rate for each parameter, giving infrequently updated (sparse) parameters relatively larger steps.

New cards

What is a commonly used decay rate ρ in RMSprop’s exponential average?

ρ ≈ 0.9.

New cards

How does RMSprop update parameters once the running average S_t is computed?

β{t+1} = βt − (α / √(S{t+1} + ε)) · gt.

New cards

What two moment estimates does the Adam optimizer maintain?

The first moment (mt, the mean of gradients) and the second moment (vt, the uncentered variance of gradients).

New cards

How does Adam correct the bias in its moment estimates?

By computing bias-corrected values m̂t = mt/(1−μ1^t) and v̂t = vt/(1−μ2^t).

New cards

Which optimizer eliminates the need to manually set a global learning rate?

Adadelta.

New cards

For what size of datasets is full-batch Gradient Descent most suitable?

Small datasets that can be fully processed at every update without prohibitive computational cost.

New cards

Which optimizer is typically chosen for real-time, large-scale systems such as recommendation engines?

Stochastic Gradient Descent (SGD).

New cards

Mini-Batch SGD balances the trade-off between which two factors?

Computation efficiency (speed) and convergence stability (variance reduction).

New cards

Which optimizer is commonly applied to RNNs handling time-series or sequence data?

RMSprop.

New cards

Name some state-of-the-art tasks that often rely on Adam.

Training GANs, transformer models like BERT, and advanced computer-vision networks.

New cards

In optimization landscapes with high curvature or saddle points, which optimizer helps reduce oscillations?

SGD with Momentum.

New cards

Why might Adadelta be preferred in speech recognition and NLP tasks with rapidly changing gradients?

Its adaptive update rule scales steps based on past gradients, removing the need for manual learning-rate tuning and coping well with changing gradient magnitudes.