ML: Optimization Learning

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/42

There's no tags or description

Looks like no tags are added yet.

Last updated 3:58 PM on 5/3/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

43 Terms

New cards

Simulating annealing (most directly) trades off:

Exploration and exploitation.

New cards

Fill in the blanks. Genetic algorithms compute iterates that mutate towards optima by blending population traits via _________; ________ explores the input space and _______ exploits it.

Crossover | mutation | crossover.

New cards

Choose the item that doesn’t belong. Though they sound similar, stochastic optimization differs from randomized optimization. The former __________ where the latter __________.

Optimizes a deterministic function | optimizes an expectation of a random variable.

New cards

Gradient descent follows the gradient of an entire dataset towards a (local) minimum. Stochastic gradient descent follows the gradient of _________ with a _________ learning rate.

Randomly sampled minibatch | decreasing.

New cards

Randomized hill climbing is a derivative-free method that can operate using only objective function evaluations.

True

New cards

Simulated annealing can accept worse moves early in search to escape local optima, but becomes greedier as temperature decreases

True

New cards

Which statements are true about randomized hill climbing? (Select ALL that apply)

Uses objective function values to decide whether to accept a new candidate.

Can get stuck in local optima because it is greedy about improvements.

Does not require gradients.

New cards

Which statements are true about simulated annealing? (Select ALL that apply)

Uses a temperature parameter to control acceptance of worse moves.

Starts more exploratory and becomes more exploitative as it cools

Can help escape local optima compared to pure hill climbing.

New cards

Which are standard components or operators in genetic algorithms? (Select ALL that apply)

Selection (choosing fitter individuals to reproduce).

Crossover (recombining traits from parents).

Mutation (random perturbations to maintain diversity).

New cards

In Adam, why are the m_hat_t and v_hat_t terms used?

To correct the early-step underestimation in moving averages due to zero initialization

New cards

Compared to “Adam + L2”, AdamW’s update:

Multiplies parameters by (1 − αλ) outside the adaptive step, restoring uniform decay

New cards

Empirically, moving from Adam(+L2) to AdamW tends to:

Decouple α and λ, producing a more separable hyperparameter landscape.

New cards

Which statement about learning-rate schedules and Adam/AdamW is best supported by the readings?

Cosine annealing often improves Adam/AdamW and can widen performance gaps over L2 coupling

New cards

In Adam, bias correction is most important in early steps because the moving averages start at zero.

True

New cards

AdamW applies weight decay independently of the adaptive gradient scaling, which makes decay more uniform across parameters.

True

New cards

(Multi-select) Which statements correctly describe why L2 and weight decay differ under Adam?

With Adam, L2 regularization is applied through the adaptive preconditioner, so decay is coordinate-scaled.

Weight decay (in AdamW) is applied directly to parameters, not through the adaptive moment terms.

Under plain SGD (uniform scaling), L2 and weight decay can be equivalent.

New cards

(Multi-select) Which hyperparameters are primarily involved in the AdamW update?

α (learning rate)

λ (weight decay coefficient)

β1 (first-moment momentum coefficient)

β2 (second-moment coefficient)

New cards

Entropy depends on:

A random variable’s distribution.

New cards

Mutual information is the KL divergence between

The joint distribution and the product distribution.

New cards

Kullback–Leibler divergence is almost metric. What properties does it lack?

Symmetry and triangle inequality.

New cards

In a machine learning problem, minimizing the KL divergence D(p || q):

Finds a distribution q that resembles the target distribution p.

New cards

In Adam, bias correction is most important in early steps because the moving averages start at zero.

True

New cards

AdamW applies weight decay independently of adaptive gradient scaling, making decay more uniform.

True

New cards

Conditioning increases entropy: H(X | Y) > H(X).

False

New cards

If two random variables are independent, their mutual information is zero.

True

New cards

KL divergence D(p || q) is symmetric: D(p || q) = D(q || p).

False

New cards

What is the name of the relationship used in the denominator of Bayes’ theorem?

Law of total probability

New cards

We can interpret maximum likelihood estimation from a Bayesian perspective. Specifically, it uses:

A noninformative (aka, flat) prior.

New cards

Kullback–Leibler divergence is almost a metric. What properties does it lack?

Symmetry and triangle inequality.

New cards

In a machine learning problem, minimizing the KL divergence D(p || q)

Finds a distribution q that resembles the target distribution p.

New cards

Conditioning increases entropy: H(X | Y) > H(X).

False

New cards

If two random variables are independent, their mutual information is zero.

True

New cards

KL divergence D(p || q) is symmetric.

False

New cards

(Multi-select) Which statements about entropy are correct? (Select ALL that apply)

Lower entropy generally means the source is more predictable and easier to compress.

Entropy depends on the probability distribution over outcomes.

Uniform distributions (over a fixed finite set) have maximal entropy

New cards

What is the “evidence” (marginal likelihood) term used for in Bayes’ theorem?

To normalize the posterior so it sums/integrates to 1

New cards

The MAP estimate is:

The parameter value that maximizes the posterior density

New cards

A 95% Bayesian credible interval means:

There is a 95% chance the true parameter lies in the interval (given data and prior)

New cards

Posterior predictive inference produces:

A distribution over future observations by integrating over parameter uncertainty

New cards

With a uniform (flat) prior, the MAP estimate equals the MLE.

True

New cards

A conjugate prior guarantees the posterior will be in the same distribution family as the prior.

True

New cards

Which are common reasons posterior inference can be hard in practice? (Select ALL that apply)

The normalizing constant (evidence) may require an intractable integral

High-dimensional parameter spaces make exact integration difficult

Non-conjugate models often lack closed-form posteriors

New cards

Which methods are commonly used for approximate Bayesian inference? (Select ALL that apply)

A. Markov chain Monte Carlo (MCMC)

B. Variational inference (VI)

C. Laplace approximation

New cards

The posterior becomes less sensitive to the prior as the amount of data grows large.

True