ML: Optimization Learning

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/42

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 3:58 PM on 5/3/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

43 Terms

1
New cards

Simulating annealing (most directly) trades off:

Exploration and exploitation.

2
New cards

Fill in the blanks. Genetic algorithms compute iterates that mutate towards optima by blending population traits via _________; ________ explores the input space and _______ exploits it.

Crossover | mutation | crossover.

3
New cards

Choose the item that doesn’t belong. Though they sound similar, stochastic optimization differs from randomized optimization. The former __________ where the latter __________.

Optimizes a deterministic function | optimizes an expectation of a random variable.

4
New cards

Gradient descent follows the gradient of an entire dataset towards a (local) minimum. Stochastic gradient descent follows the gradient of _________ with a _________ learning rate.

Randomly sampled minibatch | decreasing.

5
New cards

Randomized hill climbing is a derivative-free method that can operate using only objective function evaluations.

True

6
New cards

Simulated annealing can accept worse moves early in search to escape local optima, but becomes greedier as temperature decreases

True

7
New cards

Which statements are true about randomized hill climbing? (Select ALL that apply)

Uses objective function values to decide whether to accept a new candidate.

Can get stuck in local optima because it is greedy about improvements.

Does not require gradients.

8
New cards

Which statements are true about simulated annealing? (Select ALL that apply)

Uses a temperature parameter to control acceptance of worse moves.

Starts more exploratory and becomes more exploitative as it cools

Can help escape local optima compared to pure hill climbing.

9
New cards

Which are standard components or operators in genetic algorithms? (Select ALL that apply)

Selection (choosing fitter individuals to reproduce).

Crossover (recombining traits from parents).

Mutation (random perturbations to maintain diversity).

10
New cards

In Adam, why are the m_hat_t and v_hat_t terms used?

To correct the early-step underestimation in moving averages due to zero initialization

11
New cards

Compared to “Adam + L2”, AdamW’s update:

Multiplies parameters by (1 − αλ) outside the adaptive step, restoring uniform decay

12
New cards

Empirically, moving from Adam(+L2) to AdamW tends to:

Decouple α and λ, producing a more separable hyperparameter landscape.

13
New cards

Which statement about learning-rate schedules and Adam/AdamW is best supported by the readings?

Cosine annealing often improves Adam/AdamW and can widen performance gaps over L2 coupling

14
New cards

In Adam, bias correction is most important in early steps because the moving averages start at zero.

True

15
New cards

AdamW applies weight decay independently of the adaptive gradient scaling, which makes decay more uniform across parameters.

True

16
New cards

(Multi-select) Which statements correctly describe why L2 and weight decay differ under Adam?

With Adam, L2 regularization is applied through the adaptive preconditioner, so decay is coordinate-scaled.

Weight decay (in AdamW) is applied directly to parameters, not through the adaptive moment terms.

Under plain SGD (uniform scaling), L2 and weight decay can be equivalent.

17
New cards

(Multi-select) Which hyperparameters are primarily involved in the AdamW update?

α (learning rate)

λ (weight decay coefficient)

β1 (first-moment momentum coefficient)

β2 (second-moment coefficient)

18
New cards

Entropy depends on:

A random variable’s distribution.

19
New cards

Mutual information is the KL divergence between

The joint distribution and the product distribution.

20
New cards

Kullback–Leibler divergence is almost metric. What properties does it lack?

Symmetry and triangle inequality.

21
New cards

In a machine learning problem, minimizing the KL divergence D(p || q):

Finds a distribution q that resembles the target distribution p.

22
New cards

In Adam, bias correction is most important in early steps because the moving averages start at zero.

True

23
New cards

AdamW applies weight decay independently of adaptive gradient scaling, making decay more uniform.

True

24
New cards

Conditioning increases entropy: H(X | Y) > H(X).

False

25
New cards

If two random variables are independent, their mutual information is zero.

True

26
New cards

KL divergence D(p || q) is symmetric: D(p || q) = D(q || p).

False

27
New cards

What is the name of the relationship used in the denominator of Bayes’ theorem?

Law of total probability

28
New cards

We can interpret maximum likelihood estimation from a Bayesian perspective. Specifically, it uses:

A noninformative (aka, flat) prior.

29
New cards

Kullback–Leibler divergence is almost a metric. What properties does it lack?

Symmetry and triangle inequality.

30
New cards

In a machine learning problem, minimizing the KL divergence D(p || q)

Finds a distribution q that resembles the target distribution p.

31
New cards

Conditioning increases entropy: H(X | Y) > H(X).

False

32
New cards

If two random variables are independent, their mutual information is zero.

True

33
New cards

KL divergence D(p || q) is symmetric.

False

34
New cards

(Multi-select) Which statements about entropy are correct? (Select ALL that apply)

Lower entropy generally means the source is more predictable and easier to compress.

Entropy depends on the probability distribution over outcomes.

Uniform distributions (over a fixed finite set) have maximal entropy

35
New cards

What is the “evidence” (marginal likelihood) term used for in Bayes’ theorem?

To normalize the posterior so it sums/integrates to 1

36
New cards

The MAP estimate is:

The parameter value that maximizes the posterior density

37
New cards

A 95% Bayesian credible interval means:

There is a 95% chance the true parameter lies in the interval (given data and prior)

38
New cards

Posterior predictive inference produces:

A distribution over future observations by integrating over parameter uncertainty

39
New cards

With a uniform (flat) prior, the MAP estimate equals the MLE.

True

40
New cards

A conjugate prior guarantees the posterior will be in the same distribution family as the prior.

True

41
New cards

Which are common reasons posterior inference can be hard in practice? (Select ALL that apply)

The normalizing constant (evidence) may require an intractable integral

High-dimensional parameter spaces make exact integration difficult

Non-conjugate models often lack closed-form posteriors

42
New cards

Which methods are commonly used for approximate Bayesian inference? (Select ALL that apply)

A. Markov chain Monte Carlo (MCMC)

B. Variational inference (VI)

C. Laplace approximation

43
New cards

The posterior becomes less sensitive to the prior as the amount of data grows large.

True