1/42
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Simulating annealing (most directly) trades off:
Exploration and exploitation.
Fill in the blanks. Genetic algorithms compute iterates that mutate towards optima by blending population traits via _________; ________ explores the input space and _______ exploits it.
Crossover | mutation | crossover.
Choose the item that doesn’t belong. Though they sound similar, stochastic optimization differs from randomized optimization. The former __________ where the latter __________.
Optimizes a deterministic function | optimizes an expectation of a random variable.
Gradient descent follows the gradient of an entire dataset towards a (local) minimum. Stochastic gradient descent follows the gradient of _________ with a _________ learning rate.
Randomly sampled minibatch | decreasing.
Randomized hill climbing is a derivative-free method that can operate using only objective function evaluations.
True
Simulated annealing can accept worse moves early in search to escape local optima, but becomes greedier as temperature decreases
True
Which statements are true about randomized hill climbing? (Select ALL that apply)
Uses objective function values to decide whether to accept a new candidate.
Can get stuck in local optima because it is greedy about improvements.
Does not require gradients.
Which statements are true about simulated annealing? (Select ALL that apply)
Uses a temperature parameter to control acceptance of worse moves.
Starts more exploratory and becomes more exploitative as it cools
Can help escape local optima compared to pure hill climbing.
Which are standard components or operators in genetic algorithms? (Select ALL that apply)
Selection (choosing fitter individuals to reproduce).
Crossover (recombining traits from parents).
Mutation (random perturbations to maintain diversity).
In Adam, why are the m_hat_t and v_hat_t terms used?
To correct the early-step underestimation in moving averages due to zero initialization
Compared to “Adam + L2”, AdamW’s update:
Multiplies parameters by (1 − αλ) outside the adaptive step, restoring uniform decay
Empirically, moving from Adam(+L2) to AdamW tends to:
Decouple α and λ, producing a more separable hyperparameter landscape.
Which statement about learning-rate schedules and Adam/AdamW is best supported by the readings?
Cosine annealing often improves Adam/AdamW and can widen performance gaps over L2 coupling
In Adam, bias correction is most important in early steps because the moving averages start at zero.
True
AdamW applies weight decay independently of the adaptive gradient scaling, which makes decay more uniform across parameters.
True
(Multi-select) Which statements correctly describe why L2 and weight decay differ under Adam?
With Adam, L2 regularization is applied through the adaptive preconditioner, so decay is coordinate-scaled.
Weight decay (in AdamW) is applied directly to parameters, not through the adaptive moment terms.
Under plain SGD (uniform scaling), L2 and weight decay can be equivalent.
(Multi-select) Which hyperparameters are primarily involved in the AdamW update?
α (learning rate)
λ (weight decay coefficient)
β1 (first-moment momentum coefficient)
β2 (second-moment coefficient)
Entropy depends on:
A random variable’s distribution.
Mutual information is the KL divergence between
The joint distribution and the product distribution.
Kullback–Leibler divergence is almost metric. What properties does it lack?
Symmetry and triangle inequality.
In a machine learning problem, minimizing the KL divergence D(p || q):
Finds a distribution q that resembles the target distribution p.
In Adam, bias correction is most important in early steps because the moving averages start at zero.
True
AdamW applies weight decay independently of adaptive gradient scaling, making decay more uniform.
True
Conditioning increases entropy: H(X | Y) > H(X).
False
If two random variables are independent, their mutual information is zero.
True
KL divergence D(p || q) is symmetric: D(p || q) = D(q || p).
False
What is the name of the relationship used in the denominator of Bayes’ theorem?
Law of total probability
We can interpret maximum likelihood estimation from a Bayesian perspective. Specifically, it uses:
A noninformative (aka, flat) prior.
Kullback–Leibler divergence is almost a metric. What properties does it lack?
Symmetry and triangle inequality.
In a machine learning problem, minimizing the KL divergence D(p || q)
Finds a distribution q that resembles the target distribution p.
Conditioning increases entropy: H(X | Y) > H(X).
False
If two random variables are independent, their mutual information is zero.
True
KL divergence D(p || q) is symmetric.
False
(Multi-select) Which statements about entropy are correct? (Select ALL that apply)
Lower entropy generally means the source is more predictable and easier to compress.
Entropy depends on the probability distribution over outcomes.
Uniform distributions (over a fixed finite set) have maximal entropy
What is the “evidence” (marginal likelihood) term used for in Bayes’ theorem?
To normalize the posterior so it sums/integrates to 1
The MAP estimate is:
The parameter value that maximizes the posterior density
A 95% Bayesian credible interval means:
There is a 95% chance the true parameter lies in the interval (given data and prior)
Posterior predictive inference produces:
A distribution over future observations by integrating over parameter uncertainty
With a uniform (flat) prior, the MAP estimate equals the MLE.
True
A conjugate prior guarantees the posterior will be in the same distribution family as the prior.
True
Which are common reasons posterior inference can be hard in practice? (Select ALL that apply)
The normalizing constant (evidence) may require an intractable integral
High-dimensional parameter spaces make exact integration difficult
Non-conjugate models often lack closed-form posteriors
Which methods are commonly used for approximate Bayesian inference? (Select ALL that apply)
A. Markov chain Monte Carlo (MCMC)
B. Variational inference (VI)
C. Laplace approximation
The posterior becomes less sensitive to the prior as the amount of data grows large.
True