ML MASTERS

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/189

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 4:04 PM on 4/21/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

190 Terms

1
New cards

What is the primary conceptual difference between a deterministic function and a non-deterministic function in supervised learning?

"In a non-deterministic function

2
New cards

"When deriving a Maximum Likelihood (ML) hypothesis for binary non-deterministic data ($0$ or $1$)

what type of error function is typically used instead of sum of squared errors?"

3
New cards

How does a standard neural network using sum of squared errors behave when trained on non-deterministic data where the same $x$ has different $y$ values?

"The network will attempt to predict the mean value of $y$ for each $x$

4
New cards

What is the structural limitation of a single-layer perceptron regarding boolean functions like XOR ($A \oplus B$)?

A single-layer perceptron can only implement linearly separable functions and cannot solve XOR because no single line can separate the outputs.

5
New cards

What is a major advantage of the gradient descent training rule over the perceptron training rule for neural networks?

Gradient descent allows for training multi-layer networks by utilizing differentiable error functions and the chain rule for backpropagation.

6
New cards

"In decision tree regression using squared error

what value is predicted at a leaf node?"

7
New cards

How does a 'lazy' version of the ID3 decision tree algorithm differ from the standard 'eager' version?

"The lazy algorithm delays building the tree until a prediction is needed for a specific test instance

8
New cards

"Between Decision Trees and Nearest-Neighbor learning

which is generally more efficient for a target function defined by a diagonal line in a 2D plane?"

9
New cards

What is the VC dimension of an origin-centered circle in a 2D plane?

The VC dimension is 2 because it can shatter any 2 points but fails on 3 points if the inner two are positive and the outer one is negative.

10
New cards

How is entropy defined in the context of information theory?

Entropy is a measure of the average uncertainty or randomness associated with a random variable's possible outcomes.

11
New cards

Under what condition is the K-means clustering procedure considered a special case of the Expectation-Maximization (EM) algorithm?

"K-means is a special case of EM when the model assumes a mixture of Gaussians with equal

12
New cards

What does the first Principal Component in PCA represent in terms of data variance?

The first Principal Component is the direction in the feature space along which the data exhibits the maximum possible variance.

13
New cards

Which hierarchical clustering linkage method is most likely to produce 'chaining' effects where clusters are merged based on single close points?

Single-link clustering is most prone to chaining because it defines distance as the minimum distance between any two points in different clusters.

14
New cards

"In Markov Decision Processes (MDPs)

what is the core purpose of the Bellman Equation?"

15
New cards

What distinguishes Policy Iteration from Value Iteration in Reinforcement Learning?

"Policy Iteration alternates between explicitly evaluating a fixed policy and improving it

16
New cards

What is the conceptual definition of a Nash Equilibrium in game theory?

It is a strategy profile where no player can increase their payoff by unilaterally changing their own strategy while others keep theirs fixed.

17
New cards

Why is pruning applied to decision trees during the training process?

"Pruning is used to reduce overfitting by removing branches that offer little predictive power

18
New cards

What is a primary goal of cross-validation in regression analysis?

Cross-validation is used to estimate the generalization performance of the model and assist in hyperparameter tuning.

19
New cards

"In neural network training

what does the 'vanishing gradient problem' refer to?"

20
New cards

What characteristic makes instance-based learning (like KNN) sensitive to irrelevant features?

"Because it relies on similarity measures in the full feature space

21
New cards

How do ensemble methods like Random Forests or Boosting improve model generalization?

"They combine multiple diverse models to mitigate individual model weaknesses

22
New cards

What is the 'kernel trick' in Support Vector Machines (SVMs)?

The kernel trick implicitly maps data into a higher-dimensional space to find a linear separating hyperplane for non-linearly separable data.

23
New cards

"In the PAC learning framework

what does the 'confidence parameter' represent?"

24
New cards

What is the relationship between a model's VC dimension and its potential to overfit?

"A high VC dimension indicates a more complex hypothesis class with a greater capacity to fit noise

25
New cards

How can randomized optimization algorithms escape local optima?

They introduce probabilistic or random moves that allow the search to transition to inferior states temporarily to find better global solutions.

26
New cards

What is the primary innovation of AdamW compared to the standard Adam optimizer?

AdamW decouples weight decay from the gradient-based adaptive update to ensure the regularization effect remains consistent regardless of the learning rate.

27
New cards

"According to Information Theory

how does the probability of an event relate to the information it carries?"

28
New cards

What does the 'likelihood' represent in Bayesian Learning?

The likelihood quantifies the probability of observing the specific training data given that a particular hypothesis is true.

29
New cards

Why is Bayesian inference considered a 'principled' way to update beliefs?

It uses probability theory to mathematically integrate prior knowledge with new evidence to form a posterior probability distribution.

30
New cards

Why is feature scaling critical for K-Means clustering?

"K-Means uses distance calculations (like Euclidean distance)

31
New cards

What is a significant risk when performing feature selection using a simple filter method?

Filter methods may discard features that appear irrelevant individually but are highly predictive when combined with other features.

32
New cards

How does UMAP differ from t-SNE in its approach to manifold learning?

"UMAP better preserves global structure and provides a faster

33
New cards

"In feature transformation

how does Independent Component Analysis (ICA) differ from PCA?"

34
New cards

What is the definition of a 'greedy' policy in Reinforcement Learning?

A greedy policy is one where the agent always selects the action that maximizes the immediate expected reward or state-action value.

35
New cards

"In MDPs

what does a discount factor ($\gamma$) near 0 imply about the agent's behavior?"

36
New cards

What is the purpose of the 'Experience Replay' buffer in Deep Q-Networks (DQN)?

"It stores past transitions to allow the model to sample randomly

37
New cards

How do Dueling Networks improve the architecture of a standard DQN?

"They split the network into two streams to separately estimate the state value and the advantage of each action

38
New cards

What is the function of 'NoisyNets' in the Rainbow DQN framework?

NoisyNets inject learned noise into the network weights to encourage state-dependent exploration without relying on a fixed $\epsilon$-greedy schedule.

39
New cards

Why might a general-sum stochastic game lack a unique $Q^*$ solution?

"Unlike zero-sum games

40
New cards

"In the context of clustering

what is the 'complete link' criteria?"

41
New cards

What does Mutual Information measure between two random variables?

"It measures the amount of information obtained about one random variable by observing the other

42
New cards

How does prioritized experience replay change the sampling process in reinforcement learning?

"It samples transitions with high temporal-difference (TD) error more frequently

43
New cards

What is the 'out-of-sample mapping' problem in manifold learning?

It is the difficulty of determining where a new data point should be placed in a low-dimensional embedding without re-running the entire algorithm.

44
New cards

"In Bayesian Learning

what is the Maximum A Posteriori (MAP) hypothesis?"

45
New cards

What is 'Sample Complexity' in Computational Learning Theory?

Sample complexity is the number of training examples required for a learning algorithm to achieve a specified level of accuracy with high confidence.

46
New cards

Why can Decision Trees be used for classification even if they are 'axis-aligned'?

"They can approximate complex boundaries by creating a large number of hierarchical

47
New cards

What is the 'Advantage' function in Dueling DQN architectures?

The advantage function measures the relative importance of a specific action compared to the average value of all actions in that state.

48
New cards

"In K-Means clustering

what does 'proper initialization' help to avoid?"

49
New cards

How does 'weight decay' act as a form of regularization in neural networks?

"It penalizes large weights by adding their magnitude to the loss function

50
New cards

What is the primary conceptual use of KL Divergence in machine learning?

"KL Divergence is used to measure the difference or 'distance' between two probability distributions

51
New cards
What is the primary function of elitism in Genetic Algorithms (GAs)?
It copies the top performing individuals unchanged to the next generation to preserve the best-found fitness.
52
New cards
"In Genetic Algorithms
what is a potential consequence of excessive selection pressure?"
53
New cards
True or False: Genetic Algorithms are strictly limited to binary encodings.
False; real-valued representations are valid and frequently used in GA implementations.
54
New cards
How does the MIMIC algorithm learn to generate new candidate solutions?
It fits a probabilistic model (often a tree-structured distribution) to a selected set of elite samples.
55
New cards
Under what condition does the MIMIC algorithm reduce to an Estimation of Distribution Algorithm (EDA) with factorized marginals?
This occurs when full independence across all variables is assumed.
56
New cards
"In Simulated Annealing
what is the acceptance probability for a 'downhill' move that decreases cost ($\Delta E < 0$)?"
57
New cards
How does the temperature $T$ in Simulated Annealing influence the acceptance of 'uphill' moves?
"High temperatures increase the probability of accepting uphill moves to encourage exploration
58
New cards
What is a key conceptual difference between Heavy-ball momentum and Nesterov Accelerated Gradient (NAG)?
"Heavy-ball uses the current gradient
59
New cards
Why is 'bias correction' used in the Adam optimizer?
It counters the zero-initialization bias of the first and second moment moving averages during early training steps.
60
New cards
How does AdamW differ from standard Adam regarding weight decay?
AdamW decouples weight decay from the adaptive gradient updates to ensure it behaves consistently regardless of the learning rate.
61
New cards
"In Information Theory
how does the entropy of a biased coin compare to that of a fair coin?"
62
New cards
Under what condition does the mutual information $I(X; Y)$ in a Binary Symmetric Channel become zero?
It becomes zero when the bit-flip probability $\epsilon$ is exactly 0.5 and $X$ is uniform.
63
New cards
"In Bayesian Learning
what occurs to the Maximum A Posteriori (MAP) estimate if the prior $P(h)$ is uniform over all hypotheses?"
64
New cards
What defines the Bayes-optimal decision boundary when class-conditional distributions are Gaussians with a shared covariance matrix?
The resulting decision boundary is linear in the feature space $x$.
65
New cards
How does increasing the prior probability $P(y=1)$ for a class affect its posterior probability $P(y=1 | x)$?
The posterior probability increases monotonically as the prior for that class increases.
66
New cards
What is the core assumption of the Naive Bayes classifier regarding features?
It assumes all features are conditionally independent given the class label.
67
New cards
"In a Markov Decision Process (MDP)
what does the transition function $P(s' | s
68
New cards
What is a primary advantage of model-based reinforcement learning over model-free methods?
Model-based methods are generally more sample efficient because they can plan using a learned model of the environment.
69
New cards
Why might a model-based RL agent perform poorly compared to a model-free agent in a complex environment?
Its performance is strictly limited by the accuracy of its learned internal model of the environment.
70
New cards
How do SARSA and Q-Learning differ in their update rules?
"SARSA is on-policy (updates based on the actual next action)
71
New cards
What is the 'Folk Theorem' in repeated games?
"It suggests that many outcomes
72
New cards
"In Game Theory
how does a 'Grim Trigger' strategy function?"
73
New cards
"What is the 'Pavlov' (Win-Stay
Lose-Shift) strategy in a repeated game?"
74
New cards
"In the ID3 algorithm
what does a leaf node with zero entropy signify?"
75
New cards
How does L2 regularization (Ridge Regression) help mitigate multicollinearity?
"It penalizes large weights
76
New cards
"In classification
why is accuracy often a misleading metric for imbalanced datasets?"
77
New cards
What is the conceptual relationship between $k$ in $k$-Nearest Neighbors and model bias/variance?
"Small $k$ values result in low bias but high variance
78
New cards
Why can a single-layer perceptron not solve the XOR problem?
"The XOR function is not linearly separable
79
New cards
What is required in a neural network to represent non-linear functions like XOR?
The network must have at least one hidden layer with a non-linear activation function.
80
New cards
What does PAC (Probably Approximately Correct) learning guarantee with probability $1 - \delta$?
It guarantees that the learned hypothesis will have an error of at most $\epsilon$.
81
New cards
How does the expressiveness of a hypothesis class $\mathcal{H}$ affect sample complexity in PAC learning?
More expressive hypothesis classes (larger $|\mathcal{H}|$ or higher VC dimension) require more training samples to achieve the same error bounds.
82
New cards
What is the VC dimension of a hypothesis class?
It is the size of the largest set of points that can be 'shattered' (perfectly classified in all possible ways) by the class.
83
New cards
Why might K-Means clustering be biased toward larger clusters?
"It uses Euclidean distance and minimizes the sum-of-squares
84
New cards
How do Gaussian Mixture Models (GMMs) handle cluster sizes better than K-Means?
"GMMs learn separate covariance structures and mixing coefficients
85
New cards
What is the 'chaining' property in single-linkage hierarchical clustering?
"It refers to the tendency of the algorithm to merge clusters that are connected by a series of points
86
New cards
"In feature selection
what is the difference between 'relevance' and 'usefulness'?"
87
New cards
Compare 'Filter' and 'Wrapper' methods for feature selection.
"Filters rank features based on statistical properties (fast)
88
New cards
What is a significant risk of using 'Wrapper' methods on small datasets?
They are highly prone to overfitting because they optimize the feature subset specifically for the training/validation performance.
89
New cards
What is the primary goal of Principal Component Analysis (PCA)?
To reduce dimensionality by transforming data into orthogonal components that capture the maximum variance.
90
New cards
How does Independent Component Analysis (ICA) differ from PCA in terms of the components it seeks?
"PCA seeks orthogonal components that maximize variance
91
New cards
"In manifold learning
how does Isomap approximate 'geodesic' distance?"
92
New cards
Why is t-SNE generally considered poor for quantitative retrieval tasks in production?
"It distorts global geometry and cluster areas
93
New cards
What is a major conceptual advantage of UMAP over t-SNE?
UMAP generally preserves more of the global structure and is theoretically more grounded in manifold topology.
94
New cards
What does 'subgame perfection' require in a multi-stage game?
"It requires that the strategy forms a Nash equilibrium in every possible subgame
95
New cards
"In the context of AdamW
why is 'weight decay' applied directly to the weights rather than the gradient?"
96
New cards
How does the 'epsilon' parameter in RMSprop/Adam affect the update when placed outside the square root?
Placing it outside helps stabilize the step size when the squared gradient accumulator $v_t$ is very small.
97
New cards
"In PAC learning
what is the impact of reducing the confidence parameter $\delta$ (delta)?"
98
New cards
What is the VC dimension of linear hyperplanes in $\mathbb{R}^{10}$?
"The VC dimension is $d + 1$
99
New cards
Why might density-based clustering like DBSCAN fail on datasets with clusters of varying densities?
"DBSCAN uses a global density threshold (epsilon and MinPts)
100
New cards
What is 'embedded' feature selection?
"Feature selection that occurs automatically during the model training process