1/189
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the primary conceptual difference between a deterministic function and a non-deterministic function in supervised learning?
"In a non-deterministic function
"When deriving a Maximum Likelihood (ML) hypothesis for binary non-deterministic data ($0$ or $1$)
what type of error function is typically used instead of sum of squared errors?"
How does a standard neural network using sum of squared errors behave when trained on non-deterministic data where the same $x$ has different $y$ values?
"The network will attempt to predict the mean value of $y$ for each $x$
What is the structural limitation of a single-layer perceptron regarding boolean functions like XOR ($A \oplus B$)?
A single-layer perceptron can only implement linearly separable functions and cannot solve XOR because no single line can separate the outputs.
What is a major advantage of the gradient descent training rule over the perceptron training rule for neural networks?
Gradient descent allows for training multi-layer networks by utilizing differentiable error functions and the chain rule for backpropagation.
"In decision tree regression using squared error
what value is predicted at a leaf node?"
How does a 'lazy' version of the ID3 decision tree algorithm differ from the standard 'eager' version?
"The lazy algorithm delays building the tree until a prediction is needed for a specific test instance
"Between Decision Trees and Nearest-Neighbor learning
which is generally more efficient for a target function defined by a diagonal line in a 2D plane?"
What is the VC dimension of an origin-centered circle in a 2D plane?
The VC dimension is 2 because it can shatter any 2 points but fails on 3 points if the inner two are positive and the outer one is negative.
How is entropy defined in the context of information theory?
Entropy is a measure of the average uncertainty or randomness associated with a random variable's possible outcomes.
Under what condition is the K-means clustering procedure considered a special case of the Expectation-Maximization (EM) algorithm?
"K-means is a special case of EM when the model assumes a mixture of Gaussians with equal
What does the first Principal Component in PCA represent in terms of data variance?
The first Principal Component is the direction in the feature space along which the data exhibits the maximum possible variance.
Which hierarchical clustering linkage method is most likely to produce 'chaining' effects where clusters are merged based on single close points?
Single-link clustering is most prone to chaining because it defines distance as the minimum distance between any two points in different clusters.
"In Markov Decision Processes (MDPs)
what is the core purpose of the Bellman Equation?"
What distinguishes Policy Iteration from Value Iteration in Reinforcement Learning?
"Policy Iteration alternates between explicitly evaluating a fixed policy and improving it
What is the conceptual definition of a Nash Equilibrium in game theory?
It is a strategy profile where no player can increase their payoff by unilaterally changing their own strategy while others keep theirs fixed.
Why is pruning applied to decision trees during the training process?
"Pruning is used to reduce overfitting by removing branches that offer little predictive power
What is a primary goal of cross-validation in regression analysis?
Cross-validation is used to estimate the generalization performance of the model and assist in hyperparameter tuning.
"In neural network training
what does the 'vanishing gradient problem' refer to?"
What characteristic makes instance-based learning (like KNN) sensitive to irrelevant features?
"Because it relies on similarity measures in the full feature space
How do ensemble methods like Random Forests or Boosting improve model generalization?
"They combine multiple diverse models to mitigate individual model weaknesses
What is the 'kernel trick' in Support Vector Machines (SVMs)?
The kernel trick implicitly maps data into a higher-dimensional space to find a linear separating hyperplane for non-linearly separable data.
"In the PAC learning framework
what does the 'confidence parameter' represent?"
What is the relationship between a model's VC dimension and its potential to overfit?
"A high VC dimension indicates a more complex hypothesis class with a greater capacity to fit noise
How can randomized optimization algorithms escape local optima?
They introduce probabilistic or random moves that allow the search to transition to inferior states temporarily to find better global solutions.
What is the primary innovation of AdamW compared to the standard Adam optimizer?
AdamW decouples weight decay from the gradient-based adaptive update to ensure the regularization effect remains consistent regardless of the learning rate.
"According to Information Theory
how does the probability of an event relate to the information it carries?"
What does the 'likelihood' represent in Bayesian Learning?
The likelihood quantifies the probability of observing the specific training data given that a particular hypothesis is true.
Why is Bayesian inference considered a 'principled' way to update beliefs?
It uses probability theory to mathematically integrate prior knowledge with new evidence to form a posterior probability distribution.
Why is feature scaling critical for K-Means clustering?
"K-Means uses distance calculations (like Euclidean distance)
What is a significant risk when performing feature selection using a simple filter method?
Filter methods may discard features that appear irrelevant individually but are highly predictive when combined with other features.
How does UMAP differ from t-SNE in its approach to manifold learning?
"UMAP better preserves global structure and provides a faster
"In feature transformation
how does Independent Component Analysis (ICA) differ from PCA?"
What is the definition of a 'greedy' policy in Reinforcement Learning?
A greedy policy is one where the agent always selects the action that maximizes the immediate expected reward or state-action value.
"In MDPs
what does a discount factor ($\gamma$) near 0 imply about the agent's behavior?"
What is the purpose of the 'Experience Replay' buffer in Deep Q-Networks (DQN)?
"It stores past transitions to allow the model to sample randomly
How do Dueling Networks improve the architecture of a standard DQN?
"They split the network into two streams to separately estimate the state value and the advantage of each action
What is the function of 'NoisyNets' in the Rainbow DQN framework?
NoisyNets inject learned noise into the network weights to encourage state-dependent exploration without relying on a fixed $\epsilon$-greedy schedule.
Why might a general-sum stochastic game lack a unique $Q^*$ solution?
"Unlike zero-sum games
"In the context of clustering
what is the 'complete link' criteria?"
What does Mutual Information measure between two random variables?
"It measures the amount of information obtained about one random variable by observing the other
How does prioritized experience replay change the sampling process in reinforcement learning?
"It samples transitions with high temporal-difference (TD) error more frequently
What is the 'out-of-sample mapping' problem in manifold learning?
It is the difficulty of determining where a new data point should be placed in a low-dimensional embedding without re-running the entire algorithm.
"In Bayesian Learning
what is the Maximum A Posteriori (MAP) hypothesis?"
What is 'Sample Complexity' in Computational Learning Theory?
Sample complexity is the number of training examples required for a learning algorithm to achieve a specified level of accuracy with high confidence.
Why can Decision Trees be used for classification even if they are 'axis-aligned'?
"They can approximate complex boundaries by creating a large number of hierarchical
What is the 'Advantage' function in Dueling DQN architectures?
The advantage function measures the relative importance of a specific action compared to the average value of all actions in that state.
"In K-Means clustering
what does 'proper initialization' help to avoid?"
How does 'weight decay' act as a form of regularization in neural networks?
"It penalizes large weights by adding their magnitude to the loss function
What is the primary conceptual use of KL Divergence in machine learning?
"KL Divergence is used to measure the difference or 'distance' between two probability distributions