1/89
75 practice flashcards covering the core concepts of CSCI-B 455: Machine Learning, focused on SVMs, Ensemble Learning, CNNs, Transformers, and Unsupervised Learning.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What characterizes Supervised Learning systems?
The training data includes labels (desired outputs), and the algorithm learns to map inputs to outputs.
Which ML system type discovers hidden structures in training data without provided labels?
Unsupervised Learning.
How does Reinforcement Learning operate
An agent receives rewards or penalties and learns to maximize total reward
What is batch learning
Trains on the entire dataset at once
What is online learning
Trains incrementally on mini-batches - parameter learning rate controls learning
What is instance based learning
Memorizes training examples and generalizes to new cases using a similarity measure
How can Overfitting be fixed?
By using regularization, a simpler model, or more training data.
How is a confusion matrix made up down each column (down the first column, then the second)
Column 1: True Positive, False Positive. Column 2: False Negative, True Negative
What is the formula for Precision in classification metrics?
Tp/tp+fp
What is the formula for Recall?
tp/tp+fn
What is the formula for F1 score?
2*precision*recall/precision+recall
What does ROC curve plot?
True Positive Rate (Recall) vs. False Positive Rate at various thresholds
What is the importance of AUC
Calculates the area under the ROC curve. AUC of 1.0 is perfect, 0.5 is random
One-vs-rest (OvR)
Train one binary classifier per class. Test class N vs rest of classes. (N classifiers total)
One-vs-One (OvO)
Train one binary classifier for every pair of classes.
What is the normal equation (linear regression)
(x^t*x)^-1(x^t*y)
What is Gradient Descent
An iterative algorithm that "walks downhill" on the cost function surface by following the negative gradient. Parameter = learning rate - the size of the step
What happens if the learning rate (η) is set too large in Gradient Descent?
The algorithm may overshoot the minimum and diverge.
Stochastic Gradient Descent
Gradient is computed over one random instance
Polynomial Regression
Fitting nonlinear data using a linear model by adding polynomial features.
What is the primary risk of using high-degree polynomials in Polynomial Regression?
They overfit the training data easily.
What is Ridge Regularization (L2)
adds a penalty on the size of the coefficients by squaring
What is Lasso Regularization
adds a penalty on the size of the coefficients by utilizing absolute value - feature selection
What is Elastic Net Regularization
Combination of Ridge and Lasso
What is Logistic Regression
Estimate the probability that an instance belongs to one of two classes
What is Cross Entropy in logistic regression
A function that measures how well predicted probabilities match true labels, where if the true label is 1 the loss is −log(yhat). If the true label is 0 the loss is −log(1−that).
Is the Log Loss cost function for Logistic Regression convex or non-convex?
It is convex - Gradient Descent is guaranteed to find the global minimum.
Softmax Regression
Generalizes Logistic Regression to support multiple classes
What ensures that all probabilities produced by the Softmax function sum to 1?
The normalization of each class score
What is Support Vector Mchines (SVM)
Finding the decision boundary that separates two classes while maximizing the margin width between them.
What determines the decision boundary in an SVM model?
The support vectors (the instances sitting exactly on the margin boundaries).
In SVM models, how does a large hyperparameter C affect the margin?
It results in a narrower margin with fewer violations, which can lead to overfitting.
What is a Hard Margin in SVM
Requiems all instances to be correctly classified
What is Soft Margin in SVM
Allows some margins to be wrong
What is the 'Kernel Trick' in the context of SVMs?
Computing dot products in a high-dimensional space without mapping the data - saves computational cost.
What is the Gaussian RBF kernel?
Measures how close two points are and enables nonlinear decision boundaries by giving higher values to nearby points.
What is SVM Regression
Fit as many instances as possible inside the margin
What is a Decision tree
Makes predictions by asking a sequence of yes/no questions about features
What is gini impurity - decision trees
Quantifies how “mixed” a set of classes is
How does a Decision Tree calculate the probability that an instance belongs to a specific class?
By computing the fraction of training instances of that class in the leaf node where the instance ends up (kext/n).
Why are decision trees unstable
Decision trees are orthogonal, makes them sensitive to data rotation
What is the Cart Cost Function in decision trees
measures the sum of squared errors within each region - greedy algorithm
What is ensemble learning
Combining multiple ML models to produce a single stronger model
What is the difference between Hard Voting and Soft Voting in ensemble learning?
Hard Voting uses a majority win based on class labels, while Soft Voting averages predicted probabilities and picks the highest average.
What is bagging in Ensemble learning
Sampling with replacement - reduces overfitting
How are predictors and weights handled in AdaBoost?
An ensemble method that builds a strong classifier by correcting its predecessor - focusing on training instances that the predecessor underfitted
What is 'Gradient' referring to in the Gradient Boosting algorithm?
Gradient descent in function space, where each new predictor is fit to the residual errors
What is Stacking
Training a meta-learning model to aggregate the predictions of a set of predictors
What is the 'Curse of Dimensionality'?
As dimensions increases, data becomes more spread out - need more training data to reliably learn patterns
What is Projection approach to dimension reduction
Project data onto lower dimension subspace
What is manifold learning approach to dimension reduction
Models the lower dimensional manifold the data lies on
What is PCA
Finds the axes (principal components) that account for the max variance in the data - projects onto those axes
What mathematical technique does PCA use to find principal components?
Singular Value Decomposition (SVD)
What is the Explained Variance Ratio (EVR)
Each principal component has an (EVR) - the proportion of total dataset variance captured by that
component.
What is Kernel PCA
Apply the kernel trick (same as SVM) to PCA to perform nonlinear dimensionality reduction
What is Locally Linear Embedding (LLE)?
nonlinear dimensionality reduction method - reduces data dimensions while preserving local relationships - not projection
What is tSNE
Keeps similar instances close, dissimilar ones far.
What is Isomap
preserves geodesic distances along the manifold.
What is K-Means clustering
Groups similar instances into clusters without labels
How does K-Means work
Initialize K centroid positions, plotting each point to the nearest cluster center, updating each center to be the mean of its assigned points until the clusters stop changing.
Hard vs Soft Clusering
Hard clustering assigns each point to one cluster, soft assigns a score per point for each cluster
What is intertia in K-means
mean squared distance from each instance to its closest centroid - always decreases as clusters increase
What is the elbow point in inertia/k-means clustering
the point where adding more clusters gives small improvements
What is the Silhouette Coefficient
how well a data point fits its assigned cluster compared to other clusters, values range from −1 to 1, where higher values mean better-defined clusters.
How does the K-Means++ initialization strategy differ from standard K-Means?
It picks initial centroids based on a probability proportional to their squared distance from existing centroids
What is DBSCAN
Finds clusters of arbitrary shape - Density-based.
What is a Gaussian Mixture Model (GMM)
A probabilistic model that assumes data was generated from a mixture of K
Gaussian (Normal) distributions with unknown parameters.
What is the EM algorithm?
used to estimate unknown parameters in models with hidden variables - alternating between estimating missing data (Expectation step) and optimizing parameters given those estimates (Maximization step).
Which information criteria penalize complex GMM models more heavily for number of parameters?
BIC (Bayesian Information Criterion).
How is Anomaly Detection performed using a fitted GMM?
By identifying instances located in low-density regions that fall below a density threshold.
Why are non-linear activation functions necessary in Deep Neural Networks?
Without them, stacking multiple layers is equivalent to a single linear layer
What is Backpropagation?
computes the gradient of the loss with respect to every weight in the network
Steps for Backpropagation
Computes prediction layer by layer, measure how wrong it is, propagate error backward, update weights in the proper direction
Why should weights never be initialized to zero in a neural network?
prevents the network from learning distinct features ('symmetry' problem).
What is the simplest Keras API
Sequential API
Which Keras API provides the most flexibility
The Subclassing API.
What is Regression MLP (Multilayer Perceptron)
A neural network with hidden layers that learns nonlinear relationships and outputs continuous values - One neuron per predicted value
What is Classification MLP
learns nonlinear patterns in data and outputs class probabilities. binary-one sigmoid output. multiclass-softmax output layer w one neuron per class
What are Convolutional Neural Networks
designed for grid-like data (images) - learn spatial features using convolutional filters
What is a convolutional layer
applies filters (kernels) across the input to detect local patterns
What is the primary function of a Pooling layer in a CNN?
To reduce spatial dimensions, decrease computational cost, and create invariance.
How do Skip Connections in ResNet solve the vanishing gradient problem?
adding input directly to the output of a residual block - allowing signals to pass through if layers haven't learned anything yet
What is the innovation of the Scaled Dot-Product Attention in Transformers?
prevent gradients from becoming too small after softmax.
What is 'Multi-Head Attention' in the Transformer architecture?
Running multiple attention computations in parallel to capture different aspects (syntactic, semantic) of the sequence simultaneously.
How do Transformers handle sequence order since they lack recurrence?
By adding Positional Encodings to the input embeddings - position information
How to calculate parameters of a hidden layer
num_filters * filter_dimensions * channels
What is padding
Valid padding - smaller output. Same Paddding - same size output
Sigmoid activation function
Outputs between 0 and 1 - good for differentiation between two classes
Softmax Activation function
Output layer for multi-class classification (more than 2 classes)
Which autoencoders are used to generate new images?
Variational autoencoders