CSCI-B 455 Machine Learning Exam Review

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/89

Earn XP

Description and Tags

75 practice flashcards covering the core concepts of CSCI-B 455: Machine Learning, focused on SVMs, Ensemble Learning, CNNs, Transformers, and Unsupervised Learning.

Last updated 3:48 AM on 5/5/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

90 Terms

New cards

What characterizes Supervised Learning systems?

The training data includes labels (desired outputs), and the algorithm learns to map inputs to outputs.

New cards

Which ML system type discovers hidden structures in training data without provided labels?

Unsupervised Learning.

New cards

How does Reinforcement Learning operate

An agent receives rewards or penalties and learns to maximize total reward

New cards

What is batch learning

Trains on the entire dataset at once

New cards

What is online learning

Trains incrementally on mini-batches - parameter learning rate controls learning

New cards

What is instance based learning

Memorizes training examples and generalizes to new cases using a similarity measure

New cards

How can Overfitting be fixed?

By using regularization, a simpler model, or more training data.

New cards

How is a confusion matrix made up down each column (down the first column, then the second)

Column 1: True Positive, False Positive. Column 2: False Negative, True Negative

New cards

What is the formula for Precision in classification metrics?

Tp/tp+fp

New cards

What is the formula for Recall?

tp/tp+fn

New cards

What is the formula for F1 score?

2*precision*recall/precision+recall

New cards

What does ROC curve plot?

True Positive Rate (Recall) vs. False Positive Rate at various thresholds

New cards

What is the importance of AUC

Calculates the area under the ROC curve. AUC of 1.0 is perfect, 0.5 is random

New cards

One-vs-rest (OvR)

Train one binary classifier per class. Test class N vs rest of classes. (N classifiers total)

New cards

One-vs-One (OvO)

Train one binary classifier for every pair of classes.

New cards

What is the normal equation (linear regression)

(x^t*x)^-1(x^t*y)

New cards

What is Gradient Descent

An iterative algorithm that "walks downhill" on the cost function surface by following the negative gradient. Parameter = learning rate - the size of the step

New cards

What happens if the learning rate ( $\eta$ ) is set too large in Gradient Descent?

The algorithm may overshoot the minimum and diverge.

New cards

Stochastic Gradient Descent

Gradient is computed over one random instance

New cards

Polynomial Regression

Fitting nonlinear data using a linear model by adding polynomial features.

New cards

What is the primary risk of using high-degree polynomials in Polynomial Regression?

They overfit the training data easily.

New cards

What is Ridge Regularization (L2)

adds a penalty on the size of the coefficients by squaring

New cards

What is Lasso Regularization

adds a penalty on the size of the coefficients by utilizing absolute value - feature selection

New cards

What is Elastic Net Regularization

Combination of Ridge and Lasso

New cards

What is Logistic Regression

Estimate the probability that an instance belongs to one of two classes

New cards

What is Cross Entropy in logistic regression

A function that measures how well predicted probabilities match true labels, where if the true label is 1 the loss is −log(yhat). If the true label is 0 the loss is −log(1−that).

New cards

Is the Log Loss cost function for Logistic Regression convex or non-convex?

It is convex - Gradient Descent is guaranteed to find the global minimum.

New cards

Softmax Regression

Generalizes Logistic Regression to support multiple classes

New cards

What ensures that all probabilities produced by the Softmax function sum to $1$ ?

The normalization of each class score

New cards

What is Support Vector Mchines (SVM)

Finding the decision boundary that separates two classes while maximizing the margin width between them.

New cards

What determines the decision boundary in an SVM model?

The support vectors (the instances sitting exactly on the margin boundaries).

New cards

In SVM models, how does a large hyperparameter $C$ affect the margin?

It results in a narrower margin with fewer violations, which can lead to overfitting.

New cards

What is a Hard Margin in SVM

Requiems all instances to be correctly classified

New cards

What is Soft Margin in SVM

Allows some margins to be wrong

New cards

What is the 'Kernel Trick' in the context of SVMs?

Computing dot products in a high-dimensional space without mapping the data - saves computational cost.

New cards

What is the Gaussian RBF kernel?

Measures how close two points are and enables nonlinear decision boundaries by giving higher values to nearby points.

New cards

What is SVM Regression

Fit as many instances as possible inside the margin

New cards

What is a Decision tree

Makes predictions by asking a sequence of yes/no questions about features

New cards

What is gini impurity - decision trees

Quantifies how “mixed” a set of classes is

New cards

How does a Decision Tree calculate the probability that an instance belongs to a specific class?

By computing the fraction of training instances of that class in the leaf node where the instance ends up ( $k ext{ / } n$ ).

New cards

Why are decision trees unstable

Decision trees are orthogonal, makes them sensitive to data rotation

New cards

What is the Cart Cost Function in decision trees

measures the sum of squared errors within each region - greedy algorithm

New cards

What is ensemble learning

Combining multiple ML models to produce a single stronger model

New cards

What is the difference between Hard Voting and Soft Voting in ensemble learning?

Hard Voting uses a majority win based on class labels, while Soft Voting averages predicted probabilities and picks the highest average.

New cards

What is bagging in Ensemble learning

Sampling with replacement - reduces overfitting

New cards

How are predictors and weights handled in AdaBoost?

An ensemble method that builds a strong classifier by correcting its predecessor - focusing on training instances that the predecessor underfitted

New cards

What is 'Gradient' referring to in the Gradient Boosting algorithm?

Gradient descent in function space, where each new predictor is fit to the residual errors

New cards

What is Stacking

Training a meta-learning model to aggregate the predictions of a set of predictors

New cards

What is the 'Curse of Dimensionality'?

As dimensions increases, data becomes more spread out - need more training data to reliably learn patterns

New cards

What is Projection approach to dimension reduction

Project data onto lower dimension subspace

New cards

What is manifold learning approach to dimension reduction

Models the lower dimensional manifold the data lies on

New cards

What is PCA

Finds the axes (principal components) that account for the max variance in the data - projects onto those axes

New cards

What mathematical technique does PCA use to find principal components?

Singular Value Decomposition (SVD)

New cards

What is the Explained Variance Ratio (EVR)

Each principal component has an (EVR) - the proportion of total dataset variance captured by that

component.

New cards

What is Kernel PCA

Apply the kernel trick (same as SVM) to PCA to perform nonlinear dimensionality reduction

New cards

What is Locally Linear Embedding (LLE)?

nonlinear dimensionality reduction method - reduces data dimensions while preserving local relationships - not projection

New cards

What is tSNE

Keeps similar instances close, dissimilar ones far.

New cards

What is Isomap

preserves geodesic distances along the manifold.

New cards

What is K-Means clustering

Groups similar instances into clusters without labels

New cards

How does K-Means work

Initialize K centroid positions, plotting each point to the nearest cluster center, updating each center to be the mean of its assigned points until the clusters stop changing.

New cards

Hard vs Soft Clusering

Hard clustering assigns each point to one cluster, soft assigns a score per point for each cluster

New cards

What is intertia in K-means

mean squared distance from each instance to its closest centroid - always decreases as clusters increase

New cards

What is the elbow point in inertia/k-means clustering

the point where adding more clusters gives small improvements

New cards

What is the Silhouette Coefficient

how well a data point fits its assigned cluster compared to other clusters, values range from −1 to 1, where higher values mean better-defined clusters.

New cards

How does the K-Means++ initialization strategy differ from standard K-Means?

It picks initial centroids based on a probability proportional to their squared distance from existing centroids

New cards

What is DBSCAN

Finds clusters of arbitrary shape - Density-based.

New cards

What is a Gaussian Mixture Model (GMM)

A probabilistic model that assumes data was generated from a mixture of K

Gaussian (Normal) distributions with unknown parameters.

New cards

What is the EM algorithm?

used to estimate unknown parameters in models with hidden variables - alternating between estimating missing data (Expectation step) and optimizing parameters given those estimates (Maximization step).

New cards

Which information criteria penalize complex GMM models more heavily for number of parameters?

BIC (Bayesian Information Criterion).

New cards

How is Anomaly Detection performed using a fitted GMM?

By identifying instances located in low-density regions that fall below a density threshold.

New cards

Why are non-linear activation functions necessary in Deep Neural Networks?

Without them, stacking multiple layers is equivalent to a single linear layer

New cards

What is Backpropagation?

computes the gradient of the loss with respect to every weight in the network

New cards

Steps for Backpropagation

Computes prediction layer by layer, measure how wrong it is, propagate error backward, update weights in the proper direction

New cards

Why should weights never be initialized to zero in a neural network?

prevents the network from learning distinct features ('symmetry' problem).

New cards

What is the simplest Keras API

Sequential API

New cards

Which Keras API provides the most flexibility

The Subclassing API.

New cards

What is Regression MLP (Multilayer Perceptron)

A neural network with hidden layers that learns nonlinear relationships and outputs continuous values - One neuron per predicted value

New cards

What is Classification MLP

learns nonlinear patterns in data and outputs class probabilities. binary-one sigmoid output. multiclass-softmax output layer w one neuron per class

New cards

What are Convolutional Neural Networks

designed for grid-like data (images) - learn spatial features using convolutional filters

New cards

What is a convolutional layer

applies filters (kernels) across the input to detect local patterns

New cards

What is the primary function of a Pooling layer in a CNN?

To reduce spatial dimensions, decrease computational cost, and create invariance.

New cards

How do Skip Connections in ResNet solve the vanishing gradient problem?

adding input directly to the output of a residual block - allowing signals to pass through if layers haven't learned anything yet

New cards

What is the innovation of the Scaled Dot-Product Attention in Transformers?

prevent gradients from becoming too small after softmax.

New cards

What is 'Multi-Head Attention' in the Transformer architecture?

Running multiple attention computations in parallel to capture different aspects (syntactic, semantic) of the sequence simultaneously.

New cards

How do Transformers handle sequence order since they lack recurrence?

By adding Positional Encodings to the input embeddings - position information

New cards

How to calculate parameters of a hidden layer

num_filters * filter_dimensions * channels

New cards

What is padding

Valid padding - smaller output. Same Paddding - same size output

New cards

Sigmoid activation function

Outputs between 0 and 1 - good for differentiation between two classes

New cards

Softmax Activation function

Output layer for multi-class classification (more than 2 classes)

New cards

Which autoencoders are used to generate new images?

Variational autoencoders