MLE FINAL

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/147

Earn XP

Description and Tags

Last updated 6:28 AM on 4/21/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

148 Terms

New cards

supervised learning

learning from labeled data (each example or instance in the dataset has a corresponding label)

classification vs regression

New cards

unsupervised learning - def

learning from unlabeled data (must discover patterns in the data)

clustering (K-means)
dimensionality reduction

problems/tasks:

density estimation
outlier detection
clustering
dimensionality reduction (& visualization)
association rule mining

New cards

semi-supervised learning

learning from partially labeled data

New cards

reinforcement learning

there is an agent that can interact with its environment and perform actions and get rewards

learning policy (strategy for which actions to take) to get the most rewards over time

New cards

transfer learning

learning to repurpose an existing model for a new task

New cards

density estimation

model the (unknown) probability distribution function (pdf) from which data is drawn

New cards

outlier detection

identify data points that do not belong in the training data

New cards

clustering

goal: group similar data points together in “clusters”

dataset: matrix X (n x m) — no labels
clustering algorithms:
- k-means clustering
- DBSCAN
- hierarchical clustering
Need a holdout set — the clustering could overfir
- Should train-test split or train-test-val split

New cards

dimensionality reduction (& visualization)

transform dataset into a low(er)-dimensional representation in a way that preserves useful information

New cards

association rule mining

discover interesting relationships between variables in the data

New cards

k-means clustering

greedy iterative algorithm to assign points to clusters
input: dataset X, integer k (hyperparameter — number of clusters)

New cards

k-means — objective function

c(x_j) is the centroid of the cluster that x_j is assigned to

<p>c(x<sub>j</sub>) is the centroid of the cluster that x<sub>j</sub> is assigned to</p>

New cards

k-means algorithm

Initialization: random centroid for each cluster

Do until no change in clustering:

assignment: assign each data point to the cluster with the closest centroid
update: recalculate centroids from points assigned to each cluster

New cards

k-means — determine k

guess or try different options
Heuristics: silhoutte method or elbow method
tune is as you would any other hyperparameter (if you have an objective metric to evaluate the quality of a clustering)

New cards

dataset has millions of features

training would be slow and model performance could suffer

New cards

curse of dimensionality

as dimensions increase, data becomes more sparse, making it harder to find meaningful patterns and causing distance-based algorithms to struggle.

as the number of features increases, the amount of data required to generalize accurately grows exponentially, making data points appear as isolated islands

New cards

dimensionality of data is artificial

can reduce dimension without losing information

New cards

dimensionality reduction techniques

projections (PCA)
manifold learning (LLE, Isomap)

New cards

principal component analysis (PCA)

method for linear transformation onto a new system of coordinates
- the transformation is such that the principal components (coordinate vectors) capture the greatest amount of variance given the previous components
linear decomposition/transformation

New cards

PCA — algorithm

given data matrix X (n x m)
- Mean-center it: subtract the mean of each feature
compute covariance matrix X^TX (m x m)
eigendecomposition gives principal components
- use Single Value Decomposition (SVD)
- Matrix W of eigenvectors is the transformation matrix (i^th column is the i^th principal component)
- Eigenvalue λ_igives the variance of the i^th principal component

New cards

PCA note

can to PCA on the correlation matrix instead of covariance matrix

New cards

PCA — inverse transformation

transform data back to the original space

X’ = Z_k W_k^T
- W_k^T is a k x m matrix
- X’ is a (n x m) matrix

New cards

PCA — transformed data

Z_k = X W_k

W_k is the transformation matrix with only the first k columns (m x k)

New cards

PCA — reduce dimensionality

to reduce dimensionality, can only keep the first k principal components

New cards

PCA - variance explained

(λ₁₊λ₂+...+λ_k) / ∑_i λ_i

New cards

Kernel PCA

for non-linear dimensionality reduction
use kernel trick (same un used for SVM)

New cards

Manifold Learning

Find mapping of dataset X into dataset Z embedded in R^p (Z_i ∈ R^p) for some integer p > m

Such that (informally) Z preserves the local geometry of X
- For example, if x_i and x_j are close (according to some distance metric), then z_i and z_j are also close
- Want to preserve neighborhood structure

New cards

multidimensional scaling (MDS)

compute euclidean distance d_ij between any two points x_i and x_j

New cards

manifold learning — algorithms

different algorithms aim to preserve different “local” properties

multidimensional scaling (MDS)
locally linear embedding (LLE)
Isomap
t-distributed stochastic neighbor embedding (t-SNE)

New cards

locally linear embedding (LLE)

express each data point as a linear combination of its closest neighbors

New cards

multidimensional scaling

preserve distances between points

New cards

isomap

form a graph where points are connected to their closest neighbors
- aims to preserve geodestic distance (i.e., shortest path distance)

New cards

t-distributed stochastic neightbor embedding (t-SNE)

tries to keep similar data points close together, dissimilar data points far part
mostly used for visualization

New cards

model formula

h_w,b(x) = f (w ^. x + b)

New cards

single nueron/unit

New cards

NN — types of layers

dense (fully connected)
convolutional
recurrent

New cards

NN - activation functions

identity / linear
sigmoid
tanH
ReLU
Softmax

New cards

NN — identity activation function

f(z) = z

or none

New cards

NN — sigmoid activation function

f(z) = 1/(1 + e^-z)

New cards

NN — TanH activation function

f(z) = (e^z - e^-z) / (e^z + e^-z)

New cards

NN — ReLU activation function

f(z) = max(0, z)

New cards

NN — Softmax activation function

f(z_j) = exp(z_j / T) / ∑_iexp(z_i / T )

note: in this case, the activation function is over an entire layer, not a single unit

New cards

NN — Loss

Make sure the loss function and activation function of the output layer are consistent with each other

New cards

Single Nueron — Linear Regression

one layer NN with a single neuron:

Activation function: Linear
Loss function: MSE

New cards

single neruon (binary) logistic regression

one layer NN with a single neuron

Activation function: sigmoid
Loss function: binary cross-entropy

New cards

Multi-Layer Perceptron (MLP)

input layer (passthrough)
one or more hidden layers
output layer

New cards

backpropagation — overview

reverse pass to measure error and propagate error gradient backwards in the network

adjust the weights (parameters) to decrese the loss

New cards

backpropagation

how to compute gradients efficiently

New cards

gradient descent

how to update the parameters to minimize the loss given the function

New cards

backprop algorithm

compute the forward pass for the mini-batch B saving the intermediate results at each layer
compute the loss on the mini-batch b (compares output of network to labels/targets —> error)
backwards pass: computes the per-weight gradients (error contribution) layer by layer
- done using chain rule
Stochastic gradient descent: update the weights based on the gradients

New cards

chain rule

if z depends on y and y depends on x:

dz/dx = dz/dy · dy/dx

New cards

backprop — illustration

New cards

Use NN

complex learning task or complex data
for many problems, NN provide the best performance
- image classification/captioning tasks
- speech recognition
- natural language modeling

New cards

dont use NN

can solve problem with simple model
don’t have a lot of data
- NN require a lot of data to achieve good performance
need an explainable/interpretable model
- there are some techniques to explain decisions from NN

New cards

universal approximation theorems

(feed-forward) NN can approximately represent any function

New cards

arbitrary width; bounded depth

true even if we have a single hidden layer as long as it can have arbitrarily many units

New cards

bounded width; arbitrary depth

true even if we have layers of bounded width, as long as the network can have arbitrarily many layers

New cards

NN architecture disadvantage

inconsistent activation function of output layer with the loss function

New cards

multiclass classification with cross-entropy loss, softmax activation for output layers

okay

New cards

regression with MSE as loss, tanh activation for output layer

Fail

New cards

regression with MSE as loss, linear activation for output layer

okay

New cards

funnel trick

for supervised learning, we typically have large input feature vectors and small output vectors

should make the network look like a funnel

New cards

multiclass classification with 10 classes and m = 100 input features

(input, hidden layer 1, hidden layer 2, hidden layer 3, output layer)

100, 64, 32, 16, 10

Activations:

output: softmax
elsewhere: ReLU

New cards

deep NN

anything NN with two or more hidden layers

New cards

Deep NN (DNN) challenges

endless options for netwrok architecture/topology

# of layers, units per layer, connections between units, activation functions, weight intialization method

New cards

DNN - hyperparameters related to learning

optimizer
learning rate
decay/momentum
(mini) batch size
# of epochs

New cards

DNN - Number of Hidden layers

Deep > shallow: for the same number of parameters, more hidden layers is better than wider layers

parameter efficiency

New cards

number of units in each layer - funnel approach

make the network look like a funnel

New cards

number of units in each layer - “stretch pants”

make hidden layers wider than what you need and then regularize (ex. dropout)

New cards

DNN - activation function

hidden layers — ReLU or ReLU variants

faster to compute than alternatives
GD less likely to get stuck

output layer

multiclass classification: softmax
binary classification or multilabel: sigmoid
regression: linear (no activation function)

New cards

DNN learning rate

start with low value (0.00001) then multiply by 10 each time and train for a few epocs

once training diverges — have gone too far

New cards

training diverges

the model isn’t settling toward a stable solution — it’s going in the wrong direction instead of improving

loss error start to increase instead of decreasing

New cards

DNN - optimizer

use ADAM or SGD

New cards

DNN — batch size

small batch approach: ex. 32, 64, 100
large batch approach: the largest size that fits your GPU’s RAM and use learning rate warmup

New cards

DNN — number of epochs/iterations

use early stopping

stop training before the model starts getting worse on new (unseen) data
stop at the point where validation performance is best, not where training loss is lowest

New cards

vanishing gradient

gradient vector becomes very small during backprop

difficult to update weights of lower/earlier layers — training does not converge

New cards

exploding gradient

gradient vector becomes very large during backprop

difficult to update weights of lower/earlier layers — training does not converge

New cards

training converges

with GD, the process has found (or gotten very close) to a minimum of the loss function — each additional step makes little to no difference

New cards

unstable gradients

layers of a dnn learn at very different rates

New cards

exploding/vanishing gradient mitigations

weight initialization method
non-saturating activation functions
batch normalization
gradient clipping (for exploding gradient)
skip-connections (CNNs)

New cards

glarot intialization (Xavier intialization)

Let n_in: number of inputs, n_out: number of outputs n_avg= (n_in + n_out)/2
Gaussian (for sigmoid activation): mean 0, variance σ2 = 1/n_avg
Uniform (for sigmoid activation): in [-r, r] where r = [3/n_avg]^1/2

note: biases are initialized to 0

New cards

He intialization

Gaussian (for ReLU): mean 0, variance σ²= 2/n_avg

note: biases are initialized to 0

New cards

non-saturating activation functions

ReLU + ReLU variants

New cards

dying ReLUs

a neuron can die when the weighted sums of its input are negative (for all examples in the training data)

New cards

ReLU variants

Leaky ReLU
ELU
SELU (Scaled ReLU)

these will not let neurons die because they can output negative values

New cards

Leaky ReLU

LeakyReLU_a(z) = max{az, z}

e.g.: a = 0.01

New cards

ELU

ELU_a(z) = z if z ≥ 0 and a (e^z - 1) otherwise (z < 0)

e.g., a = 1

New cards

activation functions — rating

ELU > leaky ReLU > ReLU > Tanh > Sigmoid

use ReLU in most cases, ELU if there is extra time because network will be slower

New cards

bias

error due to incorrect assumptions in the model

inability to capture the true relationship

New cards

variance

sensitivity to small variations in the training data

New cards

strategies to lower bias

increase model complexity
use more features

New cards

strategies to lower variance

reduce model complexity
use more training data

New cards

DNN reg technique — early stopping

stop once the validation loss is at its minimum (before it starts to go back up)

New cards

DNN reg technique — L₁ or L₂ reg

L₁: penalty term λ∑_i|w_i|
L₂: penalty term λ∑_i |w_i|²

New cards

DNN reg technique — max-norm reg

the norm of weights incoming to each neuron is at most r (hyperparameter): ||w||₂ < r

note: this is not added to the loss — after each training step, the weights are scaled to ensure ||w||₂ < r

New cards

DNN reg technique — drop out

idea: during training, each neuron has a probability p of being dropped out (it will be ignored for this step)

hyperparameter p is called the dropout rate
after training: we do not drop any neurons anymore (but need to adjust connection weights)

New cards

generalization error

aka out-of-sample error or risk

prediction error on unseen data
related to overfitting: if the model overfits, then the generalization error will be large

bias² + variance + irreducible error

New cards

bias-variance tradeoff

increasing model complexity — lower bias
decreasing model complexity — lower variance

New cards

double descent

unexpected and sudden change in test error as a function of model complexity (# of parameters)
- Note: not all settings/models/data yield exactly two descents (e.g., sometimes there are three or more)

100

New cards

grokking

NN learns to generalize suddenly well after the network has overfitted

note: this occurs in very specific settings (specific tasks, small “algorithmic datasets”, complex neural networks)