MLE FINAL

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/147

flashcard set

Earn XP

Description and Tags

:(

Last updated 6:28 AM on 4/21/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

148 Terms

1
New cards

supervised learning

learning from labeled data (each example or instance in the dataset has a corresponding label)

  • classification vs regression

2
New cards

unsupervised learning - def

learning from unlabeled data (must discover patterns in the data)

  • clustering (K-means)

  • dimensionality reduction

problems/tasks:

  • density estimation

  • outlier detection

  • clustering

  • dimensionality reduction (& visualization)

  • association rule mining

3
New cards

semi-supervised learning

learning from partially labeled data

4
New cards

reinforcement learning

there is an agent that can interact with its environment and perform actions and get rewards

  • learning policy (strategy for which actions to take) to get the most rewards over time

5
New cards

transfer learning

learning to repurpose an existing model for a new task

6
New cards

density estimation

model the (unknown) probability distribution function (pdf) from which data is drawn

7
New cards

outlier detection

identify data points that do not belong in the training data

8
New cards

clustering

goal: group similar data points together in “clusters”

  • dataset: matrix X (n x m) — no labels

  • clustering algorithms:

    • k-means clustering

    • DBSCAN

    • hierarchical clustering

  • Need a holdout set — the clustering could overfir

    • Should train-test split or train-test-val split

9
New cards

dimensionality reduction (& visualization)

transform dataset into a low(er)-dimensional representation in a way that preserves useful information

10
New cards

association rule mining

discover interesting relationships between variables in the data

11
New cards

k-means clustering

  • greedy iterative algorithm to assign points to clusters

  • input: dataset X, integer k (hyperparameter — number of clusters)

12
New cards

k-means — objective function

c(xj) is the centroid of the cluster that xj is assigned to

<p>c(x<sub>j</sub>) is the centroid of the cluster that x<sub>j</sub> is assigned to</p>
13
New cards

k-means algorithm

Initialization: random centroid for each cluster

Do until no change in clustering:

  • assignment: assign each data point to the cluster with the closest centroid

  • update: recalculate centroids from points assigned to each cluster

14
New cards

k-means — determine k

  • guess or try different options

  • Heuristics: silhoutte method or elbow method

  • tune is as you would any other hyperparameter (if you have an objective metric to evaluate the quality of a clustering)

15
New cards

dataset has millions of features

training would be slow and model performance could suffer

16
New cards

curse of dimensionality

as dimensions increase, data becomes more sparse, making it harder to find meaningful patterns and causing distance-based algorithms to struggle.

  • as the number of features increases, the amount of data required to generalize accurately grows exponentially, making data points appear as isolated islands

17
New cards

dimensionality of data is artificial

can reduce dimension without losing information

18
New cards

dimensionality reduction techniques

  • projections (PCA)

  • manifold learning (LLE, Isomap)

19
New cards

principal component analysis (PCA)

  • method for linear transformation onto a new system of coordinates

    • the transformation is such that the principal components (coordinate vectors) capture the greatest amount of variance given the previous components

  • linear decomposition/transformation

20
New cards

PCA — algorithm

  • given data matrix X (n x m)

    • Mean-center it: subtract the mean of each feature

  • compute covariance matrix XTX (m x m)

  • eigendecomposition gives principal components

    • use Single Value Decomposition (SVD)

    • Matrix W of eigenvectors is the transformation matrix (ith column is the ith principal component)

    • Eigenvalue λi gives the variance of the ith principal component

21
New cards

PCA note

can to PCA on the correlation matrix instead of covariance matrix

22
New cards

PCA — inverse transformation

transform data back to the original space

  • X’ = Zk WkT

    • WkT is a k x m matrix

    • X’ is a (n x m) matrix

23
New cards

PCA — transformed data

Zk = X Wk

  • Wk is the transformation matrix with only the first k columns (m x k)

24
New cards

PCA — reduce dimensionality

to reduce dimensionality, can only keep the first k principal components

25
New cards

PCA - variance explained

1+λ2+...+λk) / ∑i λi

26
New cards

Kernel PCA

  • for non-linear dimensionality reduction

  • use kernel trick (same un used for SVM)

27
New cards

Manifold Learning

Find mapping of dataset X into dataset Z embedded in Rp (Zi ∈ Rp) for some integer p > m

  • Such that (informally) Z preserves the local geometry of X

    • For example, if xi and xj are close (according to some distance metric), then zi and zj are also close

    • Want to preserve neighborhood structure

28
New cards

multidimensional scaling (MDS)

  • compute euclidean distance dij between any two points xi and xj

29
New cards

manifold learning — algorithms

different algorithms aim to preserve different “local” properties

  • multidimensional scaling (MDS)

  • locally linear embedding (LLE)

  • Isomap

  • t-distributed stochastic neighbor embedding (t-SNE)

30
New cards

locally linear embedding (LLE)

express each data point as a linear combination of its closest neighbors

31
New cards

multidimensional scaling

preserve distances between points

32
New cards

isomap

  • form a graph where points are connected to their closest neighbors

    • aims to preserve geodestic distance (i.e., shortest path distance)

33
New cards

t-distributed stochastic neightbor embedding (t-SNE)

  • tries to keep similar data points close together, dissimilar data points far part

  • mostly used for visualization

34
New cards

model formula

hw,b(x) = f (w . x + b)

35
New cards

single nueron/unit

knowt flashcard image
36
New cards

NN — types of layers

  • dense (fully connected)

  • convolutional

  • recurrent

37
New cards

NN - activation functions

  • identity / linear

  • sigmoid

  • tanH

  • ReLU

  • Softmax

38
New cards

NN — identity activation function

f(z) = z

  • or none

39
New cards

NN — sigmoid activation function

f(z) = 1/(1 + e-z)

40
New cards

NN — TanH activation function

f(z) = (ez - e-z) / (ez + e-z)

41
New cards

NN — ReLU activation function

f(z) = max(0, z)

42
New cards

NN — Softmax activation function

f(zj) = exp(zj / T) / ∑i exp(zi / T )

  • note: in this case, the activation function is over an entire layer, not a single unit

43
New cards

NN — Loss

Make sure the loss function and activation function of the output layer are consistent with each other

44
New cards

Single Nueron — Linear Regression

one layer NN with a single neuron:

  • Activation function: Linear

  • Loss function: MSE

45
New cards

single neruon (binary) logistic regression

one layer NN with a single neuron

  • Activation function: sigmoid

  • Loss function: binary cross-entropy

46
New cards

Multi-Layer Perceptron (MLP)

  • input layer (passthrough)

  • one or more hidden layers

  • output layer

47
New cards

backpropagation — overview

reverse pass to measure error and propagate error gradient backwards in the network

  • adjust the weights (parameters) to decrese the loss

48
New cards

backpropagation

how to compute gradients efficiently

49
New cards

gradient descent

how to update the parameters to minimize the loss given the function

50
New cards

backprop algorithm

  • compute the forward pass for the mini-batch B saving the intermediate results at each layer

  • compute the loss on the mini-batch b (compares output of network to labels/targets —> error)

  • backwards pass: computes the per-weight gradients (error contribution) layer by layer

    • done using chain rule

  • Stochastic gradient descent: update the weights based on the gradients

51
New cards

chain rule

if z depends on y and y depends on x:

dz/dx = dz/dy · dy/dx

52
New cards

backprop — illustration

knowt flashcard image
53
New cards

Use NN

  • complex learning task or complex data

  • for many problems, NN provide the best performance

    • image classification/captioning tasks

    • speech recognition

    • natural language modeling

54
New cards

dont use NN

  • can solve problem with simple model

  • don’t have a lot of data

    • NN require a lot of data to achieve good performance

  • need an explainable/interpretable model

    • there are some techniques to explain decisions from NN

55
New cards

universal approximation theorems

(feed-forward) NN can approximately represent any function

56
New cards

arbitrary width; bounded depth

true even if we have a single hidden layer as long as it can have arbitrarily many units

57
New cards

bounded width; arbitrary depth

true even if we have layers of bounded width, as long as the network can have arbitrarily many layers

58
New cards

NN architecture disadvantage

inconsistent activation function of output layer with the loss function

59
New cards

multiclass classification with cross-entropy loss, softmax activation for output layers

okay

60
New cards

regression with MSE as loss, tanh activation for output layer

Fail

61
New cards

regression with MSE as loss, linear activation for output layer

okay

62
New cards

funnel trick

for supervised learning, we typically have large input feature vectors and small output vectors

  • should make the network look like a funnel

63
New cards

multiclass classification with 10 classes and m = 100 input features

(input, hidden layer 1, hidden layer 2, hidden layer 3, output layer)

  • 100, 64, 32, 16, 10

Activations:

  • output: softmax

  • elsewhere: ReLU

64
New cards

deep NN

anything NN with two or more hidden layers

65
New cards

Deep NN (DNN) challenges

endless options for netwrok architecture/topology

  • # of layers, units per layer, connections between units, activation functions, weight intialization method

66
New cards

DNN - hyperparameters related to learning

  • optimizer

  • learning rate

  • decay/momentum

  • (mini) batch size

  • # of epochs

67
New cards

DNN - Number of Hidden layers

Deep > shallow: for the same number of parameters, more hidden layers is better than wider layers

  • parameter efficiency

68
New cards

number of units in each layer - funnel approach

make the network look like a funnel

69
New cards

number of units in each layer - “stretch pants”

make hidden layers wider than what you need and then regularize (ex. dropout)

70
New cards

DNN - activation function

hidden layers — ReLU or ReLU variants

  • faster to compute than alternatives

  • GD less likely to get stuck

output layer

  • multiclass classification: softmax

  • binary classification or multilabel: sigmoid

  • regression: linear (no activation function)

71
New cards

DNN learning rate

start with low value (0.00001) then multiply by 10 each time and train for a few epocs

  • once training diverges — have gone too far

72
New cards

training diverges

the model isn’t settling toward a stable solution — it’s going in the wrong direction instead of improving

  • loss error start to increase instead of decreasing

73
New cards

DNN - optimizer

use ADAM or SGD

74
New cards

DNN — batch size

  • small batch approach: ex. 32, 64, 100

  • large batch approach: the largest size that fits your GPU’s RAM and use learning rate warmup

75
New cards

DNN — number of epochs/iterations

use early stopping

  • stop training before the model starts getting worse on new (unseen) data

  • stop at the point where validation performance is best, not where training loss is lowest

76
New cards

vanishing gradient

gradient vector becomes very small during backprop

  • difficult to update weights of lower/earlier layers — training does not converge

77
New cards

exploding gradient

gradient vector becomes very large during backprop

  • difficult to update weights of lower/earlier layers — training does not converge

78
New cards

training converges

with GD, the process has found (or gotten very close) to a minimum of the loss function — each additional step makes little to no difference

79
New cards

unstable gradients

layers of a dnn learn at very different rates

80
New cards

exploding/vanishing gradient mitigations

  • weight initialization method

  • non-saturating activation functions

  • batch normalization

  • gradient clipping (for exploding gradient)

  • skip-connections (CNNs)

81
New cards

glarot intialization (Xavier intialization)

  • Let nin: number of inputs, nout: number of outputs navg = (nin + nout)/2

  • Gaussian (for sigmoid activation): mean 0, variance σ2 = 1/navg

  • Uniform (for sigmoid activation): in [-r, r] where r = [3/navg]1/2

note: biases are initialized to 0

82
New cards

He intialization

Gaussian (for ReLU): mean 0, variance σ2 = 2/navg

  • note: biases are initialized to 0

83
New cards

non-saturating activation functions

ReLU + ReLU variants

84
New cards

dying ReLUs

  • a neuron can die when the weighted sums of its input are negative (for all examples in the training data)

85
New cards

ReLU variants

  • Leaky ReLU

  • ELU

  • SELU (Scaled ReLU)

these will not let neurons die because they can output negative values

86
New cards

Leaky ReLU

LeakyReLUa(z) = max{az, z}

e.g.: a = 0.01

87
New cards

ELU

ELUa(z) = z if z ≥ 0 and a (ez - 1) otherwise (z < 0)

e.g., a = 1

88
New cards

activation functions — rating

ELU > leaky ReLU > ReLU > Tanh > Sigmoid

use ReLU in most cases, ELU if there is extra time because network will be slower

89
New cards

bias

error due to incorrect assumptions in the model

  • inability to capture the true relationship

<p>error due to incorrect assumptions in the model</p><ul><li><p>inability to capture the true relationship</p></li></ul><p></p>
90
New cards

variance

sensitivity to small variations in the training data

<p>sensitivity to small variations in the training data</p>
91
New cards

strategies to lower bias

  • increase model complexity

  • use more features

92
New cards

strategies to lower variance

  • reduce model complexity

  • use more training data

93
New cards

DNN reg technique — early stopping

stop once the validation loss is at its minimum (before it starts to go back up)

94
New cards

DNN reg technique — L1 or L2 reg

  • L1: penalty term λ∑i |wi|

  • L2: penalty term λ∑i |wi|2

95
New cards

DNN reg technique — max-norm reg

the norm of weights incoming to each neuron is at most r (hyperparameter): ||w||2 < r

note: this is not added to the loss — after each training step, the weights are scaled to ensure ||w||2 < r

96
New cards

DNN reg technique — drop out

idea: during training, each neuron has a probability p of being dropped out (it will be ignored for this step)

  • hyperparameter p is called the dropout rate

  • after training: we do not drop any neurons anymore (but need to adjust connection weights)

97
New cards

generalization error

aka out-of-sample error or risk

  • prediction error on unseen data

  • related to overfitting: if the model overfits, then the generalization error will be large

bias2 + variance + irreducible error

98
New cards

bias-variance tradeoff

  • increasing model complexity — lower bias

  • decreasing model complexity — lower variance

99
New cards

double descent

  • unexpected and sudden change in test error as a function of model complexity (# of parameters)

    • Note: not all settings/models/data yield exactly two descents (e.g., sometimes there are three or more)

100
New cards

grokking

NN learns to generalize suddenly well after the network has overfitted

note: this occurs in very specific settings (specific tasks, small “algorithmic datasets”, complex neural networks)