1/178
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Support Vector Machine (SVM)
A supervised learning model that finds a decision boundary maximizing the margin between classes
Margin
Distance between decision boundary and closest data points
Support vectors
Points closest to the decision boundary that determine the margin
Linear separability
Data can be separated by a linear boundary
Convex optimization
Optimization where any local minimum is global
Hard margin SVM
SVM with no tolerance for misclassification
Soft margin SVM
SVM allowing violations using slack variables
Slack variable (ξᵢ)
Measures margin violation for a data point
Regularization parameter (C)
Controls tradeoff between margin size and violations
Large C
Low tolerance for errors, smaller margin
Small C
Higher tolerance for errors, larger margin
Hard margin objective
min (1/2)||w||²
Hard margin constraint
yᵢ(wᵀxᵢ) ≥ 1
Soft margin objective
min (1/2)||w||² + (C/N)Σξᵢ
Soft margin constraint
yᵢ(wᵀxᵢ) ≥ 1 − ξᵢ and ξᵢ ≥ 0
Hinge loss
max(0, 1 − yᵢwᵀxᵢ), penalizes points inside margin or misclassified
SVM loss
(1/2)||w||² + (C/N)Σmax(0, 1 − yᵢwᵀxᵢ)
L2 regularization
Penalty discouraging large weights
Logistic loss
log(1 + e^{−y wᵀx})
Hinge vs logistic loss
Hinge is piecewise linear, logistic is smooth
Feature mapping (ϕ(x))
Transforms data to higher dimension
Kernel function
K(x,z) = ϕ(x)ᵀϕ(z), computes inner product in feature space without explicit mapping
Kernel trick
Uses kernel without explicit mapping
Gaussian kernel
exp(−||x−y||² / (2σ²))
Polynomial kernel
(xᵀy + c)^d
Dual form SVM
Optimization in terms of α variables
Dual solution
w = Σαᵢ yᵢ xᵢ (for linear SVM after solving dual problem)
Why kernels matter
Allow nonlinear boundaries efficiently
Parametric model
Model with fixed number of parameters independent of dataset size
Nonparametric model
Model whose complexity grows with dataset size
Examples of parametric models
Linear regression, logistic regression, SVM
Examples of nonparametric models
KNN, decision trees
K-Nearest Neighbors (KNN)
Model that predicts using nearby data points
KNN training
Stores all training data
KNN prediction
Uses majority vote (classification) or average (regression)
KNN classification formula
ŷ(x) = sign(Σ{i ∈ Nk(x)} yᵢ)
KNN regression formula
f(x) = (1/k)Σyᵢ
KNN hyperparameter
K (number of neighbors)
Small K
Flexible, high variance, overfitting
Large K
Smooth, high bias, underfitting
Distance metric
Measure of similarity between points
Euclidean distance
||x − x′||₂ = √(Σ (xᵢ − x′ᵢ)²)
Hamming distance
Counts mismatched entries
Jaccard distance
1 − (|A ∩ B| / |A ∪ B|)
Edit distance
Minimum edits to transform one string to another
Weighted KNN
Neighbors weighted by distance
Kernel regression
Generalization of weighted KNN
Decision tree
Model that splits data using feature-based rules
Internal node
Test on a feature
Branch
Outcome of a test
Leaf node
Final prediction
Decision tree property
Interpretable and nonlinear
CART
Classification and Regression Trees algorithm
Tree building strategy
Greedy top-down splitting
Node impurity
Measure of how mixed labels are
Gini impurity
1 − Σ pᵢ², measures probability of misclassification in a node
Gini range
0 (pure) to 0.5 (max impurity in binary)
Best split
Minimizes average Gini impurity
Tree pruning
Reducing size to prevent overfitting
Stopping conditions
Pure node, no features, or too few samples
Decision tree hyperparameters
Depth, min samples, etc.
Discriminative model
Learns P(y|x)
Generative model
Learns P(x,y)
Key difference
Generative can create data, discriminative cannot
Examples discriminative
Logistic regression, SVM
Examples generative
Naive Bayes, GMM
Naive Bayes
Generative classifier using Bayes rule
Bayes rule
P(A|B) = P(B|A)P(A) / P(B)
Naive assumption
Features are conditionally independent
Naive Bayes formula
P(y|x) ∝ P(y) Π P(xᵢ|y), denominator P(x) is ignored since it is constant
Advantage of NB
Simple and efficient
Limitation of NB
Independence assumption unrealistic
Supervised learning
Uses labeled data (x,y)
Unsupervised learning
Uses unlabeled data (x only)
Clustering
Grouping similar data points
K-means clustering
Partitions data into K clusters
K-means objective
Minimize Σ min_k ||xᵢ − μ_k||² (assign each point to nearest cluster center)
Cluster center (μₖ)
Mean of points in cluster
Assignment step
Assign to nearest center
Update step
Recompute centers
K-means limitation
Not convex, may find local minimum
Initialization sensitivity
Different starts give different results
K-means++
Better initialization strategy
Choosing K
Use elbow method
Elbow method
Find point where error reduction slows
Kernel K-means
Uses kernel trick for nonlinear clustering
Non-spherical clusters
Problem for standard K-means
Dimensionality reduction
Reduce number of features
Principal Component Analysis (PCA)
Find directions of max variance
Principal component
Direction capturing most variance
Covariance matrix
Measures feature relationships
First principal component
Eigenvector with largest eigenvalue
PCA objective
Maximize variance of projections subject to ||v|| = 1
Projection
Mapping data onto lower dimension
Orthogonality in PCA
Components are perpendicular
Eigenvalue meaning
Amount of variance captured
Eigenvector meaning
Direction of variance
Top-K PCA
Use top K eigenvectors
Purpose of PCA
Compression and noise reduction
w (weight vector)
A vector of parameters that defines the model decision boundary