1/59
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Why do we use logs when working with probabilities?
Multiplying many small probabilities causes floating-point underflow (computer rounds to zero). Logs convert multiplication to addition: ln(a×b) = ln(a) + ln(b), keeping values manageable. That's why loss functions use log probabilities.
Key log/exponent values to memorize
ln(1)=0, ln(2)≈0.693, ln(5)≈1.609, e^0=1, ln(e^x)=x, e^ln(x)=x
What does the sigmoid function do?
f(x) = 1/(1+e^(-x)). Maps any real number to a value between 0 and 1 (S-shaped curve). Very negative → near 0, very positive → near 1, zero → exactly 0.5. Used in logistic regression for binary classification.
What does softmax do?
Exponentiates each logit (making everything positive), then divides by the sum. Output: every value between 0 and 1, all summing to 1 — a valid probability distribution over classes. Used for multi-class classification.
What are logits?
Raw, unbounded numbers output by a neural network's final layer. They have no probabilistic meaning until converted by sigmoid (binary) or softmax (multi-class).
Why does CrossEntropyLoss take raw logits, not probabilities?
CrossEntropyLoss applies softmax internally. If you apply softmax yourself first, you're softmaxing twice, which squashes gradients and hurts training.
Why use one-hot encoding instead of integer labels?
Integers (Apple=1, Banana=2, Cherry=3) imply false numerical relationships — the model might think Cherry is 3× Apple or Banana is halfway between. One-hot encoding ([1,0,0], [0,1,0], [0,0,1]) treats classes as independent dimensions with no ordering. Pairs naturally with softmax and cross-entropy loss.
What is NLL and why is it better than MSE for classification?
NLL = -ln(p_true). When the model is confidently wrong (p_true near 0), NLL's gradient becomes enormous, forcing fast correction. MSE's gradient stays mild in that situation, so learning is slower. Minimizing NLL = maximizing likelihood of correct labels.
What does nn.CrossEntropyLoss combine?
Softmax + Negative Log Likelihood in one function. Input: raw logits and integer labels. Do NOT apply softmax yourself first.
When to use nn.BCELoss instead of nn.CrossEntropyLoss?
For multi-label classification — when items can belong to multiple classes simultaneously. CrossEntropyLoss is for single-label multi-class problems.
What are the 4 steps of the PyTorch training loop (in order)?
What happens if you skip optimizer.zero_grad()?
Gradients accumulate across batches (PyTorch default behavior). Parameter updates become nonsensically large and loss explodes/diverges.
What three methods must a custom PyTorch Dataset implement?
init: set up data (e.g., convert numpy to tensors). len: return number of data entries. getitem: return (feature, label) at a given index.
What two methods must a custom nn.Module implement?
init(): set up layers (e.g., nn.Linear). forward(): define how data flows through the layers and return the prediction. PyTorch auto-tracks all parameters for the optimizer.
What does nn.Linear(784, 10) do?
Matrix multiplication mapping a 784-dimensional input (flattened 28×28 image) to 10 output values (one per class).
SGD vs. Adam: what's the difference?
SGD: same learning rate for every parameter. Simple, memory-efficient, needs careful LR tuning. Adam: tracks per-parameter running averages of gradient mean (momentum) and gradient magnitude (adaptive scaling). Less sensitive to LR choice, converges faster, but uses ~3× memory and can sometimes generalize worse.
When to use Adam vs. SGD?
Adam: default choice for fast convergence, prototyping, NLP, sparse data. SGD: when you can afford careful LR tuning/scheduling; sometimes generalizes better for vision models.
What does momentum do in optimization?
Each update = weighted combination of current gradient + previous update vector. If gradients point the same direction for several steps, effective step size grows. Accelerates convergence, reduces oscillation/noise, helps escape small local minima.
Downsides of momentum?
Accumulated velocity can overshoot past a minimum. Adds a hyperparameter (coefficient, typically 0.9). Already built into Adam.
Epoch vs. Batch — definitions
Epoch: one full pass over the entire training dataset. Batch: a subset processed in one forward/backward pass.
How to calculate total parameter updates
Updates per epoch = dataset_size / batch_size. Total updates = (dataset_size / batch_size) × num_epochs. Example: 5000 samples, batch=100, 10 epochs → 50 × 10 = 500 updates.
Batch size = 1 vs. small batch vs. full dataset
Batch=1 (Stochastic GD): very noisy, slow. Small batch (e.g., 100, Mini-Batch SGD): balance of speed and noise, most common. Batch=dataset (Full-Batch GD): smooth gradients but expensive and memory-heavy.
What is backtracking line search?
After computing gradient direction, start with a large step size α, evaluate loss, shrink α (e.g., ×0.5) if loss didn't decrease enough, repeat. Used in classical optimization, not deep learning (too expensive per iteration).
What is the Armijo condition?
The actual loss decrease must be at least a fraction c₁ of what the gradient predicted (sufficient decrease). It's the criterion backtracking line search checks at each try. Also called the 1st Wolfe condition.
What does logistic regression do differently from linear regression?
Wraps linear output in a sigmoid, bounding predictions to [0,1]. Still a linear model (straight decision boundary) but outputs are valid probabilities. SKLearn: SGDClassifier(loss='log_loss').
How do SVMs work?
Find the hyperplane that maximizes the margin (distance to nearest points of each class). Those nearest points = support vectors. Kernel trick enables non-linear boundaries by implicitly mapping to higher dimensions.
When to use SVMs and what are their downsides?
Use for: small-to-medium datasets, high-dimensional data, text classification. Downsides: training scales O(n²)–O(n³), impractical for large datasets. No native probability output. Sensitive to feature scaling. Multi-class needs wrappers.
How do decision trees work?
Recursively split feature space by choosing the feature and threshold that best separates classes (Gini impurity or information gain) at each node. Prediction: follow yes/no splits to a leaf.
Decision trees: strengths and weaknesses?
Strengths: interpretable (readable rules), fast, handles mixed features. Weaknesses: high variance (small data changes → different trees), overfit easily. Mitigate with max_depth, min samples per leaf, pruning. Usually outperformed by ensembles.
How do Random Forests work?
Train N decision trees independently on random bootstrap samples; each split uses a random feature subset. Combine via majority vote (bagging). Reduces variance/overfitting.
Random Forests: when to use and downsides?
Use: strong default for tabular data, minimal tuning needed, handles mixed features. Downsides: more memory, slower prediction than single tree, less interpretable. Doesn't reduce bias — if trees are systematically wrong the same way, forest will be too.
How does Boosting (XGBoost) work?
Builds shallow trees sequentially. Tree 1 fits data. Tree 2 fits residual errors of tree 1. Tree 3 fits residual errors of trees 1+2. Final prediction = weighted sum of all trees' outputs.
Boosting: when to use and downsides?
Use: maximum accuracy on tabular data, dominates competitions. Downsides: sensitive to outliers (keeps chasing errors), more prone to overfitting than RF, sequential (can't parallelize), more hyperparameters to tune.
Random Forests vs. Boosting — key differences
RF: trees built independently in parallel, combined by majority vote, reduces variance. Boosting: trees built sequentially (each corrects previous errors), combined by weighted sum, reduces bias. RF is more robust to outliers; Boosting is more accurate but riskier.
What is a GMM?
Gaussian Mixture Model: models data as coming from K overlapping Gaussian distributions, each with its own mean, covariance, and mixing weight. It's a generative model — can synthesize new data by sampling from the learned distribution.
What happens in the E-step of EM for a GMM?
Expectation step: given current Gaussian parameters, compute the probability each data point belongs to each Gaussian component (soft assignments — e.g., a point might be 70% cluster A, 30% cluster B).
What happens in the M-step of EM for a GMM?
Maximization step: given soft assignments from the E-step, update each Gaussian's mean and covariance as a weighted average of the data points. Then repeat E-step. Guaranteed to improve or stay the same each iteration.
What is image quantization / posterization with a GMM?
Fit GMM (K=8) to all pixels (each pixel = 3D RGB point). Each Gaussian captures a color cluster. Replace every pixel with its cluster's mean color. Reduces millions of colors to K colors = compression + flat poster-like artistic effect.
What does PCA do before a GMM?
Finds directions of greatest variance, projects data onto fewer dimensions while preserving most information. PCA(0.99) keeps 99% of variance. Reduces a 784-dim image to ~100–200 dims, making GMM fitting faster and more stable.
What does whiten=True do in PCA?
Rescales each principal component to unit variance so no single direction dominates the GMM's distance calculations. All directions contribute equally to fitting.
Downsides of PCA?
Linear — misses non-linear structure. Variance threshold is somewhat arbitrary. Principal components are linear combos of all features, so individual feature interpretability is lost.
ROC curve vs. Precision-Recall curve
ROC: TPR vs FPR, good for balanced data. Can be misleading on imbalanced data (high TNs hide low precision). PR: Precision vs Recall, focuses on the positive class, much more informative for imbalanced data. Rule: balanced → ROC, imbalanced → PR.
What does a confusion matrix show?
Grid with true classes as rows, predicted classes as columns. Diagonal = correct predictions. Off-diagonal = specific confusions (e.g., model calls 7s "1s"). Shows not just how often the model is wrong, but how it's wrong.
What is L1 (Lasso) regularization?
Adds sum of absolute weight values to the loss during training. Pushes small weights to exactly zero → sparse models, automatic feature selection. Downside: non-differentiable at zero (corners in loss surface).
What is L2 (Ridge) regularization?
Adds sum of squared weight values to the loss during training. Shrinks all weights toward zero but rarely to exactly zero → smoother, more stable models. Downside: doesn't eliminate irrelevant features.
When to use L1 vs. L2?
L1: when many features are likely irrelevant (want sparsity/feature selection). L2: when most features contribute somewhat (want to prevent any single weight from dominating).
What is AIC?
Akaike Information Criterion. AIC = 2k - 2·ln(L), where k = number of parameters, L = maximum likelihood. Penalty term is 2k (fixed, doesn't depend on dataset size). Lower = better. Used after training to compare models.
What is BIC?
Bayesian Information Criterion. BIC = k·ln(n) - 2·ln(L), where k = number of parameters, n = number of data points, L = maximum likelihood. Penalty term is k·ln(n). Lower = better. Used after training to compare models.
AIC vs. BIC — key difference
BIC's penalty grows with dataset size (k·ln(n)), so for large datasets BIC penalizes complexity more heavily and selects simpler models. AIC's penalty (2k) is fixed regardless of dataset size.
AIC/BIC vs. L1/L2 — fundamental difference
L1/L2 operate DURING training — modify the loss function to constrain weight sizes. AIC/BIC are used AFTER training — evaluate and compare already-trained models without changing their weights. Both address overfitting but from different angles: L1/L2 prevent it, AIC/BIC detect it.
Undersampling: what, pro, con?
Remove majority-class examples until balanced. Pro: faster training, less memory. Con: loses potentially useful information from discarded examples.
Oversampling: what, pro, con?
Duplicate minority examples or generate synthetic ones (SMOTE interpolates between minority neighbors). Pro: preserves all original data. Cons: overfitting (model memorizes repeated examples), SMOTE can add noise near decision boundary, slower training, more memory.
When to use undersampling vs. oversampling?
Oversampling when dataset is small (can't afford to lose data). Undersampling when dataset is large and losing majority examples doesn't reduce meaningful coverage.
Image types: uncompressed, lossless, lossy
Uncompressed (.bmp, .ppm): every pixel stored raw, large files. Lossless (.png, .tiff, .gif): compressed via Huffman coding, perfectly reconstructible. Lossy (.jpg): discards barely-visible details, smaller files, quality loss is permanent.
MNIST dataset specs
60,000 training + 10,000 test images. Each is 28×28 grayscale pixels (784 values flattened). 10 classes (digits 0–9).
What does argmax do?
Returns the index of the largest value in a vector. That index = predicted class. Example: [-1.2, 0.5, 4.2, 2.1] → argmax = 2 (because 4.2 is largest at index 2).
Softmax + NLL worked example
Logits [ln(2), ln(3), ln(5)] → exponentiate: 2, 3, 5 → sum=10 → probabilities: 0.2, 0.3, 0.5. If true class is C: NLL = -ln(0.5) = ln(2) ≈ 0.693.
Mini-batch SGD calculation example
Dataset=5000, batch=100, epochs=10. Batches/epoch = 5000/100 = 50. Total updates = 50 × 10 = 500. If batch=5000: full-batch gradient descent.
cross_validate() in SKLearn
Evaluates multiple metrics at once. Example: scoring=['accuracy', 'roc_auc', 'f1']. Provides a more complete picture of model performance than a single metric.
Abstract Expressionism (lecture topic)
Spontaneous, intuitive art using bold brushstrokes and large canvases (Pollock, Rothko). Connection to the CIA is a historical/cultural topic from lectures.