1/55
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Sigmoid Function
f(x) = 1 / (1 + e^(-x)). Maps any real number to a value between 0 and 1 (S-shaped curve). Used in logistic regression for binary classification.
Softmax Function
Exponentiates each logit and normalizes so all outputs sum to 1.0, producing a valid probability distribution over classes. Formula: softmax(z_i) = e^(z_i) / Σe^(z_j)
Relationship between logs and exponents
ln(e^x) = x and e^(ln(x)) = x. Logs convert multiplication to addition, avoiding underflow with small probabilities.
Key ln values to memorize
ln(1) = 0, ln(2) ≈ 0.693, ln(5) ≈ 1.609, e^0 = 1
MNIST dataset size
60,000 training images and 10,000 test images. Each image is a 28×28 grayscale pixel grid.
PyTorch training loop (4 steps in order)
1) optimizer.zero_grad() — clear old gradients. 2) prediction = model(x) — forward pass. 3) loss.backward() — compute gradients. 4) optimizer.step() — update parameters.
optimizer.zero_grad()
Clears old gradients so they don't accumulate across batches. Missing this causes loss to explode/diverge.
loss.backward()
Computes the gradients of the loss with respect to all model parameters (backpropagation).
optimizer.step()
Updates model parameters using the gradients computed by loss.backward().
torch.utils.data.Dataset
Base class for custom datasets. Must implement init (setup data), len (number of entries), and getitem (return feature and label at index).
nn.Module
Base class for all PyTorch neural networks. Tracks weights automatically. Requires init() to set up layers and forward() to process input and return prediction.
nn.Linear
Implements matrix multiplication for building linear layers. E.g., nn.Linear(784, 10) maps a flattened 28×28 image to 10 output classes.
torch.optim.SGD
Standard stochastic gradient descent optimizer.
torch.optim.Adam
Optimizer that combines momentum with adaptive per-parameter learning rates. Generally better than vanilla SGD.
Logits vs. Probabilities
Models output raw scores called logits. Sigmoid (binary) or Softmax (multi-class) converts these to probabilities.
One-Hot Encoding
Represents classes as binary vectors (e.g., class 2 of 4 = [0,1,0,0]). Prevents the model from assuming numeric ordering between categories. Works naturally with Softmax and Cross-Entropy Loss.
Negative Log Likelihood (NLL)
NLL = -ln(p_true). Measures how well predicted probability matches the true label. Produces very large gradients when p_true is near 0, driving faster correction. Minimizing NLL = maximizing likelihood.
nn.CrossEntropyLoss
Combines Softmax + NLL into one PyTorch function. Takes raw logits as input — do NOT apply softmax yourself first.
nn.BCELoss
Binary Cross Entropy Loss. Used for multi-label classification where items can belong to multiple classes simultaneously.
Epoch
One full pass over the entire training dataset.
Batch
A subset of the dataset processed simultaneously in one forward/backward pass.
Mini-Batch SGD update formula
Total parameter updates = (dataset_size / batch_size) × num_epochs
Full-Batch Gradient Descent
When batch size equals the entire dataset, the model uses all data for each update. This is full-batch gradient descent.
Momentum
Adds a fraction of the previous update step to the current gradient. Accelerates convergence, reduces oscillation/noise, and helps escape small local minima.
Adam Optimizer
Improves standard SGD by combining momentum with adaptive per-parameter learning rates.
Backtracking Line Search
Starts with a large step size and iteratively shrinks it until a convergence criterion is met, maximizing convergence rate.
Armijo Condition (1st Wolfe Condition)
Ensures the chosen step size provides "sufficient decrease" in the objective function.
Logistic Regression
Uses sigmoid to map logits to probabilities for binary classification. Keeps predictions bounded between 0 and 1. SKLearn: SGDClassifier(loss='log_loss')
Support Vector Machine (SVM / SVC)
Finds the optimal hyperplane that maximizes the margin of separation between two classes. Can use the kernel trick for non-linear data. SKLearn: SVC(kernel='linear', probability=True)
Decision Trees
Non-linear models that split data based on sequential thresholds. Intuitive but prone to overfitting without pruning or depth limits. SKLearn: DecisionTreeClassifier(max_depth=10)
Random Forests
Ensemble method that builds many decision trees independently and in parallel (bagging), then combines predictions via majority voting. Reduces variance and overfitting.
Boosting (e.g., XGBoost)
Ensemble method that builds trees sequentially, where each new tree focuses on correcting errors made by previous trees. Extremely accurate but sensitive to outliers.
Random Forests vs. Boosting
Random Forests: trees built independently in parallel. Boosting: trees built sequentially, each correcting previous errors.
GMMs as Generative Models
Can synthesize new images/data by sampling from a learned probability distribution of pixels.
PCA before GMMs
PCA(0.99, whiten=True) keeps enough components for 99% of variance and normalizes variance in all directions. Reduces dimensionality before fitting GMM.
EM Algorithm — E-step (Expectation)
Compute the probability that each data point belongs to each Gaussian component (soft assignments).
EM Algorithm — M-step (Maximization)
Update Gaussian parameters (means, covariances) using the soft assignments from the E-step.
Image Quantization / Posterization with GMM
Fit GMM (e.g., K=8) to all pixels, replace each pixel with its cluster mean. Reduces number of colors = image compression. The artistic effect is called posterization.
ROC Curve
Plots True Positive Rate vs. False Positive Rate. Can be misleading on imbalanced datasets.
Precision-Recall (PR) Curve
Plots Precision vs. Recall. More sensitive and informative than ROC on imbalanced datasets.
ROC vs PR on imbalanced data
ROC can be misleading because high true negatives hide low precision. PR curves are preferred for imbalanced datasets.
Confusion Matrix
Visual grid showing which classes are predicted correctly and which are confused for one another.
cross_validate()
SKLearn function that evaluates multiple scoring metrics at once, e.g., scoring=['accuracy', 'roc_auc', 'f1'].
L1/L2 Regularization
Directly penalize large weight values inside the loss function during training.
AIC / BIC
Measure goodness-of-fit with a built-in penalty for model complexity (number of parameters). Lower scores = better. Evaluate models after training, unlike L1/L2 which constrain during training.
Oversampling
Duplicates or generates synthetic minority class examples (e.g., SMOTE). Risk: overfitting if same examples repeated too often; uses more memory; slower training.
Undersampling
Randomly deletes samples from the majority class. Trains faster but loses information.
Argmax for predictions
torch.argmax(logits) returns the index of the largest value = predicted class. E.g., [-1.2, 0.5, 4.2, 2.1] → predicted class = 2.
Abstract Expressionism
Spontaneous, intuitive creation using bold brushstrokes and large canvases (Pollock, Rothko). Has a historical connection to the CIA.
Image pixel structure
Images are structured grids of pixel intensity values. Color = RGB (3 channels), grayscale = 1 channel.
Uncompressed image formats
.bmp, .ppm — store every pixel; large file sizes.
Lossless compression formats
.png, .tiff, .gif — compressed using Huffman encoding but perfectly reconstructible.
Lossy compression formats
.jpg — smaller files but some quality is permanently lost via interpolation.
Softmax+NLL example: z=[ln2,ln3,ln5], true=C
e^z = (2,3,5), sum=10. Probs: A=0.2, B=0.3, C=0.5. NLL = -ln(0.5) = ln(2) ≈ 0.693
Softmax+NLL example: z=[ln2,0,ln7], true=A
e^z = (2,1,7), sum=10. Probs: A=0.2, B=0.1, C=0.7. NLL = -ln(0.2) = ln(5) ≈ 1.609
SGD calc: 5000 images, batch=100, 10 epochs
Batches/epoch = 50. Total updates = 50 × 10 = 500.