Conceptual Quiz 4 Machine Learning

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/59

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 6:38 PM on 4/9/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

60 Terms

1
New cards

Why do we use logs when working with probabilities?

Multiplying many small probabilities causes floating-point underflow (computer rounds to zero). Logs convert multiplication to addition: ln(a×b) = ln(a) + ln(b), keeping values manageable. That's why loss functions use log probabilities.

2
New cards

Key log/exponent values to memorize

ln(1)=0, ln(2)≈0.693, ln(5)≈1.609, e^0=1, ln(e^x)=x, e^ln(x)=x

3
New cards

What does the sigmoid function do?

f(x) = 1/(1+e^(-x)). Maps any real number to a value between 0 and 1 (S-shaped curve). Very negative → near 0, very positive → near 1, zero → exactly 0.5. Used in logistic regression for binary classification.

4
New cards

What does softmax do?

Exponentiates each logit (making everything positive), then divides by the sum. Output: every value between 0 and 1, all summing to 1 — a valid probability distribution over classes. Used for multi-class classification.

5
New cards

What are logits?

Raw, unbounded numbers output by a neural network's final layer. They have no probabilistic meaning until converted by sigmoid (binary) or softmax (multi-class).

6
New cards

Why does CrossEntropyLoss take raw logits, not probabilities?

CrossEntropyLoss applies softmax internally. If you apply softmax yourself first, you're softmaxing twice, which squashes gradients and hurts training.

7
New cards

Why use one-hot encoding instead of integer labels?

Integers (Apple=1, Banana=2, Cherry=3) imply false numerical relationships — the model might think Cherry is 3× Apple or Banana is halfway between. One-hot encoding ([1,0,0], [0,1,0], [0,0,1]) treats classes as independent dimensions with no ordering. Pairs naturally with softmax and cross-entropy loss.

8
New cards

What is NLL and why is it better than MSE for classification?

NLL = -ln(p_true). When the model is confidently wrong (p_true near 0), NLL's gradient becomes enormous, forcing fast correction. MSE's gradient stays mild in that situation, so learning is slower. Minimizing NLL = maximizing likelihood of correct labels.

9
New cards

What does nn.CrossEntropyLoss combine?

Softmax + Negative Log Likelihood in one function. Input: raw logits and integer labels. Do NOT apply softmax yourself first.

10
New cards

When to use nn.BCELoss instead of nn.CrossEntropyLoss?

For multi-label classification — when items can belong to multiple classes simultaneously. CrossEntropyLoss is for single-label multi-class problems.

11
New cards

What are the 4 steps of the PyTorch training loop (in order)?

  1. optimizer.zero_grad() — clear old gradients. 2. prediction = model(x) — forward pass. 3. loss.backward() — backward pass (compute gradients). 4. optimizer.step() — update parameters. Order is load-bearing.
12
New cards

What happens if you skip optimizer.zero_grad()?

Gradients accumulate across batches (PyTorch default behavior). Parameter updates become nonsensically large and loss explodes/diverges.

13
New cards

What three methods must a custom PyTorch Dataset implement?

init: set up data (e.g., convert numpy to tensors). len: return number of data entries. getitem: return (feature, label) at a given index.

14
New cards

What two methods must a custom nn.Module implement?

init(): set up layers (e.g., nn.Linear). forward(): define how data flows through the layers and return the prediction. PyTorch auto-tracks all parameters for the optimizer.

15
New cards

What does nn.Linear(784, 10) do?

Matrix multiplication mapping a 784-dimensional input (flattened 28×28 image) to 10 output values (one per class).

16
New cards

SGD vs. Adam: what's the difference?

SGD: same learning rate for every parameter. Simple, memory-efficient, needs careful LR tuning. Adam: tracks per-parameter running averages of gradient mean (momentum) and gradient magnitude (adaptive scaling). Less sensitive to LR choice, converges faster, but uses ~3× memory and can sometimes generalize worse.

17
New cards

When to use Adam vs. SGD?

Adam: default choice for fast convergence, prototyping, NLP, sparse data. SGD: when you can afford careful LR tuning/scheduling; sometimes generalizes better for vision models.

18
New cards

What does momentum do in optimization?

Each update = weighted combination of current gradient + previous update vector. If gradients point the same direction for several steps, effective step size grows. Accelerates convergence, reduces oscillation/noise, helps escape small local minima.

19
New cards

Downsides of momentum?

Accumulated velocity can overshoot past a minimum. Adds a hyperparameter (coefficient, typically 0.9). Already built into Adam.

20
New cards

Epoch vs. Batch — definitions

Epoch: one full pass over the entire training dataset. Batch: a subset processed in one forward/backward pass.

21
New cards

How to calculate total parameter updates

Updates per epoch = dataset_size / batch_size. Total updates = (dataset_size / batch_size) × num_epochs. Example: 5000 samples, batch=100, 10 epochs → 50 × 10 = 500 updates.

22
New cards

Batch size = 1 vs. small batch vs. full dataset

Batch=1 (Stochastic GD): very noisy, slow. Small batch (e.g., 100, Mini-Batch SGD): balance of speed and noise, most common. Batch=dataset (Full-Batch GD): smooth gradients but expensive and memory-heavy.

23
New cards

What is backtracking line search?

After computing gradient direction, start with a large step size α, evaluate loss, shrink α (e.g., ×0.5) if loss didn't decrease enough, repeat. Used in classical optimization, not deep learning (too expensive per iteration).

24
New cards

What is the Armijo condition?

The actual loss decrease must be at least a fraction c₁ of what the gradient predicted (sufficient decrease). It's the criterion backtracking line search checks at each try. Also called the 1st Wolfe condition.

25
New cards

What does logistic regression do differently from linear regression?

Wraps linear output in a sigmoid, bounding predictions to [0,1]. Still a linear model (straight decision boundary) but outputs are valid probabilities. SKLearn: SGDClassifier(loss='log_loss').

26
New cards

How do SVMs work?

Find the hyperplane that maximizes the margin (distance to nearest points of each class). Those nearest points = support vectors. Kernel trick enables non-linear boundaries by implicitly mapping to higher dimensions.

27
New cards

When to use SVMs and what are their downsides?

Use for: small-to-medium datasets, high-dimensional data, text classification. Downsides: training scales O(n²)–O(n³), impractical for large datasets. No native probability output. Sensitive to feature scaling. Multi-class needs wrappers.

28
New cards

How do decision trees work?

Recursively split feature space by choosing the feature and threshold that best separates classes (Gini impurity or information gain) at each node. Prediction: follow yes/no splits to a leaf.

29
New cards

Decision trees: strengths and weaknesses?

Strengths: interpretable (readable rules), fast, handles mixed features. Weaknesses: high variance (small data changes → different trees), overfit easily. Mitigate with max_depth, min samples per leaf, pruning. Usually outperformed by ensembles.

30
New cards

How do Random Forests work?

Train N decision trees independently on random bootstrap samples; each split uses a random feature subset. Combine via majority vote (bagging). Reduces variance/overfitting.

31
New cards

Random Forests: when to use and downsides?

Use: strong default for tabular data, minimal tuning needed, handles mixed features. Downsides: more memory, slower prediction than single tree, less interpretable. Doesn't reduce bias — if trees are systematically wrong the same way, forest will be too.

32
New cards

How does Boosting (XGBoost) work?

Builds shallow trees sequentially. Tree 1 fits data. Tree 2 fits residual errors of tree 1. Tree 3 fits residual errors of trees 1+2. Final prediction = weighted sum of all trees' outputs.

33
New cards

Boosting: when to use and downsides?

Use: maximum accuracy on tabular data, dominates competitions. Downsides: sensitive to outliers (keeps chasing errors), more prone to overfitting than RF, sequential (can't parallelize), more hyperparameters to tune.

34
New cards

Random Forests vs. Boosting — key differences

RF: trees built independently in parallel, combined by majority vote, reduces variance. Boosting: trees built sequentially (each corrects previous errors), combined by weighted sum, reduces bias. RF is more robust to outliers; Boosting is more accurate but riskier.

35
New cards

What is a GMM?

Gaussian Mixture Model: models data as coming from K overlapping Gaussian distributions, each with its own mean, covariance, and mixing weight. It's a generative model — can synthesize new data by sampling from the learned distribution.

36
New cards

What happens in the E-step of EM for a GMM?

Expectation step: given current Gaussian parameters, compute the probability each data point belongs to each Gaussian component (soft assignments — e.g., a point might be 70% cluster A, 30% cluster B).

37
New cards

What happens in the M-step of EM for a GMM?

Maximization step: given soft assignments from the E-step, update each Gaussian's mean and covariance as a weighted average of the data points. Then repeat E-step. Guaranteed to improve or stay the same each iteration.

38
New cards

What is image quantization / posterization with a GMM?

Fit GMM (K=8) to all pixels (each pixel = 3D RGB point). Each Gaussian captures a color cluster. Replace every pixel with its cluster's mean color. Reduces millions of colors to K colors = compression + flat poster-like artistic effect.

39
New cards

What does PCA do before a GMM?

Finds directions of greatest variance, projects data onto fewer dimensions while preserving most information. PCA(0.99) keeps 99% of variance. Reduces a 784-dim image to ~100–200 dims, making GMM fitting faster and more stable.

40
New cards

What does whiten=True do in PCA?

Rescales each principal component to unit variance so no single direction dominates the GMM's distance calculations. All directions contribute equally to fitting.

41
New cards

Downsides of PCA?

Linear — misses non-linear structure. Variance threshold is somewhat arbitrary. Principal components are linear combos of all features, so individual feature interpretability is lost.

42
New cards

ROC curve vs. Precision-Recall curve

ROC: TPR vs FPR, good for balanced data. Can be misleading on imbalanced data (high TNs hide low precision). PR: Precision vs Recall, focuses on the positive class, much more informative for imbalanced data. Rule: balanced → ROC, imbalanced → PR.

43
New cards

What does a confusion matrix show?

Grid with true classes as rows, predicted classes as columns. Diagonal = correct predictions. Off-diagonal = specific confusions (e.g., model calls 7s "1s"). Shows not just how often the model is wrong, but how it's wrong.

44
New cards

What is L1 (Lasso) regularization?

Adds sum of absolute weight values to the loss during training. Pushes small weights to exactly zero → sparse models, automatic feature selection. Downside: non-differentiable at zero (corners in loss surface).

45
New cards

What is L2 (Ridge) regularization?

Adds sum of squared weight values to the loss during training. Shrinks all weights toward zero but rarely to exactly zero → smoother, more stable models. Downside: doesn't eliminate irrelevant features.

46
New cards

When to use L1 vs. L2?

L1: when many features are likely irrelevant (want sparsity/feature selection). L2: when most features contribute somewhat (want to prevent any single weight from dominating).

47
New cards

What is AIC?

Akaike Information Criterion. AIC = 2k - 2·ln(L), where k = number of parameters, L = maximum likelihood. Penalty term is 2k (fixed, doesn't depend on dataset size). Lower = better. Used after training to compare models.

48
New cards

What is BIC?

Bayesian Information Criterion. BIC = k·ln(n) - 2·ln(L), where k = number of parameters, n = number of data points, L = maximum likelihood. Penalty term is k·ln(n). Lower = better. Used after training to compare models.

49
New cards

AIC vs. BIC — key difference

BIC's penalty grows with dataset size (k·ln(n)), so for large datasets BIC penalizes complexity more heavily and selects simpler models. AIC's penalty (2k) is fixed regardless of dataset size.

50
New cards

AIC/BIC vs. L1/L2 — fundamental difference

L1/L2 operate DURING training — modify the loss function to constrain weight sizes. AIC/BIC are used AFTER training — evaluate and compare already-trained models without changing their weights. Both address overfitting but from different angles: L1/L2 prevent it, AIC/BIC detect it.

51
New cards

Undersampling: what, pro, con?

Remove majority-class examples until balanced. Pro: faster training, less memory. Con: loses potentially useful information from discarded examples.

52
New cards

Oversampling: what, pro, con?

Duplicate minority examples or generate synthetic ones (SMOTE interpolates between minority neighbors). Pro: preserves all original data. Cons: overfitting (model memorizes repeated examples), SMOTE can add noise near decision boundary, slower training, more memory.

53
New cards

When to use undersampling vs. oversampling?

Oversampling when dataset is small (can't afford to lose data). Undersampling when dataset is large and losing majority examples doesn't reduce meaningful coverage.

54
New cards

Image types: uncompressed, lossless, lossy

Uncompressed (.bmp, .ppm): every pixel stored raw, large files. Lossless (.png, .tiff, .gif): compressed via Huffman coding, perfectly reconstructible. Lossy (.jpg): discards barely-visible details, smaller files, quality loss is permanent.

55
New cards

MNIST dataset specs

60,000 training + 10,000 test images. Each is 28×28 grayscale pixels (784 values flattened). 10 classes (digits 0–9).

56
New cards

What does argmax do?

Returns the index of the largest value in a vector. That index = predicted class. Example: [-1.2, 0.5, 4.2, 2.1] → argmax = 2 (because 4.2 is largest at index 2).

57
New cards

Softmax + NLL worked example

Logits [ln(2), ln(3), ln(5)] → exponentiate: 2, 3, 5 → sum=10 → probabilities: 0.2, 0.3, 0.5. If true class is C: NLL = -ln(0.5) = ln(2) ≈ 0.693.

58
New cards

Mini-batch SGD calculation example

Dataset=5000, batch=100, epochs=10. Batches/epoch = 5000/100 = 50. Total updates = 50 × 10 = 500. If batch=5000: full-batch gradient descent.

59
New cards

cross_validate() in SKLearn

Evaluates multiple metrics at once. Example: scoring=['accuracy', 'roc_auc', 'f1']. Provides a more complete picture of model performance than a single metric.

60
New cards

Abstract Expressionism (lecture topic)

Spontaneous, intuitive art using bold brushstrokes and large canvases (Pollock, Rothko). Connection to the CIA is a historical/cultural topic from lectures.