stat1361 Quiz 4 Study Guide

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/40

Earn XP

Description and Tags

chapter 6

Last updated 10:28 PM on 3/16/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

41 Terms

New cards

Linear models with more terms (predictors/coefficients) are?

They are more flexible and will therefore have lower RSS and higher R squared

New cards

Function of Mallow’s Cp, AIC, BIC, and adjusted r²?

They are adjustments to the training error to account for models with more terms. They help us balance accuracy and complexity when choosing a good model

New cards

4 measures that adjust the training error and for models with more predictors

Mallow’s Cp, AIC, BIC, and adjusted r2

New cards

Mallow's Cp

Cp = (1/n)(RSS + 2dσ̂²); adds a penalty of 2dσ̂² to RSS to account for RSS decreasing with number of parameters (d) increasing. Can be seen as an estimate of the test error. Small Cp is ideal

New cards

AIC (Akaike Information Criterion)

(1/nσ̂²)(RSS + 2dσ̂²); proportional to Mallow's Cp for least squares models. Technically only defined for maximum likelihood models. Can be seen as an estimate of the test error. Small AIC is ideal

New cards

BIC (Bayesian Information Criterion)

(1/n)(RSS + log(n)dσ̂²); BIC applies a harsher penalty than Cp or AIC and therefore tends to prefer smaller (simpler) models. Can be seen as an estimate of the test error. Small BIC is ideal

New cards

Adjusted R²

only increases when adding a predictor meaningfully improves model accuracy. Large adjusted r2 is ideal

New cards

Cross-Validation (CV) vs model complexity

also estimates test error, but it does not account for model complexity differences, only accuracy. But it can be applied to a broader range of model types.

New cards

Is each adjustment measure (mallows cp, AIC, BIC, adjusted r2) guaranteed to select the same model?

No, but usually won’t choose drastically different models

New cards

What are the three best subset selection methods?

Best Subset Selection
Forward (Stepwise) Selection FSS
Backward (Stepwise) Selection BSS

New cards

What do best subset selection methods do?

Find the best model of the best size (how many predictors are in the model)

New cards

Best Subset Selection (Best SS)

fits every possible model using every possible combination of the p predictors, identifies the best model of each size using RSS or R², then selects the overall best using mallows Cp, AIC, BIC, CV, or Adjusted R².

Guaranteed to find the best model among all subsets, but computationally very expensive.

has high DoF, and is low bias but high variance

New cards

Forward Stepwise Selection (FSS)

Starts with a null model and greedily adds one predictor at a time (always the one that most improves fit) until all p predictors are included. (M0, m1, …. mp). The best model of each size is then found using Cp, AIC, BIC, CV, or Adjusted R².

Not guaranteed to find the best model, but more computationally efficient.

has high DoF, and has low bias but high variance

New cards

Backward Stepwise Selection (BSS)

Starts with the full model containing all p predictors and removes one at a time (always the one whose removal least worsens fit). Like FSS, the best model of each size is compared using selection criteria. Not guaranteed to find the best model, but more computationally efficient.

New cards

Would Best SS, FSS, and BSS produce the same models?

No, not guaranteed to and usually don’t

New cards

Function of Ridge Regression and Lasso

When there are a lot of predictors, OLS coefficients can be large and unstable. These methods shrink the coefficients towards zero, making the model more stable and less sensitive (better test error)

New cards

Regularization

constraining model complexity by adding a penalty term to the loss function (RSS). In linear regression, we penalize the size of the coefficients to reduce variance (at the cost of some bias).

New cards

Ordinary Least Squares (OLS)

fitting a linear model by just minimizing RSS. Produces unbiased estimates but can have high variance, especially when p is large relative to n.

New cards

Ridge Regression

A regularization method that minimizes RSS + λΣβᵢ² (the L2 penalty). Penalizes the magnitude of the coefficients and shrinks them toward zero but does not set any to exactly zero. Predictors remain in the model. Reduces variance a lot at the cost of little bias. Requires standardizing predictors before fitting.

New cards

Tuning Parameter (λ) in Ridge Regression

A positive value that controls how strongly the L2 penalty (λΣβᵢ²) is applied during fitting.

When λ = 0, the penalty disappears and the estimates = OLS. little bias but possibly high variance

As λ increases, the penalty becomes larger, so coefficient estimates shrink toward zero more (but never exactly to zero), so all predictors remain in the model.

Larger λ = more rigid, less flexible model with higher bias but lower variance.

λ is chosen using cross-validation

New cards

When might we expect ridge regression (with λ > 0) to do well?

When the OLS estimates have high variance

New cards

Lasso (Least Absolute Shrinkage and Selection Operator)

A regularization method that minimizes RSS + λΣ|βᵢ| (the L1 penalty). Like ridge, it shrinks coefficients, but unlike ridge, it sets many coefficients exactly to zero, and removes them from the model. its effectively performing automatic variable selection and producing a sparse model (with only a subset of the predictors). Good when p > n, or when OLS estimates have high variance. must standardize predictor values first. few DoF

New cards

Tuning Parameter (λ) in Lasso

A positive value that controls how strongly the L1 penalty (λΣ|βᵢ|) is applied during fitting. When λ = 0, estimates = OLS, and larger λ means more shrinkage, setting many coefficient estimates to exactly zero, removing those predictors from the model entirely.

λ controls the degree of shrinkage and the sparsity of the model (i.e. how many predictors are kept).

Larger λ = fewer predictors in final model.

λ is chosen via cross-validation.

New cards

When does Lasso outperform ridge regression?

When there are a lot of predictors, not all of which are important for predicting the response

New cards

Standardization (of predictors)

The process of scaling predictor variables so they have mean 0 and standard deviation 1 before fitting. Required for ridge and lasso regression because the penalty terms are sensitive to the scale of the predictors (unlike OLS).

New cards

Dimension Reduction

Transforms the original p predictors into a smaller set of m new predictors (m < p) that are linear combinations of the originals, then fits a linear model on those transformed predictors. Goal is better fit with fewer effective variables.

New cards

Two ways to decide which transformations to use to reduce/create new, better predictors?

Principal Component Analysis (PCA)
Partial Least Squares

New cards

Principal Component Analysis (PCA)

A dimension reduction technique that constructs new predictors (principal components) as linear combinations of the originals, ordered by the amount of variance in the data they explain. The first component captures the most variance, each subsequent component captures the next most given the info already contained in the first component and so on (all components are orthogonal to all previous ones).

New cards

Downside to PCA (Principal Component Analysis)

We can lose some interpretability, and we assume that the direction of max variation in the data is also the most informative about response Y

New cards

Principal Component

Each new transformed predictor Zᵢ produced by PCA. It is the direction in the predictor space along which the data maximizes variance (given all previous components have been accounted for).

New cards

Partial Least Squares (PLS)

A supervised dimension reduction method similar to PCA, but the new predictors are chosen based on both maximizing variance in X and correlation with the response Y. Gives more weight to predictors that are more strongly correlated with Y. Can lose some interpretability. Can reduce bias relative to PCA but may increase variance.

New cards

Big Data

the dataset is large in some way. It could have lots of observations (large n), lots of predictors (large p), or both. Generally not a problem

New cards

High-Dimensional Data

When the number of predictors p is large relative to the number of observations n (p ≈ n, p > n, or p >> n).

This is a problem: OLS breaks down, R² → 1 and RSS → 0 artificially, and multicollinearity (predictors are correlated with each other) is guaranteed.

You can’t really ever claim to have found the “best" set of predictors

In general, as the dimension (p) grows, so does the test error

New cards

n (size of data) vs p (# of predictors)

For a fixed p, a larger n is always preferred as long as the bigger dataset is of the same quality. But for a fixed n, having a larger p is problematic.

New cards

Solution for when p > n

Regularization methods provide a solution when p > n where OLS does not, but those solutions will never be as good as if we had started with the “right" (smaller) set of predictors to begin with

New cards

Signal-to-Noise Ratio (SNR)

Defined as Var(f(x)) / Var(ε); the ratio of the true signal variance to the noise variance. High SNR means the true relationship is easy to detect (even bad methods can do well); low SNR means the signal is hard to distinguish from noise.

New cards

Percent of Variance Explained (PVE)

measures how much of the variability in Y is captured by the model estimate ^f

New cards

Relaxed Lasso

a weighted average between lasso coefficients and OLS-after-lasso coefficients. The weight α is chosen via cross-validation. Less biased than pure lasso, less variable than pure OLS.

New cards

Degrees of Freedom (DoF)

A measure of model flexibility that captures how much each prediction depends on its true observed value. More dependence = more dof. Higher DoF = more flexible/complex model = lower bias, higher variance. Best SS and FSS have high DoF; lasso has DoF roughly equal to the number of nonzero coefficients.

New cards

OLS-after-Lasso (β̂_OLS|Lasso.)

When lasso is first used to select which predictors to include, then OLS is fit on only those selected predictors.

New cards

SNR vs DoF

At low SNRs, procedures with high dof tend to do poorly (they always overfit). At high SNRs, all methods begin to do well with higher dof procedures doing best (like lasso bc its meant to increase bias/stability)