1/40
chapter 6
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Linear models with more terms (predictors/coefficients) are?
They are more flexible and will therefore have lower RSS and higher R squared
Function of Mallow’s Cp, AIC, BIC, and adjusted r2?
They are adjustments to the training error to account for models with more terms. They help us balance accuracy and complexity when choosing a good model
4 measures that adjust the training error and for models with more predictors
Mallow’s Cp, AIC, BIC, and adjusted r2
Mallow's Cp
Cp = (1/n)(RSS + 2dσ̂²); adds a penalty of 2dσ̂² to RSS to account for RSS decreasing with number of parameters (d) increasing. Can be seen as an estimate of the test error. Small Cp is ideal
AIC (Akaike Information Criterion)
(1/nσ̂²)(RSS + 2dσ̂²); proportional to Mallow's Cp for least squares models. Technically only defined for maximum likelihood models. Can be seen as an estimate of the test error. Small AIC is ideal
BIC (Bayesian Information Criterion)
(1/n)(RSS + log(n)dσ̂²); BIC applies a harsher penalty than Cp or AIC and therefore tends to prefer smaller (simpler) models. Can be seen as an estimate of the test error. Small BIC is ideal
Adjusted R²
only increases when adding a predictor meaningfully improves model accuracy. Large adjusted r2 is ideal
Cross-Validation (CV) vs model complexity
also estimates test error, but it does not account for model complexity differences, only accuracy. But it can be applied to a broader range of model types.
Is each adjustment measure (mallows cp, AIC, BIC, adjusted r2) guaranteed to select the same model?
No, but usually won’t choose drastically different models
What are the three best subset selection methods?
Best Subset Selection
Forward (Stepwise) Selection FSS
Backward (Stepwise) Selection BSS
What do best subset selection methods do?
Find the best model of the best size (how many predictors are in the model)
Best Subset Selection (Best SS)
fits every possible model using every possible combination of the p predictors, identifies the best model of each size using RSS or R², then selects the overall best using mallows Cp, AIC, BIC, CV, or Adjusted R².
Guaranteed to find the best model among all subsets, but computationally very expensive.
has high DoF, and is low bias but high variance
Forward Stepwise Selection (FSS)
Starts with a null model and greedily adds one predictor at a time (always the one that most improves fit) until all p predictors are included. (M0, m1, …. mp). The best model of each size is then found using Cp, AIC, BIC, CV, or Adjusted R².
Not guaranteed to find the best model, but more computationally efficient.
has high DoF, and has low bias but high variance
Backward Stepwise Selection (BSS)
Starts with the full model containing all p predictors and removes one at a time (always the one whose removal least worsens fit). Like FSS, the best model of each size is compared using selection criteria. Not guaranteed to find the best model, but more computationally efficient.
Would Best SS, FSS, and BSS produce the same models?
No, not guaranteed to and usually don’t
Function of Ridge Regression and Lasso
When there are a lot of predictors, OLS coefficients can be large and unstable. These methods shrink the coefficients towards zero, making the model more stable and less sensitive (better test error)
Regularization
constraining model complexity by adding a penalty term to the loss function (RSS). In linear regression, we penalize the size of the coefficients to reduce variance (at the cost of some bias).
Ordinary Least Squares (OLS)
fitting a linear model by just minimizing RSS. Produces unbiased estimates but can have high variance, especially when p is large relative to n.
Ridge Regression
A regularization method that minimizes RSS + λΣβᵢ² (the L2 penalty). Penalizes the magnitude of the coefficients and shrinks them toward zero but does not set any to exactly zero. Predictors remain in the model. Reduces variance a lot at the cost of little bias. Requires standardizing predictors before fitting.
Tuning Parameter (λ) in Ridge Regression
A positive value that controls how strongly the L2 penalty (λΣβᵢ²) is applied during fitting.
When λ = 0, the penalty disappears and the estimates = OLS. little bias but possibly high variance
As λ increases, the penalty becomes larger, so coefficient estimates shrink toward zero more (but never exactly to zero), so all predictors remain in the model.
Larger λ = more rigid, less flexible model with higher bias but lower variance.
λ is chosen using cross-validation
When might we expect ridge regression (with λ > 0) to do well?
When the OLS estimates have high variance
Lasso (Least Absolute Shrinkage and Selection Operator)
A regularization method that minimizes RSS + λΣ|βᵢ| (the L1 penalty). Like ridge, it shrinks coefficients, but unlike ridge, it sets many coefficients exactly to zero, and removes them from the model. its effectively performing automatic variable selection and producing a sparse model (with only a subset of the predictors). Good when p > n, or when OLS estimates have high variance. must standardize predictor values first. few DoF
Tuning Parameter (λ) in Lasso
A positive value that controls how strongly the L1 penalty (λΣ|βᵢ|) is applied during fitting. When λ = 0, estimates = OLS, and larger λ means more shrinkage, setting many coefficient estimates to exactly zero, removing those predictors from the model entirely.
λ controls the degree of shrinkage and the sparsity of the model (i.e. how many predictors are kept).
Larger λ = fewer predictors in final model.
λ is chosen via cross-validation.
When does Lasso outperform ridge regression?
When there are a lot of predictors, not all of which are important for predicting the response
Standardization (of predictors)
The process of scaling predictor variables so they have mean 0 and standard deviation 1 before fitting. Required for ridge and lasso regression because the penalty terms are sensitive to the scale of the predictors (unlike OLS).
Dimension Reduction
Transforms the original p predictors into a smaller set of m new predictors (m < p) that are linear combinations of the originals, then fits a linear model on those transformed predictors. Goal is better fit with fewer effective variables.
Two ways to decide which transformations to use to reduce/create new, better predictors?
Principal Component Analysis (PCA)
Partial Least Squares
Principal Component Analysis (PCA)
A dimension reduction technique that constructs new predictors (principal components) as linear combinations of the originals, ordered by the amount of variance in the data they explain. The first component captures the most variance, each subsequent component captures the next most given the info already contained in the first component and so on (all components are orthogonal to all previous ones).
Downside to PCA (Principal Component Analysis)
We can lose some interpretability, and we assume that the direction of max variation in the data is also the most informative about response Y
Principal Component
Each new transformed predictor Zᵢ produced by PCA. It is the direction in the predictor space along which the data maximizes variance (given all previous components have been accounted for).
Partial Least Squares (PLS)
A supervised dimension reduction method similar to PCA, but the new predictors are chosen based on both maximizing variance in X and correlation with the response Y. Gives more weight to predictors that are more strongly correlated with Y. Can lose some interpretability. Can reduce bias relative to PCA but may increase variance.
Big Data
the dataset is large in some way. It could have lots of observations (large n), lots of predictors (large p), or both. Generally not a problem
High-Dimensional Data
When the number of predictors p is large relative to the number of observations n (p ≈ n, p > n, or p >> n).
This is a problem: OLS breaks down, R² → 1 and RSS → 0 artificially, and multicollinearity (predictors are correlated with each other) is guaranteed.
You can’t really ever claim to have found the “best" set of predictors
In general, as the dimension (p) grows, so does the test error
n (size of data) vs p (# of predictors)
For a fixed p, a larger n is always preferred as long as the bigger dataset is of the same quality. But for a fixed n, having a larger p is problematic.
Solution for when p > n
Regularization methods provide a solution when p > n where OLS does not, but those solutions will never be as good as if we had started with the “right" (smaller) set of predictors to begin with
Signal-to-Noise Ratio (SNR)
Defined as Var(f(x)) / Var(ε); the ratio of the true signal variance to the noise variance. High SNR means the true relationship is easy to detect (even bad methods can do well); low SNR means the signal is hard to distinguish from noise.
Percent of Variance Explained (PVE)
measures how much of the variability in Y is captured by the model estimate ^f
Relaxed Lasso
a weighted average between lasso coefficients and OLS-after-lasso coefficients. The weight α is chosen via cross-validation. Less biased than pure lasso, less variable than pure OLS.
Degrees of Freedom (DoF)
A measure of model flexibility that captures how much each prediction depends on its true observed value. More dependence = more dof. Higher DoF = more flexible/complex model = lower bias, higher variance. Best SS and FSS have high DoF; lasso has DoF roughly equal to the number of nonzero coefficients.
OLS-after-Lasso (β̂OLS|Lasso.)
When lasso is first used to select which predictors to include, then OLS is fit on only those selected predictors.
SNR vs DoF
At low SNRs, procedures with high dof tend to do poorly (they always overfit). At high SNRs, all methods begin to do well with higher dof procedures doing best (like lasso bc its meant to increase bias/stability)