1/16
Linear Model Selection and Regularization
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Subset Selection
Reducing the number of predictors by selecting a subset of the predictors and evaluating the performance of that subset compared to other subsets of predictors
Best Subset Selection
This model predicts the sample mean for each observation.
Try combinations of predictors to find the one with the smallest RSS or largest R squared
Total of 2p possible models, computational intensive
If p is large (high dimensional data), may suffer from statistical problems for some models
Forward Stepwise Selection
Test all predictors seperately
Find the best predictor and tack on other predictors
Only adds predictors that are improving the model
Not guaranteed to find best possible combinations of predictors
Can be used in high dimensional data (n < p) with special considerations (Do not pass Mn-1)
Backward Stepwise Selection
Start with the full model (all predictors included).
Iteratively remove the least useful predictor (based on some criterion) one at a time.
Stop when removing more predictors would make the model worse.
Select the single best model using highest p-value (prediction error)
AIC BIC or R2
If p is large (high dimensional data), may suffer from statistical problems for some model
How to choose the best model?
Can measure this error:
Indirectly - by adjusting training accuracy measurements
Directly - by using a test/validation set, or KFold/LOO cross validation approach
Prediction focus? → CpC_pCp or AIC.
Simplicity & interpretability? → BIC.
Regression only, intuitive? → Adjusted R2R^2R2
Cp Statistic
Adds a penalty to training RSS
Penalizes models with more predictors
Lower CpC_pCp → better model.
Akaike Information Criterion (AIC)
Adds penalty, only defined for models fit by maximum likelihood
Lower AIC is better
Works when models are fit by maximum likelihood (e.g., regression, logistic regression)
Bayesian Information Criterion (BIC)
Heavier penalty when nnn is large.
Tends to select simpler models than AIC
Adjusted R 2
Essentially, a perfect fit would have only correct variables and no noise
Unlike plain R2, adding useless predictors decreases adjusted R2R^2R2.
Higher adjusted R2 → better model
Direct Measurement of Test Error
Report on the test/holdout/validation set of observations
Perform a cross validation (either LOO or KFold)
Advantages over indirect measurement, makes fewer assumptions about the true model
Shrinkage Methods
Aim to “shrink” or regularize the coefficients so that they are essentially equal to zero
Reduces the effect of the predictor on the model
Ridge Regression
As 𝜆 →∞, shrinkage penalty increases, shrinks coefficient close to 0, is only equal to zero when 𝜆 = ∞
All regression lines get close to 0 at the same time
Will always use all the predictors (but some may have small coefficients)
Does not use Beta 0
Lasso
coefficient estimates not only shrink to zero, some may be equal to zero and removed from the model entirely (variable selection)
Regression lines hit 0 at different times
Can use only a subset of the predictors (variable selection)
Which Shrinkage Method?
In general, ridge regression might be better if all of the predictors are contributing to the response at least a little bit, lasso may be better if certain that some predictors are equal to zero
Dimension Reduction
Here we will transform the predictors
Principal Component Analysis
Reduce number of predictors from p to M, by performing mathematical transformations to the existing predictors
Transform the original correlated features into a set of uncorrelated variables, called principal components
Adding any Zs will be automatically perpendicular or orthagonal to the last Z
Unsupervised
Principal Components Regression
When predictors X1,X2,…,XpX_1, X_2, \dots, X_pX1,X2,…,Xp are highly correlated, standard linear regression can become unstable.
Reduces the predictors to a smaller set of uncorrelated principal components (PCs) and then uses them as inputs for regression.
Doesn’t use the original predictors directly; it uses the top MMM components (usually the ones explaining most variance) as inputs for the regression.
Assumes that the principal components capturing most variance in Xare also the ones that matter for predicting Y
does use Y when computing ZM , is supervised