Data Mining Quiz 4

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/16

Earn XP

Description and Tags

Linear Model Selection and Regularization

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

17 Terms

New cards

Subset Selection

Reducing the number of predictors by selecting a subset of the predictors and evaluating the performance of that subset compared to other subsets of predictors

New cards

Best Subset Selection

This model predicts the sample mean for each observation.
Try combinations of predictors to find the one with the smallest RSS or largest R squared
Total of 2p possible models, computational intensive
If p is large (high dimensional data), may suffer from statistical problems for some models

New cards

Forward Stepwise Selection

Test all predictors seperately
Find the best predictor and tack on other predictors
Only adds predictors that are improving the model
Not guaranteed to find best possible combinations of predictors
Can be used in high dimensional data (n < p) with special considerations (Do not pass Mn-1)

New cards

Backward Stepwise Selection

Start with the full model (all predictors included).
Iteratively remove the least useful predictor (based on some criterion) one at a time.
Stop when removing more predictors would make the model worse.
Select the single best model using highest p-value (prediction error)
AIC BIC or R2
If p is large (high dimensional data), may suffer from statistical problems for some model

New cards

How to choose the best model?

Can measure this error:

Indirectly - by adjusting training accuracy measurements
Directly - by using a test/validation set, or KFold/LOO cross validation approach
Prediction focus? → CpC_pCp or AIC.
Simplicity & interpretability? → BIC.
Regression only, intuitive? → Adjusted R2R^2R2

New cards

Cp Statistic

Adds a penalty to training RSS
Penalizes models with more predictors
Lower CpC_pCp → better model.

New cards

Akaike Information Criterion (AIC)

Adds penalty, only defined for models fit by maximum likelihood
Lower AIC is better
Works when models are fit by maximum likelihood (e.g., regression, logistic regression)

New cards

Bayesian Information Criterion (BIC)

Heavier penalty when nnn is large.
Tends to select simpler models than AIC

New cards

Adjusted R 2

Essentially, a perfect fit would have only correct variables and no noise
Unlike plain R2, adding useless predictors decreases adjusted R2R^2R2.
Higher adjusted R2 → better model

New cards

Direct Measurement of Test Error

Report on the test/holdout/validation set of observations
Perform a cross validation (either LOO or KFold)
Advantages over indirect measurement, makes fewer assumptions about the true model

New cards

Shrinkage Methods

Aim to “shrink” or regularize the coefficients so that they are essentially equal to zero
Reduces the effect of the predictor on the model

New cards

Ridge Regression

As 𝜆 →∞, shrinkage penalty increases, shrinks coefficient close to 0, is only equal to zero when 𝜆 = ∞
All regression lines get close to 0 at the same time
Will always use all the predictors (but some may have small coefficients)
Does not use Beta 0

New cards

Lasso

coefficient estimates not only shrink to zero, some may be equal to zero and removed from the model entirely (variable selection)
Regression lines hit 0 at different times
Can use only a subset of the predictors (variable selection)

New cards

Which Shrinkage Method?

In general, ridge regression might be better if all of the predictors are contributing to the response at least a little bit, lasso may be better if certain that some predictors are equal to zero

New cards

Dimension Reduction

Here we will transform the predictors

New cards

Principal Component Analysis

Reduce number of predictors from p to M, by performing mathematical transformations to the existing predictors
Transform the original correlated features into a set of uncorrelated variables, called principal components
Adding any Zs will be automatically perpendicular or orthagonal to the last Z
Unsupervised

New cards

Principal Components Regression

When predictors X1,X2,…,XpX_1, X_2, \dots, X_pX1,X2,…,Xp are highly correlated, standard linear regression can become unstable.
Reduces the predictors to a smaller set of uncorrelated principal components (PCs) and then uses them as inputs for regression.
Doesn’t use the original predictors directly; it uses the top MMM components (usually the ones explaining most variance) as inputs for the regression.
Assumes that the principal components capturing most variance in Xare also the ones that matter for predicting Y
does use Y when computing ZM , is supervised