Data Mining Quiz 4

0.0(0)
studied byStudied by 0 people
full-widthCall with Kai
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/16

flashcard set

Earn XP

Description and Tags

Linear Model Selection and Regularization

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

17 Terms

1
New cards

Subset Selection

Reducing the number of predictors by selecting a subset of the predictors and evaluating the performance of that subset compared to other subsets of predictors

2
New cards

Best Subset Selection

  • This model predicts the sample mean for each observation.
    Try combinations of predictors to find the one with the smallest RSS or largest R squared

  • Total of 2p possible models, computational intensive

  • If p is large (high dimensional data), may suffer from statistical problems for some models

3
New cards

Forward Stepwise Selection

  • Test all predictors seperately

  • Find the best predictor and tack on other predictors

  • Only adds predictors that are improving the model

  • Not guaranteed to find best possible combinations of predictors

  • Can be used in high dimensional data (n < p) with special considerations (Do not pass Mn-1)

4
New cards

Backward Stepwise Selection

  • Start with the full model (all predictors included).

  • Iteratively remove the least useful predictor (based on some criterion) one at a time.

  • Stop when removing more predictors would make the model worse.

  • Select the single best model using highest p-value (prediction error)

  • AIC BIC or R2

  • If p is large (high dimensional data), may suffer from statistical problems for some model

5
New cards

How to choose the best model?

Can measure this error:

  • Indirectly - by adjusting training accuracy measurements

  • Directly - by using a test/validation set, or KFold/LOO cross validation approach

  • Prediction focus? → CpC_pCp​ or AIC.

  • Simplicity & interpretability? → BIC.

  • Regression only, intuitive? → Adjusted R2R^2R2

6
New cards

Cp​ Statistic

  • Adds a penalty to training RSS

  • Penalizes models with more predictors

  • Lower CpC_pCp​ → better model.

7
New cards

Akaike Information Criterion (AIC)

  • Adds penalty, only defined for models fit by maximum likelihood

  • Lower AIC is better

  • Works when models are fit by maximum likelihood (e.g., regression, logistic regression)

8
New cards

Bayesian Information Criterion (BIC)

  • Heavier penalty when nnn is large.

  • Tends to select simpler models than AIC

9
New cards

Adjusted R 2

  • Essentially, a perfect fit would have only correct variables and no noise

  • Unlike plain R2, adding useless predictors decreases adjusted R2R^2R2.

  • Higher adjusted R2 → better model

10
New cards

Direct Measurement of Test Error

  • Report on the test/holdout/validation set of observations 

  • Perform a cross validation (either LOO or KFold)

  • Advantages over indirect measurement, makes fewer assumptions about the true model

11
New cards

Shrinkage Methods

  • Aim to “shrink” or regularize the coefficients so that they are essentially equal to zero

  • Reduces the effect of the predictor on the model

12
New cards

Ridge Regression

  • As 𝜆 →∞, shrinkage penalty increases, shrinks coefficient close to 0, is only equal to zero when 𝜆 = ∞

  • All regression lines get close to 0 at the same time

  • Will always use all the predictors (but some may have small coefficients)

  • Does not use Beta 0

13
New cards

Lasso

  • coefficient estimates not only shrink to zero, some may be equal to zero and removed from the model entirely (variable selection)

  • Regression lines hit 0 at different times

  • Can use only a subset of the predictors (variable selection)

14
New cards

Which Shrinkage Method?

In general, ridge regression might be better if all of the predictors are contributing to the response at least a little bit, lasso may be better if certain that some predictors are equal to zero

15
New cards

Dimension Reduction

Here we will transform the predictors

16
New cards

Principal Component Analysis

  • Reduce number of predictors from p to M, by performing mathematical transformations to the existing predictors

  • Transform the original correlated features into a set of uncorrelated variables, called principal components

  • Adding any Zs will be automatically perpendicular or orthagonal to the last Z

  • Unsupervised

17
New cards

Principal Components Regression

  • When predictors X1,X2,…,XpX_1, X_2, \dots, X_pX1​,X2​,…,Xp​ are highly correlated, standard linear regression can become unstable.

  • Reduces the predictors to a smaller set of uncorrelated principal components (PCs) and then uses them as inputs for regression.

  • Doesn’t use the original predictors directly; it uses the top MMM components (usually the ones explaining most variance) as inputs for the regression.

  • Assumes that the principal components capturing most variance in Xare also the ones that matter for predicting Y

  • does use Y when computing ZM , is supervised