L5: Subset selection and shrinkage methods

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

26 Terms

1
New cards

Linear model

Linear model has advantages with interpretability and good predictive performance

Advantages:

  • Inference

    • Competitive to non-linear models

<p>Linear model has advantages with interpretability and good predictive performance</p><p>Advantages:</p><ul><li><p>Inference</p><ul><li><p>Competitive to non-linear models</p></li></ul></li></ul><p></p>
2
New cards

Prediction accuracy

True relationship linear, MLR low bias

Ratio p and n is important

n »p : low variance

n ~p: high variance, overtraining

n <p: no unique model (variance infinite)

3
New cards

Interpretability

Not all variables are associated with the response

By removing irrelevant features (set coefs to zero)

→ more easily interpreted, feature selection

4
New cards

Three alternatives to least squares

Subset selection

Shrinkage

Dimension reduction

5
New cards

Subset selection

Identify a subset of predictors that best predicts a response. Use least squares to fit the model

<p>Identify a subset of predictors that best predicts a response. Use least squares to fit the model</p>
6
New cards

Shrinkage

Fit a model using least squares but shrink the coefficients to zero. This reduces variance and can perform variable selection

7
New cards

How to select subsets

Best subset selection

Fit separate least squares regression model

  • For each possible combination of predictiors

    • model with 1,2, .. p predictorss

8
New cards

How to select single best model?

Select best model out of p+1 models

RSS decreases monotonically, R² increases monotonically

RSS/R² is a bad choice

  • will end with max variables in model

RSS/R² is about training error

Choose model with low test error

  • Cross validation or some other criterion

    • Cp, AIC, BIC and adjusted R²

  • Best subset selection is unfeasible for p > 40

same logic for best subset also extends to other models

  • logistic regression

9
New cards

Forward stepwise regressoin

Computationally efficient alternative

start with an empty model

  • keeps on adding predictors one-at-a-time that gives best additoinal improvement

<p>Computationally efficient alternative</p><p>start with an empty model</p><ul><li><p>keeps on adding predictors one-at-a-time that gives best additoinal improvement</p><ul><li><p></p></li></ul></li></ul><p></p>
10
New cards

Cost - how many models

Best subset selction: 2^p models

Forward stepwise selection:

  • One null model

  • p=k models in the Kth iteration (k=0, …, p-1)

<p>Best subset selction: 2^p models</p><p>Forward stepwise selection:</p><ul><li><p>One null model</p></li><li><p>p=k models in the Kth iteration (k=0, …, p-1)</p></li><li><p></p></li></ul><p></p>
11
New cards

Backward stepwise selection

Starts with full least squares model containing all predictors

Iteratively removes least useful predictor one at a time

like forward, 1 + p(p+1)/2 models

n>p (must be)

<p>Starts with full least squares model containing all predictors</p><p>Iteratively removes least useful predictor one at a time</p><p>like forward, 1 + p(p+1)/2 models</p><p>n&gt;p (must be)</p><p></p>
12
New cards

Hybrid approaches

Combination of forward selection and backward elimination

Idea is to remove a variable added by forward when it no longer improves the model fit

13
New cards

Choosing the best model

Select the best model amongst models with different number of predictors

Training error estimates R², RSS are not suited

Best model: model with smallest test error

Two common approahces

  • make adjustment to training error

    • Cross validation

14
New cards

Cp, AIC, BIC and adjusted R²

Training error MSE is underestimate of test MSE

Adjust training error for model size

  • can be used to select between models with different numbers of variables

Four criteria

  • Mallow’s Cp, colin lingwood mallows

  • AIC, Aikake information criterion, hirotugu aikake

  • BIC (Bayesian information criterion)

  • Adjusted R²

<p>Training error MSE is underestimate of test MSE</p><p>Adjust training error for model size</p><ul><li><p>can be used to select between models with different numbers of variables</p></li></ul><p>Four criteria</p><ul><li><p>Mallow’s Cp, colin lingwood mallows</p></li><li><p>AIC, Aikake information criterion, hirotugu aikake</p></li><li><p>BIC (Bayesian information criterion)</p></li><li><p>Adjusted R²</p></li></ul><p></p>
15
New cards

Cross-validation

K-fold cross validation

Leave-one-out cross validation

Advantage over AIC, BIC, Cp and AdjR^@

  • Provides a direct estimate of test error

  • Fewer underlying assumptions

    • Useful in wider range of model, e.g. hard to estimate estimate of error varaince

16
New cards

One-standard-error-rule

Select smallest model for which test error is within one standard error of the lowest point

17
New cards

Shrinkage methods

Subset methods use least squares to fit a model using a subset of predictors

Alternatively, we can fit all predictors using a technique that constrains or regularizes the coefficient estimates

  • shirnks the estimates to zero

Shirnking coefficients reduces their variance

  • best known methods: ridge regression and lasso

    • q

18
New cards

Ridge regression

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small

However, the second term (shrinkage penalty) is small when beta1, … beta p are close to zero and thus it has the effect of shrinking the estimates of Betaj towards zero

The tuning parameter lambda serves to control the relative impact of these two terms on the regression coefficient estimates

<p>As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small</p><p>However, the second term (shrinkage penalty) is small when beta1, … beta p are close to zero and thus it has the effect of shrinking the estimates of Betaj towards zero</p><p>The tuning parameter lambda serves to control the relative impact of these two terms on the regression coefficient estimates</p>
19
New cards

Lmabda

when lambda = 0, penalty has no effect, ridge produces least squares estimates

when lambda → infinity, impact of penalty grows and coeffiicient will aproach zero

Selecting a good vaue for lambda is critical

  • cross-validation is used for this

20
New cards

scaling of predictors

The standard least squares coefficient estiamtes are scale equivariant

  • multiplying Xj by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c

In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function

Therefore, it is best to apply ridge regression after standardizing the predictors

21
New cards

Why does ridge improve over least squares

Ridge regression is an "improvement" because it accepts a tiny bit of bias (not fitting the training data perfectly) in exchange for a massive reduction in variance (making better predictions on new data). Variance is the error that comes form being too sensitive to small fluctuations in the training set

In data science, we almost always prefer a model that is "mostly right all the time" over a model that is "perfectly right once but wrong everywhere else."D

<p><span>Ridge regression is an "improvement" because it accepts a </span><strong><span>tiny bit of bias</span></strong><span> (not fitting the training data perfectly) in exchange for a </span><strong><span>massive reduction in variance</span></strong><span> (making better predictions on new data). Variance is the error that comes form being too sensitive to small fluctuations in the training set</span></p><p></p><p>In data science, we almost always prefer a model that is "mostly right all the time" over a model that is "perfectly right once but wrong everywhere else."D</p>
22
New cards

Drawback of ridge regression

It will include all p predictors, no selection

  • None of the regression coefficients will become exact zero

  • Not a problem for model accuracy

    • A problem for model interpretation

23
New cards

The Lasso

Lasso also shrinks coefficients to zero, but this time they can become zero

Models have fewer predictors, thus easier to interpret

Lasso performs variable selectionV

<p>Lasso also shrinks coefficients to zero, but this time they can become zero</p><p>Models have fewer predictors, thus easier to interpret</p><p>Lasso performs variable selectionV</p>
24
New cards

Variable selection properties of Lasso

1. The Shapes: The "Constraint Regions"

The blue areas represent the "budget" or limit placed on the coefficients ($\beta_1$ and $\beta_2$). The penalty parameter $\lambda$ (or $s$ in this diagram) determines how small these shapes are.

  • Lasso (The Diamond): Because Lasso uses absolute values ($|\beta_1| + |\beta_2| \leq s$), the boundary is a diamond. In 2D space, absolute values create straight lines that meet at sharp corners on the axes.

  • Ridge (The Circle): Because Ridge uses squared values ($\beta_1^2 + \beta_2^2 \leq s$), the boundary is a circle. This is the standard equation for a circle centered at the origin.

2. The Red Ellipses: Residual Sum of Squares (RSS)

The red rings are like a "topographical map" for error.

  • The black dot ($\hat{\beta}$) in the center is the Least Squares estimate (the point of minimum error).

  • As you move away from that dot, the error (RSS) increases. Every point on a single red ring has the exact same error level.

<p>1. The Shapes: The "Constraint Regions"</p><p>The blue areas represent the "budget" or limit placed on the coefficients (<span><span>$\beta_1$</span></span> and <span><span>$\beta_2$</span></span>). The penalty parameter <span><span>$\lambda$</span></span> (or <span><span>$s$</span></span> in this diagram) determines how small these shapes are.</p><ul><li><p><strong>Lasso (The Diamond):</strong> Because Lasso uses absolute values (<span><span>$|\beta_1| + |\beta_2| \leq s$</span></span>), the boundary is a diamond. In 2D space, absolute values create straight lines that meet at sharp <strong>corners</strong> on the axes.</p></li><li><p><strong>Ridge (The Circle):</strong> Because Ridge uses squared values (<span><span>$\beta_1^2 + \beta_2^2 \leq s$</span></span>), the boundary is a circle. This is the standard equation for a circle centered at the origin.</p></li></ul><p>2. The Red Ellipses: Residual Sum of Squares (RSS)</p><p>The red rings are like a "topographical map" for error.</p><ul><li><p>The black dot (<span><span>$\hat{\beta}$</span></span>) in the center is the <strong>Least Squares estimate</strong> (the point of minimum error).</p></li><li><p>As you move away from that dot, the error (RSS) increases. Every point on a single red ring has the exact same error level.</p></li></ul><p></p>
25
New cards

wHat is better, ridge or lasso?

Depends on data characteristics

Lasso assumes some predictors are not related to the response

Ridge includes all predictors in the model

26
New cards

Selecting the tuning parameter

As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is bets

That is, we require a method selecting a value for the tuning parameter lambda, or equivalently, the value of the constraints

Cross-validation provides a simple way to tackly this problem. We choose a grid of lambda values, and compute the cross-validation error rate for each value of lambda.

We then select the tuning parameter value for which the cross-validation error is smallest

Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter