Shrinkage Methods

6.2 Shrinkage Methods

6.2 Overview of Shrinkage Methods

  • Shrinkage methods are alternative approaches to subset selection when fitting linear models.

  • These methods involve constraining or regularizing coefficient estimates, effectively shrinking them towards zero.

  • This shrinkage can significantly reduce the variance of the coefficient estimates.

  • Two well-known techniques in shrinkage methods are:

    • Ridge Regression

    • Lasso

6.2.1 Ridge Regression

  • Basic Concept: Similar to least squares fitting but estimates coefficients by minimizing a different objective function.

  • Objective Function: The ridge regression coefficient estimates, denoted as extλextRext{λ}^{ ext{R}}, aim to minimize the following:
    extRSS+extεimesextΣ<em>j=1pextλ</em>j2ext{RSS} + ext{ε} imes ext{Σ}<em>{j=1}^{p} ext{λ}</em>j^2

  • Where:

    • RSS (Residual Sum of Squares) = extΣ<em>i=1n(y</em>i(λ<em>0+Σ</em>j=1pλ<em>jx</em>ij))2ext{Σ}<em>{i=1}^{n} (y</em>i - (λ<em>0 + Σ</em>{j=1}^{p} λ<em>j x</em>{ij}))^2

    • ε is a tuning parameter that determines the strength of the shrinkage penalty applied to the coefficients.

    • The first term minimizes the error, while the second term shrinks coefficients towards zero.

  • Interpretation:

    • When ε=0ε = 0: Ridge regression outputs the least squares estimates.

    • As εε approaches infinity, coefficients approach zero (null model).

    • Each value of εε leads to a unique set of coefficient estimates.

  • Scaling of Variables:

    • The shrinkage penalty is applied to λ<em>1,,λ</em>pλ<em>1, … , λ</em>p but not to the intercept λ0λ_0.

    • Important to center variables (mean zero) before applying ridge regression to get an accurate intercept estimate: λ0 = ar{y} = ext{Σ}{i=1}^{n} y_i/n.

  • Example from Credit Data:

    • Figure 6.4 illustrates the ridge regression coefficient estimates for the Credit data set, showing the impact of varying εε on the coefficient estimates for different predictors (income, limit, rating, student).

    • The plots indicate that as εε increases, coefficient estimates shrink towards zero.

  • Norms:

    • The extβ<em>2ext{β}<em>2 norm of a vector λλ is defined as extλext</em>2=ext(Σ<em>j=1pλ</em>j2)ext{||}λ ext{||}</em>2 = ext{√(Σ}<em>{j=1}^{p} λ</em>j^2).

    • The relationship of λextRλ^{ ext{R}} and least squares estimates' norms is explored.

  • Bias-Variance Trade-off:

    • Ridge regression reduces variance at the expense of introducing some bias.

    • The variance of ridge predictions decreases with increasing εε and shows that ridge regression performs better when least squares estimates exhibit high variance.

6.2.2 The Lasso

  • Ridge Regression Limitations: Although it is beneficial for reducing variance, ridge regression includes all predictors, which complicates model interpretation.

  • Lasso Overview: The lasso addresses this limitation by minimizing:
    RSS+εimesextΣ<em>j=1pextλ</em>jRSS + ε imes ext{Σ}<em>{j=1}^{p} | ext{λ}</em>j| (6.7)

  • Key Differences:

    • Penalty: The lasso uses an L<em>1L<em>1 penalty (sum of absolute values) instead of the L</em>2L</em>2 penalty (sum of squares) used in ridge regression.

    • Feature Selection: The lasso can set some coefficients to zero, effectively performing variable selection (sparse models), improving model interpretation.

  • Coefficient Behavior:

    • As εε increases, lasso removes coefficients from the model entirely, unlike ridge regression.

    • Figure 6.6 illustrates the coefficient behavior for the lasso on the Credit data set.

6.2.3 Selecting the Tuning Parameter εε

  • Methods for Selection:

    • Cross-validation is used to select the optimal value for the tuning parameter εε.

    • A grid of εε values is tested, the cross-validation error is computed, and the one with the smallest error is selected.

    • Figures 6.12 and 6.13 display cross-validation applied to ridge regression and lasso, respectively, indicating the importance of selecting the proper εε.

6.3 Dimension Reduction Methods

  • Dimension Reduction Overview: Unlike shrinkage methods which modify existing predictors, dimension reduction approaches create new predictors from the original variables.

  • Linear Combinations: New variables denoted as Z<em>1,Z</em>2,,Z<em>MZ<em>1, Z</em>2, …, Z<em>M can be formed as linear combinations: Z</em>m=Σ<em>j=1p</em>jmx<em>jZ</em>m = Σ<em>{j=1}^{p} ℓ</em>{jm} x<em>j (6.16) where </em>jmℓ</em>{jm} are coefficients determining the contribution of each original variable to the new variable.