Shrinkage Methods

6.2 Shrinkage Methods

6.2 Overview of Shrinkage Methods

  • Shrinkage methods are alternative approaches to subset selection when fitting linear models.

  • These methods involve constraining or regularizing coefficient estimates, effectively shrinking them towards zero.

  • This shrinkage can significantly reduce the variance of the coefficient estimates.

  • Two well-known techniques in shrinkage methods are:

    • Ridge Regression

    • Lasso

6.2.1 Ridge Regression

  • Basic Concept: Similar to least squares fitting but estimates coefficients by minimizing a different objective function.

  • Objective Function: The ridge regression coefficient estimates, denoted as ext{λ}^{ ext{R}}, aim to minimize the following:
    ext{RSS} + ext{ε} imes ext{Σ}{j=1}^{p} ext{λ}j^2

  • Where:

    • RSS (Residual Sum of Squares) = ext{Σ}{i=1}^{n} (yi - (λ0 + Σ{j=1}^{p} λj x{ij}))^2

    • ε is a tuning parameter that determines the strength of the shrinkage penalty applied to the coefficients.

    • The first term minimizes the error, while the second term shrinks coefficients towards zero.

  • Interpretation:

    • When ε = 0: Ridge regression outputs the least squares estimates.

    • As ε approaches infinity, coefficients approach zero (null model).

    • Each value of ε leads to a unique set of coefficient estimates.

  • Scaling of Variables:

    • The shrinkage penalty is applied to λ1, … , λp but not to the intercept λ_0.

    • Important to center variables (mean zero) before applying ridge regression to get an accurate intercept estimate: λ0 = ar{y} = ext{Σ}{i=1}^{n} y_i/n.

  • Example from Credit Data:

    • Figure 6.4 illustrates the ridge regression coefficient estimates for the Credit data set, showing the impact of varying ε on the coefficient estimates for different predictors (income, limit, rating, student).

    • The plots indicate that as ε increases, coefficient estimates shrink towards zero.

  • Norms:

    • The ext{β}2 norm of a vector λ is defined as ext{||}λ ext{||}2 = ext{√(Σ}{j=1}^{p} λj^2) .

    • The relationship of λ^{ ext{R}} and least squares estimates' norms is explored.

  • Bias-Variance Trade-off:

    • Ridge regression reduces variance at the expense of introducing some bias.

    • The variance of ridge predictions decreases with increasing ε and shows that ridge regression performs better when least squares estimates exhibit high variance.

6.2.2 The Lasso

  • Ridge Regression Limitations: Although it is beneficial for reducing variance, ridge regression includes all predictors, which complicates model interpretation.

  • Lasso Overview: The lasso addresses this limitation by minimizing:
    RSS + ε imes ext{Σ}{j=1}^{p} | ext{λ}j| (6.7)

  • Key Differences:

    • Penalty: The lasso uses an L1 penalty (sum of absolute values) instead of the L2 penalty (sum of squares) used in ridge regression.

    • Feature Selection: The lasso can set some coefficients to zero, effectively performing variable selection (sparse models), improving model interpretation.

  • Coefficient Behavior:

    • As ε increases, lasso removes coefficients from the model entirely, unlike ridge regression.

    • Figure 6.6 illustrates the coefficient behavior for the lasso on the Credit data set.

6.2.3 Selecting the Tuning Parameter ε

  • Methods for Selection:

    • Cross-validation is used to select the optimal value for the tuning parameter ε.

    • A grid of ε values is tested, the cross-validation error is computed, and the one with the smallest error is selected.

    • Figures 6.12 and 6.13 display cross-validation applied to ridge regression and lasso, respectively, indicating the importance of selecting the proper ε.

6.3 Dimension Reduction Methods

  • Dimension Reduction Overview: Unlike shrinkage methods which modify existing predictors, dimension reduction approaches create new predictors from the original variables.

  • Linear Combinations: New variables denoted as Z1, Z2, …, ZM can be formed as linear combinations: Zm = Σ{j=1}^{p} ℓ{jm} xj (6.16) where ℓ{jm} are coefficients determining the contribution of each original variable to the new variable.