Shrinkage Methods
6.2 Shrinkage Methods
6.2 Overview of Shrinkage Methods
Shrinkage methods are alternative approaches to subset selection when fitting linear models.
These methods involve constraining or regularizing coefficient estimates, effectively shrinking them towards zero.
This shrinkage can significantly reduce the variance of the coefficient estimates.
Two well-known techniques in shrinkage methods are:
Ridge Regression
Lasso
6.2.1 Ridge Regression
Basic Concept: Similar to least squares fitting but estimates coefficients by minimizing a different objective function.
Objective Function: The ridge regression coefficient estimates, denoted as ext{λ}^{ ext{R}}, aim to minimize the following:
ext{RSS} + ext{ε} imes ext{Σ}{j=1}^{p} ext{λ}j^2Where:
RSS (Residual Sum of Squares) = ext{Σ}{i=1}^{n} (yi - (λ0 + Σ{j=1}^{p} λj x{ij}))^2
ε is a tuning parameter that determines the strength of the shrinkage penalty applied to the coefficients.
The first term minimizes the error, while the second term shrinks coefficients towards zero.
Interpretation:
When ε = 0: Ridge regression outputs the least squares estimates.
As ε approaches infinity, coefficients approach zero (null model).
Each value of ε leads to a unique set of coefficient estimates.
Scaling of Variables:
The shrinkage penalty is applied to λ1, … , λp but not to the intercept λ_0.
Important to center variables (mean zero) before applying ridge regression to get an accurate intercept estimate: λ0 = ar{y} = ext{Σ}{i=1}^{n} y_i/n.
Example from Credit Data:
Figure 6.4 illustrates the ridge regression coefficient estimates for the Credit data set, showing the impact of varying ε on the coefficient estimates for different predictors (income, limit, rating, student).
The plots indicate that as ε increases, coefficient estimates shrink towards zero.
Norms:
The ext{β}2 norm of a vector λ is defined as ext{||}λ ext{||}2 = ext{√(Σ}{j=1}^{p} λj^2) .
The relationship of λ^{ ext{R}} and least squares estimates' norms is explored.
Bias-Variance Trade-off:
Ridge regression reduces variance at the expense of introducing some bias.
The variance of ridge predictions decreases with increasing ε and shows that ridge regression performs better when least squares estimates exhibit high variance.
6.2.2 The Lasso
Ridge Regression Limitations: Although it is beneficial for reducing variance, ridge regression includes all predictors, which complicates model interpretation.
Lasso Overview: The lasso addresses this limitation by minimizing:
RSS + ε imes ext{Σ}{j=1}^{p} | ext{λ}j| (6.7)Key Differences:
Penalty: The lasso uses an L1 penalty (sum of absolute values) instead of the L2 penalty (sum of squares) used in ridge regression.
Feature Selection: The lasso can set some coefficients to zero, effectively performing variable selection (sparse models), improving model interpretation.
Coefficient Behavior:
As ε increases, lasso removes coefficients from the model entirely, unlike ridge regression.
Figure 6.6 illustrates the coefficient behavior for the lasso on the Credit data set.
6.2.3 Selecting the Tuning Parameter ε
Methods for Selection:
Cross-validation is used to select the optimal value for the tuning parameter ε.
A grid of ε values is tested, the cross-validation error is computed, and the one with the smallest error is selected.
Figures 6.12 and 6.13 display cross-validation applied to ridge regression and lasso, respectively, indicating the importance of selecting the proper ε.
6.3 Dimension Reduction Methods
Dimension Reduction Overview: Unlike shrinkage methods which modify existing predictors, dimension reduction approaches create new predictors from the original variables.
Linear Combinations: New variables denoted as Z1, Z2, …, ZM can be formed as linear combinations: Zm = Σ{j=1}^{p} ℓ{jm} xj (6.16) where ℓ{jm} are coefficients determining the contribution of each original variable to the new variable.