R-squared in Linear Regression (Notes)

In linear regression, the R-squared (or coefficient of determination) statistic quantifies the proportion of the variance in the dependent variable (y) that is predictable from the independent variable(s) in a statistical model. It represents the percentage of the variability in the dependent variable y that can be explained by the regression model, indicating how well the model fits the observed data. Its value typically ranges from 0 to 1 (for ordinary least squares regression with an intercept), where 0 means the model explains no variability, and 1 means it explains all variability.

R-squared is fundamentally defined as a ratio of the explained variation to the total variation:

R^2 = \frac{SS{explained}}{SS{tot}}

where:

SS_{explained} (Sum of Squares Explained) represents the variation in y that is captured by the model.
SS_{tot} (Total Sum of Squares) represents the total variation in y from its mean.

Alternatively, the explained variation can be expressed as the total variation minus the unexplained variation. This leads to an equivalent and often more practical formula:

R^2 = 1 - \frac{SS{res}}{SS{tot}}

where:

SS{res} = \sum{i=1}^n (yi - \hat{y}i)^2 is the unexplained variation, also known as the Sum of Squared Residuals (SSR) or Sum of Squared Errors (SSE). This term measures the aggregate discrepancy between the observed values (yi) and the values predicted by the model (\hat{y}i).
SS{tot} = \sum{i=1}^n (y_i - \bar{y})^2 is the total variation of y from its mean. This term measures the total variability inherent in the dependent variable.
\bar{y} = \frac{1}{n} \sum{i=1}^n yi is the mean of the observed dependent variable values.

Unexplained and total variation; the role of the residuals

To delve deeper, the unexplained variation (or residual sum of squares) is specifically calculated as the sum of the squared differences between each observed value (yi) and its corresponding predicted value (\hat{y}i) from the regression line:

SS{res} = \sum{i=1}^n (yi - \hat{y}i)^2

This quantity represents the portion of the variability in y that the model could NOT explain. It is the error variance.

Conversely, the total variation (or total sum of squares) is computed as the sum of the squared differences between each observed value (y_i) and the mean of the dependent variable (\bar{y}):

SS{tot} = \sum{i=1}^n (y_i - \bar{y})^2

This quantity represents the overall variability of the dependent variable regardless of the model.

Note on Variance Factors: While variance is typically defined with factors of \frac{1}{n} (for population variance) or \frac{1}{n-1} (for sample variance), these factors are crucial for estimating population parameters. However, for the R-squared statistic, which is a ratio, these factors are often omitted from the denominators of SS{res} and SS{tot} as they would cancel out. The R-squared formula remains the same whether the sums of squares are scaled as variances or not:

R^2 = 1 - \frac{\sum{i=1}^n (yi - \hat{y}i)^2}{\sum{i=1}^n (y_i - \bar{y})^2}

This demonstrates that R-squared is a dimensionless proportion.

Why the mean of prediction errors does not appear

A critical property of Ordinary Least Squares (OLS) regression, particularly when the model includes an intercept term, is that the sum (and thus the mean) of the prediction errors (or residuals, ei = yi - \hat{y}_i) is always zero. This is a direct consequence of the first-order conditions used to derive the OLS estimators.

Mathematically, for a model with an intercept, \sum{i=1}^n ei = \sum{i=1}^n (yi - \hat{y}_i) = 0.

Because the residuals always sum to zero, their mean is also zero. This "centering" of the residuals means that there is no 'average' error that needs to be accounted for in the R-squared calculation beyond their squared sum. This fundamental property underpins the validity of decomposing the total variation (SS{tot}) into explained variation (SS{explained}) and unexplained variation (SS_{res}) in the manner shown by the R-squared formulas.

Worked example

Consider a tiny dataset with four observations: x = 1, 2, 3, 4 and y = 5, 6, 6, 9. A fitted model yields

\hat{y} = 3.5 + 1.2 x.

Using the x-values, the predicted y-values are:

For x = 1: \hat{y} = 3.5 + 1.2 \cdot 1 = 4.7
For x = 2: \hat{y} = 3.5 + 1.2 \cdot 2 = 5.9
For x = 3: \hat{y} = 3.5 + 1.2 \cdot 3 = 7.1
For x = 4: \hat{y} = 3.5 + 1.2 \cdot 4 = 8.3

The prediction errors are:

e_1 = 5 - 4.7 = 0.3
e_2 = 6 - 5.9 = 0.1
e_3 = 6 - 7.1 = -1.1
e_4 = 9 - 8.3 = 0.7

The squared prediction errors sum to:

SS{res} = \sum ei^2 = 0.09 + 0.01 + 1.21 + 0.49 = 1.80.

Next, compute the total variation of y. The mean of y is

\bar{y} = \frac{5 + 6 + 6 + 9}{4} = 6.5,

and

SS{tot} = \sum (yi - \bar{y})^2 = (5-6.5)^2 + (6-6.5)^2 + (6-6.5)^2 + (9-6.5)^2 = (-1.5)^2 + (-0.5)^2 + (-0.5)^2 + 2.5^2 = 9.

Therefore the R-squared is

R^2 = 1 - \frac{SS{res}}{SS{tot}} = 1 - \frac{1.80}{9} = 0.80.

Interpretation: 80% of the variability in y is explained by the model, while the remaining 20% is unexplained and attributable to factors not captured by the model.

Key takeaways

Definition: R-squared (R^2), also known as the coefficient of determination, is a crucial statistic in linear regression. It represents the proportion (or percentage) of the total variance in the dependent variable (y) that can be explained by the independent variable(s) included in the regression model.
Calculation: It is calculated as the ratio of explained variation to total variation, or more commonly, as R^2 = 1 - \frac{SS{res}}{SS{tot}}.
- SS{res} = \sum{i=1}^n (yi - \hat{y}i)^2 quantifies the unexplained variation (sum of squared residuals), representing the errors of the model.
- SS{tot} = \sum{i=1}^n (y_i - \bar{y})^2 quantifies the total variation in y from its mean, representing the inherent variability in the dependent variable.
Residual Properties: A fundamental property of Ordinary Least Squares (OLS) regression models (with an intercept) is that the sum, and thus the mean, of the prediction errors (residuals) is always zero. This ensures that the SS_{res} accurately reflects the unexplained variance without needing to center the residuals.
Variance Scaling: Any scaling factors (like \frac{1}{n} or \frac{1}{n-1} for variance computations) cancel out in the R-squared ratio, meaning its value is independent of whether SS{res} and SS{tot} are expressed as sums of squares or averages of squares.

Practical Implications and Considerations

Model Fit: A higher R^2 value (closer to 1) generally indicates that the model provides a better fit to the data, as a larger proportion of the variability in y is accounted for by the model.
Limitations:
- Causation: A high R^2 does not imply causation between the independent and dependent variables. Correlation is not causation.
- Overfitting: In models with many independent variables, R^2 can artificially increase even if the new variables are not truly improving the model's predictive power. This is known as overfitting, where the model fits the training data too closely, but performs poorly on new, unseen data.
- Adjusted R-squared: For multiple regression, the "Adjusted R-squared" is often preferred as it accounts for the number of predictors in the model and penalizes the inclusion of unnecessary variables. This helps mitigate the issue of overfitting when comparing models.
- Context Matters: The "goodness" of an R^2 value highly depends on the field of study. In some fields (e.g., precise physical sciences), a very high R^2 (>0.9) might be expected, while in others (e.g., social sciences), a lower R^2 (>0.2) may still be considered significant. Always interpret R^2 in the context of the specific domain and research question.