Linear Regression Notes
Linear Regression Overview
Definition: Linear regression is a statistical method to model the relationship between a dependent variable (e.g., mouse size) and one or more independent variables (e.g., mouse weight).
Alternative Name: General Linear Models, part one.
Importance: Linear regression is a powerful and widely-used technique in statistics and data analysis.
Key Concepts in Linear Regression
Least Squares Method
Purpose: To fit a line to the data.
Process:
Calculate the distance from the line to the data points, known as residuals.
Square each residual and sum them up to obtain the total sum of squares of residuals.
We square residuals instead of taking absolute values because:
Squared errors are differentiable, giving a simple closed-form solution.
Squaring penalizes large mistakes more, which is usually desirable.
Squared error corresponds to the normal distribution, the foundation for classical regression inference.
Repeat by adjusting the line’s angle (rotating) and calculating new residuals and their squared sums.
Plot the sum of squared residuals against the line's rotations to find the position that minimizes this sum.
Result: The line with the minimum sum of squared residuals is chosen as the best fit line.
R-squared (R²)
Function: A statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable (or variables) in a regression model.
Calculation:
Total Sum of Squares Formula:
SS_{mean} = ext{Sum of Squares around the Mean} = ext{Data} - ext{Mean} ^2Variation around the Mean: ext{Variance} = rac{SS_{mean}}{n}
Sum of Squares for the Fit:
SS_{fit} = ext{Sum of Squares around the Fit} = ext{Data} - ext{Fitted Line} ^2Variation around the Fit: ext{Variance around Fit} = rac{SS_{fit}}{n}
R² Formula:
R^2 = rac{SS{mean} - SS{fit}}{SS_{mean}}
Interpretation:
High R² indicates that a significant amount of variance in the dependent variable is accounted for by the independent variable(s).
Example Results:
If R² = 0.6, it means 60% of the variance in mouse size can be explained by mouse weight.
If R² = 1, mouse weight explains 100% of the variance in mouse size.
If R² = 0, mouse weight does not explain any variance.
Calculating R-squared Examples
Example Calculations:
Given:
Variation around the Mean = 11.1
Variation around the Fit = 4.4
Calculation: R^2 = rac{11.1 - 4.4}{11.1} = 0.6
Interpretation: 60% reduction in variance upon accounting for weight.
Great question — and NO, low variance & low bias does NOT mean you want a small R^2.
This is a super common confusion, so let me clear it up cleanly.
✅ First: Bias–Variance vs. R² measure
different things
Bias–variance
A property of the model class and the training process
About how well your method generalizes to unseen data
High variance = overfits
High bias = underfits
R² (coefficient of determination)
A measure of how well your model fits the existing training data
Does NOT measure generalization
Can be high even when the model has terrible variance
Can be low even when the model generalizes perfectly (e.g., very noisy data)
So the two concepts are not aligned.
“Since we want low variance, maybe a small R² is better.”
No — and here’s why.
If R² is “small”
It usually means:
❌ Your model is underfitting
→ High bias
→ Model is too simple
→ You are not capturing real patterns
R² close to 0 usually means the model learned almost nothing.
❌ It can also mean the data is inherently noisy
→ R² cannot be improved no matter what
→ Bias–variance has nothing to do with it
If R² is “big”
It usually means:
✔ Your model fits the data well
and that the independent variables explain a significant proportion of the variance in the dependent variable.
But beware:
R² can be high because of overfitting
A neural network can achieve R² = 0.999 on training data but generalize terribly
This is why ML uses test sets, cross-validation, and regularization.
🔥 So what SHOULD you optimize?
Not R².
You optimize test error, like:
MAE
RMSE
MSE
Cross-entropy
Accuracy
Your validation/test performance tells you whether bias and variance are balanced.
R² is just descriptive — it doesn’t guide ML model selection.
Clean 15-second explanation (interview-ready)
R² only measures how well the model fits the training data.
A small R² usually means high bias (underfitting).
A high R² can still mean high variance (overfitting).
Bias–variance tradeoff is evaluated on validation/test error, not R².
Additional Concepts and Scenarios in Linear Regression
When knowing mouse weight allows perfect predictions, R² would be 1 (100% explained variance).
If knowing mouse weight does not provide predictive power, R² would be 0.
Even complex equations can use R-squared, which depends solely on comparing the sum of squares around the mean and fit.
Multi-Variable Linear Regression
Modeling with Multiple Predictors
Scenario: Predicting body length using both mouse weight and tail length.
Visualization: Use a 3D graph to plot weight, tail length, and body length.
Fitting Process: Performs similar least squares adjustments but for a plane instead of a line based on two variables.
Result: More predictors can only maintain or improve the R² value due to the nature of least squares fitting.
Parameters and Adjusted R-squared
Definition: Adjusted R² adjusts the R² value based on the number of parameters in the model to prevent overfitting by adding unnecessary predictors.
Evaluating R-squared Significance
Role of P-value
Importance: Indicates whether the R² value is statistically significant or due to random chance.
Calculation:
Derived from the F-statistic f , which is the ratio of the variance explained by the model to the variance not explained, indicating the quality of fit.
f = rac{ ext{Variation explained}}{ ext{Variation not explained}}
Degrees of freedom are used to adjust F-values into a standardized format for significance testing.
P-value Computation Steps
Generate random data and calculate mean and sums of squares around it.
Utilize results of F-statistics, plotting results into a histogram for interpretation.
Establish significance by assessing how many generated values exceed the original data F-value.
Final Takeaways
Necessity of R² to quantify explained variance in regression analysis.
Importance of p-value for establishing the reliability of the R² value.
Ideal outcome in regression analysis is both a high R² (large) and a low p-value (small).
Conclusion
Linear regression is a fundamental tool in statistics for quantifying relationships between variables, and it requires careful consideration of R² and p-values for valid interpretations.
Importance of understanding these concepts for effective data analysis and drawing accurate conclusions in research and applied statistics.