Notes on Linear Regression, Residuals, R-Squared, P-Values, and Feature Engineering
Linear Regression Foundations
Purpose of linear regression in this course: a regression model where the label (the target) is a numerical value.
Distinction clarified: regression models can exist for classification tasks in general, but when this lecturer uses the term regression, he means linear regression specifically (a model for predicting a numeric label).
Basic idea: use one or more features (variables) to predict a numeric label.
Correlation context: correlation is a number between -1 and +1. Linear regression exploits this relationship to make predictions.
Simple linear regression form (example): \hat{y} = \beta0 + \beta1 x_1
Where (\hat{y}) is the predicted label, (\beta0) is the intercept, and (\beta1) is the slope for feature (x_1).
In a concrete example given in class: (\hat{y} = 0.1 + 0.78 x).
Interpretation: for every one-unit increase in (x), (\hat{y}) increases by 0.78 units (assuming other factors constant).
Model parameters vs hyperparameters:
Parameters: the learned coefficients (e.g., (\beta0, \beta1, \beta_2, \ldots) ) that the model fits from the training data.
Hyperparameters: settings you control that govern the learning process (e.g., number of leaves in a decision tree, the regularization strength, etc.). Example given: hyperparameters control the model, while learned coefficients are the model's parameters.
“Base case” intuition: if you have no information about a new instance, you might predict the average value of the training labels (the baseline). Knowledge about features (e.g., gender) can shift predictions away from the average in a sensible way (e.g., male vs female, age, etc.).
Regression vs classification terminology recall:
For regression, we predict a numerical value ((y) is continuous).
For model evaluation in regression, we use metrics like SSE, MSE, RMSE, and R-squared, not plain accuracy.
Important visuals concept (residuals): the vertical distance between the observed value and the regression line is a residual (error).
Data and Model Structure
Regression goal: use features to predict a numeric label (the target).
In the lecture example, data examples include pairs (feature, label) such as mass weight (feature) predicting mouse size (label) in a toy illustration.
Strong positive correlation intuition: as you move right along the X-axis (feature increasing), Y tends to increase (the data points align along an upward trend).
Negative correlation intuition: if a dataset showed a negative correlation, the data would trend downward as X increases.
Perfect correlation concept:
Perfect correlation corresponds to data points lying exactly on a straight line (very rare in real data).
Perfect correlation occurs when the correlation coefficient is exactly 1 or -1.
Prediction strategy when nothing is known about an instance outside the dataset:
Baseline approach: predict the average label of the training data.
When some information is known (e.g., gender), models can adjust predictions using that information (e.g., separate group means, or a gender-based predictor).
Residuals, Loss, and Error Metrics
Residuals: the vertical distances between observed values and the model’s predictions. They are the error terms for regression.
Goal: minimize the total prediction error across all data points.
Why we square residuals:
To ensure all residuals contribute positively (avoid cancellation of positive and negative errors).
To penalize larger errors more heavily.
Sum of Squared Residuals (SSR or SSE):
\text{SSE} = \sum{i=1}^n (yi - \hat{y}_i)^2Mean Squared Error (MSE):
\text{MSE} = \frac{\text{SSE}}{n}Root Mean Squared Error (RMSE):
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{\text{SSE}}{n}}Explained and total deviation intuition (R-squared will come later):
Total sum of squares (SST):
\text{SST} = \sum{i=1}^n (yi - \bar{y})^2Residual sum of squares (SSE) is the unexplained part.
Regression line shaping (the bowl curve concept): as you tilt the line (adjust the slope) you change SSE; there is a point where SSE is minimized before it starts increasing again if you tilt too far.
Intercept vs slope:
Slope ((\beta_1)) controls how steeply the line rises or falls with X.
Intercept ((\beta_0)) shifts the line up or down.
Data sensitivity:
Linear regression is sensitive to the training data used. Different training subsets can yield different regression lines, illustrating the impact of data selection on the model.
Extreme values (outliers):
Extreme data points can disproportionately affect the fitted line in linear regression.
Linear regression is less tolerant of extreme values; outlier handling or robust methods may be needed.
By contrast, some other models (e.g., decision trees) may tolerate extreme values better.
Model Evaluation: R-Squared and Related Metrics
R-squared (R^2) intuition:
R^2 is the proportion of variance in the dependent variable that is predictable from the independent variables.
Range: 0 to 1. Higher is better (closer to 1 means the model explains more variance).
Formal definition:
R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum{i=1}^n (yi - \hat{y}i)^2}{\sum{i=1}^n (y_i - \bar{y})^2}Example interpretations from the lecture:
An R^2 of about 0.91 means the model explains 91% of the deviation in the label around its mean.
In the example, an R^2 of 0.92 was observed after adding an interaction term (synergy), indicating the model explains about 92% of the variance.
Adjusted R-squared:
Adjusted R^2 accounts for the number of predictors and helps prevent automatic increases in R^2 as features are added.
Formula (conceptual): adjusts for the number of parameters; exact AI Studio form may vary, and in this course AI Studio lacked an explicit adjusted R^2 metric.
Important caveat: R^2 tends to increase with more features even if they are not truly informative; adjusted R^2 mitigates this.
R-squared vs model complexity comparison:
When comparing models with exactly the same features, higher R^2 indicates a better fit.
If features differ, adjusted R^2 is a more reliable comparison metric.
Coefficients, Hypothesis Testing, and P-Values
Coefficient interpretation:
Each coefficient (e.g., (\beta_1)) represents the expected change in the label per unit change in the corresponding feature, holding other features constant.
Example from the advertising data: the TV coefficient was ~0.056; a one-unit increase in TV spending increases sales by ~0.056 units (the unit depends on the data scaling).
Sign of coefficients:
Positive coefficient: as the feature increases, the label tends to increase.
Negative coefficient: as the feature increases, the label tends to decrease.
Hypothesis testing for coefficients:
Null hypothesis: (H0: \betaj = 0) (the feature has no impact on the label).
Alternative: (Ha: \betaj \neq 0) (the feature has an impact).
p-value interpretation: the probability of observing the data (or more extreme) if the null hypothesis is true.
In class example: TV and radio coefficients had very small p-values (almost zero), indicating strong evidence that they have nonzero effects on Sales.
Newspaper coefficient had a relatively large p-value (e.g., 0.44), implying uncertainty about its impact.
Intercept p-value:
The intercept’s p-value is not typically used to draw conclusions about the importance of variables; it often has limited practical interpretive value.
Practical interpretation guideline:
Prefer small p-values for coefficients you want to keep in the model; coefficients with large p-values are candidates for removal, especially when comparing models.
Feature Engineering and Interaction Terms
Interaction terms (synergy) in marketing context:
Interaction term example: an engineered feature Synergy = (TV spending) Ă— (Radio spending).
This captures the idea that the combined effect of TV and Radio might be more than the sum of their individual effects.
In the lecture, adding the synergy feature reduced reliance on Newspaper and increased the model’s explanatory power (p-value of the synergy term was near zero, i.e., significant).
Generating features before data splitting:
Feature engineering should occur before splitting the data into training/testing to ensure the new feature appears in both sets.
In the software workflow shown (Turbo Prep), a new column synergy was generated and then added to the model-building process.
Feature selection algorithms:
Several options exist to remove uninformative features (e.g., M5 Prime, t-test-based methods).
When newspaper contributed nothing, M5 Prime removed it, leaving TV and Radio in the model; p-values for retained features remained informative.
If many features exist (e.g., 100s), automatic feature selection helps avoid manual, one-by-one p-value inspection.
Why feature engineering can help:
Adding meaningful interactions can raise R^2 and reveal dependencies that simple features miss.
However, adding features can also shift p-values for existing features; interpretation must consider the chosen features together.
Data Preparation, Splitting, and Model Training Workflow
Data split choices:
Common splits discussed: 50-50, 90-10. With 200 data points, a 50-50 split yields 100 training and 100 testing examples; a 90-10 split yields 180 training and 20 testing examples.
Larger test sets give more robust evaluation; smaller test sets can lead to unstable estimates of performance.
Reproducibility and seeds:
A local random seed was used (e.g., 2) to ensure the same split across runs for consistency in examination or demonstrations.
Model configuration in the software:
Use linear regression model; ensure feature selection is set to None initially to include all features.
Run the model and inspect coefficients and p-values.
If you anticipate nonlinearity or interactions, enable feature engineering (interaction terms) and re-evaluate.
Practical workflow for linear regression in this course:
Data import and variable selection (label = Sales; features = TV, Radio, Newspaper, etc.).
Check distributions (e.g., Newspaper is highly skewed; others roughly normal-ish).
Split data (e.g., 50-50 with a fixed seed).
Train linear regression with all features; inspect coefficients, p-values, and R^2.
If needed, perform feature selection (e.g., M5 Prime) and re-train.
Consider feature engineering (e.g., synergy) and re-evaluate.
Practical interpretation of an example model (advertising dataset):
Intercept: approximately 4.87 (base sales when all spend terms are zero).
TV coefficient: about 0.056 and highly significant (p-value near zero).
Radio coefficient: significant (p-value near zero).
Newspaper coefficient: negative or near zero with p-value around 0.44, suggesting little or no reliable impact in the model.
Model terminology recap in this workflow:
Parameters: the learned coefficients (e.g., 0.056 for TV, 0.04–0.05 for Radio, etc.).
Hyperparameters: settings controlling the modeling process (e.g., feature selection method, whether to include interaction terms, etc.).
Performance metrics: RMSE, R^2, and, for a single model, the corresponding p-values for coefficients.
Practical Examples and Takeaways from the Advertising Dataset
Dataset specifics and initial observations:
Dataset: advertising.csv with features TV, Radio, Newspaper and target Sales.
Observations: Sales roughly normally distributed for some features; Newspaper is right-skewed, which may affect calculations if not addressed.
Sample size mentioned: about 200 data points.
Initial modeling approach:
Start with all three features (TV, Radio, Newspaper) and a linear regression model.
Intercept (beta0) ~ 4.87; TV and Radio coefficients positive with p-values near zero; Newspaper coefficient negative with p-value ~ 0.44.
Feature selection outcomes:
Using M5 Prime (feature selection) tends to drop Newspaper entirely, leaving TV and Radio in the model.
P-values for retained features remain informative (near zero for TV and Radio).
If Newspaper occasionally reappears due to interactions or model changes, re-check p-values and model fit.
Interaction (synergy) exploration:
Creating a synergy feature TV × Radio sometimes improves the model’s R^2 (e.g., R^2 increasing from ~0.91 to ~0.92).
In synergy-enabled models, the p-value for the synergy term can be near zero, indicating a significant interaction effect.
The coefficient for the synergy term can be small in absolute value, yet its p-value indicates importance; this is because the synergy term captures a multiplicative effect that may be weak on its own but meaningful when combined with other features.
Model comparison guidance:
When comparing models with the same features, higher R^2 indicates a better fit.
When comparing different feature sets, consider adjusted R^2 to account for the number of features.
Decision guidance and practical interpretation:
A significant positive TV coefficient implies that increasing TV spending tends to raise Sales.
A significant Radio coefficient implies a positive relationship as well, though the magnitude matters for budgeting decisions.
Newspaper often shows weak or non-significant impact in this dataset, potentially due to diminishing marginal returns or changes in consumption patterns (e.g., newspaper readership decline).
Final notes on metrics and interpretation:
Root Mean Squared Error (RMSE) and R^2 are your primary regression evaluation metrics.
R^2 ranges [0,1]; higher is better; values around 0.9+ indicate a strong model, acknowledging some noise in real data.
Adjusted R^2 helps guard against artificially high R^2s as more features are added.
The learning process involves balancing feature inclusion, interaction terms, and data preprocessing to maximize explained variance while avoiding overfitting.
Data Distribution, Preprocessing, and Practical Considerations
Distribution notes:
Newspaper feature shows strong skewness; other features are closer to normal distribution.
Skewness can affect regression calculations; future lessons may cover transformations to address skew.
Data preparation best practices:
Do feature engineering (e.g., interaction terms) before data splitting.
Use a fixed seed for splits to maintain reproducibility in demonstrations and exams.
Apply automated feature selection techniques to handle many features when necessary.
Calibration and diagnostics:
After fitting, inspect residuals to assess model fit and potential nonlinearity or heteroscedasticity.
Compare models using R^2, RMSE, and consider adjusted R^2 when adding features.
Exam-oriented takeaways:
You may be asked to interpret a regression output: identify the meaning of coefficients, intercept, p-values, and the R^2 value.
You may be asked to explain why a feature is removed by a feature selection method and how interaction terms affect model performance.
You may be asked to explain why linear regression is sensitive to outliers and how to address them.
Summary definitions (quick reference):
Coefficients: (\beta_j) values learned by the model.
Predictions: \hat{y} = \beta0 + \beta1 x1 + \cdots + \betap x_p
SSE: \text{SSE} = \sum{i=1}^n (yi - \hat{y}_i)^2
MSE: \text{MSE} = \frac{\text{SSE}}{n}
RMSE: \text{RMSE} = \sqrt{\text{MSE}}
R-squared: R^2 = 1 - \frac{\text{SSE}}{\text{SST}} with \text{SST} = \sum{i=1}^n (yi - \bar{y})^2
Adjusted R-squared: (conceptual adjustment for number of features; not always available in all tools)
Hypothesis testing for coeffs: H0: (\beta_j = 0); p-values indicate significance; small p-values imply the coefficient is significantly different from zero.
Interaction term: product of two features (e.g., TV Ă— Radio) to capture synergy; can improve model fit even if the individual coefficients are small.
Quick exam-style recap
If given a regression output: identify the model equation, interpret the intercept and slope(s), discuss significance via p-values, and describe the model’s explanatory power with R^2 and (if provided) adjusted R^2.
Explain why residuals are squared and how SSE and RMSE relate to model quality.
Describe why outliers can distort linear regression results and what steps you might take to handle them.
Explain the difference between parameters and hyperparameters with examples from regression and other models.
Discuss the rationale for feature engineering (including interaction terms) before splitting data and how this can impact p-values and R^2.
Compare models using R^2 vs adjusted R^2 and explain why the latter can be preferable when comparing models with different feature sets.
Summarize how to interpret a regression coefficient in a real-world marketing context (e.g., spending on TV or Radio and its effect on Sales).