AJ

Notes on Linear Regression, Residuals, R-Squared, P-Values, and Feature Engineering

Linear Regression Foundations

  • Purpose of linear regression in this course: a regression model where the label (the target) is a numerical value.

  • Distinction clarified: regression models can exist for classification tasks in general, but when this lecturer uses the term regression, he means linear regression specifically (a model for predicting a numeric label).

  • Basic idea: use one or more features (variables) to predict a numeric label.

  • Correlation context: correlation is a number between -1 and +1. Linear regression exploits this relationship to make predictions.

  • Simple linear regression form (example): \hat{y} = \beta0 + \beta1 x_1

    • Where (\hat{y}) is the predicted label, (\beta0) is the intercept, and (\beta1) is the slope for feature (x_1).

    • In a concrete example given in class: (\hat{y} = 0.1 + 0.78 x).

    • Interpretation: for every one-unit increase in (x), (\hat{y}) increases by 0.78 units (assuming other factors constant).

  • Model parameters vs hyperparameters:

    • Parameters: the learned coefficients (e.g., (\beta0, \beta1, \beta_2, \ldots) ) that the model fits from the training data.

    • Hyperparameters: settings you control that govern the learning process (e.g., number of leaves in a decision tree, the regularization strength, etc.). Example given: hyperparameters control the model, while learned coefficients are the model's parameters.

  • “Base case” intuition: if you have no information about a new instance, you might predict the average value of the training labels (the baseline). Knowledge about features (e.g., gender) can shift predictions away from the average in a sensible way (e.g., male vs female, age, etc.).

  • Regression vs classification terminology recall:

    • For regression, we predict a numerical value ((y) is continuous).

    • For model evaluation in regression, we use metrics like SSE, MSE, RMSE, and R-squared, not plain accuracy.

  • Important visuals concept (residuals): the vertical distance between the observed value and the regression line is a residual (error).

Data and Model Structure

  • Regression goal: use features to predict a numeric label (the target).

  • In the lecture example, data examples include pairs (feature, label) such as mass weight (feature) predicting mouse size (label) in a toy illustration.

  • Strong positive correlation intuition: as you move right along the X-axis (feature increasing), Y tends to increase (the data points align along an upward trend).

  • Negative correlation intuition: if a dataset showed a negative correlation, the data would trend downward as X increases.

  • Perfect correlation concept:

    • Perfect correlation corresponds to data points lying exactly on a straight line (very rare in real data).

    • Perfect correlation occurs when the correlation coefficient is exactly 1 or -1.

  • Prediction strategy when nothing is known about an instance outside the dataset:

    • Baseline approach: predict the average label of the training data.

  • When some information is known (e.g., gender), models can adjust predictions using that information (e.g., separate group means, or a gender-based predictor).

Residuals, Loss, and Error Metrics

  • Residuals: the vertical distances between observed values and the model’s predictions. They are the error terms for regression.

  • Goal: minimize the total prediction error across all data points.

  • Why we square residuals:

    • To ensure all residuals contribute positively (avoid cancellation of positive and negative errors).

    • To penalize larger errors more heavily.

  • Sum of Squared Residuals (SSR or SSE):
    \text{SSE} = \sum{i=1}^n (yi - \hat{y}_i)^2

  • Mean Squared Error (MSE):
    \text{MSE} = \frac{\text{SSE}}{n}

  • Root Mean Squared Error (RMSE):
    \text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{\text{SSE}}{n}}

  • Explained and total deviation intuition (R-squared will come later):

    • Total sum of squares (SST):
      \text{SST} = \sum{i=1}^n (yi - \bar{y})^2

    • Residual sum of squares (SSE) is the unexplained part.

  • Regression line shaping (the bowl curve concept): as you tilt the line (adjust the slope) you change SSE; there is a point where SSE is minimized before it starts increasing again if you tilt too far.

  • Intercept vs slope:

    • Slope ((\beta_1)) controls how steeply the line rises or falls with X.

    • Intercept ((\beta_0)) shifts the line up or down.

  • Data sensitivity:

    • Linear regression is sensitive to the training data used. Different training subsets can yield different regression lines, illustrating the impact of data selection on the model.

  • Extreme values (outliers):

    • Extreme data points can disproportionately affect the fitted line in linear regression.

    • Linear regression is less tolerant of extreme values; outlier handling or robust methods may be needed.

    • By contrast, some other models (e.g., decision trees) may tolerate extreme values better.

Model Evaluation: R-Squared and Related Metrics

  • R-squared (R^2) intuition:

    • R^2 is the proportion of variance in the dependent variable that is predictable from the independent variables.

    • Range: 0 to 1. Higher is better (closer to 1 means the model explains more variance).

  • Formal definition:
    R^2 = 1 - \frac{\text{SSE}}{\text{SST}} = 1 - \frac{\sum{i=1}^n (yi - \hat{y}i)^2}{\sum{i=1}^n (y_i - \bar{y})^2}

  • Example interpretations from the lecture:

    • An R^2 of about 0.91 means the model explains 91% of the deviation in the label around its mean.

    • In the example, an R^2 of 0.92 was observed after adding an interaction term (synergy), indicating the model explains about 92% of the variance.

  • Adjusted R-squared:

    • Adjusted R^2 accounts for the number of predictors and helps prevent automatic increases in R^2 as features are added.

    • Formula (conceptual): adjusts for the number of parameters; exact AI Studio form may vary, and in this course AI Studio lacked an explicit adjusted R^2 metric.

    • Important caveat: R^2 tends to increase with more features even if they are not truly informative; adjusted R^2 mitigates this.

  • R-squared vs model complexity comparison:

    • When comparing models with exactly the same features, higher R^2 indicates a better fit.

    • If features differ, adjusted R^2 is a more reliable comparison metric.

Coefficients, Hypothesis Testing, and P-Values

  • Coefficient interpretation:

    • Each coefficient (e.g., (\beta_1)) represents the expected change in the label per unit change in the corresponding feature, holding other features constant.

    • Example from the advertising data: the TV coefficient was ~0.056; a one-unit increase in TV spending increases sales by ~0.056 units (the unit depends on the data scaling).

  • Sign of coefficients:

    • Positive coefficient: as the feature increases, the label tends to increase.

    • Negative coefficient: as the feature increases, the label tends to decrease.

  • Hypothesis testing for coefficients:

    • Null hypothesis: (H0: \betaj = 0) (the feature has no impact on the label).

    • Alternative: (Ha: \betaj \neq 0) (the feature has an impact).

    • p-value interpretation: the probability of observing the data (or more extreme) if the null hypothesis is true.

    • In class example: TV and radio coefficients had very small p-values (almost zero), indicating strong evidence that they have nonzero effects on Sales.

    • Newspaper coefficient had a relatively large p-value (e.g., 0.44), implying uncertainty about its impact.

  • Intercept p-value:

    • The intercept’s p-value is not typically used to draw conclusions about the importance of variables; it often has limited practical interpretive value.

  • Practical interpretation guideline:

    • Prefer small p-values for coefficients you want to keep in the model; coefficients with large p-values are candidates for removal, especially when comparing models.

Feature Engineering and Interaction Terms

  • Interaction terms (synergy) in marketing context:

    • Interaction term example: an engineered feature Synergy = (TV spending) Ă— (Radio spending).

    • This captures the idea that the combined effect of TV and Radio might be more than the sum of their individual effects.

    • In the lecture, adding the synergy feature reduced reliance on Newspaper and increased the model’s explanatory power (p-value of the synergy term was near zero, i.e., significant).

  • Generating features before data splitting:

    • Feature engineering should occur before splitting the data into training/testing to ensure the new feature appears in both sets.

    • In the software workflow shown (Turbo Prep), a new column synergy was generated and then added to the model-building process.

  • Feature selection algorithms:

    • Several options exist to remove uninformative features (e.g., M5 Prime, t-test-based methods).

    • When newspaper contributed nothing, M5 Prime removed it, leaving TV and Radio in the model; p-values for retained features remained informative.

    • If many features exist (e.g., 100s), automatic feature selection helps avoid manual, one-by-one p-value inspection.

  • Why feature engineering can help:

    • Adding meaningful interactions can raise R^2 and reveal dependencies that simple features miss.

    • However, adding features can also shift p-values for existing features; interpretation must consider the chosen features together.

Data Preparation, Splitting, and Model Training Workflow

  • Data split choices:

    • Common splits discussed: 50-50, 90-10. With 200 data points, a 50-50 split yields 100 training and 100 testing examples; a 90-10 split yields 180 training and 20 testing examples.

    • Larger test sets give more robust evaluation; smaller test sets can lead to unstable estimates of performance.

  • Reproducibility and seeds:

    • A local random seed was used (e.g., 2) to ensure the same split across runs for consistency in examination or demonstrations.

  • Model configuration in the software:

    • Use linear regression model; ensure feature selection is set to None initially to include all features.

    • Run the model and inspect coefficients and p-values.

    • If you anticipate nonlinearity or interactions, enable feature engineering (interaction terms) and re-evaluate.

  • Practical workflow for linear regression in this course:

    • Data import and variable selection (label = Sales; features = TV, Radio, Newspaper, etc.).

    • Check distributions (e.g., Newspaper is highly skewed; others roughly normal-ish).

    • Split data (e.g., 50-50 with a fixed seed).

    • Train linear regression with all features; inspect coefficients, p-values, and R^2.

    • If needed, perform feature selection (e.g., M5 Prime) and re-train.

    • Consider feature engineering (e.g., synergy) and re-evaluate.

  • Practical interpretation of an example model (advertising dataset):

    • Intercept: approximately 4.87 (base sales when all spend terms are zero).

    • TV coefficient: about 0.056 and highly significant (p-value near zero).

    • Radio coefficient: significant (p-value near zero).

    • Newspaper coefficient: negative or near zero with p-value around 0.44, suggesting little or no reliable impact in the model.

  • Model terminology recap in this workflow:

    • Parameters: the learned coefficients (e.g., 0.056 for TV, 0.04–0.05 for Radio, etc.).

    • Hyperparameters: settings controlling the modeling process (e.g., feature selection method, whether to include interaction terms, etc.).

    • Performance metrics: RMSE, R^2, and, for a single model, the corresponding p-values for coefficients.

Practical Examples and Takeaways from the Advertising Dataset

  • Dataset specifics and initial observations:

    • Dataset: advertising.csv with features TV, Radio, Newspaper and target Sales.

    • Observations: Sales roughly normally distributed for some features; Newspaper is right-skewed, which may affect calculations if not addressed.

    • Sample size mentioned: about 200 data points.

  • Initial modeling approach:

    • Start with all three features (TV, Radio, Newspaper) and a linear regression model.

    • Intercept (beta0) ~ 4.87; TV and Radio coefficients positive with p-values near zero; Newspaper coefficient negative with p-value ~ 0.44.

  • Feature selection outcomes:

    • Using M5 Prime (feature selection) tends to drop Newspaper entirely, leaving TV and Radio in the model.

    • P-values for retained features remain informative (near zero for TV and Radio).

    • If Newspaper occasionally reappears due to interactions or model changes, re-check p-values and model fit.

  • Interaction (synergy) exploration:

    • Creating a synergy feature TV Ă— Radio sometimes improves the model’s R^2 (e.g., R^2 increasing from ~0.91 to ~0.92).

    • In synergy-enabled models, the p-value for the synergy term can be near zero, indicating a significant interaction effect.

    • The coefficient for the synergy term can be small in absolute value, yet its p-value indicates importance; this is because the synergy term captures a multiplicative effect that may be weak on its own but meaningful when combined with other features.

  • Model comparison guidance:

    • When comparing models with the same features, higher R^2 indicates a better fit.

    • When comparing different feature sets, consider adjusted R^2 to account for the number of features.

  • Decision guidance and practical interpretation:

    • A significant positive TV coefficient implies that increasing TV spending tends to raise Sales.

    • A significant Radio coefficient implies a positive relationship as well, though the magnitude matters for budgeting decisions.

    • Newspaper often shows weak or non-significant impact in this dataset, potentially due to diminishing marginal returns or changes in consumption patterns (e.g., newspaper readership decline).

  • Final notes on metrics and interpretation:

    • Root Mean Squared Error (RMSE) and R^2 are your primary regression evaluation metrics.

    • R^2 ranges [0,1]; higher is better; values around 0.9+ indicate a strong model, acknowledging some noise in real data.

    • Adjusted R^2 helps guard against artificially high R^2s as more features are added.

    • The learning process involves balancing feature inclusion, interaction terms, and data preprocessing to maximize explained variance while avoiding overfitting.

Data Distribution, Preprocessing, and Practical Considerations

  • Distribution notes:

    • Newspaper feature shows strong skewness; other features are closer to normal distribution.

    • Skewness can affect regression calculations; future lessons may cover transformations to address skew.

  • Data preparation best practices:

    • Do feature engineering (e.g., interaction terms) before data splitting.

    • Use a fixed seed for splits to maintain reproducibility in demonstrations and exams.

    • Apply automated feature selection techniques to handle many features when necessary.

  • Calibration and diagnostics:

    • After fitting, inspect residuals to assess model fit and potential nonlinearity or heteroscedasticity.

    • Compare models using R^2, RMSE, and consider adjusted R^2 when adding features.

  • Exam-oriented takeaways:

    • You may be asked to interpret a regression output: identify the meaning of coefficients, intercept, p-values, and the R^2 value.

    • You may be asked to explain why a feature is removed by a feature selection method and how interaction terms affect model performance.

    • You may be asked to explain why linear regression is sensitive to outliers and how to address them.

  • Summary definitions (quick reference):

    • Coefficients: (\beta_j) values learned by the model.

    • Predictions: \hat{y} = \beta0 + \beta1 x1 + \cdots + \betap x_p

    • SSE: \text{SSE} = \sum{i=1}^n (yi - \hat{y}_i)^2

    • MSE: \text{MSE} = \frac{\text{SSE}}{n}

    • RMSE: \text{RMSE} = \sqrt{\text{MSE}}

    • R-squared: R^2 = 1 - \frac{\text{SSE}}{\text{SST}} with \text{SST} = \sum{i=1}^n (yi - \bar{y})^2

    • Adjusted R-squared: (conceptual adjustment for number of features; not always available in all tools)

    • Hypothesis testing for coeffs: H0: (\beta_j = 0); p-values indicate significance; small p-values imply the coefficient is significantly different from zero.

    • Interaction term: product of two features (e.g., TV Ă— Radio) to capture synergy; can improve model fit even if the individual coefficients are small.

Quick exam-style recap

  • If given a regression output: identify the model equation, interpret the intercept and slope(s), discuss significance via p-values, and describe the model’s explanatory power with R^2 and (if provided) adjusted R^2.

  • Explain why residuals are squared and how SSE and RMSE relate to model quality.

  • Describe why outliers can distort linear regression results and what steps you might take to handle them.

  • Explain the difference between parameters and hyperparameters with examples from regression and other models.

  • Discuss the rationale for feature engineering (including interaction terms) before splitting data and how this can impact p-values and R^2.

  • Compare models using R^2 vs adjusted R^2 and explain why the latter can be preferable when comparing models with different feature sets.

  • Summarize how to interpret a regression coefficient in a real-world marketing context (e.g., spending on TV or Radio and its effect on Sales).