Regression Analysis and Linear Regression

Modeling Basics

  • Response variable (Y) and predictive variables (Xs) are combined into a model.
  • Modeling involves extra steps like variable selection.
  • Complex models like machine learning can handle many variables, but regression models require careful selection.
  • Too many variables in regression can reduce the model's interpretability.
  • Calibration and validation are crucial steps in real-life settings.
  • Train the model on a subset of the data and test its generalizability on the remaining data.
  • A good model should be generalizable to new data, not just fit the training data well.

Course Overview

  • The lecture provides a refresher for ENVX one double o two.
  • It aims to provide necessary knowledge even for those without prior statistics coursework.
  • Today's lecture serves as a cheat sheet for regression.

Regression Fundamentals

  • Regression involves one response variable (Y) and one or more predictive variables (Xs).
  • Relationships between variables can be linear or nonlinear.
  • The course focuses on simple linear regression (one predictor) for theoretical understanding.
  • The majority of the course covers multiple linear regression (multiple predictors).

Reasons for Regression

  1. Determine if a relationship exists between predictors and response.
  2. Estimate new values of the response based on existing data.
    • Example: Estimate cow weights based on existing data from 100 cows.
  3. Test hypotheses about relationships between variables.
    • Test if the model is significant.

The Essence of Regression

  • Regression fits a line of best fit to the data points.
  • Technically, the goal is to minimize the sum of squared residuals.

Lecture Outline: Model Development

  1. Developing the Model
    • Visualizing and summarizing data.
    • Fitting a simple model.
    • Checking assumptions.
    • Interpreting the output.
    • Applying transformations if necessary.

Historical Context

  • Galton and others developed the theory of least squares.
  • Residual: The distance between the observed data point and the regression line.
  • Sum of Squared Residuals: Squaring the residuals to eliminate negative values and summing them.
  • The goal is to minimize the sum of squared residuals to get the best fit.

Simple Linear Regression

  • Involves one predictive variable and one response variable.
  • Example: Galton's data on parent and child heights.

Galton's Data

  • Measures the height of parents (average of mother and father) and their children.
  • Investigates the influence of parent height on child height.
  • Each pair of parents had approximately four children.
  • Height is measured in inches.
  • Data is binned into 0.5-inch size classes.

Checking for Linearity

  • Visually inspect the data to see if the relationship appears linear.

  • Use Pearson's correlation coefficient r to quantify the linear relationship.

  • Formula for Pearson's correlation coefficient (will be provided in exams):

    • r = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sqrt{\sum{(xi - \bar{x})^2 \sum{(yi - \bar{y})^2}}}
  • r ranges from -1 to +1.

    • Positive values indicate a positive relationship (as one increases, the other increases).
    • Negative values indicate a negative relationship (as one increases, the other decreases).
    • The absolute value indicates the strength of the relationship (1 is a perfect linear relationship).
  • Caveat: Pearson's correlation coefficient doesn't distinguish patterns well.

Examples of Correlation Coefficients

  • Coefficient of 0.8 can occur in various data patterns.
    • Nonlinear data.
    • Data with outliers.
    • Quadratic relationships.
    • Linear data.
  • Always check the plot in addition to the correlation coefficient.

Equation

  • y = \beta0 + \beta1x + \epsilon
  • y is the response variable.
  • \beta_0 is the intercept.
  • \beta_1 is the slope.
  • x is the predictor variable.
  • \epsilon is the error term (residuals).

Understanding the Equation

  • y is the result of the equation.
  • \beta0 and \beta1 are constants (fixed once known).
  • x is a known value.
  • The error term is associated with the residual.

Model Fitting

  • \hat{y} represents the predicted value.
  • \hat{y} = \beta0 + \beta1x
  • Residual is the difference between the actual y and the predicted \hat{y}.
  • Residual = y - \hat{y}
  • Residual = y - (\beta0 + \beta1x)
  • The method of least squares minimizes the squared residual.
  • \text{Minimize } \sum{(y - \hat{y})^2}

Analytical vs. Numerical Fitting

  • Analytical: Solving for \beta1 and \beta0 using equations.
    • \beta1 = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sum{(xi - \bar{x})^2}}
    • \beta0 = \bar{y} - \beta1\bar{x}
  • Numerical: Estimating \beta1 and \beta0 iteratively and tweaking them to reduce the residual.
  • Computers use numerical fitting, especially for large datasets.

Model Fitting in R

  • Use the lm function for simple and multiple linear regression.
  • lm(y ~ x, data = galton_data)
  • y is the response variable (child height).
  • x is the predictor variable (average parent height).
  • galton_data is the dataset.

Hypothesis Testing

  • Revisiting the hypothesis testing process (hat PC).
  • H: Hypothesis
  • A: Assumptions
  • T: Test
  • P: p-values
  • C: Conclusions

Hypothesis for Simple Linear Regression

  • Null hypothesis (H0): The slope is equal to zero (\beta1 = 0).
  • Alternative hypothesis (H1): The slope is not equal to zero (\beta1 \neq 0).
  • If slope is zero, it tests the model that has a fit of a horizontal line, which is at the average of the y response variable.
  • Simple linear regression expresses whether there is a slope (model with slope is better than the mean of the y response).

Assumptions of Linear Regression (LINE)

  • L: Linearity
  • I: Independence
  • N: Normality
  • E: Equal Variance (Homoscedasticity)
  • The assumptions apply to the residuals, not the data itself.
  • If the data is non-linear, the residuals will also be non-linear.

Checking Assumptions

  • In base R, use par(mfrow = c(2, 2)) to arrange plots.
  • Use the plot(model) function to generate diagnostic plots.

Diagnostic Plots

  1. Residuals vs. Fitted: Points should be randomly scattered around the line.
  2. QQ Plot: Residuals should follow the one-to-one line.
  3. Scale-Location: Red line should be flat, and points should be evenly scattered.
  4. Residuals vs. Leverage: Check for points outside Cook's distance lines.

Alternative Packages for Checking Assumptions

  • ggfortify package: Provides prettier plots but lacks Cook's distance lines.
  • performance package: Provides instructions for each assumption plot.

Linearity Assessment

  • Assess whether residuals are linear.
  • Check if the green reference line is flat and horizontal near zero.
  • If violated, consider transforming the data or using a non-linear model.

Independence

  • Addressed during experimental design.
  • Violations can occur with paired data, time series data (autocorrelation), or similar predictive variables (multicollinearity).

Multicollinearity

Definition
  • High correlation between predictive variables (e.g., > 0.9).
Effects
  • Unstable model results.
  • Difficulty in determining the importance of individual predictors.
Solutions
  • Remove one of the highly correlated predictors (usually the less important one).

Normality

  • Residuals should be normally distributed at every stage of the line.
  • Problems occur with skewed distributions or fanning data.
  • Use histogram and QQ plot from the performance package.
  • Histogram: Points should follow the green line.
  • QQ Plot: Points should follow the flat line.

Skewness

Heavy vs. Light Tails
  • Heavy tails: Very skewed.
  • Light tails: Slightly skewed.
Left vs. Right Skew
  • Left skewed: Tail on the left.
  • Right skewed: Tail on the right.
  • Interpret the skew based on the shape of the distribution.

Equal Variances (Homoscedasticity)

  • Check for fanning in the data.
  • Use the homogeneity of variance plot.
  • Green line should be relatively flat, with evenly scattered points.
  • Standardized residuals should be less than 2.
  • Leverage can distort the model fit.
  • A single point shifted far-off may shift the line.

Standardized Residuals

  • Residual divided by the standard error of the residual.
  • Mean of residuals should be zero.
  • Values above 2 (positive or negative) indicate potential outliers.
  • Be cautious if variance is not constant.
  • Particular shape such as a U or a W indicates non-linearity and questionable residuals.

Model Fit and ANOVA

  • ANOVA (Analysis of Variance) is a variation of linear regression.
  • Both partition variance into sums of squares for the residual and error.
  • Components form the F-statistic.

ANOVA Table and Regression Output

  • Can generate ANOVA table from a linear regression model.
  • anova(model)
  • Outputs an F-value and a p-value.
  • F-value: \frac{\text{Sums of squares for the parent variable}}{\text{Mean square of the residuals}}

Interpreting the Regression Output

  • Use summary(model) to get detailed regression output.
  • The F-statistic and p-value in the regression output are the same as in the ANOVA table.
  • The F statistic will tell how much does the parent's height explain the child's height.
  • When you look at the significance of the F statistic is that it's the same value.

Writing out the Equation

  • With function summary, we can write out the equation
  • y = intercept + slope * parent
  • Y = estimate + Beta 1 (slope) * parent.

Interpreting the Relationship

  • For every unit change in the predictor (parent), we expect a corresponding change in the response (child).

R-squared Values

  • Two types: multiple R-squared and adjusted R-squared.
  • Multiple R-squared: Use when only one predictor variable is available.
  • Adjusted R-squared: Use when multiple predictor are available, with a penalty based on keeping extra variables around.
  • Adjusted R-squared is always going to be lower than Multiple R-squared

Making Predictions

  • Plug in any value for the predictor variable and calculate an estimation from training data.
  • Use the equation or the "predict" function
  • predict(model, newdata = data.frame(parent = 70))

Generalizability

  • Assess whether the model can apply to a new data set.
  • Galton's study may not relate to more modern data due to improved eating etc.

Transformations

  • Apply transformations when assumptions are not met, particularly linearity, equal variance, and normality.
  • Transform either the response (y) or predictor (x) variable.
  • Transforming the y is easier when all the variables show that it's not linearity.
  • Easier to transform just individual x variables
  • Beware: the equation becomes harder to read when transformed.

Example: Air Quality Data

  • Measure the amount of ozone in New York in 1973.
  • Fit a simple linear regression with temperature as the predictor.

Checking Assumptions

  • With the performance package, assumptions still may not be fulfilled
  • Even if a bit subjective, it's good enough to pass
  • Still may exhibit waviness, values over standard deviation,