Regression Analysis and Linear Regression
Modeling Basics
- Response variable (Y) and predictive variables (Xs) are combined into a model.
- Modeling involves extra steps like variable selection.
- Complex models like machine learning can handle many variables, but regression models require careful selection.
- Too many variables in regression can reduce the model's interpretability.
- Calibration and validation are crucial steps in real-life settings.
- Train the model on a subset of the data and test its generalizability on the remaining data.
- A good model should be generalizable to new data, not just fit the training data well.
Course Overview
- The lecture provides a refresher for ENVX one double o two.
- It aims to provide necessary knowledge even for those without prior statistics coursework.
- Today's lecture serves as a cheat sheet for regression.
Regression Fundamentals
- Regression involves one response variable (Y) and one or more predictive variables (Xs).
- Relationships between variables can be linear or nonlinear.
- The course focuses on simple linear regression (one predictor) for theoretical understanding.
- The majority of the course covers multiple linear regression (multiple predictors).
Reasons for Regression
- Determine if a relationship exists between predictors and response.
- Estimate new values of the response based on existing data.
- Example: Estimate cow weights based on existing data from 100 cows.
- Test hypotheses about relationships between variables.
- Test if the model is significant.
The Essence of Regression
- Regression fits a line of best fit to the data points.
- Technically, the goal is to minimize the sum of squared residuals.
Lecture Outline: Model Development
- Developing the Model
- Visualizing and summarizing data.
- Fitting a simple model.
- Checking assumptions.
- Interpreting the output.
- Applying transformations if necessary.
Historical Context
- Galton and others developed the theory of least squares.
- Residual: The distance between the observed data point and the regression line.
- Sum of Squared Residuals: Squaring the residuals to eliminate negative values and summing them.
- The goal is to minimize the sum of squared residuals to get the best fit.
Simple Linear Regression
- Involves one predictive variable and one response variable.
- Example: Galton's data on parent and child heights.
Galton's Data
- Measures the height of parents (average of mother and father) and their children.
- Investigates the influence of parent height on child height.
- Each pair of parents had approximately four children.
- Height is measured in inches.
- Data is binned into 0.5-inch size classes.
Checking for Linearity
Visually inspect the data to see if the relationship appears linear.
Use Pearson's correlation coefficient r to quantify the linear relationship.
Formula for Pearson's correlation coefficient (will be provided in exams):
- r = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sqrt{\sum{(xi - \bar{x})^2 \sum{(yi - \bar{y})^2}}}
r ranges from -1 to +1.
- Positive values indicate a positive relationship (as one increases, the other increases).
- Negative values indicate a negative relationship (as one increases, the other decreases).
- The absolute value indicates the strength of the relationship (1 is a perfect linear relationship).
Caveat: Pearson's correlation coefficient doesn't distinguish patterns well.
Examples of Correlation Coefficients
- Coefficient of 0.8 can occur in various data patterns.
- Nonlinear data.
- Data with outliers.
- Quadratic relationships.
- Linear data.
- Always check the plot in addition to the correlation coefficient.
Equation
- y = \beta0 + \beta1x + \epsilon
- y is the response variable.
- \beta_0 is the intercept.
- \beta_1 is the slope.
- x is the predictor variable.
- \epsilon is the error term (residuals).
Understanding the Equation
- y is the result of the equation.
- \beta0 and \beta1 are constants (fixed once known).
- x is a known value.
- The error term is associated with the residual.
Model Fitting
- \hat{y} represents the predicted value.
- \hat{y} = \beta0 + \beta1x
- Residual is the difference between the actual y and the predicted \hat{y}.
- Residual = y - \hat{y}
- Residual = y - (\beta0 + \beta1x)
- The method of least squares minimizes the squared residual.
- \text{Minimize } \sum{(y - \hat{y})^2}
Analytical vs. Numerical Fitting
- Analytical: Solving for \beta1 and \beta0 using equations.
- \beta1 = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sum{(xi - \bar{x})^2}}
- \beta0 = \bar{y} - \beta1\bar{x}
- Numerical: Estimating \beta1 and \beta0 iteratively and tweaking them to reduce the residual.
- Computers use numerical fitting, especially for large datasets.
Model Fitting in R
- Use the
lmfunction for simple and multiple linear regression. lm(y ~ x, data = galton_data)yis the response variable (child height).xis the predictor variable (average parent height).galton_datais the dataset.
Hypothesis Testing
- Revisiting the hypothesis testing process (hat PC).
- H: Hypothesis
- A: Assumptions
- T: Test
- P: p-values
- C: Conclusions
Hypothesis for Simple Linear Regression
- Null hypothesis (H0): The slope is equal to zero (\beta1 = 0).
- Alternative hypothesis (H1): The slope is not equal to zero (\beta1 \neq 0).
- If slope is zero, it tests the model that has a fit of a horizontal line, which is at the average of the y response variable.
- Simple linear regression expresses whether there is a slope (model with slope is better than the mean of the y response).
Assumptions of Linear Regression (LINE)
- L: Linearity
- I: Independence
- N: Normality
- E: Equal Variance (Homoscedasticity)
- The assumptions apply to the residuals, not the data itself.
- If the data is non-linear, the residuals will also be non-linear.
Checking Assumptions
- In base R, use
par(mfrow = c(2, 2))to arrange plots. - Use the
plot(model)function to generate diagnostic plots.
Diagnostic Plots
- Residuals vs. Fitted: Points should be randomly scattered around the line.
- QQ Plot: Residuals should follow the one-to-one line.
- Scale-Location: Red line should be flat, and points should be evenly scattered.
- Residuals vs. Leverage: Check for points outside Cook's distance lines.
Alternative Packages for Checking Assumptions
ggfortifypackage: Provides prettier plots but lacks Cook's distance lines.performancepackage: Provides instructions for each assumption plot.
Linearity Assessment
- Assess whether residuals are linear.
- Check if the green reference line is flat and horizontal near zero.
- If violated, consider transforming the data or using a non-linear model.
Independence
- Addressed during experimental design.
- Violations can occur with paired data, time series data (autocorrelation), or similar predictive variables (multicollinearity).
Multicollinearity
Definition
- High correlation between predictive variables (e.g., > 0.9).
Effects
- Unstable model results.
- Difficulty in determining the importance of individual predictors.
Solutions
- Remove one of the highly correlated predictors (usually the less important one).
Normality
- Residuals should be normally distributed at every stage of the line.
- Problems occur with skewed distributions or fanning data.
- Use histogram and QQ plot from the
performancepackage. - Histogram: Points should follow the green line.
- QQ Plot: Points should follow the flat line.
Skewness
Heavy vs. Light Tails
- Heavy tails: Very skewed.
- Light tails: Slightly skewed.
Left vs. Right Skew
- Left skewed: Tail on the left.
- Right skewed: Tail on the right.
- Interpret the skew based on the shape of the distribution.
Equal Variances (Homoscedasticity)
- Check for fanning in the data.
- Use the homogeneity of variance plot.
- Green line should be relatively flat, with evenly scattered points.
- Standardized residuals should be less than 2.
- Leverage can distort the model fit.
- A single point shifted far-off may shift the line.
Standardized Residuals
- Residual divided by the standard error of the residual.
- Mean of residuals should be zero.
- Values above 2 (positive or negative) indicate potential outliers.
- Be cautious if variance is not constant.
- Particular shape such as a U or a W indicates non-linearity and questionable residuals.
Model Fit and ANOVA
- ANOVA (Analysis of Variance) is a variation of linear regression.
- Both partition variance into sums of squares for the residual and error.
- Components form the F-statistic.
ANOVA Table and Regression Output
- Can generate ANOVA table from a linear regression model.
anova(model)- Outputs an F-value and a p-value.
- F-value: \frac{\text{Sums of squares for the parent variable}}{\text{Mean square of the residuals}}
Interpreting the Regression Output
- Use
summary(model)to get detailed regression output. - The F-statistic and p-value in the regression output are the same as in the ANOVA table.
- The F statistic will tell how much does the parent's height explain the child's height.
- When you look at the significance of the F statistic is that it's the same value.
Writing out the Equation
- With function summary, we can write out the equation
- y = intercept + slope * parent
- Y = estimate + Beta 1 (slope) * parent.
Interpreting the Relationship
- For every unit change in the predictor (parent), we expect a corresponding change in the response (child).
R-squared Values
- Two types: multiple R-squared and adjusted R-squared.
- Multiple R-squared: Use when only one predictor variable is available.
- Adjusted R-squared: Use when multiple predictor are available, with a penalty based on keeping extra variables around.
- Adjusted R-squared is always going to be lower than Multiple R-squared
Making Predictions
- Plug in any value for the predictor variable and calculate an estimation from training data.
- Use the equation or the "predict" function
predict(model, newdata = data.frame(parent = 70))
Generalizability
- Assess whether the model can apply to a new data set.
- Galton's study may not relate to more modern data due to improved eating etc.
Transformations
- Apply transformations when assumptions are not met, particularly linearity, equal variance, and normality.
- Transform either the response (y) or predictor (x) variable.
- Transforming the y is easier when all the variables show that it's not linearity.
- Easier to transform just individual x variables
- Beware: the equation becomes harder to read when transformed.
Example: Air Quality Data
- Measure the amount of ozone in New York in 1973.
- Fit a simple linear regression with temperature as the predictor.
Checking Assumptions
- With the performance package, assumptions still may not be fulfilled
- Even if a bit subjective, it's good enough to pass
- Still may exhibit waviness, values over standard deviation,