Regression Analysis and Linear Regression
Modeling Basics
- Response variable (Y) and predictive variables (Xs) are combined into a model.
- Modeling involves extra steps like variable selection.
- Complex models like machine learning can handle many variables, but regression models require careful selection.
- Too many variables in regression can reduce the model's interpretability.
- Calibration and validation are crucial steps in real-life settings.
- Train the model on a subset of the data and test its generalizability on the remaining data.
- A good model should be generalizable to new data, not just fit the training data well.
Course Overview
- The lecture provides a refresher for ENVX one double o two.
- It aims to provide necessary knowledge even for those without prior statistics coursework.
- Today's lecture serves as a cheat sheet for regression.
Regression Fundamentals
- Regression involves one response variable (Y) and one or more predictive variables (Xs).
- Relationships between variables can be linear or nonlinear.
- The course focuses on simple linear regression (one predictor) for theoretical understanding.
- The majority of the course covers multiple linear regression (multiple predictors).
Reasons for Regression
- Determine if a relationship exists between predictors and response.
- Estimate new values of the response based on existing data.
- Example: Estimate cow weights based on existing data from 100 cows.
- Test hypotheses about relationships between variables.
- Test if the model is significant.
The Essence of Regression
- Regression fits a line of best fit to the data points.
- Technically, the goal is to minimize the sum of squared residuals.
Lecture Outline: Model Development
- Developing the Model
- Visualizing and summarizing data.
- Fitting a simple model.
- Checking assumptions.
- Interpreting the output.
- Applying transformations if necessary.
Historical Context
- Galton and others developed the theory of least squares.
- Residual: The distance between the observed data point and the regression line.
- Sum of Squared Residuals: Squaring the residuals to eliminate negative values and summing them.
- The goal is to minimize the sum of squared residuals to get the best fit.
Simple Linear Regression
- Involves one predictive variable and one response variable.
- Example: Galton's data on parent and child heights.
Galton's Data
- Measures the height of parents (average of mother and father) and their children.
- Investigates the influence of parent height on child height.
- Each pair of parents had approximately four children.
- Height is measured in inches.
- Data is binned into 0.5-inch size classes.
Checking for Linearity
Visually inspect the data to see if the relationship appears linear.
Use Pearson's correlation coefficient to quantify the linear relationship.
Formula for Pearson's correlation coefficient (will be provided in exams):
- r = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sqrt{\sum{(xi - \bar{x})^2 \sum{(yi - \bar{y})^2}}}
ranges from -1 to +1.
- Positive values indicate a positive relationship (as one increases, the other increases).
- Negative values indicate a negative relationship (as one increases, the other decreases).
- The absolute value indicates the strength of the relationship (1 is a perfect linear relationship).
Caveat: Pearson's correlation coefficient doesn't distinguish patterns well.
Examples of Correlation Coefficients
- Coefficient of 0.8 can occur in various data patterns.
- Nonlinear data.
- Data with outliers.
- Quadratic relationships.
- Linear data.
- Always check the plot in addition to the correlation coefficient.
Equation
- y is the response variable.
- is the intercept.
- is the slope.
- x is the predictor variable.
- is the error term (residuals).
Understanding the Equation
- y is the result of the equation.
- and are constants (fixed once known).
- x is a known value.
- The error term is associated with the residual.
Model Fitting
- represents the predicted value.
- Residual is the difference between the actual y and the predicted .
- The method of least squares minimizes the squared residual.
Analytical vs. Numerical Fitting
- Analytical: Solving for and using equations.
- Numerical: Estimating and iteratively and tweaking them to reduce the residual.
- Computers use numerical fitting, especially for large datasets.
Model Fitting in R
- Use the
lmfunction for simple and multiple linear regression. lm(y ~ x, data = galton_data)yis the response variable (child height).xis the predictor variable (average parent height).galton_datais the dataset.
Hypothesis Testing
- Revisiting the hypothesis testing process (hat PC).
- H: Hypothesis
- A: Assumptions
- T: Test
- P: p-values
- C: Conclusions
Hypothesis for Simple Linear Regression
- Null hypothesis (): The slope is equal to zero ().
- Alternative hypothesis (): The slope is not equal to zero ().
- If slope is zero, it tests the model that has a fit of a horizontal line, which is at the average of the y response variable.
- Simple linear regression expresses whether there is a slope (model with slope is better than the mean of the y response).
Assumptions of Linear Regression (LINE)
- L: Linearity
- I: Independence
- N: Normality
- E: Equal Variance (Homoscedasticity)
- The assumptions apply to the residuals, not the data itself.
- If the data is non-linear, the residuals will also be non-linear.
Checking Assumptions
- In base R, use
par(mfrow = c(2, 2))to arrange plots. - Use the
plot(model)function to generate diagnostic plots.
Diagnostic Plots
- Residuals vs. Fitted: Points should be randomly scattered around the line.
- QQ Plot: Residuals should follow the one-to-one line.
- Scale-Location: Red line should be flat, and points should be evenly scattered.
- Residuals vs. Leverage: Check for points outside Cook's distance lines.
Alternative Packages for Checking Assumptions
ggfortifypackage: Provides prettier plots but lacks Cook's distance lines.performancepackage: Provides instructions for each assumption plot.
Linearity Assessment
- Assess whether residuals are linear.
- Check if the green reference line is flat and horizontal near zero.
- If violated, consider transforming the data or using a non-linear model.
Independence
- Addressed during experimental design.
- Violations can occur with paired data, time series data (autocorrelation), or similar predictive variables (multicollinearity).
Multicollinearity
Definition
- High correlation between predictive variables (e.g., > 0.9).
Effects
- Unstable model results.
- Difficulty in determining the importance of individual predictors.
Solutions
- Remove one of the highly correlated predictors (usually the less important one).
Normality
- Residuals should be normally distributed at every stage of the line.
- Problems occur with skewed distributions or fanning data.
- Use histogram and QQ plot from the
performancepackage. - Histogram: Points should follow the green line.
- QQ Plot: Points should follow the flat line.
Skewness
Heavy vs. Light Tails
- Heavy tails: Very skewed.
- Light tails: Slightly skewed.
Left vs. Right Skew
- Left skewed: Tail on the left.
- Right skewed: Tail on the right.
- Interpret the skew based on the shape of the distribution.
Equal Variances (Homoscedasticity)
- Check for fanning in the data.
- Use the homogeneity of variance plot.
- Green line should be relatively flat, with evenly scattered points.
- Standardized residuals should be less than 2.
- Leverage can distort the model fit.
- A single point shifted far-off may shift the line.
Standardized Residuals
- Residual divided by the standard error of the residual.
- Mean of residuals should be zero.
- Values above 2 (positive or negative) indicate potential outliers.
- Be cautious if variance is not constant.
- Particular shape such as a U or a W indicates non-linearity and questionable residuals.
Model Fit and ANOVA
- ANOVA (Analysis of Variance) is a variation of linear regression.
- Both partition variance into sums of squares for the residual and error.
- Components form the F-statistic.
ANOVA Table and Regression Output
- Can generate ANOVA table from a linear regression model.
anova(model)- Outputs an F-value and a p-value.
- F-value:
Interpreting the Regression Output
- Use
summary(model)to get detailed regression output. - The F-statistic and p-value in the regression output are the same as in the ANOVA table.
- The F statistic will tell how much does the parent's height explain the child's height.
- When you look at the significance of the F statistic is that it's the same value.
Writing out the Equation
- With function summary, we can write out the equation
- y = intercept + slope * parent
- Y = estimate + Beta 1 (slope) * parent.
Interpreting the Relationship
- For every unit change in the predictor (parent), we expect a corresponding change in the response (child).
R-squared Values
- Two types: multiple R-squared and adjusted R-squared.
- Multiple R-squared: Use when only one predictor variable is available.
- Adjusted R-squared: Use when multiple predictor are available, with a penalty based on keeping extra variables around.
- Adjusted R-squared is always going to be lower than Multiple R-squared
Making Predictions
- Plug in any value for the predictor variable and calculate an estimation from training data.
- Use the equation or the "predict" function
predict(model, newdata = data.frame(parent = 70))
Generalizability
- Assess whether the model can apply to a new data set.
- Galton's study may not relate to more modern data due to improved eating etc.
Transformations
- Apply transformations when assumptions are not met, particularly linearity, equal variance, and normality.
- Transform either the response (y) or predictor (x) variable.
- Transforming the y is easier when all the variables show that it's not linearity.
- Easier to transform just individual x variables
- Beware: the equation becomes harder to read when transformed.
Example: Air Quality Data
- Measure the amount of ozone in New York in 1973.
- Fit a simple linear regression with temperature as the predictor.
Checking Assumptions
- With the performance package, assumptions still may not be fulfilled
- Even if a bit subjective, it's good enough to pass
- Still may exhibit waviness, values over standard deviation,