Note

0.0(0)

Take a practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

Explore Top Notes

Big Idea 5: Impact of Computing

Note

Studied by 10807 people

Ancient Chinese Philosophies

Note

Studied by 12 people

5.0(1)

Chapter 21 - Phylogeny, Speciation, and Extinction

Regression Analysis and Linear Regression

Modeling Basics

Response variable (Y) and predictive variables (Xs) are combined into a model.
Modeling involves extra steps like variable selection.
Complex models like machine learning can handle many variables, but regression models require careful selection.
Too many variables in regression can reduce the model's interpretability.
Calibration and validation are crucial steps in real-life settings.
Train the model on a subset of the data and test its generalizability on the remaining data.
A good model should be generalizable to new data, not just fit the training data well.

Course Overview

The lecture provides a refresher for ENVX one double o two.
It aims to provide necessary knowledge even for those without prior statistics coursework.
Today's lecture serves as a cheat sheet for regression.

Regression Fundamentals

Regression involves one response variable (Y) and one or more predictive variables (Xs).
Relationships between variables can be linear or nonlinear.
The course focuses on simple linear regression (one predictor) for theoretical understanding.
The majority of the course covers multiple linear regression (multiple predictors).

Reasons for Regression

Determine if a relationship exists between predictors and response.
Estimate new values of the response based on existing data.
- Example: Estimate cow weights based on existing data from 100 cows.
Test hypotheses about relationships between variables.
- Test if the model is significant.

The Essence of Regression

Regression fits a line of best fit to the data points.
Technically, the goal is to minimize the sum of squared residuals.

Lecture Outline: Model Development

Developing the Model
- Visualizing and summarizing data.
- Fitting a simple model.
- Checking assumptions.
- Interpreting the output.
- Applying transformations if necessary.

Historical Context

Galton and others developed the theory of least squares.
Residual: The distance between the observed data point and the regression line.
Sum of Squared Residuals: Squaring the residuals to eliminate negative values and summing them.
The goal is to minimize the sum of squared residuals to get the best fit.

Simple Linear Regression

Involves one predictive variable and one response variable.
Example: Galton's data on parent and child heights.

Galton's Data

Measures the height of parents (average of mother and father) and their children.
Investigates the influence of parent height on child height.
Each pair of parents had approximately four children.
Height is measured in inches.
Data is binned into 0.5-inch size classes.

Checking for Linearity

Visually inspect the data to see if the relationship appears linear.
Use Pearson's correlation coefficient r to quantify the linear relationship.
Formula for Pearson's correlation coefficient (will be provided in exams):
- r = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sqrt{\sum{(xi - \bar{x})^2 \sum{(yi - \bar{y})^2}}}
r ranges from -1 to +1.
- Positive values indicate a positive relationship (as one increases, the other increases).
- Negative values indicate a negative relationship (as one increases, the other decreases).
- The absolute value indicates the strength of the relationship (1 is a perfect linear relationship).
Caveat: Pearson's correlation coefficient doesn't distinguish patterns well.

Examples of Correlation Coefficients

Coefficient of 0.8 can occur in various data patterns.
- Nonlinear data.
- Data with outliers.
- Quadratic relationships.
- Linear data.
Always check the plot in addition to the correlation coefficient.

Equation

y = \beta0 + \beta1x + \epsilon
y is the response variable.
\beta_0 is the intercept.
\beta_1 is the slope.
x is the predictor variable.
\epsilon is the error term (residuals).

Understanding the Equation

y is the result of the equation.
\beta0 and \beta1 are constants (fixed once known).
x is a known value.
The error term is associated with the residual.

Model Fitting

\hat{y} represents the predicted value.
\hat{y} = \beta0 + \beta1x
Residual is the difference between the actual y and the predicted \hat{y}.
Residual = y - \hat{y}
Residual = y - (\beta0 + \beta1x)
The method of least squares minimizes the squared residual.
\text{Minimize } \sum{(y - \hat{y})^2}

Analytical vs. Numerical Fitting

Analytical: Solving for \beta1 and \beta0 using equations.
- \beta1 = \frac{\sum{(xi - \bar{x})(yi - \bar{y})}}{\sum{(xi - \bar{x})^2}}
- \beta0 = \bar{y} - \beta1\bar{x}
Numerical: Estimating \beta1 and \beta0 iteratively and tweaking them to reduce the residual.
Computers use numerical fitting, especially for large datasets.

Model Fitting in R

Use the lm function for simple and multiple linear regression.
lm(y ~ x, data = galton_data)
y is the response variable (child height).
x is the predictor variable (average parent height).
galton_data is the dataset.

Hypothesis Testing

Revisiting the hypothesis testing process (hat PC).
H: Hypothesis
A: Assumptions
T: Test
P: p-values
C: Conclusions

Hypothesis for Simple Linear Regression

Null hypothesis (H0): The slope is equal to zero (\beta1 = 0).
Alternative hypothesis (H1): The slope is not equal to zero (\beta1 \neq 0).
If slope is zero, it tests the model that has a fit of a horizontal line, which is at the average of the y response variable.
Simple linear regression expresses whether there is a slope (model with slope is better than the mean of the y response).

Assumptions of Linear Regression (LINE)

L: Linearity
I: Independence
N: Normality
E: Equal Variance (Homoscedasticity)
The assumptions apply to the residuals, not the data itself.
If the data is non-linear, the residuals will also be non-linear.

Checking Assumptions

In base R, use par(mfrow = c(2, 2)) to arrange plots.
Use the plot(model) function to generate diagnostic plots.

Diagnostic Plots

Residuals vs. Fitted: Points should be randomly scattered around the line.
QQ Plot: Residuals should follow the one-to-one line.
Scale-Location: Red line should be flat, and points should be evenly scattered.
Residuals vs. Leverage: Check for points outside Cook's distance lines.

Alternative Packages for Checking Assumptions

ggfortify package: Provides prettier plots but lacks Cook's distance lines.
performance package: Provides instructions for each assumption plot.

Linearity Assessment

Assess whether residuals are linear.
Check if the green reference line is flat and horizontal near zero.
If violated, consider transforming the data or using a non-linear model.

Independence

Addressed during experimental design.
Violations can occur with paired data, time series data (autocorrelation), or similar predictive variables (multicollinearity).

Multicollinearity

Definition

High correlation between predictive variables (e.g., > 0.9).

Effects

Unstable model results.
Difficulty in determining the importance of individual predictors.

Solutions

Remove one of the highly correlated predictors (usually the less important one).

Normality

Residuals should be normally distributed at every stage of the line.
Problems occur with skewed distributions or fanning data.
Use histogram and QQ plot from the performance package.
Histogram: Points should follow the green line.
QQ Plot: Points should follow the flat line.

Skewness

Heavy vs. Light Tails

Heavy tails: Very skewed.
Light tails: Slightly skewed.

Left vs. Right Skew

Left skewed: Tail on the left.
Right skewed: Tail on the right.
Interpret the skew based on the shape of the distribution.

Equal Variances (Homoscedasticity)

Check for fanning in the data.
Use the homogeneity of variance plot.
Green line should be relatively flat, with evenly scattered points.
Standardized residuals should be less than 2.
Leverage can distort the model fit.
A single point shifted far-off may shift the line.

Standardized Residuals

Residual divided by the standard error of the residual.
Mean of residuals should be zero.
Values above 2 (positive or negative) indicate potential outliers.
Be cautious if variance is not constant.
Particular shape such as a U or a W indicates non-linearity and questionable residuals.

Model Fit and ANOVA

ANOVA (Analysis of Variance) is a variation of linear regression.
Both partition variance into sums of squares for the residual and error.
Components form the F-statistic.

ANOVA Table and Regression Output

Can generate ANOVA table from a linear regression model.
anova(model)
Outputs an F-value and a p-value.
F-value: \frac{\text{Sums of squares for the parent variable}}{\text{Mean square of the residuals}}

Interpreting the Regression Output

Use summary(model) to get detailed regression output.
The F-statistic and p-value in the regression output are the same as in the ANOVA table.
The F statistic will tell how much does the parent's height explain the child's height.
When you look at the significance of the F statistic is that it's the same value.

Writing out the Equation

With function summary, we can write out the equation
y = intercept + slope * parent
Y = estimate + Beta 1 (slope) * parent.

Interpreting the Relationship

For every unit change in the predictor (parent), we expect a corresponding change in the response (child).

R-squared Values

Two types: multiple R-squared and adjusted R-squared.
Multiple R-squared: Use when only one predictor variable is available.
Adjusted R-squared: Use when multiple predictor are available, with a penalty based on keeping extra variables around.
Adjusted R-squared is always going to be lower than Multiple R-squared

Making Predictions

Plug in any value for the predictor variable and calculate an estimation from training data.
Use the equation or the "predict" function
predict(model, newdata = data.frame(parent = 70))

Generalizability

Assess whether the model can apply to a new data set.
Galton's study may not relate to more modern data due to improved eating etc.

Transformations

Apply transformations when assumptions are not met, particularly linearity, equal variance, and normality.
Transform either the response (y) or predictor (x) variable.
Transforming the y is easier when all the variables show that it's not linearity.
Easier to transform just individual x variables
Beware: the equation becomes harder to read when transformed.