In-Depth Notes on Linear Regression
Linear Regression Overview
- Course: Biol 350 - Biostatistics, Spring 2025
- Learning Objectives:
- Predict Y based on X using linear regression.
- Test the null hypothesis of zero slope.
- Display uncertainty in predictions from a linear model.
- Use scatterplots to evaluate data assumptions for regression.
Key Concepts
Correlation vs. Regression
Correlation:
- Assesses association between variables: direction and strength.
- Do not fit a line; does not imply causality.
- Useful in observational studies with multiple factors.
Regression:
- Predicts values of Y from values of X by fitting a line.
- Enables conclusions about causality.
- Involves response (Y) and explanatory (X) variables in experimental studies.
Definitions
Correlation Coefficient ($r$):
- Ranges from -1 to 1 (unitless).
- No distinction between explanatory and response variables.
- Measures how strongly X and Y change together.
Regression Line ($Y = a + bX$):
- Predicts Y from X, where $a$ is the intercept and $b$ is the slope:
- $a = Y$ when $X = 0$.
- $b$ indicates the change in Y for a unit increase in X.
Residuals
- Measure the difference between each observed ($Yi$) and predicted ($ ilde{Y}i$) value:
- Represents the error not explained by the regression line.
- visualized as deviations of Y across X.
Regression Equation
- General Form:
- Where:
- $a$ = intercept.
- $b$ = slope.
- $e$ = residual error.
Assumptions of Linear Regression
- Random sample of Y for each value of X.
- Linear relationship between X and Y.
- Normal distribution of possible Y-values at each X.
- Constant variance of Y-values across all X.
Assessing Assumptions
- Use scatterplots to check for a linear relationship, normality, and equal variances:
- Look for patterns in the residual plots.
Regression Analysis in R
Steps to Perform Regression:
- Creating a Scatterplot:
- ```R
library(ggplot2)
ggplot(telomeredata, aes(x=fathertelomerelength, y=offspringtelomerelength)) + geompoint() +
theme_minimal()
- ```R
2. **Creating the Linear Model:**
- ```R
telomere_regr <- lm(offspring_telomere_length ~ father_telomere_length, data = telomere_data)
- Interpreting Results:
- Summary provides min, max, residuals, and coefficients with significance:
- Example results indicate significant predictive relationship if p-value < 0.05.
Output interpretation:
- Coefficients:
- Intercept and slope give the regression equation to predict Y based on X.
- Example equation:
Goodness of Fit
- R-squared ($R^2$):
- Indicates how well the model explains the variance in the data.
- Higher values denote a better fit (e.g., R-squared = 0.32 indicates 32% of the variance is explained).
Key Study Areas
- Distinguish between correlation and regression.
- Understand regression line calculation methods.
- Interpret scatterplots and identify violations of assumptions.
- Analyze p-values and R-squared.
- Practice using R functions like
lm()andsummary()to derive and interpret regression results.