In-Depth Notes on Linear Regression

Linear Regression Overview

  • Course: Biol 350 - Biostatistics, Spring 2025
  • Learning Objectives:
    • Predict Y based on X using linear regression.
    • Test the null hypothesis of zero slope.
    • Display uncertainty in predictions from a linear model.
    • Use scatterplots to evaluate data assumptions for regression.

Key Concepts

Correlation vs. Regression
  • Correlation:

    • Assesses association between variables: direction and strength.
    • Do not fit a line; does not imply causality.
    • Useful in observational studies with multiple factors.
  • Regression:

    • Predicts values of Y from values of X by fitting a line.
    • Enables conclusions about causality.
    • Involves response (Y) and explanatory (X) variables in experimental studies.
Definitions
  • Correlation Coefficient ($r$):

    • Ranges from -1 to 1 (unitless).
    • No distinction between explanatory and response variables.
    • Measures how strongly X and Y change together.
  • Regression Line ($Y = a + bX$):

    • Predicts Y from X, where $a$ is the intercept and $b$ is the slope:
    • $a = Y$ when $X = 0$.
    • $b$ indicates the change in Y for a unit increase in X.
Residuals
  • Measure the difference between each observed ($Yi$) and predicted ($ ilde{Y}i$) value:
    • e<em>i=Y</em>iildeYie<em>i = Y</em>i - ilde{Y}_i
  • Represents the error not explained by the regression line.
  • visualized as deviations of Y across X.

Regression Equation

  • General Form: Y=a+bX+eY = a + bX + e
    • Where:
    • $a$ = intercept.
    • $b$ = slope.
    • $e$ = residual error.
Assumptions of Linear Regression
  1. Random sample of Y for each value of X.
  2. Linear relationship between X and Y.
  3. Normal distribution of possible Y-values at each X.
  4. Constant variance of Y-values across all X.
Assessing Assumptions
  • Use scatterplots to check for a linear relationship, normality, and equal variances:
    • Look for patterns in the residual plots.

Regression Analysis in R

Steps to Perform Regression:
  1. Creating a Scatterplot:
    • ```R
      library(ggplot2)
      ggplot(telomeredata, aes(x=fathertelomerelength, y=offspringtelomerelength)) + geompoint() +
      theme_minimal()
2. **Creating the Linear Model:**
   - ```R
   telomere_regr <- lm(offspring_telomere_length ~ father_telomere_length, data = telomere_data)
  1. Interpreting Results:
    • Summary provides min, max, residuals, and coefficients with significance:
    • Example results indicate significant predictive relationship if p-value < 0.05.
Output interpretation:
  • Coefficients:
    • Intercept and slope give the regression equation to predict Y based on X.
    • Example equation: Offspringexttelomereextlength=0.2252+0.9792imes(fatherexttelomereextlength)Offspring ext{ } telomere ext{ } length = 0.2252 + 0.9792 imes (father ext{ } telomere ext{ } length)
Goodness of Fit
  • R-squared ($R^2$):
    • Indicates how well the model explains the variance in the data.
    • Higher values denote a better fit (e.g., R-squared = 0.32 indicates 32% of the variance is explained).

Key Study Areas

  • Distinguish between correlation and regression.
  • Understand regression line calculation methods.
  • Interpret scatterplots and identify violations of assumptions.
  • Analyze p-values and R-squared.
  • Practice using R functions like lm() and summary() to derive and interpret regression results.