6. Simple Linear Regression

Chapter 6: Simple Linear Regression

6.1 Examining Scatterplots

  • Visualization of Relationships: The relationship between two numerical variables can be visualized using a scatterplot in the xy-plane.

  • Axes:

    • Horizontal Axis: Predictor or explanatory variable.

    • Vertical Axis: Response variable.

  • Purpose of This Chapter: This chapter explores simple linear regression, a technique for estimating a straight line that best fits data on a scatterplot.

    • Functions as a linear model for both prediction and inference.

    • Only applicable to data showing linear or approximately linear relationships.

  • Example of Linear Relationship: NHANES data illustrates a linear relationship between height and weight.

    • Height serves as a predictor of weight.

    • Regression techniques can predict an individual’s weight based on height.

  • Non-Linear Relationships: Not all data relationships are linear. Example: annual per capita income vs. life expectancy.

    • Strong relationships show clear patterns in scatterplots.

    • Weak relationships appear diffuse and unclear.

  • Effect of Measurement Scale Changes: Changing the scale of measurement of one or both variables alters the axes but does not change the nature of the relationship.

6.2 Estimating a Regression Line Using Least Squares

  • Adding a Regression Line: A least squares regression line minimizes the sum of the squared residuals (differences between observed and estimated values).

  • Residual Definition: The residual for the i-th observation

    • Formula: ei = yi - \bar{y}_i
      where:

    • y_i = observed value

    • \bar{y}_i = predicted value from the regression line

  • Least Squares Regression Line:

    • Minimizes the sum of squared residuals: e^21 + e^22 + \dots + e^2_n

  • Population Regression Model:

    • Formula: y = \beta0 + \beta1 x + \varepsilon

    • \varepsilon: Normally distributed error term with mean 0 and standard deviation \sigma.

    • Expected value (mean) represented as E(Y|x) = \beta0 + \beta1 x.

  • Fitting the Model:

    • Example for the PREVEND data predicting RFFT score from age.

    • Regression parameters:

    • \beta_0: Intercept (value when x=0)

    • \beta_1: Slope (change in y for a 1-unit change in x)

  • Estimating Parameters:

    • Sample estimates: b0 (intercept) and b1 (slope).

    • Calculating slope: b1 = \frac{sy}{s_x} r
      where:

    • r = correlation coefficient

    • sx, sy = standard deviations of x and y respectively.

    • Intercept calculation: b0 = \bar{y} - b1 \bar{x}.

Example 6.5: Parameter Calculation
  • From summary statistics for PREVEND data:

    • b1 = \frac{sy}{s_x} r

    • b0 = \bar{y} - b1 \bar{x} resulting in RFFT equation RFFT = 137.55 - 1.26(age).

6.3 Interpreting a Linear Model

  • Mathematical Interpretation: Slope denotes change in y for a 1-unit increase in x.

    • Example: In PREVEND data, for each increase in age by 1 year, the RFFT score decreases by 1.26 points.

  • Causation vs. Correlation: Avoid claiming causality from observational study data.

    • Linear model does not imply one variable causes the change in another.

  • Intercept Meaning: Represents the expected outcome (RFFT score) when the explanatory variable is zero.

    • Typically nonsensical in biological contexts.

  • Extrapolation Warning: Predictions should only be made within the observed range of data, particularly for extreme values.

  • Checking Model Validity: Key assumptions include:

    1. Linearity: The relationship should show a linear trend.

    2. Constant variability: Variability around the regression line should be consistent.

    3. Independent observations: Each pair of observations should be independent.

    4. Normally distributed residuals: Check after fitting the model.

  • Residual Analysis: Plot residuals to evaluate model fit. A good model has residuals scattered around zero without patterns.

    • Outliers or non-constant variance can indicate issues with model fit.

6.4 Statistical Inference with Regression

  • Purpose: Use regression for statistical inference about the population, primarily concerning the slope parameter \beta_1.

  • Sampling Distributions:

    • The estimated slope b_1 has normal distribution properties based on the population model.

  • Null Hypothesis Testing: Null hypothesis H0: \beta1 = 0 means no association between variables.

    • Calculate test statistic t = \frac{b1 - \beta0}{s.e.(b_1)} using t-distribution with n-2 degrees of freedom.

  • Confidence Intervals: Construct confidence intervals for slope with b1 \pm t{df} \times s.e.(b_1).

Example 6.17: Association Between Variables
  • Analyzing the association of doctors per 100,000 population with infant mortality using regression analysis.

6.5 Interval Estimates with Regression

  • Confidence Intervals: Build around estimates from the regression line.

    • Example for estimated mean RFFT score at age 60:

    • 61.95 \pm (1.96)(1.14) = (59.72, 64.18).

  • Prediction Intervals: Wider than confidence intervals, accounting for variability and prediction uncertainty.

  • Confidence intervals vs. Prediction intervals:

    • Confidence intervals provide mean estimates, while prediction intervals account for individual variability.

6.6 Notes

  • Nonlinearity Assessment: Visual assessment of data fits with an applied regression line is critical.

  • Role of Variability: R2 indicates strength but is influenced by data variability in regression contexts.

  • Model Validation: Essential in changing contexts; small p-values must be cautiously interpreted.

6.7 Exercises

  • Exercises 6.1-6.32: 1. Identify relationships in scatterplots, fit linear models, interpret regression outputs, assess residuals, and conduct hypothesis testing in various scenarios.