6. Simple Linear Regression

Chapter 6: Simple Linear Regression

6.1 Examining Scatterplots
  • Visualization of Relationships: The relationship between two numerical variables can be visualized using a scatterplot in the xy-plane.

  • Axes:

    • Horizontal Axis: Predictor or explanatory variable.

    • Vertical Axis: Response variable.

  • Purpose of This Chapter: This chapter explores simple linear regression, a technique for estimating a straight line that best fits data on a scatterplot.

    • Functions as a linear model for both prediction and inference.

    • Only applicable to data showing linear or approximately linear relationships.

  • Example of Linear Relationship: NHANES data illustrates a linear relationship between height and weight.

    • Height serves as a predictor of weight.

    • Regression techniques can predict an individual’s weight based on height.

  • Non-Linear Relationships: Not all data relationships are linear. Example: annual per capita income vs. life expectancy.

    • Strong relationships show clear patterns in scatterplots.

    • Weak relationships appear diffuse and unclear.

  • Effect of Measurement Scale Changes: Changing the scale of measurement of one or both variables alters the axes but does not change the nature of the relationship.

6.2 Estimating a Regression Line Using Least Squares
  • Adding a Regression Line: A least squares regression line minimizes the sum of the squared residuals (differences between observed and estimated values).

  • Residual Definition: The residual for the i-th observation

    • Formula: e<em>i=y</em>iyˉie<em>i = y</em>i - \bar{y}_i
      where:

    • yiy_i = observed value

    • yˉi\bar{y}_i = predicted value from the regression line

  • Least Squares Regression Line:

    • Minimizes the sum of squared residuals: e2<em>1+e2</em>2++en2e^2<em>1 + e^2</em>2 + \dots + e^2_n

  • Population Regression Model:

    • Formula: y=β<em>0+β</em>1x+εy = \beta<em>0 + \beta</em>1 x + \varepsilon

    • ε\varepsilon: Normally distributed error term with mean 0 and standard deviation σ\sigma.

    • Expected value (mean) represented as E(Yx)=β<em>0+β</em>1xE(Y|x) = \beta<em>0 + \beta</em>1 x.

  • Fitting the Model:

    • Example for the PREVEND data predicting RFFT score from age.

    • Regression parameters:

    • β0\beta_0: Intercept (value when x=0x=0)

    • β1\beta_1: Slope (change in yy for a 1-unit change in xx)

  • Estimating Parameters:

    • Sample estimates: b<em>0b<em>0 (intercept) and b</em>1b</em>1 (slope).

    • Calculating slope: b<em>1=s</em>ysxrb<em>1 = \frac{s</em>y}{s_x} r
      where:

    • rr = correlation coefficient

    • s<em>x,s</em>ys<em>x, s</em>y = standard deviations of x and y respectively.

    • Intercept calculation: b<em>0=yˉb</em>1xˉb<em>0 = \bar{y} - b</em>1 \bar{x}.

Example 6.5: Parameter Calculation
  • From summary statistics for PREVEND data:

    • b<em>1=s</em>ysxrb<em>1 = \frac{s</em>y}{s_x} r

    • b<em>0=yˉb</em>1xˉb<em>0 = \bar{y} - b</em>1 \bar{x} resulting in RFFT equation RFFT=137.551.26(age)RFFT = 137.55 - 1.26(age).

6.3 Interpreting a Linear Model
  • Mathematical Interpretation: Slope denotes change in yy for a 1-unit increase in xx.

    • Example: In PREVEND data, for each increase in age by 1 year, the RFFT score decreases by 1.26 points.

  • Causation vs. Correlation: Avoid claiming causality from observational study data.

    • Linear model does not imply one variable causes the change in another.

  • Intercept Meaning: Represents the expected outcome (RFFT score) when the explanatory variable is zero.

    • Typically nonsensical in biological contexts.

  • Extrapolation Warning: Predictions should only be made within the observed range of data, particularly for extreme values.

  • Checking Model Validity: Key assumptions include:

    1. Linearity: The relationship should show a linear trend.

    2. Constant variability: Variability around the regression line should be consistent.

    3. Independent observations: Each pair of observations should be independent.

    4. Normally distributed residuals: Check after fitting the model.

  • Residual Analysis: Plot residuals to evaluate model fit. A good model has residuals scattered around zero without patterns.

    • Outliers or non-constant variance can indicate issues with model fit.

6.4 Statistical Inference with Regression
  • Purpose: Use regression for statistical inference about the population, primarily concerning the slope parameter β1\beta_1.

  • Sampling Distributions:

    • The estimated slope b1b_1 has normal distribution properties based on the population model.

  • Null Hypothesis Testing: Null hypothesis H<em>0:β</em>1=0H<em>0: \beta</em>1 = 0 means no association between variables.

    • Calculate test statistic t=b<em>1β</em>0s.e.(b1)t = \frac{b<em>1 - \beta</em>0}{s.e.(b_1)} using t-distribution with n2n-2 degrees of freedom.

  • Confidence Intervals: Construct confidence intervals for slope with b<em>1±t</em>df×s.e.(b1)b<em>1 \pm t</em>{df} \times s.e.(b_1).

Example 6.17: Association Between Variables
  • Analyzing the association of doctors per 100,000 population with infant mortality using regression analysis.

6.5 Interval Estimates with Regression
  • Confidence Intervals: Build around estimates from the regression line.

    • Example for estimated mean RFFT score at age 60:

    • 61.95±(1.96)(1.14)=(59.72,64.18)61.95 \pm (1.96)(1.14) = (59.72, 64.18).

  • Prediction Intervals: Wider than confidence intervals, accounting for variability and prediction uncertainty.

  • Confidence intervals vs. Prediction intervals:

    • Confidence intervals provide mean estimates, while prediction intervals account for individual variability.

6.6 Notes
  • Nonlinearity Assessment: Visual assessment of data fits with an applied regression line is critical.

  • Role of Variability: R2 indicates strength but is influenced by data variability in regression contexts.

  • Model Validation: Essential in changing contexts; small p-values must be cautiously interpreted.

6.7 Exercises
  • Exercises 6.1-6.32: 1. Identify relationships in scatterplots, fit linear models, interpret regression outputs, assess residuals, and conduct hypothesis testing in various scenarios.