6. Simple Linear Regression
Chapter 6: Simple Linear Regression
6.1 Examining Scatterplots
Visualization of Relationships: The relationship between two numerical variables can be visualized using a scatterplot in the xy-plane.
Axes:
Horizontal Axis: Predictor or explanatory variable.
Vertical Axis: Response variable.
Purpose of This Chapter: This chapter explores simple linear regression, a technique for estimating a straight line that best fits data on a scatterplot.
Functions as a linear model for both prediction and inference.
Only applicable to data showing linear or approximately linear relationships.
Example of Linear Relationship: NHANES data illustrates a linear relationship between height and weight.
Height serves as a predictor of weight.
Regression techniques can predict an individual’s weight based on height.
Non-Linear Relationships: Not all data relationships are linear. Example: annual per capita income vs. life expectancy.
Strong relationships show clear patterns in scatterplots.
Weak relationships appear diffuse and unclear.
Effect of Measurement Scale Changes: Changing the scale of measurement of one or both variables alters the axes but does not change the nature of the relationship.
6.2 Estimating a Regression Line Using Least Squares
Adding a Regression Line: A least squares regression line minimizes the sum of the squared residuals (differences between observed and estimated values).
Residual Definition: The residual for the i-th observation
Formula: ei = yi - \bar{y}_i
where:y_i = observed value
\bar{y}_i = predicted value from the regression line
Least Squares Regression Line:
Minimizes the sum of squared residuals: e^21 + e^22 + \dots + e^2_n
Population Regression Model:
Formula: y = \beta0 + \beta1 x + \varepsilon
\varepsilon: Normally distributed error term with mean 0 and standard deviation \sigma.
Expected value (mean) represented as E(Y|x) = \beta0 + \beta1 x.
Fitting the Model:
Example for the PREVEND data predicting RFFT score from age.
Regression parameters:
\beta_0: Intercept (value when x=0)
\beta_1: Slope (change in y for a 1-unit change in x)
Estimating Parameters:
Sample estimates: b0 (intercept) and b1 (slope).
Calculating slope: b1 = \frac{sy}{s_x} r
where:r = correlation coefficient
sx, sy = standard deviations of x and y respectively.
Intercept calculation: b0 = \bar{y} - b1 \bar{x}.
Example 6.5: Parameter Calculation
From summary statistics for PREVEND data:
b1 = \frac{sy}{s_x} r
b0 = \bar{y} - b1 \bar{x} resulting in RFFT equation RFFT = 137.55 - 1.26(age).
6.3 Interpreting a Linear Model
Mathematical Interpretation: Slope denotes change in y for a 1-unit increase in x.
Example: In PREVEND data, for each increase in age by 1 year, the RFFT score decreases by 1.26 points.
Causation vs. Correlation: Avoid claiming causality from observational study data.
Linear model does not imply one variable causes the change in another.
Intercept Meaning: Represents the expected outcome (RFFT score) when the explanatory variable is zero.
Typically nonsensical in biological contexts.
Extrapolation Warning: Predictions should only be made within the observed range of data, particularly for extreme values.
Checking Model Validity: Key assumptions include:
Linearity: The relationship should show a linear trend.
Constant variability: Variability around the regression line should be consistent.
Independent observations: Each pair of observations should be independent.
Normally distributed residuals: Check after fitting the model.
Residual Analysis: Plot residuals to evaluate model fit. A good model has residuals scattered around zero without patterns.
Outliers or non-constant variance can indicate issues with model fit.
6.4 Statistical Inference with Regression
Purpose: Use regression for statistical inference about the population, primarily concerning the slope parameter \beta_1.
Sampling Distributions:
The estimated slope b_1 has normal distribution properties based on the population model.
Null Hypothesis Testing: Null hypothesis H0: \beta1 = 0 means no association between variables.
Calculate test statistic t = \frac{b1 - \beta0}{s.e.(b_1)} using t-distribution with n-2 degrees of freedom.
Confidence Intervals: Construct confidence intervals for slope with b1 \pm t{df} \times s.e.(b_1).
Example 6.17: Association Between Variables
Analyzing the association of doctors per 100,000 population with infant mortality using regression analysis.
6.5 Interval Estimates with Regression
Confidence Intervals: Build around estimates from the regression line.
Example for estimated mean RFFT score at age 60:
61.95 \pm (1.96)(1.14) = (59.72, 64.18).
Prediction Intervals: Wider than confidence intervals, accounting for variability and prediction uncertainty.
Confidence intervals vs. Prediction intervals:
Confidence intervals provide mean estimates, while prediction intervals account for individual variability.
6.6 Notes
Nonlinearity Assessment: Visual assessment of data fits with an applied regression line is critical.
Role of Variability: R2 indicates strength but is influenced by data variability in regression contexts.
Model Validation: Essential in changing contexts; small p-values must be cautiously interpreted.
6.7 Exercises
Exercises 6.1-6.32: 1. Identify relationships in scatterplots, fit linear models, interpret regression outputs, assess residuals, and conduct hypothesis testing in various scenarios.