6. Simple Linear Regression

Visualization of Relationships: The relationship between two numerical variables can be visualized using a scatterplot in the xy-plane.
Axes:
- Horizontal Axis: Predictor or explanatory variable.
- Vertical Axis: Response variable.
Purpose of This Chapter: This chapter explores simple linear regression, a technique for estimating a straight line that best fits data on a scatterplot.
- Functions as a linear model for both prediction and inference.
- Only applicable to data showing linear or approximately linear relationships.
Example of Linear Relationship: NHANES data illustrates a linear relationship between height and weight.
- Height serves as a predictor of weight.
- Regression techniques can predict an individual’s weight based on height.
Non-Linear Relationships: Not all data relationships are linear. Example: annual per capita income vs. life expectancy.
- Strong relationships show clear patterns in scatterplots.
- Weak relationships appear diffuse and unclear.
Effect of Measurement Scale Changes: Changing the scale of measurement of one or both variables alters the axes but does not change the nature of the relationship.

Adding a Regression Line: A least squares regression line minimizes the sum of the squared residuals (differences between observed and estimated values).
Residual Definition: The residual for the i-th observation
- Formula: ei = yi - \bar{y}_i
  where:
- y_i = observed value
- \bar{y}_i = predicted value from the regression line
Least Squares Regression Line:
- Minimizes the sum of squared residuals: e^21 + e^22 + \dots + e^2_n
Population Regression Model:
- Formula: y = \beta0 + \beta1 x + \varepsilon
- \varepsilon: Normally distributed error term with mean 0 and standard deviation \sigma.
- Expected value (mean) represented as E(Y|x) = \beta0 + \beta1 x.
Fitting the Model:
- Example for the PREVEND data predicting RFFT score from age.
- Regression parameters:
- \beta_0: Intercept (value when x=0)
- \beta_1: Slope (change in y for a 1-unit change in x)
Estimating Parameters:
- Sample estimates: b0 (intercept) and b1 (slope).
- Calculating slope: b1 = \frac{sy}{s_x} r
  where:
- r = correlation coefficient
- sx, sy = standard deviations of x and y respectively.
- Intercept calculation: b0 = \bar{y} - b1 \bar{x}.

From summary statistics for PREVEND data:
- b1 = \frac{sy}{s_x} r
- b0 = \bar{y} - b1 \bar{x} resulting in RFFT equation RFFT = 137.55 - 1.26(age).

Mathematical Interpretation: Slope denotes change in y for a 1-unit increase in x.
- Example: In PREVEND data, for each increase in age by 1 year, the RFFT score decreases by 1.26 points.
Causation vs. Correlation: Avoid claiming causality from observational study data.
- Linear model does not imply one variable causes the change in another.
Intercept Meaning: Represents the expected outcome (RFFT score) when the explanatory variable is zero.
- Typically nonsensical in biological contexts.
Extrapolation Warning: Predictions should only be made within the observed range of data, particularly for extreme values.
Checking Model Validity: Key assumptions include:
1. Linearity: The relationship should show a linear trend.
2. Constant variability: Variability around the regression line should be consistent.
3. Independent observations: Each pair of observations should be independent.
4. Normally distributed residuals: Check after fitting the model.
Residual Analysis: Plot residuals to evaluate model fit. A good model has residuals scattered around zero without patterns.
- Outliers or non-constant variance can indicate issues with model fit.

Purpose: Use regression for statistical inference about the population, primarily concerning the slope parameter \beta_1.
Sampling Distributions:
- The estimated slope b_1 has normal distribution properties based on the population model.
Null Hypothesis Testing: Null hypothesis H0: \beta1 = 0 means no association between variables.
- Calculate test statistic t = \frac{b1 - \beta0}{s.e.(b_1)} using t-distribution with n-2 degrees of freedom.
Confidence Intervals: Construct confidence intervals for slope with b1 \pm t{df} \times s.e.(b_1).

Analyzing the association of doctors per 100,000 population with infant mortality using regression analysis.

Confidence Intervals: Build around estimates from the regression line.
- Example for estimated mean RFFT score at age 60:
- 61.95 \pm (1.96)(1.14) = (59.72, 64.18).
Prediction Intervals: Wider than confidence intervals, accounting for variability and prediction uncertainty.
Confidence intervals vs. Prediction intervals:
- Confidence intervals provide mean estimates, while prediction intervals account for individual variability.

Nonlinearity Assessment: Visual assessment of data fits with an applied regression line is critical.
Role of Variability: R2 indicates strength but is influenced by data variability in regression contexts.
Model Validation: Essential in changing contexts; small p-values must be cautiously interpreted.

Exercises 6.1-6.32: 1. Identify relationships in scatterplots, fit linear models, interpret regression outputs, assess residuals, and conduct hypothesis testing in various scenarios.