Linear Regression II

Overview

  • Course: Biostatistics 521: Applied Biostatistics

  • Instructor: Mousumi Banerjee

  • Topics Covered:

    • Various Regression Topics

    • Example: Poverty vs. Murder Rate

    • Confidence Intervals for Regression

    • Extrapolation

    • Centering Numerical Covariates

    • Transformations

    • Outliers Analysis

    • Linear Regression with Categorical Covariates

Example: Poverty vs. Murder Rate
Exercise 8.25

  • Objective: Predict annual murders per million from the percentage of individuals living in poverty in a random sample of 20 metropolitan areas.

  • Steps for Performing Simple Linear Regression:

    1. Assess whether the data points are independent.

    2. Create a scatterplot to evaluate if the relationship appears approximately linear.

    3. Fit the regression line to the data.

    4. Create residual plots and QQ plots.

    5. Confirm the linearity of the relationship.

    6. Check for constant variance of residuals.

    7. Check for normality of residuals.

    8. Perform inference on the fitted line.

    9. Compute predicted values for annual murders based on poverty levels.

Residual Analysis

  • Assessing Assumptions:

    • No clear patterns in the residuals indicate good model fitting.

    • There may be a slight departure from normality, validated by comparison of observed residuals to theoretical quantiles.

Regression Inference
Key Question

  • Is poverty rate a significant predictor for annual murder rate?

Fitted Model

  • M=−29.9+2.559×PM=−29.9+2.559×P

  • Hypothesis Testing:

    • Null Hypothesis (H0H0): \betaP = 0 (no relationship between poverty and murder rate)

    • Alternative Hypothesis (H1H1): \betaP \neq 0

  • Test Statistic Result:

    • p=3.64×10−4p=3.64×10−4

    • Conclusion: Reject the null hypothesis, indicating a significant relationship.

Interpretation of Regression Model
Fitted Model Interpretation

  • For each 1-unit increase in the percentage of individuals living in poverty, the expected annual number of murders per million increases by 2.559.

  • This relationship can be scaled up: a 10% increase in poverty results in an expected increase of approximately 25 additional murders per million.

Confidence Intervals for Regression Slope Parameter
Construction of Confidence Intervals

  • Point Estimates:

    • β0 and β1β0 and β1 are the best guesses at the true population parameters.

  • Confidence Interval Formula:

    • The (1−α)(1−α) % confidence interval for the population slope parameter β1β1 is: β^1±tα2,n−2×S(β^1)β^​1±t​,n−2×S(β^​1)

    • Where nn is the sample size.

  • When nn is large, tt can be replaced with zz.

Example: Interpretation of Confidence Interval

  • 95% Confidence Interval for the effect of % in poverty on the annual murder rate:

  • 2.559±(1.96×0.390)=(1.79,3.32)2.559±(1.96×0.390)=(1.79,3.32)

  • Conclusion: We are 95% confident that the true effect size is between 1.79 and 3.32.

Prediction of Values
Example: Predicting Annual Murders

  • Question: What is the predicted annual murder rate for a city where 22% of the population lives in poverty?

  • Calculation:

    • M=−29.9+2.559×22M=−29.9+2.559×22

    • M=26.398M=26.398

    • Conclusion: The predicted murder rate for this city is approximately 26.4 per million.

Confidence Intervals for Predicted Values
Precision of Predicted Outcomes

Formula for Confidence Intervals at a Fixed Value

  • Confidence Interval Construction:

    • Predicted Value±tn−2×srPredicted Value±tn−2×sr

    • Where srsr is derived from residuals and provides a measure of precision around the predicted value.

Example Calculation

  • For a prediction of a car's fuel efficiency (mpg) when weight is fixed at 3.1K lbs:

  • 20.71±1.96×3.04620.71±1.96×3.046

    • Result: 95% Confidence Interval for predicted mean mpg is (19.65, 21.77).

Extrapolation
Definition and Caution

  • Definition: Extrapolating refers to predicting outcomes for exposure levels outside the range of data that the regression model was fit on.

  • Caution: Extrapolation is discouraged as it can result in poor predictions due to potential nonlinear relationships outside the observed range.

Example Scenarios in Extrapolation

  • Case Study: A one-year-old's shoulder girth of 56 cm is outside the model's data range (85 to 135 cm).

  • Questioning the appropriateness of using the model for predictions:

    • We lack data; thus, the inferred linear relationship is speculative and could lead to inaccurate outcomes.

Conclusion

  • To ensure accurate predictions, maintain analysis within established data ranges whenever possible, and always assess underlying assumptions in regression modeling.