Linear Regression II
Overview
Course: Biostatistics 521: Applied Biostatistics
Instructor: Mousumi Banerjee
Topics Covered:
Various Regression Topics
Example: Poverty vs. Murder Rate
Confidence Intervals for Regression
Extrapolation
Centering Numerical Covariates
Transformations
Outliers Analysis
Linear Regression with Categorical Covariates
Example: Poverty vs. Murder Rate
Exercise 8.25
Objective: Predict annual murders per million from the percentage of individuals living in poverty in a random sample of 20 metropolitan areas.
Steps for Performing Simple Linear Regression:
Assess whether the data points are independent.
Create a scatterplot to evaluate if the relationship appears approximately linear.
Fit the regression line to the data.
Create residual plots and QQ plots.
Confirm the linearity of the relationship.
Check for constant variance of residuals.
Check for normality of residuals.
Perform inference on the fitted line.
Compute predicted values for annual murders based on poverty levels.
Residual Analysis
Assessing Assumptions:
No clear patterns in the residuals indicate good model fitting.
There may be a slight departure from normality, validated by comparison of observed residuals to theoretical quantiles.
Regression Inference
Key Question
Is poverty rate a significant predictor for annual murder rate?
Fitted Model
M=−29.9+2.559×PM=−29.9+2.559×P
Hypothesis Testing:
Null Hypothesis (H0H0): \betaP = 0 (no relationship between poverty and murder rate)
Alternative Hypothesis (H1H1): \betaP \neq 0
Test Statistic Result:
p=3.64×10−4p=3.64×10−4
Conclusion: Reject the null hypothesis, indicating a significant relationship.
Interpretation of Regression Model
Fitted Model Interpretation
For each 1-unit increase in the percentage of individuals living in poverty, the expected annual number of murders per million increases by 2.559.
This relationship can be scaled up: a 10% increase in poverty results in an expected increase of approximately 25 additional murders per million.
Confidence Intervals for Regression Slope Parameter
Construction of Confidence Intervals
Point Estimates:
β0 and β1β0 and β1 are the best guesses at the true population parameters.
Confidence Interval Formula:
The (1−α)(1−α) % confidence interval for the population slope parameter β1β1 is: β^1±tα2,n−2×S(β^1)β^1±t2α,n−2×S(β^1)
Where nn is the sample size.
When nn is large, tt can be replaced with zz.
Example: Interpretation of Confidence Interval
95% Confidence Interval for the effect of % in poverty on the annual murder rate:
2.559±(1.96×0.390)=(1.79,3.32)2.559±(1.96×0.390)=(1.79,3.32)
Conclusion: We are 95% confident that the true effect size is between 1.79 and 3.32.
Prediction of Values
Example: Predicting Annual Murders
Question: What is the predicted annual murder rate for a city where 22% of the population lives in poverty?
Calculation:
M=−29.9+2.559×22M=−29.9+2.559×22
M=26.398M=26.398
Conclusion: The predicted murder rate for this city is approximately 26.4 per million.
Confidence Intervals for Predicted Values
Precision of Predicted Outcomes
Formula for Confidence Intervals at a Fixed Value
Confidence Interval Construction:
Predicted Value±tn−2×srPredicted Value±tn−2×sr
Where srsr is derived from residuals and provides a measure of precision around the predicted value.
Example Calculation
For a prediction of a car's fuel efficiency (mpg) when weight is fixed at 3.1K lbs:
20.71±1.96×3.04620.71±1.96×3.046
Result: 95% Confidence Interval for predicted mean mpg is (19.65, 21.77).
Extrapolation
Definition and Caution
Definition: Extrapolating refers to predicting outcomes for exposure levels outside the range of data that the regression model was fit on.
Caution: Extrapolation is discouraged as it can result in poor predictions due to potential nonlinear relationships outside the observed range.
Example Scenarios in Extrapolation
Case Study: A one-year-old's shoulder girth of 56 cm is outside the model's data range (85 to 135 cm).
Questioning the appropriateness of using the model for predictions:
We lack data; thus, the inferred linear relationship is speculative and could lead to inaccurate outcomes.
Conclusion
To ensure accurate predictions, maintain analysis within established data ranges whenever possible, and always assess underlying assumptions in regression modeling.