Linear Regression
M BIOSTATISTICS
Week 9
Course: PUBHLTH 500 - Investigating Public Health Issues
Instructor: Jennifer Daniels
Correlation
Definition: Correlation measures the relationship between two measurements on the same subject.
In R: Use function
cor(x, y)to calculate correlation.Correlation Coefficient (p): A powerful statistic represented as follows:
p = \frac{\Sigma(x - \bar{x})(y - \bar{y})}{Var(x)Var(y)}Unitless: Similar to a percentage.
Range: -1 \leq p \leq 1
Negative Correlation: If p < 0, as X increases, Y decreases (negative slope).
Perfect Correlation: p = -1 or p = 1 indicates a perfect correlation (very unusual).
Strong Correlation: ||p|| > 0.75 is usually considered a strong correlation.
No Correlation: p = 0 indicates no correlation or independence (also unusual).
Least Squares Regression (Line of Best Fit)
Analysis of Blood Alcohol Content (BAC) vs. Beer Consumption
Data Sample:
Beers | BAC |
|---|---|
5 | 0.1 |
2 | 0.03 |
9 | 0.19 |
7 | 0.095 |
3 | 0.07 |
3 | 0.02 |
4 | 0.07 |
… | … |
Graphical Representation: A line of best fit is plotted in the context of the data, representing the relationship visually.
Overview: Uses a single X predictor to relate Y to X.
Equation of a Straight Line:
y = a + bxWhere:
b= slope: the amount by which y changes when x increases by one unit.a= intercept: the value of y when x = 0.
Models Estimation:
Y = B0 + B1X + \epsilonFitted Value:
\hat{Y} = B0 + B1XResidual/Error:
\epsilon = Y - \hat{Y}Parameters:
B0 and B1 are estimated parameters calculated using sample data.
Least Squares Method: Uses calculus to estimate parameters to minimize residuals:
\Sigma \epsilon^2 = \Sigma(Y - \hat{Y})^2
Simple Linear Regression with R
Data Preparation: Import beer consumption data from
beerdata.xlsx.Model Fitting:
model <- lm(BAC ~ Beers, data = beerdata)
summary(model)
Model Output:
Equation:
BAC = -0.0127118 + 0.017964 * Xp-value:
2.97e-06 for testing the null hypothesis H0: \beta1 = 0 vs. alternative Ha: \beta1 \neq 0If H_0 is not rejected, the predictor is not significant.
In this case, H_a is accepted, indicating the number of beers is a statistically significant predictor of BAC.
Model Summary:
Residuals:
Min: -0.027118, 1Q: -0.017350, Median: 0.001773, 3Q: 0.008623, Max: 0.041027
Coefficients:
Intercept: -0.012701 (p = 0.332)
Beers: 0.017964 (t = 7.480, p = 2.97e-06)
R-Squared Values:
Multiple R-squared: 0.7998
Adjusted R-squared: 0.7855
Calculation Example:
For 5 beers:
\hat{Y} = -0.012701 + 0.017964 * 5 = 0.077119Residual Calculation:
\text{Residual} = 0.1 - 0.077119 = 0.022881Indicates an underestimation of BAC.
Predictions with Regression
Extrapolation Warning: Predictions beyond the range of the data (e.g., 30 beers) may yield unreliable results.
Toxic BAC: Example of highly concerning BAC (0.526) provided.
Confidence Intervals:
predict(model, newdata = newbeer, interval = "confidence")
Example predictions output:
fit | lwr | upr |
|---|---|---|
0.0232269 | 0.005060471 | 0.04139337 |
0.09508197 | 0.082530188 | 0.10763375 |
0.52621225 | 0.396005750 | 0.65641875 |
Four Key Assumptions for Valid Linear Models
Linear Relationship: Y and X should have a straight-line relationship.
Independence of Residuals: Residual vs. Fit Plot should show no discernable pattern.
Constant Variance (Homoscedasticity): Residuals should have a constant scatter.
Normality of Residuals: Residuals should follow a normal distribution.
Diagnostics for Validation
Plots:
Residual vs. Fit: Should exhibit random scatter.
Normality Testing: Shapiro-Wilk test and Q-Q plots can assess the normal distribution of residuals.
Testing Independence
Durbin-Watson test: Not covered, but a technique to assess the independence of residuals.
Constant Variance Check
Look for fan shapes in scatter plots indicating non-constant variance.
Normality Check of Residuals
Normality Tests:
Shapiro-Wilk test example output given for checking normality.
Output: In a sample analysis:
shapiro.test(model$residuals)
W = 0.95096, p-value = 0.505
Interpretation: Since p-value > 0.05, do not reject H0; residuals are normal.
Outliers and Influential Points
Definition: An outlier can influence the regression line significantly; monitoring is crucial.
Impact of Points: Adding influential points to data can markedly change slope and R² values. An example scenario included.
Practice Problems
ERdata Dataset
Metadata: Recorded symptoms and severity scores for patients in an ER.
Key Questions:
Identify predictor (X) and response (Y).
Evaluate statistical significance via p-values.
Interpret regression outputs, including slope and intercept.
Which variable is the predictor (X) and which variable is the response (Y)?
At 95% confidence, is symptoms a significant predictor of the Severity Score?
Interpret the slope for this model.
Interpret the intercept value for this model.
What is the residual (error) for the first patient?
What percent of the variation in score is explained by symptoms?
Symptoms required for the top priority severity score threshold?
Discuss the implication of R² and unexplained variation.
Validate diagnostics by examining model fit and normality.
R Lab 3 Week 9
Objective: Apply Simple Linear Regression techniques in lab work, applying findings to HDR paper.
EPID Quiz 2 Review and Details
Sessions: Provide key review sessions, participation requirements, and content to study in preparation.