Linear Regression

Course: PUBHLTH 500 - Investigating Public Health Issues

Instructor: Jennifer Daniels

Definition: Correlation measures the relationship between two measurements on the same subject.
In R: Use function cor(x, y) to calculate correlation.
Correlation Coefficient (p): A powerful statistic represented as follows:
$p = \frac{\Sigma(x - \bar{x})(y - \bar{y})}{Var(x)Var(y)}$
Unitless: Similar to a percentage.
Range: $-1 \leq p \leq 1$
- Negative Correlation: If p < 0, as X increases, Y decreases (negative slope).
- Perfect Correlation: $p = -1$ or $p = 1$ indicates a perfect correlation (very unusual).
- Strong Correlation: ||p|| > 0.75 is usually considered a strong correlation.
- No Correlation: $p = 0$ indicates no correlation or independence (also unusual).

Graphical Representation: A line of best fit is plotted in the context of the data, representing the relationship visually.
Overview: Uses a single X predictor to relate Y to X.
Equation of a Straight Line:
$y = a + bx$
- Where:
- b = slope: the amount by which y changes when x increases by one unit.
- a = intercept: the value of y when x = 0.
Models Estimation:
$Y = B0 + B1X + \epsilon$
- Fitted Value:
 $\hat{Y} = B0 + B1X$
- Residual/Error:
 $\epsilon = Y - \hat{Y}$
- Parameters:
 $B0$ and $B1$ are estimated parameters calculated using sample data.
Least Squares Method: Uses calculus to estimate parameters to minimize residuals:
$\Sigma \epsilon^2 = \Sigma(Y - \hat{Y})^2$

model <- lm(BAC ~ Beers, data = beerdata)
  summary(model)

Model Output:
- Equation:
 $BAC = -0.0127118 + 0.017964 * X$
- p-value:
 $2.97e-06$ for testing the null hypothesis $H0: \beta1 = 0$ vs. alternative $Ha: \beta1 \neq 0$
- If $H_0$ is not rejected, the predictor is not significant.
- In this case, $H_a$ is accepted, indicating the number of beers is a statistically significant predictor of BAC.
Model Summary:
- Residuals:
- Min: -0.027118, 1Q: -0.017350, Median: 0.001773, 3Q: 0.008623, Max: 0.041027
- Coefficients:
- Intercept: -0.012701 (p = 0.332)
- Beers: 0.017964 (t = 7.480, p = 2.97e-06)
- R-Squared Values:
- Multiple R-squared: 0.7998
- Adjusted R-squared: 0.7855
Calculation Example:
- For 5 beers:
  $\hat{Y} = -0.012701 + 0.017964 * 5 = 0.077119$
- Residual Calculation:
  $\text{Residual} = 0.1 - 0.077119 = 0.022881$
- Indicates an underestimation of BAC.

Extrapolation Warning: Predictions beyond the range of the data (e.g., 30 beers) may yield unreliable results.
Toxic BAC: Example of highly concerning BAC (0.526) provided.
Confidence Intervals:

predict(model, newdata = newbeer, interval = "confidence")

Linear Relationship: Y and X should have a straight-line relationship.
Independence of Residuals: Residual vs. Fit Plot should show no discernable pattern.
Constant Variance (Homoscedasticity): Residuals should have a constant scatter.
Normality of Residuals: Residuals should follow a normal distribution.

Plots:
1. Residual vs. Fit: Should exhibit random scatter.
2. Normality Testing: Shapiro-Wilk test and Q-Q plots can assess the normal distribution of residuals.

Testing Independence

Durbin-Watson test: Not covered, but a technique to assess the independence of residuals.

Constant Variance Check

Normality Check of Residuals

Normality Tests:
- Shapiro-Wilk test example output given for checking normality.
- Output: In a sample analysis:

shapiro.test(model$residuals)
  W = 0.95096, p-value = 0.505

Definition: An outlier can influence the regression line significantly; monitoring is crucial.
Impact of Points: Adding influential points to data can markedly change slope and R² values. An example scenario included.

Metadata: Recorded symptoms and severity scores for patients in an ER.
Key Questions:
- Identify predictor (X) and response (Y).
- Evaluate statistical significance via p-values.
- Interpret regression outputs, including slope and intercept.

Which variable is the predictor (X) and which variable is the response (Y)?
At 95% confidence, is symptoms a significant predictor of the Severity Score?
Interpret the slope for this model.
Interpret the intercept value for this model.
What is the residual (error) for the first patient?
What percent of the variation in score is explained by symptoms?
Symptoms required for the top priority severity score threshold?
Discuss the implication of R² and unexplained variation.
Validate diagnostics by examining model fit and normality.

Objective: Apply Simple Linear Regression techniques in lab work, applying findings to HDR paper.

Sessions: Provide key review sessions, participation requirements, and content to study in preparation.