Simple Linear Regression: Model and Theory

Linear regression is used to summarize the relationship between two numerical variables.
Slope Parameter ($eta_1$):
- Summary measure of the effect of exposure on outcome.
- A positive value for $\beta_1$ indicates a positive relationship; a negative value indicates a negative relationship.
- $\beta_1 = 0$ suggests no relationship.
Objectives:
- To perform statistical inference regarding the exposure-outcome relationship.
- Formal hypothesis testing:
  - Null Hypothesis ($H0$): $\beta1 = 0$
  - Alternative Hypothesis ($H1$): $\beta1 \neq 0$
Review of regression models and assumptions necessary for statistical inference on linear regression parameters.

Relationships can be summarized graphically (e.g., scatterplots) and numerically (e.g., correlation coefficients).
Data Types: Numerical and Categorical.
- Method of summarization depends on the exposure and outcome data type.
- For numerical data: Scatterplots and Correlation measures can be applied.
- For categorical data: Different summarization methods apply.

Model Description:
- Linear regression describes the relationship between exposure and outcome using a linear equation:
 $y = \beta0 + \beta1 x$
- Examples:
- For $y = -1 + 2x$, the parameters are:
 - $\beta_0$ (y-intercept): value where the line crosses the y-axis.
 - $\beta1$ (slope): rate of change; y increases by $\beta1$ units for every 1-unit increase in x.

Linear regression is a widely used analysis tool in statistics.
Main focus: compute parameters and interpret the model: $y = \beta0 + \beta1 x$
- Where:
- $y$ = outcome variable
- $x$ = exposure variable
Parameter estimates are computed as follows:
- Fitted values:
- $\hat{\beta1} = r \times \frac{sy}{s_x}$
- $\hat{\beta0} = \bar{y} - \hat{\beta1} \bar{x}$
Predicted outcome for a fixed exposure value $x$:
- $\hat{y} = \hat{\beta0} + \hat{\beta1} x$

Intercept ($\beta_0$): Predicted outcome value when exposure is zero.
- $\hat{y} = \beta0 + \beta1 \cdot 0 = \beta_0$
Slope ($\beta_1$): Often referred to as the “effect size.”
- Indicates the predicted mean outcome increases by $\beta_1$ units for each 1-unit change in exposure.
- Interpretations of $\beta_1$:
 - $\beta_1 > 0$: Positive association
 - $\beta_1 < 0$: Negative association
 - $\beta_1 = 0$: No association
- Larger magnitudes of $\beta_1$ suggest greater changes in the outcome per change in the exposure.
- For estimated slope ($\hat{\beta_1}$):
 - No Association: $\hat{\beta_1} \approx 0$
 - Positive Association: $\hat{\beta_1} > 0$
 - Negative Association: $\hat{\beta_1} < 0$

Dataset Description:
- Classic mtcars dataset in R covering fuel consumption and 10 aspects of automobile design for 32 automobiles (1973–74).
Variables Considered:
- Weight (1000s lbs)
- Miles per gallon (mpg)
- Quarter mile time (qsec)

Relationship Between Car Weight and Quarter Mile Time

Scatterplot indicates no clear relationship.
Correlation coefficient between weight and quarter mile time:
- $r = -0.174$
Conclusion: Linear regression model performance insufficient;
- $R^2 = 0.03$ : Indicates only 3% variation explained by model.
Model equation:
$Qsec = 18.9 - 0.32 \cdot Weight$

Relationship Between Car Weight and Fuel Economy

Observations indicate a negative, strong, and linear relationship in the scatterplot.
Correlation coefficient:
- $r = -0.868$
Fitted linear model:
- $Mpg = 37.3 - 5.3 \cdot Weight$
Interpretation: A 1000 lb increase in weight reduces fuel economy by 5.3 mpg.
Model adequacy:
- $R^2 = 0.75$ indicates 75% of the variation in fuel efficiency explained by linear relationship with weight.

Testing the hypothesis that there is sufficient evidence to state $\beta_1 \neq 0$.

Population Parameters:
- True relationship denoted by $\beta0$, $\beta1$.
If $\beta_1 = 0$, expected outcomes are equal across samples regardless of exposure value.
If $\beta_1 \neq 0$, a true linear relationship exists.
Focus typically on assessing if slope parameter ($\beta_1$) is non-zero.

Population vs Sample Estimates

Null Hypothesis: No association between exposure and outcome ($H0: \beta1 = 0$)
Alternative Hypothesis: Indicates some association ($H1: \beta1 \neq 0$)
Null model: $y = \beta_0$ sufficiently predicts data.
Variation exists in estimates across independent random samples.

The estimate of the slope ($\hat{\beta_1}$) has a sampling distribution and standard error,
Inference uses point estimate $\hat{\beta_1}$ along with its standard error.

Defined as the distance between regression line and observed value for $i^{th}$ data point ($e_i$).
Linear regression parameters ($\hat{\beta0}$ and $\hat{\beta1}$) obtained by minimizing total squared errors of residuals.
Least Squares Regression Line: Minimizes total squared residuals defined mathematically:
- $f(\beta0, \beta1) = \sum{i=1}^{N} (yi - \hat{y}_i)^2$
Use calculus to determine parameter values that minimize this function.

Defined as:
$Yi = \beta0 + \beta1 Xi + \epsilon_i$
Assumption that $\epsilon_i \sim N(0, \sigma^2)$, indicating normally distributed error terms.
The model captures the linear relationship between outcomes (Y) and exposure/covariate (X).

Independence: Each data point ($Yi, Xi$) is independent.
Linearity: $Yi$ is a linear function of $Xi$.
Normality of Errors: Errors $\epsilon_i$ follow a normal distribution.
Constant Variance: The variance of errors is constant across all levels of $i$ (homoscedasticity).

Verification important for valid inference, though rarely reported in practice.
Independence: Addressed through data collection methods and study design.
Linearity Assessment: Scatterplot analysis or regression fit to observe relationships.
Residual Analysis: Examine residuals for patterns that indicate model fitness.
Normality Assessment of Residuals:
- Use histograms, QQ plots to investigate normality.
Constant Variance Assessment: Check residuals for patterns indicating non-constant variance.

Generally, assumptions for linear regression are rarely perfectly met.
Key focus areas in order of importance for inference validity:
1. Linearity
2. Independence
3. Constant Variance
4. Normality
Remember, "All models are wrong, but some are useful."