Simple Linear Regression: Model and Theory
SIMPLE LINEAR REGRESSION
Linear Regression Overview
Linear regression is used to summarize the relationship between two numerical variables.
Slope Parameter ($eta_1$):
Summary measure of the effect of exposure on outcome.
A positive value for $\beta_1$ indicates a positive relationship; a negative value indicates a negative relationship.
$\beta_1 = 0$ suggests no relationship.
Objectives:
To perform statistical inference regarding the exposure-outcome relationship.
Formal hypothesis testing:
Null Hypothesis ($H0$): $\beta1 = 0$
Alternative Hypothesis ($H1$): $\beta1 \neq 0$
Review of regression models and assumptions necessary for statistical inference on linear regression parameters.
Summarizing Relationships Between Variables
Relationships can be summarized graphically (e.g., scatterplots) and numerically (e.g., correlation coefficients).
Data Types: Numerical and Categorical.
Method of summarization depends on the exposure and outcome data type.
For numerical data: Scatterplots and Correlation measures can be applied.
For categorical data: Different summarization methods apply.
Linear Regression
Model Description:
Linear regression describes the relationship between exposure and outcome using a linear equation:
Examples:
For $y = -1 + 2x$, the parameters are:
$\beta_0$ (y-intercept): value where the line crosses the y-axis.
$\beta1$ (slope): rate of change; y increases by $\beta1$ units for every 1-unit increase in x.
Linear Regression Prediction
Linear regression is a widely used analysis tool in statistics.
Main focus: compute parameters and interpret the model:
Where:
$y$ = outcome variable
$x$ = exposure variable
Parameter estimates are computed as follows:
Fitted values:
Predicted outcome for a fixed exposure value $x$:
Interpretation of Linear Regression Parameters
Intercept ($\beta_0$): Predicted outcome value when exposure is zero.
Slope ($\beta_1$): Often referred to as the “effect size.”
Indicates the predicted mean outcome increases by $\beta_1$ units for each 1-unit change in exposure.
Interpretations of $\beta_1$:
$\beta_1 > 0$: Positive association
$\beta_1 < 0$: Negative association
$\beta_1 = 0$: No association
Larger magnitudes of $\beta_1$ suggest greater changes in the outcome per change in the exposure.
For estimated slope ($\hat{\beta_1}$):
No Association: $\hat{\beta_1} \approx 0$
Positive Association: $\hat{\beta_1} > 0$
Negative Association: $\hat{\beta_1} < 0$
Example Dataset: mtcars
Dataset Description:
Classic
mtcarsdataset in R covering fuel consumption and 10 aspects of automobile design for 32 automobiles (1973–74).
Variables Considered:
Weight (1000s lbs)
Miles per gallon (mpg)
Quarter mile time (qsec)
Relationship Between Car Weight and Quarter Mile Time
Scatterplot indicates no clear relationship.
Correlation coefficient between weight and quarter mile time:
Conclusion: Linear regression model performance insufficient;
: Indicates only 3% variation explained by model.
Model equation:
Relationship Between Car Weight and Fuel Economy
Observations indicate a negative, strong, and linear relationship in the scatterplot.
Correlation coefficient:
Fitted linear model:
Interpretation: A 1000 lb increase in weight reduces fuel economy by 5.3 mpg.
Model adequacy:
indicates 75% of the variation in fuel efficiency explained by linear relationship with weight.
Goal: Evaluate Regressions using Inference Framework
Testing the hypothesis that there is sufficient evidence to state $\beta_1 \neq 0$.
Statistical Inference for Linear Regression Models
Population Parameters:
True relationship denoted by $\beta0$, $\beta1$.
If $\beta_1 = 0$, expected outcomes are equal across samples regardless of exposure value.
If $\beta_1 \neq 0$, a true linear relationship exists.
Focus typically on assessing if slope parameter ($\beta_1$) is non-zero.
Population vs Sample Estimates
True parameters: $\beta0$, $\beta1$
Estimated parameters: $\hat{\beta0}$, $\hat{\beta1}$ based on random sampling.
Hypothesis Testing in Linear Regression
Null Hypothesis: No association between exposure and outcome ($H0: \beta1 = 0$)
Alternative Hypothesis: Indicates some association ($H1: \beta1 \neq 0$)
Null model: $y = \beta_0$ sufficiently predicts data.
Variation exists in estimates across independent random samples.
Variability and Sampling Distribution
The estimate of the slope ($\hat{\beta_1}$) has a sampling distribution and standard error,
Inference uses point estimate $\hat{\beta_1}$ along with its standard error.
Residuals
Defined as the distance between regression line and observed value for $i^{th}$ data point ($e_i$).
Linear regression parameters ($\hat{\beta0}$ and $\hat{\beta1}$) obtained by minimizing total squared errors of residuals.
Least Squares Regression Line: Minimizes total squared residuals defined mathematically:
Use calculus to determine parameter values that minimize this function.
Statistical Model for Simple Linear Regression
Defined as:
Assumption that $\epsilon_i \sim N(0, \sigma^2)$, indicating normally distributed error terms.
The model captures the linear relationship between outcomes (Y) and exposure/covariate (X).
Assumptions of Linear Regression
Independence: Each data point ($Yi, Xi$) is independent.
Linearity: $Yi$ is a linear function of $Xi$.
Normality of Errors: Errors $\epsilon_i$ follow a normal distribution.
Constant Variance: The variance of errors is constant across all levels of $i$ (homoscedasticity).
Checking Assumptions
Verification important for valid inference, though rarely reported in practice.
Independence: Addressed through data collection methods and study design.
Linearity Assessment: Scatterplot analysis or regression fit to observe relationships.
Residual Analysis: Examine residuals for patterns that indicate model fitness.
Normality Assessment of Residuals:
Use histograms, QQ plots to investigate normality.
Constant Variance Assessment: Check residuals for patterns indicating non-constant variance.
Conclusion on Assumptions
Generally, assumptions for linear regression are rarely perfectly met.
Key focus areas in order of importance for inference validity:
Linearity
Independence
Constant Variance
Normality
Remember, "All models are wrong, but some are useful."