Simple Linear Regression: Model and Theory

SIMPLE LINEAR REGRESSION
Linear Regression Overview
  • Linear regression is used to summarize the relationship between two numerical variables.

  • Slope Parameter ($eta_1$):

    • Summary measure of the effect of exposure on outcome.

    • A positive value for $\beta_1$ indicates a positive relationship; a negative value indicates a negative relationship.

    • $\beta_1 = 0$ suggests no relationship.

  • Objectives:

    • To perform statistical inference regarding the exposure-outcome relationship.

    • Formal hypothesis testing:

      • Null Hypothesis ($H0$): $\beta1 = 0$

      • Alternative Hypothesis ($H1$): $\beta1 \neq 0$

  • Review of regression models and assumptions necessary for statistical inference on linear regression parameters.

Summarizing Relationships Between Variables
  • Relationships can be summarized graphically (e.g., scatterplots) and numerically (e.g., correlation coefficients).

  • Data Types: Numerical and Categorical.

    • Method of summarization depends on the exposure and outcome data type.

    • For numerical data: Scatterplots and Correlation measures can be applied.

    • For categorical data: Different summarization methods apply.

Linear Regression
  • Model Description:

    • Linear regression describes the relationship between exposure and outcome using a linear equation:
      y=β<em>0+β</em>1xy = \beta<em>0 + \beta</em>1 x

    • Examples:

    • For $y = -1 + 2x$, the parameters are:

      • $\beta_0$ (y-intercept): value where the line crosses the y-axis.

      • $\beta1$ (slope): rate of change; y increases by $\beta1$ units for every 1-unit increase in x.

Linear Regression Prediction
  • Linear regression is a widely used analysis tool in statistics.

  • Main focus: compute parameters and interpret the model: y=β<em>0+β</em>1xy = \beta<em>0 + \beta</em>1 x

    • Where:

    • $y$ = outcome variable

    • $x$ = exposure variable

  • Parameter estimates are computed as follows:

    • Fitted values:

    • β<em>1^=r×s</em>ysx\hat{\beta<em>1} = r \times \frac{s</em>y}{s_x}

    • β<em>0^=yˉβ</em>1^xˉ\hat{\beta<em>0} = \bar{y} - \hat{\beta</em>1} \bar{x}

  • Predicted outcome for a fixed exposure value $x$:

    • y^=β<em>0^+β</em>1^x\hat{y} = \hat{\beta<em>0} + \hat{\beta</em>1} x

Interpretation of Linear Regression Parameters
  • Intercept ($\beta_0$): Predicted outcome value when exposure is zero.

    • y^=β<em>0+β</em>10=β0\hat{y} = \beta<em>0 + \beta</em>1 \cdot 0 = \beta_0

  • Slope ($\beta_1$): Often referred to as the “effect size.”

    • Indicates the predicted mean outcome increases by $\beta_1$ units for each 1-unit change in exposure.

    • Interpretations of $\beta_1$:

      • $\beta_1 > 0$: Positive association

      • $\beta_1 < 0$: Negative association

      • $\beta_1 = 0$: No association

    • Larger magnitudes of $\beta_1$ suggest greater changes in the outcome per change in the exposure.

    • For estimated slope ($\hat{\beta_1}$):

      • No Association: $\hat{\beta_1} \approx 0$

      • Positive Association: $\hat{\beta_1} > 0$

      • Negative Association: $\hat{\beta_1} < 0$

Example Dataset: mtcars
  • Dataset Description:

    • Classic mtcars dataset in R covering fuel consumption and 10 aspects of automobile design for 32 automobiles (1973–74).

  • Variables Considered:

    • Weight (1000s lbs)

    • Miles per gallon (mpg)

    • Quarter mile time (qsec)

Relationship Between Car Weight and Quarter Mile Time

  • Scatterplot indicates no clear relationship.

  • Correlation coefficient between weight and quarter mile time:

    • r=0.174r = -0.174

  • Conclusion: Linear regression model performance insufficient;

    • R2=0.03R^2 = 0.03: Indicates only 3% variation explained by model.

  • Model equation:
    Qsec=18.90.32WeightQsec = 18.9 - 0.32 \cdot Weight

Relationship Between Car Weight and Fuel Economy

  • Observations indicate a negative, strong, and linear relationship in the scatterplot.

  • Correlation coefficient:

    • r=0.868r = -0.868

  • Fitted linear model:

    • Mpg=37.35.3WeightMpg = 37.3 - 5.3 \cdot Weight

  • Interpretation: A 1000 lb increase in weight reduces fuel economy by 5.3 mpg.

  • Model adequacy:

    • R2=0.75R^2 = 0.75 indicates 75% of the variation in fuel efficiency explained by linear relationship with weight.

Goal: Evaluate Regressions using Inference Framework
  • Testing the hypothesis that there is sufficient evidence to state $\beta_1 \neq 0$.

Statistical Inference for Linear Regression Models
  • Population Parameters:

    • True relationship denoted by $\beta0$, $\beta1$.

  • If $\beta_1 = 0$, expected outcomes are equal across samples regardless of exposure value.

  • If $\beta_1 \neq 0$, a true linear relationship exists.

  • Focus typically on assessing if slope parameter ($\beta_1$) is non-zero.

Population vs Sample Estimates

  • True parameters: $\beta0$, $\beta1$

  • Estimated parameters: $\hat{\beta0}$, $\hat{\beta1}$ based on random sampling.

Hypothesis Testing in Linear Regression
  • Null Hypothesis: No association between exposure and outcome ($H0: \beta1 = 0$)

  • Alternative Hypothesis: Indicates some association ($H1: \beta1 \neq 0$)

  • Null model: $y = \beta_0$ sufficiently predicts data.

  • Variation exists in estimates across independent random samples.

Variability and Sampling Distribution
  • The estimate of the slope ($\hat{\beta_1}$) has a sampling distribution and standard error,

  • Inference uses point estimate $\hat{\beta_1}$ along with its standard error.

Residuals
  • Defined as the distance between regression line and observed value for $i^{th}$ data point ($e_i$).

  • Linear regression parameters ($\hat{\beta0}$ and $\hat{\beta1}$) obtained by minimizing total squared errors of residuals.

  • Least Squares Regression Line: Minimizes total squared residuals defined mathematically:

    • f(β<em>0,β</em>1)=<em>i=1N(y</em>iy^i)2f(\beta<em>0, \beta</em>1) = \sum<em>{i=1}^{N} (y</em>i - \hat{y}_i)^2

  • Use calculus to determine parameter values that minimize this function.

Statistical Model for Simple Linear Regression
  • Defined as:
    Y<em>i=β</em>0+β<em>1X</em>i+ϵiY<em>i = \beta</em>0 + \beta<em>1 X</em>i + \epsilon_i

  • Assumption that $\epsilon_i \sim N(0, \sigma^2)$, indicating normally distributed error terms.

  • The model captures the linear relationship between outcomes (Y) and exposure/covariate (X).

Assumptions of Linear Regression
  1. Independence: Each data point ($Yi, Xi$) is independent.

  2. Linearity: $Yi$ is a linear function of $Xi$.

  3. Normality of Errors: Errors $\epsilon_i$ follow a normal distribution.

  4. Constant Variance: The variance of errors is constant across all levels of $i$ (homoscedasticity).

Checking Assumptions
  • Verification important for valid inference, though rarely reported in practice.

  • Independence: Addressed through data collection methods and study design.

  • Linearity Assessment: Scatterplot analysis or regression fit to observe relationships.

  • Residual Analysis: Examine residuals for patterns that indicate model fitness.

  • Normality Assessment of Residuals:

    • Use histograms, QQ plots to investigate normality.

  • Constant Variance Assessment: Check residuals for patterns indicating non-constant variance.

Conclusion on Assumptions
  • Generally, assumptions for linear regression are rarely perfectly met.

  • Key focus areas in order of importance for inference validity:

    1. Linearity

    2. Independence

    3. Constant Variance

    4. Normality

  • Remember, "All models are wrong, but some are useful."