Linear Regression Notes

Linear Regression

Used to predict variable $y$ based on the value of another variable $x$ .
$x$ : Experimenter controls.
$y$ : Experimenter measures.

Learning Objectives

Explain the motivation and practical applications of linear regression in data analysis.
Define residual as the difference between the observed and estimated values.
Calculate the optimal estimates of the slope and intercept from given x and y data.
Calculate and understand the relationship between SST, SSE, and SSR
Calculate and interpret the coefficient of determination and correlation coefficient.
Discuss possible explanations for the correlation between two variables.
Understand how linear regression can be used to estimate nonlinear relationships
Calculate the MSE and the uncertainty in the estimated slope and intercept
Conduct a hypothesis test on the slope or intercept using confidence intervals
List and assess the assumptions of least-squares regression using graphical techniques

Linear Regression Model

Basic regression model that is linear and includes one predictor: $Yi = Bo + Bixi + \varepsilon i$ where:
- $Bo$ is the intercept.
- $B_1$ is the slope.
- $\varepsilon_i \sim N(0, \sigma)$

Optimal Estimates for β0 and β1

Goal: Find optimal estimates for $\beta0$ and $\beta1$ so that the estimated line is closer to the data than any other line.
Estimated line: $\hat{y}i = \hat{\beta}0 + \hat{\beta}1xi$

Method of Ordinary Least Squares (OLS)

Used to find the optimal $\beta0$ and $\beta1$ .
Measure of closeness:
- Error in the prediction: observed value minus estimated value.
- Difference between the observed values $yi$ and the predicted values $\hat{y}i$ .
- Residuals: $ei = yi - \hat{y}_i$
Square the residuals to avoid cancellation.
Overall measure of closeness: Sum of Squared Errors (SSE).
Minimize $SSE$ to find the closest line to the data.

Formulas for β̂0 and β̂1

$\hat{\beta}0 = \bar{y} - \hat{\beta}1\bar{x}$
$\hat{\beta}1 = \frac{\sum{i} (xi - \bar{x})(yi - \bar{y})}{\sum{i} (xi - \bar{x})^2}$
Alternative formulas:
- $S{XX} = \sum{i} x_i^2 - n\bar{x}^2$
- $S{XY} = \sum{i} xiyi - n\bar{x}\bar{y}$
- $\hat{\beta}1 = \frac{S{XY}}{S_{XX}}$

Method of Ordinary Least Squares (OLS) Summary

Given observations $\begin{aligned}x1, y1, \dots, xn, yn\end{aligned}$ , find optimal estimates for $\beta0$ and $\beta1$ so that the estimated line is closer to the data than any other line
For each $x_i$ , the estimated value is obtained by:
- $\hat{y}i = \hat{\beta}0 + \hat{\beta}1xi$
This line has the lowest SSE for the given data
- SSE: Sum Squared Error
The quantities $ei = yi - \hat{y}_i$ are called residuals
- $SSE = \sum{i} ei^2 = \sum{i} (yi - \hat{\beta}0 - \hat{\beta}1x_i)^2$

Quantifying Goodness-of-Fit

$SSE$ (Sum Squared Error): Remaining variation after accounting for the relationship between $X$ and $Y$ .
$SST$ (Sum Squared Total): Total variation if the relationship between $X$ and $Y$ is not accounted for.
$SSR$ (Sum Squared Regression): Variation accounted for by the relationship between $X$ and $Y$ .
$SST = \sum{i} (yi - \bar{y})^2$
$SSE = \sum{i} (yi - \hat{y}_i)^2$
$SSR = \sum{i} (\hat{y}i - \bar{y})^2$

Coefficient of Determination

$r^2$ : Proportion of the total variation SST that is explained by the model SSR.
- $r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$
Properties of $r^2$ .
- $0 \le r^2 \le 1$
- Unitless.
- Larger values suggest better fit.

Coefficient of Determination vs. Correlation

The square root of the coefficient of determination is the sample correlation coefficient:
$r = \pm \sqrt{r^2}$
The notation is meaningful: $r$ is an estimate of $\rho$ (just like $s$ is an estimate of $\sigma$ )
$r$ is based on observations $x1, y1 , \dots, xn, yn$ , which are a sample
$\rho$ is based on random variables $X$ and $Y$ , which are the populations
$\rho = \frac{Cov(X, Y)}{\sigmaX \sigmaY}$

Correlation vs. Causation

Correlation does not imply causation.
Possible explanations for correlation between $X$ and $Y$ .
1. $X$ caused $Y$ .
2. $Y$ caused $X$ .
3. $X$ and $Y$ impact each other.
4. A common third factor $Z$ influences $X$ and $Y$ simultaneously.
5. Just a coincidence due to sampling variation.

Correlation & Outliers

Correlation can be significantly affected by outliers.
Some outliers are caused by data recording errors: these outliers can be corrected/deleted.
Deleting outliers without a justification is not appropriate.

Nonlinear Models

Correlation measures linear association.
Lack of correlation suggests lack of linear association (there could be a nonlinear association).
If we suspect a nonlinear relationship exists, then, in some cases, we can still use linear regression.
Apply linear regression on a transformed set of variables.
For example, if $Y = \beta0 + \beta1x^2$

Uncertainty in Regression Estimates

The least-squares values $\hat{\beta}0$ and $\hat{\beta}1$ are estimates of the true unknown parameters $\beta0$ and $\beta1$ .
Treat $\hat{\beta}0$ and $\hat{\beta}1$ as random variables
- $Yi = \beta0 + \beta1Xi + \varepsilon_i$
- $\hat{Y}i = \hat{\beta}0 + \hat{\beta}1Xi$
Precision: uncertainty in $\hat{\beta}0$ and $\hat{\beta}1$

Uncertainty in and

Under the assumptions of linear regression (discussed later), the uncertainty in $\hat{\beta}0$ and $\hat{\beta}1$ as a function of $\sigma$ is
$\operatorname{Var}(\hat{\beta}0) = \sigma^2 \frac{\sum{i} xi^2}{n s{XX}}$
$\operatorname{Var}(\hat{\beta}1) = \frac{\sigma^2}{s{XX}}$
- Where $s{XX} = \sum{i} x_i^2 - n\bar{x}^2$
However, $\sigma^2 = \operatorname{Var}(\varepsilon_i)$ is a parameter, and hence, is typically unknown
Use $ei$ to estimate $\sigma^2 = \operatorname{Var}(\varepsiloni)$
- This is also called Mean Squared Error (MSE)

Uncertainty in β̂0 and β̂1: Summary

$\hat{\beta}0$ and $\hat{\beta}1$ are estimators of the unknown parameters $\beta0$ and $\beta1$
The uncertainty in $\hat{\beta}0$ and $\hat{\beta}1$ is controlled by the variance of the error
- $\operatorname{Var} \varepsiloni = \sigma^2$ , where $\varepsiloni = yi - \beta0 + \beta1xi$
The unbiased estimator of $\sigma^2$ is:
- $MSE = s^2 = \frac{SSE}{n - 2}$
Then, the uncertainty in $\hat{\beta}0$ and $\hat{\beta}1$ can be estimated as:
- $\operatorname{Var}(\hat{\beta}0) = s^2 \frac{\sum{i} xi^2}{n s{XX}}$
- $\operatorname{Var}(\hat{\beta}1) = \frac{s^2}{s{XX}}$

Confidence Intervals for β0 and β1

Two-sided $100(1 - \alpha)\%$ confidence intervals for $\beta0$ and $\beta1$ are given by:
- $\hat{\beta}0 \pm t{\alpha/2, n-2} s{\hat{\beta}0}$
- $\hat{\beta}1 \pm t{\alpha/2, n-2} s{\hat{\beta}1}$

Hypothesis Testing for β0 and β1

One use of linear regression is to test whether or not there is a significant relationship between two variables.
- Does strength $y$ depend on cement content $x$ ?
If there is no relationship between strength and cement content:
- $H0: \beta1 = 0$
- $H1: \beta1 \neq 0$
$t\text{-score} = \frac{\hat{\beta}1 - \beta1}{s{\hat{\beta}1}}$
* $p\text{-value} = P(t < -t\text{-score}) + P(t > t\text{-score})$

Assumptions For Least-Squares

Everything we discussed (variances of $\hat{\beta}0$ and $\hat{\beta}1$ , confidence intervals, etc.) is only appropriate if the model assumptions are satisfied: LINE conditions
- Linear relationship between $X$ and $Y$
- Independent error terms $\varepsilon_i$ (and therefore, independent observations)
- Normally distributed error terms $\varepsilon_i$
- Equal variance $\sigma^2$ of $\varepsilon_i$ along regression line
If the assumptions are satisfied, the residuals, $e_i$ , should reflect these properties

Checking Assumptions

Linear Relationship: Plot the residuals $ei$ against $xi$ (or against the estimated values $\hat{y}_i$ ).
Independent Errors: Plot the residuals $ei$ against $xi$ (or against the estimated values $\hat{y}_i$ ).
Normally Distributed Errors 𝜀𝑖 : Plot a histogram or Q-Q plot of the residuals
Equal Variance: Plot the residuals $ei$ against $xi$ (or against the estimated values $\hat{y}_i$ ).
Outlier Check: Plot the residuals $ei$ against $xi$ (or against the estimated values $\hat{y}_i$ ).

Remedial Measures

If a simple linear regression model is not appropriate for the data:
- Use a more appropriate model
- Apply transformation on the data and then use linear regression
- Unequal variances and nonnormality of the errors frequently occur together
  - To remedy these violations, we need a transformation on $\Upsilon$
  - Such a transformation may also at the same time help to linearize a curvilinear relation
  - At other times, a simultaneous transformation on $X$ along with $\Upsilon$ may be needed