Linear Regression Notes

Linear Regression

  • Used to predict variable yy based on the value of another variable xx.
  • xx: Experimenter controls.
  • yy: Experimenter measures.

Learning Objectives

  1. Explain the motivation and practical applications of linear regression in data analysis.
  2. Define residual as the difference between the observed and estimated values.
  3. Calculate the optimal estimates of the slope and intercept from given x and y data.
  4. Calculate and understand the relationship between SST, SSE, and SSR
  5. Calculate and interpret the coefficient of determination and correlation coefficient.
  6. Discuss possible explanations for the correlation between two variables.
  7. Understand how linear regression can be used to estimate nonlinear relationships
  8. Calculate the MSE and the uncertainty in the estimated slope and intercept
  9. Conduct a hypothesis test on the slope or intercept using confidence intervals
  10. List and assess the assumptions of least-squares regression using graphical techniques

Linear Regression Model

  • Basic regression model that is linear and includes one predictor: Yi=Bo+Bixi+εiYi = Bo + Bixi + \varepsilon i where:
    • BoBo is the intercept.
    • B1B_1 is the slope.
    • εiN(0,σ)\varepsilon_i \sim N(0, \sigma)

Optimal Estimates for β0 and β1

  • Goal: Find optimal estimates for β<em>0\beta<em>0 and β</em>1\beta</em>1 so that the estimated line is closer to the data than any other line.
  • Estimated line: y^<em>i=β^</em>0+β^<em>1x</em>i\hat{y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1x</em>i

Method of Ordinary Least Squares (OLS)

  • Used to find the optimal β<em>0\beta<em>0 and β</em>1\beta</em>1.
  • Measure of closeness:
    • Error in the prediction: observed value minus estimated value.
    • Difference between the observed values y<em>iy<em>i and the predicted values y^</em>i\hat{y}</em>i.
    • Residuals: e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i
  • Square the residuals to avoid cancellation.
  • Overall measure of closeness: Sum of Squared Errors (SSE).
  • Minimize SSESSE to find the closest line to the data.

Formulas for β̂0 and β̂1

  • β^<em>0=yˉβ^</em>1xˉ\hat{\beta}<em>0 = \bar{y} - \hat{\beta}</em>1\bar{x}
  • β^<em>1=</em>i(x<em>ixˉ)(y</em>iyˉ)<em>i(x</em>ixˉ)2\hat{\beta}<em>1 = \frac{\sum</em>{i} (x<em>i - \bar{x})(y</em>i - \bar{y})}{\sum<em>{i} (x</em>i - \bar{x})^2}
  • Alternative formulas:
    • S<em>XX=</em>ixi2nxˉ2S<em>{XX} = \sum</em>{i} x_i^2 - n\bar{x}^2
    • S<em>XY=</em>ix<em>iy</em>inxˉyˉS<em>{XY} = \sum</em>{i} x<em>iy</em>i - n\bar{x}\bar{y}
    • β^<em>1=S</em>XYSXX\hat{\beta}<em>1 = \frac{S</em>{XY}}{S_{XX}}

Method of Ordinary Least Squares (OLS) Summary

  • Given observations x<em>1,y</em>1,,x<em>n,y</em>n\begin{aligned}x<em>1, y</em>1, \dots, x<em>n, y</em>n\end{aligned}, find optimal estimates for β<em>0\beta<em>0 and β</em>1\beta</em>1 so that the estimated line is closer to the data than any other line
  • For each xix_i, the estimated value is obtained by:
    • y^<em>i=β^</em>0+β^<em>1x</em>i\hat{y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1x</em>i
  • This line has the lowest SSE for the given data
    • SSE: Sum Squared Error
  • The quantities e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i are called residuals
    • SSE=<em>ie</em>i2=<em>i(y</em>iβ^<em>0β^</em>1xi)2SSE = \sum<em>{i} e</em>i^2 = \sum<em>{i} (y</em>i - \hat{\beta}<em>0 - \hat{\beta}</em>1x_i)^2

Quantifying Goodness-of-Fit

  • SSESSE (Sum Squared Error): Remaining variation after accounting for the relationship between XX and YY.
  • SSTSST (Sum Squared Total): Total variation if the relationship between XX and YY is not accounted for.
  • SSRSSR (Sum Squared Regression): Variation accounted for by the relationship between XX and YY.
  • SST=<em>i(y</em>iyˉ)2SST = \sum<em>{i} (y</em>i - \bar{y})^2
  • SSE=<em>i(y</em>iy^i)2SSE = \sum<em>{i} (y</em>i - \hat{y}_i)^2
  • SSR=<em>i(y^</em>iyˉ)2SSR = \sum<em>{i} (\hat{y}</em>i - \bar{y})^2

Coefficient of Determination

  • r2r^2: Proportion of the total variation SST that is explained by the model SSR.
    • r2=SSRSST=1SSESSTr^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}
  • Properties of r2r^2.
    • 0r210 \le r^2 \le 1
    • Unitless.
    • Larger values suggest better fit.

Coefficient of Determination vs. Correlation

  • The square root of the coefficient of determination is the sample correlation coefficient:
  • r=±r2r = \pm \sqrt{r^2}
  • The notation is meaningful: rr is an estimate of ρ\rho (just like ss is an estimate of σ\sigma)
  • rr is based on observations x<em>1,y</em>1,,x<em>n,y</em>nx<em>1, y</em>1 , \dots, x<em>n, y</em>n, which are a sample
  • ρ\rho is based on random variables XX and YY, which are the populations
  • ρ=Cov(X,Y)σ<em>Xσ</em>Y\rho = \frac{Cov(X, Y)}{\sigma<em>X \sigma</em>Y}

Correlation vs. Causation

  • Correlation does not imply causation.
  • Possible explanations for correlation between XX and YY.
    1. XX caused YY.
    2. YY caused XX.
    3. XX and YY impact each other.
    4. A common third factor ZZ influences XX and YY simultaneously.
    5. Just a coincidence due to sampling variation.

Correlation & Outliers

  • Correlation can be significantly affected by outliers.
  • Some outliers are caused by data recording errors: these outliers can be corrected/deleted.
  • Deleting outliers without a justification is not appropriate.

Nonlinear Models

  • Correlation measures linear association.
  • Lack of correlation suggests lack of linear association (there could be a nonlinear association).
  • If we suspect a nonlinear relationship exists, then, in some cases, we can still use linear regression.
  • Apply linear regression on a transformed set of variables.
  • For example, if Y=β<em>0+β</em>1x2Y = \beta<em>0 + \beta</em>1x^2

Uncertainty in Regression Estimates

  • The least-squares values β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 are estimates of the true unknown parameters β<em>0\beta<em>0 and β</em>1\beta</em>1.
  • Treat β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 as random variables
    • Y<em>i=β</em>0+β<em>1X</em>i+εiY<em>i = \beta</em>0 + \beta<em>1X</em>i + \varepsilon_i
    • Y^<em>i=β^</em>0+β^<em>1X</em>i\hat{Y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1X</em>i
  • Precision: uncertainty in β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1

Uncertainty in and

  • Under the assumptions of linear regression (discussed later), the uncertainty in β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 as a function of σ\sigma is
  • Var(β^<em>0)=σ2</em>ix<em>i2ns</em>XX\operatorname{Var}(\hat{\beta}<em>0) = \sigma^2 \frac{\sum</em>{i} x<em>i^2}{n s</em>{XX}}
  • Var(β^<em>1)=σ2s</em>XX\operatorname{Var}(\hat{\beta}<em>1) = \frac{\sigma^2}{s</em>{XX}}
    • Where s<em>XX=</em>ixi2nxˉ2s<em>{XX} = \sum</em>{i} x_i^2 - n\bar{x}^2
  • However, σ2=Var(εi)\sigma^2 = \operatorname{Var}(\varepsilon_i) is a parameter, and hence, is typically unknown
  • Use e<em>ie<em>i to estimate σ2=Var(ε</em>i)\sigma^2 = \operatorname{Var}(\varepsilon</em>i)
    • This is also called Mean Squared Error (MSE)

Uncertainty in β̂0 and β̂1: Summary

  • β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 are estimators of the unknown parameters β<em>0\beta<em>0 and β</em>1\beta</em>1
  • The uncertainty in β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 is controlled by the variance of the error
    • Varε<em>i=σ2\operatorname{Var} \varepsilon<em>i = \sigma^2, where ε</em>i=y<em>iβ</em>0+β<em>1x</em>i\varepsilon</em>i = y<em>i - \beta</em>0 + \beta<em>1x</em>i
  • The unbiased estimator of σ2\sigma^2 is:
    • MSE=s2=SSEn2MSE = s^2 = \frac{SSE}{n - 2}
  • Then, the uncertainty in β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1 can be estimated as:
    • Var(β^<em>0)=s2</em>ix<em>i2ns</em>XX\operatorname{Var}(\hat{\beta}<em>0) = s^2 \frac{\sum</em>{i} x<em>i^2}{n s</em>{XX}}
    • Var(β^<em>1)=s2s</em>XX\operatorname{Var}(\hat{\beta}<em>1) = \frac{s^2}{s</em>{XX}}

Confidence Intervals for β0 and β1

  • Two-sided 100(1α)%100(1 - \alpha)\% confidence intervals for β<em>0\beta<em>0 and β</em>1\beta</em>1 are given by:
    • β^<em>0±t</em>α/2,n2s<em>β^</em>0\hat{\beta}<em>0 \pm t</em>{\alpha/2, n-2} s<em>{\hat{\beta}</em>0}
    • β^<em>1±t</em>α/2,n2s<em>β^</em>1\hat{\beta}<em>1 \pm t</em>{\alpha/2, n-2} s<em>{\hat{\beta}</em>1}

Hypothesis Testing for β0 and β1

  • One use of linear regression is to test whether or not there is a significant relationship between two variables.
    • Does strength yy depend on cement content xx ?
  • If there is no relationship between strength and cement content:
    • H<em>0:β</em>1=0H<em>0: \beta</em>1 = 0
    • H<em>1:β</em>10H<em>1: \beta</em>1 \neq 0
  • t-score=β^<em>1β</em>1s<em>β^</em>1t\text{-score} = \frac{\hat{\beta}<em>1 - \beta</em>1}{s<em>{\hat{\beta}</em>1}}
    * p-value=P(t<t-score)+P(t>t-score)p\text{-value} = P(t < -t\text{-score}) + P(t > t\text{-score})

Assumptions For Least-Squares

  • Everything we discussed (variances of β^<em>0\hat{\beta}<em>0 and β^</em>1\hat{\beta}</em>1, confidence intervals, etc.) is only appropriate if the model assumptions are satisfied: LINE conditions
    • Linear relationship between XX and YY
    • Independent error terms εi\varepsilon_i (and therefore, independent observations)
    • Normally distributed error terms εi\varepsilon_i
    • Equal variance σ2\sigma^2 of εi\varepsilon_i along regression line
  • If the assumptions are satisfied, the residuals, eie_i, should reflect these properties

Checking Assumptions

  • Linear Relationship: Plot the residuals e<em>ie<em>i against x</em>ix</em>i (or against the estimated values y^i\hat{y}_i).
  • Independent Errors: Plot the residuals e<em>ie<em>i against x</em>ix</em>i (or against the estimated values y^i\hat{y}_i).
  • Normally Distributed Errors 𝜀𝑖 : Plot a histogram or Q-Q plot of the residuals
  • Equal Variance: Plot the residuals e<em>ie<em>i against x</em>ix</em>i (or against the estimated values y^i\hat{y}_i).
  • Outlier Check: Plot the residuals e<em>ie<em>i against x</em>ix</em>i (or against the estimated values y^i\hat{y}_i).

Remedial Measures

  • If a simple linear regression model is not appropriate for the data:
    • Use a more appropriate model
    • Apply transformation on the data and then use linear regression
    • Unequal variances and nonnormality of the errors frequently occur together
      • To remedy these violations, we need a transformation on Υ\Upsilon
      • Such a transformation may also at the same time help to linearize a curvilinear relation
      • At other times, a simultaneous transformation on XX along with Υ\Upsilon may be needed