Linear Regression Notes Linear Regression Used to predict variable y y y based on the value of another variable x x x . x x x : Experimenter controls.y y y : Experimenter measures.Learning Objectives Explain the motivation and practical applications of linear regression in data analysis. Define residual as the difference between the observed and estimated values. Calculate the optimal estimates of the slope and intercept from given x and y data. Calculate and understand the relationship between SST, SSE, and SSR Calculate and interpret the coefficient of determination and correlation coefficient. Discuss possible explanations for the correlation between two variables. Understand how linear regression can be used to estimate nonlinear relationships Calculate the MSE and the uncertainty in the estimated slope and intercept Conduct a hypothesis test on the slope or intercept using confidence intervals List and assess the assumptions of least-squares regression using graphical techniques Linear Regression Model Basic regression model that is linear and includes one predictor:
Y i = B o + B i x i + ε i Yi = Bo + Bixi + \varepsilon i Yi = B o + B i x i + ε i
where:B o Bo B o is the intercept.B 1 B_1 B 1 is the slope.ε i ∼ N ( 0 , σ ) \varepsilon_i \sim N(0, \sigma) ε i ∼ N ( 0 , σ ) Optimal Estimates for β0 and β1 Goal: Find optimal estimates for β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 so that the estimated line is closer to the data than any other line. Estimated line: y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 x < / e m > i \hat{y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1x</em>i y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 x < / e m > i Method of Ordinary Least Squares (OLS) Used to find the optimal β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 . Measure of closeness:Error in the prediction: observed value minus estimated value. Difference between the observed values y < e m > i y<em>i y < e m > i and the predicted values y ^ < / e m > i \hat{y}</em>i y ^ < / e m > i . Residuals: e < e m > i = y < / e m > i − y ^ i e<em>i = y</em>i - \hat{y}_i e < e m > i = y < / e m > i − y ^ i Square the residuals to avoid cancellation. Overall measure of closeness: Sum of Squared Errors (SSE). Minimize S S E SSE SSE to find the closest line to the data. β ^ < e m > 0 = y ˉ − β ^ < / e m > 1 x ˉ \hat{\beta}<em>0 = \bar{y} - \hat{\beta}</em>1\bar{x} β ^ < e m > 0 = y ˉ − β ^ < / e m > 1 x ˉ β ^ < e m > 1 = ∑ < / e m > i ( x < e m > i − x ˉ ) ( y < / e m > i − y ˉ ) ∑ < e m > i ( x < / e m > i − x ˉ ) 2 \hat{\beta}<em>1 = \frac{\sum</em>{i} (x<em>i - \bar{x})(y</em>i - \bar{y})}{\sum<em>{i} (x</em>i - \bar{x})^2} β ^ < e m > 1 = ∑ < e m > i ( x < / e m > i − x ˉ ) 2 ∑ < / e m > i ( x < e m > i − x ˉ ) ( y < / e m > i − y ˉ ) Alternative formulas:S < e m > X X = ∑ < / e m > i x i 2 − n x ˉ 2 S<em>{XX} = \sum</em>{i} x_i^2 - n\bar{x}^2 S < e m > XX = ∑ < / e m > i x i 2 − n x ˉ 2 S < e m > X Y = ∑ < / e m > i x < e m > i y < / e m > i − n x ˉ y ˉ S<em>{XY} = \sum</em>{i} x<em>iy</em>i - n\bar{x}\bar{y} S < e m > X Y = ∑ < / e m > i x < e m > i y < / e m > i − n x ˉ y ˉ β ^ < e m > 1 = S < / e m > X Y S X X \hat{\beta}<em>1 = \frac{S</em>{XY}}{S_{XX}} β ^ < e m > 1 = S XX S < / e m > X Y Method of Ordinary Least Squares (OLS) Summary Given observations x < e m > 1 , y < / e m > 1 , … , x < e m > n , y < / e m > n \begin{aligned}x<em>1, y</em>1, \dots, x<em>n, y</em>n\end{aligned} x < e m > 1 , y < / e m > 1 , … , x < e m > n , y < / e m > n , find optimal estimates for β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 so that the estimated line is closer to the data than any other line For each x i x_i x i , the estimated value is obtained by:y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 x < / e m > i \hat{y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1x</em>i y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 x < / e m > i This line has the lowest SSE for the given data The quantities e < e m > i = y < / e m > i − y ^ i e<em>i = y</em>i - \hat{y}_i e < e m > i = y < / e m > i − y ^ i are called residualsS S E = ∑ < e m > i e < / e m > i 2 = ∑ < e m > i ( y < / e m > i − β ^ < e m > 0 − β ^ < / e m > 1 x i ) 2 SSE = \sum<em>{i} e</em>i^2 = \sum<em>{i} (y</em>i - \hat{\beta}<em>0 - \hat{\beta}</em>1x_i)^2 SSE = ∑ < e m > i e < / e m > i 2 = ∑ < e m > i ( y < / e m > i − β ^ < e m > 0 − β ^ < / e m > 1 x i ) 2 Quantifying Goodness-of-Fit S S E SSE SSE (Sum Squared Error): Remaining variation after accounting for the relationship between X X X and Y Y Y .S S T SST SST (Sum Squared Total): Total variation if the relationship between X X X and Y Y Y is not accounted for.S S R SSR SSR (Sum Squared Regression): Variation accounted for by the relationship between X X X and Y Y Y .S S T = ∑ < e m > i ( y < / e m > i − y ˉ ) 2 SST = \sum<em>{i} (y</em>i - \bar{y})^2 SST = ∑ < e m > i ( y < / e m > i − y ˉ ) 2 S S E = ∑ < e m > i ( y < / e m > i − y ^ i ) 2 SSE = \sum<em>{i} (y</em>i - \hat{y}_i)^2 SSE = ∑ < e m > i ( y < / e m > i − y ^ i ) 2 S S R = ∑ < e m > i ( y ^ < / e m > i − y ˉ ) 2 SSR = \sum<em>{i} (\hat{y}</em>i - \bar{y})^2 SSR = ∑ < e m > i ( y ^ < / e m > i − y ˉ ) 2 Coefficient of Determination r 2 r^2 r 2 : Proportion of the total variation SST that is explained by the model SSR.r 2 = S S R S S T = 1 − S S E S S T r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} r 2 = SST SSR = 1 − SST SSE Properties of r 2 r^2 r 2 .0 ≤ r 2 ≤ 1 0 \le r^2 \le 1 0 ≤ r 2 ≤ 1 Unitless. Larger values suggest better fit. Coefficient of Determination vs. Correlation The square root of the coefficient of determination is the sample correlation coefficient: r = ± r 2 r = \pm \sqrt{r^2} r = ± r 2 The notation is meaningful: r r r is an estimate of ρ \rho ρ (just like s s s is an estimate of σ \sigma σ ) r r r is based on observations x < e m > 1 , y < / e m > 1 , … , x < e m > n , y < / e m > n x<em>1, y</em>1 , \dots, x<em>n, y</em>n x < e m > 1 , y < / e m > 1 , … , x < e m > n , y < / e m > n , which are a sampleρ \rho ρ is based on random variables X X X and Y Y Y , which are the populationsρ = C o v ( X , Y ) σ < e m > X σ < / e m > Y \rho = \frac{Cov(X, Y)}{\sigma<em>X \sigma</em>Y} ρ = σ < e m > X σ < / e m > Y C o v ( X , Y ) Correlation vs. Causation Correlation does not imply causation. Possible explanations for correlation between X X X and Y Y Y .X X X caused Y Y Y .Y Y Y caused X X X .X X X and Y Y Y impact each other.A common third factor Z Z Z influences X X X and Y Y Y simultaneously. Just a coincidence due to sampling variation. Correlation & Outliers Correlation can be significantly affected by outliers. Some outliers are caused by data recording errors: these outliers can be corrected/deleted. Deleting outliers without a justification is not appropriate. Nonlinear Models Correlation measures linear association. Lack of correlation suggests lack of linear association (there could be a nonlinear association). If we suspect a nonlinear relationship exists, then, in some cases, we can still use linear regression. Apply linear regression on a transformed set of variables. For example, if Y = β < e m > 0 + β < / e m > 1 x 2 Y = \beta<em>0 + \beta</em>1x^2 Y = β < e m > 0 + β < / e m > 1 x 2 Uncertainty in Regression Estimates The least-squares values β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 are estimates of the true unknown parameters β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 . Treat β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 as random variablesY < e m > i = β < / e m > 0 + β < e m > 1 X < / e m > i + ε i Y<em>i = \beta</em>0 + \beta<em>1X</em>i + \varepsilon_i Y < e m > i = β < / e m > 0 + β < e m > 1 X < / e m > i + ε i Y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 X < / e m > i \hat{Y}<em>i = \hat{\beta}</em>0 + \hat{\beta}<em>1X</em>i Y ^ < e m > i = β ^ < / e m > 0 + β ^ < e m > 1 X < / e m > i Precision: uncertainty in β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 Uncertainty in and Under the assumptions of linear regression (discussed later), the uncertainty in β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 as a function of σ \sigma σ is Var ( β ^ < e m > 0 ) = σ 2 ∑ < / e m > i x < e m > i 2 n s < / e m > X X \operatorname{Var}(\hat{\beta}<em>0) = \sigma^2 \frac{\sum</em>{i} x<em>i^2}{n s</em>{XX}} Var ( β ^ < e m > 0 ) = σ 2 n s < / e m > XX ∑ < / e m > i x < e m > i 2 Var ( β ^ < e m > 1 ) = σ 2 s < / e m > X X \operatorname{Var}(\hat{\beta}<em>1) = \frac{\sigma^2}{s</em>{XX}} Var ( β ^ < e m > 1 ) = s < / e m > XX σ 2 Where s < e m > X X = ∑ < / e m > i x i 2 − n x ˉ 2 s<em>{XX} = \sum</em>{i} x_i^2 - n\bar{x}^2 s < e m > XX = ∑ < / e m > i x i 2 − n x ˉ 2 However, σ 2 = Var ( ε i ) \sigma^2 = \operatorname{Var}(\varepsilon_i) σ 2 = Var ( ε i ) is a parameter, and hence, is typically unknown Use e < e m > i e<em>i e < e m > i to estimate σ 2 = Var ( ε < / e m > i ) \sigma^2 = \operatorname{Var}(\varepsilon</em>i) σ 2 = Var ( ε < / e m > i ) This is also called Mean Squared Error (MSE) Uncertainty in β̂0 and β̂1: Summary β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 are estimators of the unknown parameters β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 The uncertainty in β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 is controlled by the variance of the errorVar ε < e m > i = σ 2 \operatorname{Var} \varepsilon<em>i = \sigma^2 Var ε < e m > i = σ 2 , where ε < / e m > i = y < e m > i − β < / e m > 0 + β < e m > 1 x < / e m > i \varepsilon</em>i = y<em>i - \beta</em>0 + \beta<em>1x</em>i ε < / e m > i = y < e m > i − β < / e m > 0 + β < e m > 1 x < / e m > i The unbiased estimator of σ 2 \sigma^2 σ 2 is:M S E = s 2 = S S E n − 2 MSE = s^2 = \frac{SSE}{n - 2} MSE = s 2 = n − 2 SSE Then, the uncertainty in β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 can be estimated as:Var ( β ^ < e m > 0 ) = s 2 ∑ < / e m > i x < e m > i 2 n s < / e m > X X \operatorname{Var}(\hat{\beta}<em>0) = s^2 \frac{\sum</em>{i} x<em>i^2}{n s</em>{XX}} Var ( β ^ < e m > 0 ) = s 2 n s < / e m > XX ∑ < / e m > i x < e m > i 2 Var ( β ^ < e m > 1 ) = s 2 s < / e m > X X \operatorname{Var}(\hat{\beta}<em>1) = \frac{s^2}{s</em>{XX}} Var ( β ^ < e m > 1 ) = s < / e m > XX s 2 Confidence Intervals for β0 and β1 Two-sided 100 ( 1 − α ) % 100(1 - \alpha)\% 100 ( 1 − α ) % confidence intervals for β < e m > 0 \beta<em>0 β < e m > 0 and β < / e m > 1 \beta</em>1 β < / e m > 1 are given by:β ^ < e m > 0 ± t < / e m > α / 2 , n − 2 s < e m > β ^ < / e m > 0 \hat{\beta}<em>0 \pm t</em>{\alpha/2, n-2} s<em>{\hat{\beta}</em>0} β ^ < e m > 0 ± t < / e m > α /2 , n − 2 s < e m > β ^ < / e m > 0 β ^ < e m > 1 ± t < / e m > α / 2 , n − 2 s < e m > β ^ < / e m > 1 \hat{\beta}<em>1 \pm t</em>{\alpha/2, n-2} s<em>{\hat{\beta}</em>1} β ^ < e m > 1 ± t < / e m > α /2 , n − 2 s < e m > β ^ < / e m > 1 Hypothesis Testing for β0 and β1 One use of linear regression is to test whether or not there is a significant relationship between two variables.Does strength y y y depend on cement content x x x ? If there is no relationship between strength and cement content:H < e m > 0 : β < / e m > 1 = 0 H<em>0: \beta</em>1 = 0 H < e m > 0 : β < / e m > 1 = 0 H < e m > 1 : β < / e m > 1 ≠ 0 H<em>1: \beta</em>1 \neq 0 H < e m > 1 : β < / e m > 1 = 0 t -score = β ^ < e m > 1 − β < / e m > 1 s < e m > β ^ < / e m > 1 t\text{-score} = \frac{\hat{\beta}<em>1 - \beta</em>1}{s<em>{\hat{\beta}</em>1}} t -score = s < e m > β ^ < / e m > 1 β ^ < e m > 1 − β < / e m > 1
* p -value = P ( t < − t -score ) + P ( t > t -score ) p\text{-value} = P(t < -t\text{-score}) + P(t > t\text{-score}) p -value = P ( t < − t -score ) + P ( t > t -score ) Assumptions For Least-Squares Everything we discussed (variances of β ^ < e m > 0 \hat{\beta}<em>0 β ^ < e m > 0 and β ^ < / e m > 1 \hat{\beta}</em>1 β ^ < / e m > 1 , confidence intervals, etc.) is only appropriate if the model assumptions are satisfied: LINE conditionsLinear relationship between X X X and Y Y Y Independent error terms ε i \varepsilon_i ε i (and therefore, independent observations) Normally distributed error terms ε i \varepsilon_i ε i Equal variance σ 2 \sigma^2 σ 2 of ε i \varepsilon_i ε i along regression line If the assumptions are satisfied, the residuals, e i e_i e i , should reflect these properties Checking Assumptions Linear Relationship: Plot the residuals e < e m > i e<em>i e < e m > i against x < / e m > i x</em>i x < / e m > i (or against the estimated values y ^ i \hat{y}_i y ^ i ). Independent Errors: Plot the residuals e < e m > i e<em>i e < e m > i against x < / e m > i x</em>i x < / e m > i (or against the estimated values y ^ i \hat{y}_i y ^ i ). Normally Distributed Errors 𝜀𝑖 : Plot a histogram or Q-Q plot of the residuals Equal Variance: Plot the residuals e < e m > i e<em>i e < e m > i against x < / e m > i x</em>i x < / e m > i (or against the estimated values y ^ i \hat{y}_i y ^ i ). Outlier Check: Plot the residuals e < e m > i e<em>i e < e m > i against x < / e m > i x</em>i x < / e m > i (or against the estimated values y ^ i \hat{y}_i y ^ i ). If a simple linear regression model is not appropriate for the data:Use a more appropriate model Apply transformation on the data and then use linear regression Unequal variances and nonnormality of the errors frequently occur togetherTo remedy these violations, we need a transformation on Υ \Upsilon Υ Such a transformation may also at the same time help to linearize a curvilinear relation At other times, a simultaneous transformation on X X X along with Υ \Upsilon Υ may be needed Knowt Play Call Kai