Multiple regression

Multiple Linear Regression, Part 1
Outline
  • Multiple linear regression models
  • Least squares estimation
  • Fitted values, residuals, estimate of variance
  • Interpretation of regression coefficients
What are Multiple Linear Regression Models
Deterministic Models (No Errors)
  • Deterministic models describe perfect relationships between variables without errors: Y=f(X<em>1,X</em>2,,Xp)Y = f(X<em>1, X</em>2, …, X_p)

    • Examples:

    • Newton’s second law of motion: F=m×aF = m \times a (Force = mass × acceleration)

    • Ideal gas law: PV=nRTPV = nRT (pressure × volume = amount of gas in moles × ideal gas constant × temperature in °K)

Timber Volume of Trees
  • Modeling timber volume of a tree as a function of its radius and height:

    • If the trunk is a cylinder: volume=πr2hvolume = \pi r^2 h, where r = radius, h = height
    • If the trunk is a cone: volume=13πr2hvolume = \frac{1}{3} \pi r^2 h
    • Tree trunks are not exactly cylinders or cones, so the formulas are subject to error.
    • Model with error:

volume=f(r,h)+ϵ=αr2h+ϵvolume = f(r, h) + \epsilon = \alpha r^2 h + \epsilon, where α\alpha is a constant.

Statistical Models
  • A statistical model is a simple, low-dimensional summary of:

    • The relationship present in the data
    • The data-generation process
    • The relationship present in the population
  • Statistical models allow errors (uncertainty): Y=f(X<em>1,X</em>2,,Xp)+ϵY = f(X<em>1, X</em>2, …, X_p) + \epsilon

    • Y = response
    • f(X1, X2, . . . , Xp) = deterministic function
    • ε = error (noise)
Linear Regression Models
  • Focus on linear regression models: Y=f(X<em>1,X</em>2,,Xp)+ϵ=β<em>0+β</em>1X<em>1+β</em>2X<em>2++β</em>pXp+ϵY = f(X<em>1, X</em>2, …, X*p) + \epsilon = \beta<em>0 + \beta</em>1 X<em>1 + \beta</em>2 X<em>2 + … + \beta</em>p X_p + \epsilon
    • Linearity means the model is linear in its parameters β<em>0,β</em>1,,βp\beta<em>0, \beta</em>1, …, \beta_p.
    • Examples of linear regression models:
    • Y=β<em>0+β</em>1X+β2X2+ϵY = \beta<em>0 + \beta</em>1 X + \beta_2 X^2 + \epsilon
    • Y=β<em>0+β</em>1log(X)+ϵY = \beta<em>0 + \beta</em>1 log(X) + \epsilon
    • Even though the relationship between Y and X is not linear.
Some Non-linear Models Can Be Turned Linear (1)
  • Ex 1: Reciprocal Transformation

    • Non-linear model: Y=XαX+βY = \frac{X}{\alpha X + \beta}

    • Transformed:

1Y=α+β(1X)\frac{1}{Y} = \alpha + \beta(\frac{1}{X})

  • Linear model: Y=α+βXY' = \alpha + \beta X', where Y=1Y,X=1XY' = \frac{1}{Y}, X' = \frac{1}{X}.
Some Non-linear Models Can Be Turned Linear (2)
  • Ex 2: Timber volume of trees cr2h≈ cr^2h or more generally, αrβ<em>1hβ</em>2\alpha r^{\beta<em>1} h^{\beta</em>2}

    • Non-linear model: Volume=α×rβ<em>1×hβ</em>2Volume = \alpha \times r^{\beta<em>1} \times h^{\beta</em>2}
    • Taking logarithm: log(Volume)=log(α)+β<em>1log(r)+β</em>2log(h)log(Volume) = log(\alpha) + \beta<em>1 log(r) + \beta</em>2 log(h)
    • Linear model: Y=β<em>0+β</em>1X<em>1+β</em>2X<em>2Y = \beta<em>0 + \beta</em>1 X<em>1 + \beta</em>2 X<em>2 where Y=log(Volume)Y = log(Volume), X</em>1=log(r)=log(radius)X</em>1 = log(r) = log(radius), and X2=log(h)=log(height)X_2 = log(h) = log(height).
Some Non-linear Models Can Be Turned Linear (3)
  • Ex 3: Cobb-Douglas Production Function

    • In economics, the Cobb-Douglas production function is:

V=αKβ<em>1Lβ</em>2V = \alpha K^{\beta<em>1} L^{\beta</em>2}, where K = capital, L = labor

  • Represents the relationship between inputs (capital K and labor L) and output V.
  • Linear transformation:

log(V)=log(α)+β<em>1log(K)+β</em>2log(L)log(V) = log(\alpha) + \beta<em>1 log(K) + \beta</em>2 log(L).

Identifying Linear Models
  • Determining which of the following models are linear:

    • (a) Y=β<em>0+βX</em>1+ϵY = \beta<em>0 + \beta X</em>1 + \epsilon (Linear)
    • (b) Y=β<em>0βX</em>1ϵY = \beta<em>0 \beta^{X</em>1} \epsilon (Non-linear)
    • (c) Y=β<em>0+β</em>1eX+ϵY = \beta<em>0 + \beta</em>1 e^X + \epsilon (Linear)
    • (d) Y=β<em>0+β</em>1X2+β2log(X)+ϵY = \beta<em>0 + \beta</em>1 X^2 + \beta_2 log(X) + \epsilon (Linear)
Identifying Transformable Models
  • Determining which models can be turned linear after transformation:

    • (a) Y=β<em>0+βX</em>1+ϵY = \beta<em>0 + \beta^{X</em>1} + \epsilon
    • Not transformable
    • (b) Y=β<em>0βX</em>1ϵY = \beta<em>0 \beta^{X</em>1} \epsilon
    • Transformable
Data for Multiple Linear Regression Models
SLRMLR
XXX1 X2 . . . Xp
YYY
case 1:x1x11 x12 . . . x1p y1
case 2:x2x21 x22 . . . x2p y2
case n:xnxn1 xn2 . . . xnp yn
  • SLR observes pairs of data values.
  • MLR observes rows of data values.
  • Each row (or pair) is a case, a record, or a data point.
  • yiy_i is the response (dependent variable) of the ithi^{th} case.
  • There are p explanatory variables (predictors, covariates), and x<em>ikx<em>{ik} is the value of the explanatory variable X</em>kX</em>k of the ithi^{th} case.
Multiple Linear Regression Models

y<em>i=β</em>0+β<em>1x</em>i1++β<em>px</em>ip+ϵiy<em>i = \beta</em>0 + \beta<em>1 x</em>{i1} + … + \beta<em>p x</em>{ip} + \epsilon_i

  • where:

    • ϵi\epsilon_i’s (errors, or noise) are i.i.d. N(0,σ2)N(0, \sigma^2)
    • Parameters include:
    • β0\beta_0 = intercept;
    • βk\beta_k = regression coefficient (slope) for the kthk^{th} explanatory variable, k = 1, . . . , p
    • σ2\sigma^2 = Var(ϵi\epsilon_i) = the variance of errors
    • Observed (known): y<em>i,x</em>i1,x<em>i2,,x</em>ipy<em>i, x</em>{i1}, x<em>{i2}, …, x</em>{ip}
    • Unknown: β<em>0,β</em>1,,β<em>p,σ2,ϵ</em>i\beta<em>0, \beta</em>1, …, \beta<em>p, \sigma^2 , \epsilon</em>i’s
    • Random: ϵ<em>i\epsilon<em>i’s, y</em>iy</em>i’s
    • Constants (not random): β<em>k\beta<em>k’s, σ2\sigma^2, x</em>ikx</em>{ik}’s
Multiple Linear Regression Models in Matrix Notation
  • Matrix notation:

Y<em>n×1=[y</em>1 y<em>2  y</em>n]Y<em>{n \times 1} = \begin{bmatrix} y</em>1 \ y<em>2 \ … \ y</em>n \end{bmatrix},

X<em>n×(p+1)=[1x</em>11amp;x<em>21x</em>p1 1amp;x<em>12x</em>22amp;amp;x<em>p2  1x</em>1namp;x<em>2nx</em>pn]X<em>{n \times (p+1)} = \begin{bmatrix} 1 & x</em>{11} &amp; x<em>{21} & \cdots & x</em>{p1} \ 1 &amp; x<em>{12} & x</em>{22} &amp; \cdots &amp; x<em>{p2} \ … & … & … & … & … \ 1 & x</em>{1n} &amp; x<em>{2n} & \cdots & x</em>{pn} \end{bmatrix},

β<em>(p+1)×1=[β</em>0 β<em>1 β</em>2  β<em>p]\beta<em>{(p+1) \times 1} = \begin{bmatrix} \beta</em>0 \ \beta<em>1 \ \beta</em>2 \ … \ \beta<em>p \end{bmatrix}, ϵ</em>n×1=[ϵ<em>1 ϵ</em>2  ϵn]\epsilon</em>{n \times 1} = \begin{bmatrix} \epsilon<em>1 \ \epsilon</em>2 \ … \ \epsilon_n \end{bmatrix}

  • Model: Y=Xβ+ϵY = X\beta + \epsilon
Least Squares Estimation
Fitting the Mode
Least Squares Method (SLR)
  • Least squares estimate β<em>0^,β</em>1^\hat{\beta<em>0}, \hat{\beta</em>1} for (β<em>0,β</em>1)(\beta<em>0, \beta</em>1) is the intercept and slope of the straight line with the minimum sum of squared vertical distances to the data points:

<em>i=1n(y</em>iβ<em>0^β</em>1^xi)2\sum<em>{i=1}^{n}(y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x_i)^2

Least Squares Method (MLR)
  • The least squares estimate (β<em>0^,,β</em>p^)(\hat{\beta<em>0}, …, \hat{\beta</em>p}) for (β<em>0,,β</em>p)(\beta<em>0, …, \beta</em>p) is the intercept and slopes of the (hyper)plane with the minimum sum of squared vertical distance to the data points:

<em>i=1n(y</em>iβ<em>0^β</em>1^x<em>i1β</em>p^xip)2\sum<em>{i=1}^{n}(y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x<em>{i1} - … - \hat{\beta</em>p} x_{ip})^2

The “Hat” Notation:
  • Differentiate:

    • Estimated coefficient βj^\hat{\beta_j} from
    • The actual unknown coefficient βj\beta_j
Least Squares Problem for SLR
  • Minimize L(β<em>0^,β</em>1^)=<em>i=1n(y</em>iβ<em>0^β</em>1^xi)2L(\hat{\beta<em>0}, \hat{\beta</em>1}) = \sum<em>{i=1}^{n} (y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x_i)^2
  • Set derivatives to 0:

Lβ<em>0^=2</em>i=1n(y<em>iβ</em>0^β<em>1^x</em>i)=0\frac{\partial L}{\partial \hat{\beta<em>0}} = -2\sum</em>{i=1}^{n}(y<em>i - \hat{\beta</em>0} - \hat{\beta<em>1} x</em>i) = 0

Lβ<em>1^=2</em>i=1nx<em>i(y</em>iβ<em>0^β</em>1^xi)=0\frac{\partial L}{\partial \hat{\beta<em>1}} = -2\sum</em>{i=1}^{n} x<em>i(y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x_i) = 0

  • This results in the 2 equations below in 2 unknowns β<em>0^\hat{\beta<em>0} and β</em>1^\hat{\beta</em>1}.

nβ<em>0^+β</em>1^<em>i=1nx</em>i=<em>i=1ny</em>in\hat{\beta<em>0} + \hat{\beta</em>1}\sum<em>{i=1}^{n} x</em>i = \sum<em>{i=1}^{n} y</em>i

β<em>0^</em>i=1nx<em>i+β</em>1^<em>i=1nx</em>i2=<em>i=1nx</em>iyi\hat{\beta<em>0}\sum</em>{i=1}^{n} x<em>i + \hat{\beta</em>1}\sum<em>{i=1}^{n} x</em>i^2 = \sum<em>{i=1}^{n} x</em>i y_i

Solving for Estimates
  • From the normal equations:

nβ<em>0^+β</em>1^<em>i=1nx</em>i=<em>i=1ny</em>in\hat{\beta<em>0} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>i = \sum<em>{i=1}^{n} y</em>i

β<em>0^</em>i=1nx<em>i+β</em>1^<em>i=1nx</em>i2=<em>i=1nx</em>iyi\hat{\beta<em>0} \sum</em>{i=1}^{n} x<em>i + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>i^2 = \sum<em>{i=1}^{n} x</em>i y_i

  • Rewrite using means:

nβ<em>0^+β</em>1^nxˉ=nyˉn\hat{\beta<em>0} + \hat{\beta</em>1} n\bar{x} = n\bar{y}

β<em>0^nxˉ+β</em>1^<em>i=1nx</em>i2=<em>i=1nx</em>iyi\hat{\beta<em>0} n\bar{x} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>i^2 = \sum<em>{i=1}^{n} x</em>i y_i

  • Divide the first equation by n:

β<em>0^+β</em>1^xˉ=yˉ    β<em>0^=yˉβ</em>1^xˉ\hat{\beta<em>0} + \hat{\beta</em>1} \bar{x} = \bar{y} \implies \hat{\beta<em>0} = \bar{y} - \hat{\beta</em>1} \bar{x}

  • Substitute into the second equation:

β<em>0^nxˉ+β</em>1^<em>i=1nx</em>i2=<em>i=1nx</em>iyi\hat{\beta<em>0} n\bar{x} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>i^2 = \sum<em>{i=1}^{n} x</em>i y_i

(yˉβ<em>1^xˉ)nxˉ+β</em>1^<em>i=1nx</em>i2=<em>i=1nx</em>iyi(\bar{y} - \hat{\beta<em>1} \bar{x})n\bar{x} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>i^2 = \sum<em>{i=1}^{n} x</em>i y_i

  • Solve for β<em>1^\hat{\beta<em>1}: β</em>1^(<em>i=1nx</em>i2nxˉ2)=<em>i=1nx</em>iyinxˉyˉ\hat{\beta</em>1} (\sum<em>{i=1}^{n} x</em>i^2 - n\bar{x}^2) = \sum<em>{i=1}^{n} x</em>i y_i - n\bar{x}\bar{y}

β<em>1^=</em>i=1nx<em>iy</em>inxˉyˉ<em>i=1nx</em>i2nxˉ2\hat{\beta<em>1} = \frac{\sum</em>{i=1}^{n} x<em>i y</em>i - n\bar{x}\bar{y}}{\sum<em>{i=1}^{n} x</em>i^2 - n\bar{x}^2}

Homework
  • Show that

<em>i=1n(x</em>ixˉ)(y<em>iyˉ)=</em>i=1n(x<em>ixˉ)y</em>i=<em>i=1nx</em>iyinxˉyˉ\sum<em>{i=1}^{n}(x</em>i - \bar{x})(y<em>i - \bar{y}) = \sum</em>{i=1}^{n}(x<em>i - \bar{x})y</em>i = \sum<em>{i=1}^{n} x</em>i y_i - n\bar{x}\bar{y}.

  • Show that

<em>i=1n(x</em>ixˉ)2=<em>i=1nx</em>i2nxˉ2\sum<em>{i=1}^{n}(x</em>i - \bar{x})^2 = \sum<em>{i=1}^{n} x</em>i^2 - n\bar{x}^2

  • Hence, there are 3 formulae for LS estimate of the slope:
    • β<em>1^=</em>i=1nx<em>iy</em>inxˉyˉ<em>i=1nx</em>i2nxˉ2=<em>i=1n(x</em>ixˉ)(y<em>iyˉ)</em>i=1n(x<em>ixˉ)2=</em>i=1n(x<em>ixˉ)y</em>i<em>i=1n(x</em>ixˉ)2\hat{\beta<em>1} = \frac{\sum</em>{i=1}^{n} x<em>i y</em>i - n\bar{x}\bar{y}}{\sum<em>{i=1}^{n} x</em>i^2 - n\bar{x}^2} = \frac{\sum<em>{i=1}^{n}(x</em>i - \bar{x})(y<em>i - \bar{y})}{\sum</em>{i=1}^{n}(x<em>i - \bar{x})^2} = \frac{\sum</em>{i=1}^{n}(x<em>i - \bar{x})y</em>i}{\sum<em>{i=1}^{n}(x</em>i - \bar{x})^2}
Least Squares Problem for MLR
  • To find the (β<em>0^,β</em>1^,,β<em>p^)(\hat{\beta<em>0}, \hat{\beta</em>1}, …, \hat{\beta<em>p}) that minimize L(β</em>0^,β<em>1^,,β</em>p^)=<em>i=1n(y</em>iβ<em>0^β</em>1^x<em>i1β</em>p^xip)2L(\hat{\beta</em>0}, \hat{\beta<em>1}, …, \hat{\beta</em>p}) = \sum<em>{i=1}^{n} (y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x<em>{i1} - … - \hat{\beta</em>p} x_{ip})^2
  • Set the derivatives of L with respect to β<em>j^\hat{\beta<em>j} to 0 Lβ</em>0^=2<em>i=1n(y</em>iβ<em>0^β</em>1^x<em>i1β</em>p^x<em>ip)\frac{\partial L}{\partial \hat{\beta</em>0}} = -2\sum<em>{i=1}^{n} (y</em>i - \hat{\beta<em>0} - \hat{\beta</em>1} x<em>{i1} - … - \hat{\beta</em>p} x<em>{ip})Lβ</em>k^=2<em>i=1nx</em>ik(y<em>iβ</em>0^β<em>1^x</em>i1β<em>p^x</em>ip),k=1,2,,p\frac{\partial L}{\partial \hat{\beta</em>k}} = -2\sum<em>{i=1}^{n} x</em>{ik}(y<em>i - \hat{\beta</em>0} - \hat{\beta<em>1} x</em>{i1} - … - \hat{\beta<em>p} x</em>{ip}), k = 1, 2, …, p
  • This results in a system of (p+1)(p + 1) equations in (p+1)(p + 1) unknowns (normal equations).
Least Squares Problem for MLR: The Normal Equations
  • The least squares estimate (β<em>0^,β</em>1^,,βp^)(\hat{\beta<em>0}, \hat{\beta</em>1}, …, \hat{\beta_p}) is the solution to the following system of equations, called the normal equations.
    • β<em>0^n+β</em>1^<em>i=1nx</em>i1++β<em>p^</em>i=1nx<em>ip=</em>i=1nyi\hat{\beta<em>0} \cdot n + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>{i1} + \cdots + \hat{\beta<em>p} \sum</em>{i=1}^{n} x<em>{ip} = \sum</em>{i=1}^{n} y_i

β<em>0^</em>i=1nx<em>i1+β</em>1^<em>i=1nx</em>i12++β<em>p^</em>i=1nx<em>i1x</em>ip=<em>i=1nx</em>i1yi\hat{\beta<em>0} \sum</em>{i=1}^{n} x<em>{i1} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>{i1}^2 + \cdots + \hat{\beta<em>p} \sum</em>{i=1}^{n} x<em>{i1} x</em>{ip} = \sum<em>{i=1}^{n} x</em>{i1} y_i

\vdots

β<em>0^</em>i=1nx<em>ik+β</em>1^<em>i=1nx</em>ikx<em>i1++β</em>p^<em>i=1nx</em>ikx<em>ip=</em>i=1nx<em>iky</em>i\hat{\beta<em>0} \sum</em>{i=1}^{n} x<em>{ik} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>{ik} x<em>{i1} + \cdots + \hat{\beta</em>p} \sum<em>{i=1}^{n} x</em>{ik} x<em>{ip} = \sum</em>{i=1}^{n} x<em>{ik} y</em>i

\vdots

β<em>0^</em>i=1nx<em>ip+β</em>1^<em>i=1nx</em>ipx<em>i1++β</em>p^<em>i=1nx</em>ip2=<em>i=1nx</em>ipyi\hat{\beta<em>0} \sum</em>{i=1}^{n} x<em>{ip} + \hat{\beta</em>1} \sum<em>{i=1}^{n} x</em>{ip} x<em>{i1} + \cdots + \hat{\beta</em>p} \sum<em>{i=1}^{n} x</em>{ip}^2 = \sum<em>{i=1}^{n} x</em>{ip} y_i

  • In matrix notation, the normal equation is (XTX)β^=XTY(X^T X)\hat{\beta} = X^T Y, and the least squares estimate is β^=(XTX)1XTY\hat{\beta} = (X^T X)^{-1}X^T Y
  • Don’t worry about solving the equations. R and other software can do the computation for us.
Parameters vs. Estimates
  • β<em>i\beta<em>i’s are the coefficients of the MLR model, and β</em>i^\hat{\beta</em>i}’s are the estimates of βi\beta_i’s.
  • For SLR model:
    • y=β<em>0+β</em>1xy = \beta<em>0 + \beta</em>1 x is the least square line for the population.
    • y=β<em>0^+β</em>1^xy = \hat{\beta<em>0} + \hat{\beta</em>1} x is the least square line for a sample.
  • Population:
    • y=β<em>0+β</em>1xy = \beta<em>0 + \beta</em>1 x: least-square regression line of the population, fixed, unknown, not of interest.
  • Sample:
    • y=β<em>0^+β</em>1^xy = \hat{\beta<em>0} + \hat{\beta</em>1} x: least-square regression line of the sample, random, changes from sample to sample, can be calculated from sample of interest
Fitted Values, Residuals, Estimate of σ2\sigma^2
Fitted Values
  • The fitted value or predicted value:

y<em>i^=β</em>0^+β<em>1^x</em>i1++β<em>p^x</em>ip\hat{y<em>i} = \hat{\beta</em>0} + \hat{\beta<em>1} x</em>{i1} + … + \hat{\beta<em>p} x</em>{ip}

  • Again, the “hat” notation is used.
    • yi^\hat{y_i} is the fitted value
    • yiy_i is the actual observed value
Errors and Residuals
  • Errors cannot be directly computed:

ϵ<em>i=y</em>iβ<em>0β</em>1x<em>i1β</em>pxip\epsilon<em>i = y</em>i - \beta<em>0 - \beta</em>1 x<em>{i1} - … - \beta</em>p x_{ip}

*b Since the coefficients β<em>0,β</em>1,,βp\beta<em>0, \beta</em>1, …, \beta_p are unknown.

  • Errors are estimated by residuals:

e<em>i=y</em>iy<em>i^=y</em>i(β<em>0^+β</em>1^x<em>i1++β</em>p^xip)e<em>i = y</em>i - \hat{y<em>i} = y</em>i - (\hat{\beta<em>0} + \hat{\beta</em>1} x<em>{i1} + … + \hat{\beta</em>p} x_{ip})

  • e<em>iϵ</em>ie<em>i ≈ \epsilon</em>i in general since β<em>j^β</em>j\hat{\beta<em>j} ≈ \beta</em>j
Properties of Residuals
  • The LS estimate (β<em>0^,β</em>1^,,β<em>p^)(\hat{\beta<em>0}, \hat{\beta</em>1}, …, \hat{\beta<em>p}) satisfies the equations </em>i=1n(y<em>iβ</em>0^β<em>1^x</em>i1β<em>p^x</em>ip)=<em>i=1ne</em>i=0\sum</em>{i=1}^{n}(y<em>i - \hat{\beta</em>0} - \hat{\beta<em>1} x</em>{i1} - … - \hat{\beta<em>p} x</em>{ip}) = \sum<em>{i=1}^{n} e</em>i = 0
    and
    <em>i=1nx</em>ik(y<em>iβ</em>0^β<em>1^x</em>i1β<em>p^x</em>ip)=0,k=1,2,,p\sum<em>{i=1}^{n} x</em>{ik}(y<em>i - \hat{\beta</em>0} - \hat{\beta<em>1} x</em>{i1} - … - \hat{\beta<em>p} x</em>{ip}) = 0, k = 1, 2, …, p
  • The residuals e<em>ie<em>i hence have the properties </em>i=1ne<em>i=0\sum</em>{i=1}^{n} e<em>i = 0, Residuals add up to 0 </em>i=1nx<em>ike</em>i=0,k=1,2,,p\sum</em>{i=1}^{n} x<em>{ik}e</em>i = 0, k = 1, 2, …, p
    Residuals are orthogonal to predictors.
  • The two properties combined imply that the residuals have 0 correlation with each of the p predictors since
  • Cov(X<em>k,e)=1n1[</em>i=1nx<em>ike</em>inxkˉeˉ]=0Cov(X<em>k, e) = \frac{1}{n-1} [\sum</em>{i=1}^{n} x<em>{ik}e</em>i - n\bar{x_k} \bar{e}] = 0
Mean Square Error (MSE) — Estimate of σ2\sigma^2
  • The variance σ2\sigma^2 of the errors ϵi\epsilon_i’s is estimated by the mean square error (MSE), the sum of squares of residuals divided by np1n − p − 1.

  • MSE=<em>i=1ne</em>i2np1=<em>i=1n(y</em>iyi^)2np1MSE = \frac{\sum<em>{i=1}^{n} e</em>i^2}{n - p - 1} = \frac{\sum<em>{i=1}^{n}(y</em>i - \hat{y_i})^2}{n - p - 1}

  • Why divided by np1n − p − 1 instead of by n?

  • A simple reason is it takes at least p+1p + 1 observations to estimate β<em>0,β</em>1,,βp\beta<em>0, \beta</em>1, …, \beta_p.
    Need at least p+2p + 2 observations to get non-zero residuals to determine the variability of the estimate

  • We will show (in the next Lecture) that MSE is an unbiased estimator for σ2\sigma^2.

Example: The Auto Data
  • Auto data of 9 variables about 392 car models in the 1980s.
  • The variables include
    • acceleration: Time to accelerate from 0 to 60 mph (in seconds)
    • horsepower: Engine horsepower
    • weight: Vehicle weight (lbs.)
How to Do Regression in R?
  • R code:
lm(acceleration ~ weight + horsepower, data=Auto)

Output:

Call:
lm(formula = acceleration ~ weight + horsepower, data = Auto)

Coefficients:
(Intercept)       weight   horsepower
   18.4358     -0.0023      -0.0933
  • The lm() command above asks R to fit the model acceleration=β<em>0+β</em>1weight+β2horsepower+ϵacceleration = \beta<em>0 + \beta</em>1 weight + \beta_2 horsepower + \epsilon
  • R gives us the regression equation

acceleration=18.4358+0.0023weight0.0933horsepoweracceleration = 18.4358 + 0.0023 weight − 0.0933 horsepower

More R Commands
lm1 = lm(acceleration ~ weight + horsepower, data=Auto)
lm1$coef  # show the estimated beta&amp;apos;s

Output:

(Intercept)       weight   horsepower
18.435791     -0.0023      -0.0933
lm1$fit # show the fitted values
lm1$res # show the residuals
plot(lm1$fit,lm1$res, xlab="Fitted Values", ylab="Residuals")

Interpretation of Regression Coefficients

Interpretation of the Intercept β0\beta_0
  • β0\beta_0 = intercept = the mean value of Y when all Xj’s are 0.
    • May have no practical meaning
    • e.g., β0\beta_0 is meaningless in the Auto model as no car has 0 weight
Interpretation of the regression coefficient for βj\beta_j
  • βj\beta_j = the regression coefficient for Xj, is the mean change in the response Y when Xj is increased by one unit holding other Xi’s constant.

    • Also called the partial regression coefficients because they are adjusted for the other covariates
    • Interpretation of βj\beta_j depends on the presence of other predictors in the model
  • e.g., the 2 β1\beta_1’s in the 2 models below have different interpretations

    • Model 1 : Y=β<em>0+β</em>1X<em>1+β</em>2X2+ϵY = \beta<em>0 + \beta</em>1 X<em>1 + \beta</em>2 X_2 + \epsilon
    • Model 2 : Y=β<em>0+β</em>1X1+ϵY = \beta<em>0 + \beta</em>1 X_1 + \epsilon
Something Wrong?
lm(acceleration ~ weight, data=Auto)$coef

Output:

(Intercept)       weight
19.572666     -0.001354
lm(acceleration ~ weight + horsepower, data=Auto)$coef

Output:

(Intercept)       weight   horsepower
18.435791     0.002302      -0.093313
  • The coefficient β1^\hat{\beta_1} for weight is negative in the Model 1 but positive in the Model 2.
  • Do heavier cars require more or less time to accelerate from 0 to 60 mph?
Effect of weight Not Controlling for Other Predictors
library(ggplot2)
ggplot(Auto, aes(x=weight, y=acceleration)) + geom_point()
  • From the scatter plot above, are weight and acceleration are positively or negatively associated?
  • Do heavier vehicles generally require more or less time to accelerate from 0 to 60 mph?
  • Is that reasonable?
Effect of weight Controlling for horsepower (1)
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point()
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5))
Effect of weight Controlling for horsepower (2)
  • Consider car models of similar horsepower (similar color), are weight and acceleration positively or negatively correlated?
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5))
Effect of weight Controlling for horsepower (3)
R codes for the plot on the previous page
Auto$hp = cut(Auto$horsepower, breaks=c(45,70, 80, 90,100,110, 130, 150,230),
labels=c("hp<=70", "70 < hp <= 80", "80 < hp <= 90", "90 < hp <= 100", "100 < hp <= 110", "110 < hp <= 130", "130 < hp <= 150","hp> 150"))
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5)) + facet_wrap(~hp, nrow=2) + theme(legend.position="top")

#### Example: Auto Data — Simpson’s Paradox

  • Why is the association between acceleration and weight flipped from positive to negative when horsepower is ignored?

  • Heavier vehicles (purple dots) tend to have more horsepower while lighter ones (red dots) tend to have less

  • Vehicles with more horsepower (purple dots) require less time to accelerate while those with less (red dots) require more

  • Hence, when ignoring horsepower, it looks like heavier vehicles require less time to accelerate, though heavier vehicles require more time to accelerate after the effect of horsepower is adjusted (which means considering only vehicles with similar horsepower).

What We Mean by “Adjusted for Other Coveriates”?
  • For a multiple linear regression model with p predictors

  • Y=β<em>0+β</em>1X<em>1++β</em>pXp+ϵY = \beta<em>0 + \beta</em>1 X<em>1 + · · · + \beta</em>p X_p + \epsilon

  • β<em>j\beta<em>j represents the effect of Xj on the response variable Y after it has been adjusted for all of X</em>1,,Xp\begin{aligned}X</em>1, …, X_p \end{aligned}
    except Xj.

What does “adjusted for” mean?
What We Mean by “Adjusted for Other Coveriates” (2)?
  • The LS estimate β<em>j^\hat{\beta<em>j} for β</em>j\beta</em>j in the MLR model Y=β<em>0+β</em>1X<em>1++β</em>pXp+ϵY = \beta<em>0 + \beta</em>1 X<em>1 + · · · + \beta</em>p X_p + \epsilon would be identical to the slope for the SLR model computed as follows.
    1. Regress Y on all other Xk’s except Xj
    2. Regress Xj on all other Xk’s except Xj
    3. Fit a SLR model using the residuals from Step 1 as the response and the residuals from Step 2 as the predictor.
  • Moreover, the intercept obtained in Step 3 would be 0.
  • This proof of this result involves complicated matrix algebra and hence is omitted. We just illustrate with an example.
Example for the Auto Data
  • recall we have fit the model

&amp;gt; acceleration = beta0 + beta1weight + beta2horsepower + ϵ\epsilon

  • and obtained the estimate for β\beta1 to be β^\hat{\beta}2 = 0.0023.
    1. Step 1. Regress acceleration on horsepower. Let RY be the residuals of this model.

```R
RY = lm(acceleration ~