Multiple regression

Multiple Linear Regression, Part 1

Outline

Multiple linear regression models
Least squares estimation
Fitted values, residuals, estimate of variance
Interpretation of regression coefficients

What are Multiple Linear Regression Models

Deterministic Models (No Errors)

Deterministic models describe perfect relationships between variables without errors: $Y = f(X1, X2, …, X_p)$
- Examples:
- Newton’s second law of motion: $F = m \times a$ (Force = mass × acceleration)
- Ideal gas law: $PV = nRT$ (pressure × volume = amount of gas in moles × ideal gas constant × temperature in °K)

Timber Volume of Trees

Modeling timber volume of a tree as a function of its radius and height:
- If the trunk is a cylinder: $volume = \pi r^2 h$ , where r = radius, h = height
- If the trunk is a cone: $volume = \frac{1}{3} \pi r^2 h$
- Tree trunks are not exactly cylinders or cones, so the formulas are subject to error.
- Model with error:

$volume = f(r, h) + \epsilon = \alpha r^2 h + \epsilon$ , where $\alpha$ is a constant.

Statistical Models

A statistical model is a simple, low-dimensional summary of:
- The relationship present in the data
- The data-generation process
- The relationship present in the population
Statistical models allow errors (uncertainty): $Y = f(X1, X2, …, X_p) + \epsilon$
- Y = response
- f(X1, X2, . . . , Xp) = deterministic function
- ε = error (noise)

Linear Regression Models

Focus on linear regression models: $Y = f(X1, X2, …, X*p) + \epsilon = \beta0 + \beta1 X1 + \beta2 X2 + … + \betap X_p + \epsilon$
- Linearity means the model is linear in its parameters $\beta0, \beta1, …, \beta_p$ .
- Examples of linear regression models:
- $Y = \beta0 + \beta1 X + \beta_2 X^2 + \epsilon$
- $Y = \beta0 + \beta1 log(X) + \epsilon$
- Even though the relationship between Y and X is not linear.

Some Non-linear Models Can Be Turned Linear (1)

Ex 1: Reciprocal Transformation
- Non-linear model: $Y = \frac{X}{\alpha X + \beta}$
- Transformed:

$\frac{1}{Y} = \alpha + \beta(\frac{1}{X})$

Linear model: $Y' = \alpha + \beta X'$ , where $Y' = \frac{1}{Y}, X' = \frac{1}{X}$ .

Some Non-linear Models Can Be Turned Linear (2)

Ex 2: Timber volume of trees $≈ cr^2h$ or more generally, $\alpha r^{\beta1} h^{\beta2}$
- Non-linear model: $Volume = \alpha \times r^{\beta1} \times h^{\beta2}$
- Taking logarithm: $log(Volume) = log(\alpha) + \beta1 log(r) + \beta2 log(h)$
- Linear model: $Y = \beta0 + \beta1 X1 + \beta2 X2$ where $Y = log(Volume)$ , $X1 = log(r) = log(radius)$ , and $X_2 = log(h) = log(height)$ .

Some Non-linear Models Can Be Turned Linear (3)

Ex 3: Cobb-Douglas Production Function
- In economics, the Cobb-Douglas production function is:

$V = \alpha K^{\beta1} L^{\beta2}$ , where K = capital, L = labor

Represents the relationship between inputs (capital K and labor L) and output V.
Linear transformation:

$log(V) = log(\alpha) + \beta1 log(K) + \beta2 log(L)$ .

Identifying Linear Models

Determining which of the following models are linear:
- (a) $Y = \beta0 + \beta X1 + \epsilon$ (Linear)
- (b) $Y = \beta0 \beta^{X1} \epsilon$ (Non-linear)
- (c) $Y = \beta0 + \beta1 e^X + \epsilon$ (Linear)
- (d) $Y = \beta0 + \beta1 X^2 + \beta_2 log(X) + \epsilon$ (Linear)

Identifying Transformable Models

Determining which models can be turned linear after transformation:
- (a) $Y = \beta0 + \beta^{X1} + \epsilon$
- Not transformable
- (b) $Y = \beta0 \beta^{X1} \epsilon$
- Transformable

Data for Multiple Linear Regression Models

	SLR	MLR
X	X	X1 X2 . . . Xp
Y	Y	Y
case 1:	x1	x11 x12 . . . x1p y1
case 2:	x2	x21 x22 . . . x2p y2
…	…	…
case n:	xn	xn1 xn2 . . . xnp yn

SLR observes pairs of data values.
MLR observes rows of data values.
Each row (or pair) is a case, a record, or a data point.
$y_i$ is the response (dependent variable) of the $i^{th}$ case.
There are p explanatory variables (predictors, covariates), and $x{ik}$ is the value of the explanatory variable $Xk$ of the $i^{th}$ case.

Multiple Linear Regression Models

$yi = \beta0 + \beta1 x{i1} + … + \betap x{ip} + \epsilon_i$

where:
- $\epsilon_i$ ’s (errors, or noise) are i.i.d. $N(0, \sigma^2)$
- Parameters include:
- $\beta_0$ = intercept;
- $\beta_k$ = regression coefficient (slope) for the $k^{th}$ explanatory variable, k = 1, . . . , p
- $\sigma^2$ = Var( $\epsilon_i$ ) = the variance of errors
- Observed (known): $yi, x{i1}, x{i2}, …, x{ip}$
- Unknown: $\beta0, \beta1, …, \betap, \sigma^2 , \epsiloni$ ’s
- Random: $\epsiloni$ ’s, $yi$ ’s
- Constants (not random): $\betak$ ’s, $\sigma^2$ , $x{ik}$ ’s

Multiple Linear Regression Models in Matrix Notation

Matrix notation:

$Y{n \times 1} = \begin{bmatrix} y1 \ y2 \ … \ yn \end{bmatrix}$ ,

$X{n \times (p+1)} = \begin{bmatrix} 1 & x{11} & x{21} & \cdots & x{p1} \ 1 & x{12} & x{22} & \cdots & x{p2} \ … & … & … & … & … \ 1 & x{1n} & x{2n} & \cdots & x{pn} \end{bmatrix}$ ,

$\beta{(p+1) \times 1} = \begin{bmatrix} \beta0 \ \beta1 \ \beta2 \ … \ \betap \end{bmatrix}$ , $\epsilon{n \times 1} = \begin{bmatrix} \epsilon1 \ \epsilon2 \ … \ \epsilon_n \end{bmatrix}$

Model: $Y = X\beta + \epsilon$

Least Squares Estimation

Fitting the Mode

Least Squares Method (SLR)

Least squares estimate $\hat{\beta0}, \hat{\beta1}$ for $(\beta0, \beta1)$ is the intercept and slope of the straight line with the minimum sum of squared vertical distances to the data points:

$\sum{i=1}^{n}(yi - \hat{\beta0} - \hat{\beta1} x_i)^2$

Least Squares Method (MLR)

The least squares estimate $(\hat{\beta0}, …, \hat{\betap})$ for $(\beta0, …, \betap)$ is the intercept and slopes of the (hyper)plane with the minimum sum of squared vertical distance to the data points:

$\sum{i=1}^{n}(yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x_{ip})^2$

The “Hat” Notation:

Differentiate:
- Estimated coefficient $\hat{\beta_j}$ from
- The actual unknown coefficient $\beta_j$

Least Squares Problem for SLR

Minimize $L(\hat{\beta0}, \hat{\beta1}) = \sum{i=1}^{n} (yi - \hat{\beta0} - \hat{\beta1} x_i)^2$
Set derivatives to 0:

$\frac{\partial L}{\partial \hat{\beta0}} = -2\sum{i=1}^{n}(yi - \hat{\beta0} - \hat{\beta1} xi) = 0$

$\frac{\partial L}{\partial \hat{\beta1}} = -2\sum{i=1}^{n} xi(yi - \hat{\beta0} - \hat{\beta1} x_i) = 0$

This results in the 2 equations below in 2 unknowns $\hat{\beta0}$ and $\hat{\beta1}$ .

$n\hat{\beta0} + \hat{\beta1}\sum{i=1}^{n} xi = \sum{i=1}^{n} yi$

$\hat{\beta0}\sum{i=1}^{n} xi + \hat{\beta1}\sum{i=1}^{n} xi^2 = \sum{i=1}^{n} xi y_i$

Solving for Estimates

From the normal equations:

$n\hat{\beta0} + \hat{\beta1} \sum{i=1}^{n} xi = \sum{i=1}^{n} yi$

$\hat{\beta0} \sum{i=1}^{n} xi + \hat{\beta1} \sum{i=1}^{n} xi^2 = \sum{i=1}^{n} xi y_i$

Rewrite using means:

$n\hat{\beta0} + \hat{\beta1} n\bar{x} = n\bar{y}$

$\hat{\beta0} n\bar{x} + \hat{\beta1} \sum{i=1}^{n} xi^2 = \sum{i=1}^{n} xi y_i$

Divide the first equation by n:

$\hat{\beta0} + \hat{\beta1} \bar{x} = \bar{y} \implies \hat{\beta0} = \bar{y} - \hat{\beta1} \bar{x}$

Substitute into the second equation:

$\hat{\beta0} n\bar{x} + \hat{\beta1} \sum{i=1}^{n} xi^2 = \sum{i=1}^{n} xi y_i$

$(\bar{y} - \hat{\beta1} \bar{x})n\bar{x} + \hat{\beta1} \sum{i=1}^{n} xi^2 = \sum{i=1}^{n} xi y_i$

Solve for $\hat{\beta1}$ : $\hat{\beta1} (\sum{i=1}^{n} xi^2 - n\bar{x}^2) = \sum{i=1}^{n} xi y_i - n\bar{x}\bar{y}$

$\hat{\beta1} = \frac{\sum{i=1}^{n} xi yi - n\bar{x}\bar{y}}{\sum{i=1}^{n} xi^2 - n\bar{x}^2}$

Homework

Show that

$\sum{i=1}^{n}(xi - \bar{x})(yi - \bar{y}) = \sum{i=1}^{n}(xi - \bar{x})yi = \sum{i=1}^{n} xi y_i - n\bar{x}\bar{y}$ .

Show that

$\sum{i=1}^{n}(xi - \bar{x})^2 = \sum{i=1}^{n} xi^2 - n\bar{x}^2$

Hence, there are 3 formulae for LS estimate of the slope:
- $\hat{\beta1} = \frac{\sum{i=1}^{n} xi yi - n\bar{x}\bar{y}}{\sum{i=1}^{n} xi^2 - n\bar{x}^2} = \frac{\sum{i=1}^{n}(xi - \bar{x})(yi - \bar{y})}{\sum{i=1}^{n}(xi - \bar{x})^2} = \frac{\sum{i=1}^{n}(xi - \bar{x})yi}{\sum{i=1}^{n}(xi - \bar{x})^2}$

Least Squares Problem for MLR

To find the $(\hat{\beta0}, \hat{\beta1}, …, \hat{\betap})$ that minimize $L(\hat{\beta0}, \hat{\beta1}, …, \hat{\betap}) = \sum{i=1}^{n} (yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x_{ip})^2$
Set the derivatives of L with respect to $\hat{\betaj}$ to 0 $\frac{\partial L}{\partial \hat{\beta0}} = -2\sum{i=1}^{n} (yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x{ip})$ $\frac{\partial L}{\partial \hat{\betak}} = -2\sum{i=1}^{n} x{ik}(yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x{ip}), k = 1, 2, …, p$
This results in a system of $(p + 1)$ equations in $(p + 1)$ unknowns (normal equations).

Least Squares Problem for MLR: The Normal Equations

The least squares estimate $(\hat{\beta0}, \hat{\beta1}, …, \hat{\beta_p})$ is the solution to the following system of equations, called the normal equations.
- $\hat{\beta0} \cdot n + \hat{\beta1} \sum{i=1}^{n} x{i1} + \cdots + \hat{\betap} \sum{i=1}^{n} x{ip} = \sum{i=1}^{n} y_i$

$\hat{\beta0} \sum{i=1}^{n} x{i1} + \hat{\beta1} \sum{i=1}^{n} x{i1}^2 + \cdots + \hat{\betap} \sum{i=1}^{n} x{i1} x{ip} = \sum{i=1}^{n} x{i1} y_i$

$\vdots$

$\hat{\beta0} \sum{i=1}^{n} x{ik} + \hat{\beta1} \sum{i=1}^{n} x{ik} x{i1} + \cdots + \hat{\betap} \sum{i=1}^{n} x{ik} x{ip} = \sum{i=1}^{n} x{ik} yi$

$\vdots$

$\hat{\beta0} \sum{i=1}^{n} x{ip} + \hat{\beta1} \sum{i=1}^{n} x{ip} x{i1} + \cdots + \hat{\betap} \sum{i=1}^{n} x{ip}^2 = \sum{i=1}^{n} x{ip} y_i$

In matrix notation, the normal equation is $(X^T X)\hat{\beta} = X^T Y$ , and the least squares estimate is $\hat{\beta} = (X^T X)^{-1}X^T Y$
Don’t worry about solving the equations. R and other software can do the computation for us.

Parameters vs. Estimates

$\betai$ ’s are the coefficients of the MLR model, and $\hat{\betai}$ ’s are the estimates of $\beta_i$ ’s.
For SLR model:
- $y = \beta0 + \beta1 x$ is the least square line for the population.
- $y = \hat{\beta0} + \hat{\beta1} x$ is the least square line for a sample.
Population:
- $y = \beta0 + \beta1 x$ : least-square regression line of the population, fixed, unknown, not of interest.
Sample:
- $y = \hat{\beta0} + \hat{\beta1} x$ : least-square regression line of the sample, random, changes from sample to sample, can be calculated from sample of interest

Fitted Values, Residuals, Estimate of $\sigma^2$

Fitted Values

The fitted value or predicted value:

$\hat{yi} = \hat{\beta0} + \hat{\beta1} x{i1} + … + \hat{\betap} x{ip}$

Again, the “hat” notation is used.
- $\hat{y_i}$ is the fitted value
- $y_i$ is the actual observed value

Errors and Residuals

Errors cannot be directly computed:

$\epsiloni = yi - \beta0 - \beta1 x{i1} - … - \betap x_{ip}$

*b Since the coefficients $\beta0, \beta1, …, \beta_p$ are unknown.

Errors are estimated by residuals:

$ei = yi - \hat{yi} = yi - (\hat{\beta0} + \hat{\beta1} x{i1} + … + \hat{\betap} x_{ip})$

$ei ≈ \epsiloni$ in general since $\hat{\betaj} ≈ \betaj$

Properties of Residuals

The LS estimate $(\hat{\beta0}, \hat{\beta1}, …, \hat{\betap})$ satisfies the equations $\sum{i=1}^{n}(yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x{ip}) = \sum{i=1}^{n} ei = 0$
and
$\sum{i=1}^{n} x{ik}(yi - \hat{\beta0} - \hat{\beta1} x{i1} - … - \hat{\betap} x{ip}) = 0, k = 1, 2, …, p$
The residuals $ei$ hence have the properties $\sum{i=1}^{n} ei = 0$ , Residuals add up to 0 $\sum{i=1}^{n} x{ik}ei = 0, k = 1, 2, …, p$
Residuals are orthogonal to predictors.
The two properties combined imply that the residuals have 0 correlation with each of the p predictors since
$Cov(Xk, e) = \frac{1}{n-1} [\sum{i=1}^{n} x{ik}ei - n\bar{x_k} \bar{e}] = 0$

Mean Square Error (MSE) — Estimate of $\sigma^2$

The variance $\sigma^2$ of the errors $\epsilon_i$ ’s is estimated by the mean square error (MSE), the sum of squares of residuals divided by $n − p − 1$ .
$MSE = \frac{\sum{i=1}^{n} ei^2}{n - p - 1} = \frac{\sum{i=1}^{n}(yi - \hat{y_i})^2}{n - p - 1}$
Why divided by $n − p − 1$ instead of by n?
A simple reason is it takes at least $p + 1$ observations to estimate $\beta0, \beta1, …, \beta_p$ .
Need at least $p + 2$ observations to get non-zero residuals to determine the variability of the estimate
We will show (in the next Lecture) that MSE is an unbiased estimator for $\sigma^2$ .

Example: The Auto Data

Auto data of 9 variables about 392 car models in the 1980s.
The variables include
- acceleration: Time to accelerate from 0 to 60 mph (in seconds)
- horsepower: Engine horsepower
- weight: Vehicle weight (lbs.)

How to Do Regression in R?

R code:

lm(acceleration ~ weight + horsepower, data=Auto)

Output:

Call:
lm(formula = acceleration ~ weight + horsepower, data = Auto)

Coefficients:
(Intercept)       weight   horsepower
   18.4358     -0.0023      -0.0933

The lm() command above asks R to fit the model $acceleration = \beta0 + \beta1 weight + \beta_2 horsepower + \epsilon$
R gives us the regression equation

$acceleration = 18.4358 + 0.0023 weight − 0.0933 horsepower$

More R Commands

lm1 = lm(acceleration ~ weight + horsepower, data=Auto)
lm1$coef  # show the estimated beta&amp;apos;s

Output:

(Intercept)       weight   horsepower
18.435791     -0.0023      -0.0933

lm1$fit # show the fitted values
lm1$res # show the residuals

plot(lm1$fit,lm1$res, xlab="Fitted Values", ylab="Residuals")

Interpretation of Regression Coefficients

Interpretation of the Intercept $\beta_0$

$\beta_0$ = intercept = the mean value of Y when all Xj’s are 0.
- May have no practical meaning
- e.g., $\beta_0$ is meaningless in the Auto model as no car has 0 weight

Interpretation of the regression coefficient for $\beta_j$

$\beta_j$ = the regression coefficient for Xj, is the mean change in the response Y when Xj is increased by one unit holding other Xi’s constant.
- Also called the partial regression coefficients because they are adjusted for the other covariates
- Interpretation of $\beta_j$ depends on the presence of other predictors in the model
e.g., the 2 $\beta_1$ ’s in the 2 models below have different interpretations
- Model 1 : $Y = \beta0 + \beta1 X1 + \beta2 X_2 + \epsilon$
- Model 2 : $Y = \beta0 + \beta1 X_1 + \epsilon$

Something Wrong?

lm(acceleration ~ weight, data=Auto)$coef

Output:

(Intercept)       weight
19.572666     -0.001354

lm(acceleration ~ weight + horsepower, data=Auto)$coef

Output:

(Intercept)       weight   horsepower
18.435791     0.002302      -0.093313

The coefficient $\hat{\beta_1}$ for weight is negative in the Model 1 but positive in the Model 2.
Do heavier cars require more or less time to accelerate from 0 to 60 mph?

Effect of weight Not Controlling for Other Predictors

library(ggplot2)
ggplot(Auto, aes(x=weight, y=acceleration)) + geom_point()

From the scatter plot above, are weight and acceleration are positively or negatively associated?
Do heavier vehicles generally require more or less time to accelerate from 0 to 60 mph?
Is that reasonable?

Effect of weight Controlling for horsepower (1)

ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point()
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5))

Effect of weight Controlling for horsepower (2)

Consider car models of similar horsepower (similar color), are weight and acceleration positively or negatively correlated?

ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5))

Effect of weight Controlling for horsepower (3)

R codes for the plot on the previous page

Auto$hp = cut(Auto$horsepower, breaks=c(45,70, 80, 90,100,110, 130, 150,230),
labels=c("hp<=70", "70 < hp <= 80", "80 < hp <= 90", "90 < hp <= 100", "100 < hp <= 110", "110 < hp <= 130", "130 < hp <= 150","hp> 150"))
ggplot(Auto, aes(x=weight, y=acceleration, col=horsepower)) + geom_point() + scale_color_gradientn(colours = rainbow(5)) + facet_wrap(~hp, nrow=2) + theme(legend.position="top")

#### Example: Auto Data — Simpson’s Paradox

Why is the association between acceleration and weight flipped from positive to negative when horsepower is ignored?
Heavier vehicles (purple dots) tend to have more horsepower while lighter ones (red dots) tend to have less
Vehicles with more horsepower (purple dots) require less time to accelerate while those with less (red dots) require more
Hence, when ignoring horsepower, it looks like heavier vehicles require less time to accelerate, though heavier vehicles require more time to accelerate after the effect of horsepower is adjusted (which means considering only vehicles with similar horsepower).

What We Mean by “Adjusted for Other Coveriates”?

For a multiple linear regression model with p predictors
$Y = \beta0 + \beta1 X1 + · · · + \betap X_p + \epsilon$
$\betaj$ represents the effect of Xj on the response variable Y after it has been adjusted for all of $\begin{aligned}X1, …, X_p \end{aligned}$
except Xj.

What does “adjusted for” mean?

What We Mean by “Adjusted for Other Coveriates” (2)?

The LS estimate $\hat{\betaj}$ for $\betaj$ in the MLR model $Y = \beta0 + \beta1 X1 + · · · + \betap X_p + \epsilon$ would be identical to the slope for the SLR model computed as follows.
1. Regress Y on all other Xk’s except Xj
2. Regress Xj on all other Xk’s except Xj
3. Fit a SLR model using the residuals from Step 1 as the response and the residuals from Step 2 as the predictor.
Moreover, the intercept obtained in Step 3 would be 0.
This proof of this result involves complicated matrix algebra and hence is omitted. We just illustrate with an example.

Example for the Auto Data

recall we have fit the model

&gt; acceleration = beta0 + beta1weight + beta2horsepower + $\epsilon$

and obtained the estimate for $\beta$ 1 to be $\hat{\beta}$ 2 = 0.0023.
1. Step 1. Regress acceleration on horsepower. Let RY be the residuals of this model.

```R
RY = lm(acceleration ~

Multiple regression

Multiple Linear Regression, Part 1

Outline

What are Multiple Linear Regression Models

Deterministic Models (No Errors)

Timber Volume of Trees

Statistical Models

Linear Regression Models

Some Non-linear Models Can Be Turned Linear (1)

Some Non-linear Models Can Be Turned Linear (2)

Some Non-linear Models Can Be Turned Linear (3)

Identifying Linear Models

Identifying Transformable Models

Data for Multiple Linear Regression Models

Multiple Linear Regression Models

Multiple Linear Regression Models in Matrix Notation

Least Squares Estimation

Fitting the Mode

Least Squares Method (SLR)

Least Squares Method (MLR)

The “Hat” Notation:

Least Squares Problem for SLR

Solving for Estimates

Homework

Least Squares Problem for MLR

Least Squares Problem for MLR: The Normal Equations

Parameters vs. Estimates

Fitted Values, Residuals, Estimate of σ2\sigma^2σ2

Fitted Values

Errors and Residuals

Properties of Residuals

Mean Square Error (MSE) — Estimate of σ2\sigma^2σ2

Example: The Auto Data

How to Do Regression in R?

More R Commands

Interpretation of Regression Coefficients

Interpretation of the Intercept β0\beta_0β0​

Interpretation of the regression coefficient for βj\beta_jβj​

Something Wrong?

Effect of weight Not Controlling for Other Predictors

Effect of weight Controlling for horsepower (1)

Effect of weight Controlling for horsepower (2)

Effect of weight Controlling for horsepower (3)

R codes for the plot on the previous page

What We Mean by “Adjusted for Other Coveriates”?

What does “adjusted for” mean?

What We Mean by “Adjusted for Other Coveriates” (2)?

Example for the Auto Data

Fitted Values, Residuals, Estimate of $\sigma^2$

Mean Square Error (MSE) — Estimate of $\sigma^2$

Interpretation of the Intercept $\beta_0$

Interpretation of the regression coefficient for $\beta_j$