Linear Regression Notes

Biostatistics and Algebra - FHMS 103: Linear Regression

Learning Objectives

Describe the Linear Regression Model
State the Regression Modeling Steps
Explain Ordinary Least Squares
Compute Regression Coefficients
Understand and check model assumptions
Predict Response Variable
Comments of R Output
Correlation Models
Link between a correlation model and a regression model
Test of coefficient of Correlation

Models

What is a Model?

Representation of Some Phenomenon (Non-Math/Stats Model)

What is a Math/Stats Model?

Often Describes Relationship between Variables
Types:
- Deterministic Models (no randomness)
- Probabilistic Models (with randomness)

Deterministic Models

Hypothesize Exact Relationships
Suitable When Prediction Error is Negligible
Example: Body mass index (BMI) is a measure of body fat based on:
- Metric Formula: $BMI = \frac{\text{Weight in Kilograms}}{(\text{Height in Meters})^2}$
- Non-metric Formula: $BMI = \frac{\text{Weight (pounds)} \times 703}{(\text{Height in inches})^2}$

Probabilistic Models

Hypothesize 2 Components:
- Deterministic
- Random Error
Example: Systolic blood pressure of newborns Is 6 Times the Age in days + Random Error
- $SBP = 6 \times \text{age(d)} + \epsilon$
- Random Error May Be Due to Factors Other Than age in days (e.g. Birthweight)

Types of Probabilistic Models

Regression Models
Correlation Models
Other Models

Regression Models

Relationship between one dependent variable and explanatory variable(s)
Use equation to set up relationship
- Numerical Dependent (Response) Variable
- 1 or More Numerical or Categorical Independent (Explanatory) Variables
Used Mainly for Prediction & Estimation

Regression Modeling Steps

Hypothesize Deterministic Component
- Estimate Unknown Parameters
Specify Probability Distribution of Random Error Term
- Estimate Standard Deviation of Error
Evaluate the fitted Model
Use Model for Prediction & Estimation

Model Specification

Specifying the deterministic component

Define the dependent variable and independent variable
Hypothesize Nature of Relationship
- Expected Effects (i.e., Coefficients’ Signs)
- Functional Form (Linear or Non-Linear)
- Interactions

Model Specification Is Based on Theory

Theory of Field (e.g., Epidemiology)
Mathematical Theory
Previous Research
‘Common Sense’

Types of Regression Models

Simple
Multiple
Linear
Non-Linear

Linear Regression Model

Linear Equations: $Y = mX + b$ where:
- $b$ = Y-intercept
- $m = \frac{\text{Change in Y}}{\text{Change in X}}$ = Slope
$Yi = \beta0 + \beta1 Xi + \epsilon_i$ Linear Regression Model
- Relationship Between Variables Is a Linear Function
 - Dependent (Response) Variable (e.g., CD+ c.)
 - Independent (Explanatory) Variable (e.g., Years s. serocon.)
 - $\beta_1$ Population Slope
 - $\beta_0$ Population Y-Intercept
 - $\epsilon_i$ Random Error

Population & Sample Regression Models

Population: $Yi = \beta0 + \beta1 Xi + \epsilon_i$ (Unknown Relationship)
Random Sample: $\hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}$

Population Linear Regression Model

Observed value: $Yi = \beta0 + \beta1 Xi + \epsilon_i$
$E(Y) = \beta0 + \beta1 X_i$
$\epsilon_i$ = Random error

Sample Linear Regression Model

Observed value: $\hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}$
$\hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi$
$\epsilon_i$ = Random error

Estimating Parameters: Least Squares Method

Scatter plot:
- Plot of All ( $Xi$ , $Yi$ ) Pairs
- Suggests How Well Model Will Fit

Least Squares

‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative ones. So square errors!
$\sum{i=1}^{n} \hat{\epsiloni}^2 = \sum{i=1}^{n} (Yi - \hat{Y_i})^2$
LS Minimizes the Sum of the Squared Differences (errors) (SSE)

Least Squares Graphically

LS minimizes $\sum{i=1}^{n} {\hat{\epsiloni}}^2 = {\hat{\epsilon1}}^2 + {\hat{\epsilon2}}^2 + {\hat{\epsilon3}}^2 + {\hat{\epsilon4}}^2$

Coefficient Equations

Prediction equation: $\hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi$
Sample slope: $\hat{\beta1} = \frac{SS{xy}}{SS{xx}} = \frac{\sum{i=1}^{n} (xi - \bar{x})(yi - \bar{y})}{\sum{i=1}^{n} (xi - \bar{x})^2}$
Sample Y - intercept: $\hat{\beta0} = \bar{y} - \hat{\beta1} \bar{x}$

Derivation of Parameters (1)

Least Squares (L-S): Minimize squared error
$\frac{\partial \epsilon}{\partial \beta0} = \frac{\partial}{\partial \beta0} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2$
$\frac{\partial \epsilon}{\partial \beta1} = \frac{\partial}{\partial \beta1} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2$

Computation Table

$X_i$	$Y_i$	$X_i^2$	$Y_i^2$	$X<em>iY</em>i$
$X_1$	$Y_1$	$X_1^2$	$Y_1^2$	$X<em>1Y</em>1$
$X_2$	$Y_2$	$X_2^2$	$Y_2^2$	$X<em>2Y</em>2$
:	:	:	:	:
$X_n$	$Y_n$	$X_n^2$	$Y_n^2$	$X<em>nY</em>n$
$\sum X_i$	$\sum Y_i$	$\sum X_i^2$	$\sum Y_i^2$	$\sum X<em>iY</em>i$

Interpretation of Coefficients

Slope ( $\beta_1$ )
- Estimated Y Changes by $\beta_1$ for Each 1 Unit Increase in X
  - If $\beta_1$ = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X
Y-Intercept ( $\beta_0$ )
- Average Value of Y When X = 0
  - If $\beta_0$ = 4, then Average Y Is Expected to Be 4 When X Is 0

Parameter Estimation Example

Obstetrics: What is the relationship between Mother’s Estriol level & Birthweight using the following data?

Estriol (mg/24h)	Birthweight (g/1000)
1	1
2	1
3	2
4	2
5	4

Parameter Estimation Solution Table

$X_i$	$Y_i$	$X_i^2$	$Y_i^2$	$X<em>iY</em>i$
1	1	1	1	1
2	1	4	1	2
3	2	9	4	6
4	2	16	4	8
5	4	25	16	20
15	10	55	26	37

Parameter Estimation Solution

$\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{5(37) - 15(10)}{5(55) - (15)^2} = \frac{35}{50} = 0.70$
$\hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{10}{5} - (0.70) \frac{15}{5} = 2 - (0.70)(3) = -0.10$

Coefficient Interpretation Solution

Slope ( $\beta_1$ )
- Birthweight (Y) Is Expected to Increase by .7 Units for Each 1 unit Increase in Estriol (X)
Intercept ( $\beta_0$ )
- Average Birthweight (Y) Is -.10 Units When Estriol level (X) Is 0
  - Difficult to explain
  - The birthweight should always be positive

Parameter Estimation R codes

# Linear regression
# Birthweight and mother’s estriol level example data
el<-c(1,2,3,4,5) # this is mother’s estriol level 
bw<-c(1,1,2,2,4) # this is child birthweight
mod<-lm(bw~el) # fitting linear regression model
summary(mod) # call for the results from the model

Parameter Estimation R Computer Output

Parameter	Standard
Variable	DF	Estimate	Error	t Value	Pr >
Intercept	1	-0.10000	0.63509	-0.16	0.8849
Estriol	1	0.70000	0.19149	3.66	0.0354

Parameter Estimation Thinking Challenge

You’re a Vet epidemiologist for the county cooperative. You gather the following data:

Food (lb.)	Milk yield (lb.)
4	3.0
6	5.5
10	6.5
12	9.0

What is the relationship between cows’ food intake and milk yield?

Parameter Estimation Solution Table

$X_i$	$Y_i$	$X_i^2$	$Y_i^2$	$X<em>iY</em>i$
4	3.0	16	9.00	12
6	5.5	36	30.25	33
10	6.5	100	42.25	65
12	9.0	144	81.00	108
32	24.0	296	162.50	218

Parameter Estimation Solution

$\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{4(218) - 32(24)}{4(296) - (32)^2} = 0.65$
$\hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{24}{4} - (0.65) \frac{32}{4} = 6 - (0.65)(8) = 0.80$

Coefficient Interpretation Solution

Slope ( $\beta_1$ )
- Milk Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Food intake (X)
Y-Intercept ( $\beta_0$ )
- Average Milk yield (Y) Is Expected to Be 0.8 lb. When Food intake (X) Is 0

BMI = \frac{\text{Weight i
BMI = \frac{\text{Weight (pounds)} \times 703}{(\t