Linear Regression Notes
Biostatistics and Algebra - FHMS 103: Linear Regression
Learning Objectives
Describe the Linear Regression Model
State the Regression Modeling Steps
Explain Ordinary Least Squares
Compute Regression Coefficients
Understand and check model assumptions
Predict Response Variable
Comments of R Output
Correlation Models
Link between a correlation model and a regression model
Test of coefficient of Correlation
Models
What is a Model?
Representation of Some Phenomenon (Non-Math/Stats Model)
What is a Math/Stats Model?
Often Describes Relationship between Variables
Types:
Deterministic Models (no randomness)
Probabilistic Models (with randomness)
Deterministic Models
Hypothesize Exact Relationships
Suitable When Prediction Error is Negligible
Example: Body mass index (BMI) is a measure of body fat based on:
Metric Formula: BMI = \frac{\text{Weight in Kilograms}}{(\text{Height in Meters})^2}
Non-metric Formula: BMI = \frac{\text{Weight (pounds)} \times 703}{(\text{Height in inches})^2}
Probabilistic Models
Hypothesize 2 Components:
Deterministic
Random Error
Example: Systolic blood pressure of newborns Is 6 Times the Age in days + Random Error
SBP = 6 \times \text{age(d)} + \epsilon
Random Error May Be Due to Factors Other Than age in days (e.g. Birthweight)
Types of Probabilistic Models
Regression Models
Correlation Models
Other Models
Regression Models
Relationship between one dependent variable and explanatory variable(s)
Use equation to set up relationship
Numerical Dependent (Response) Variable
1 or More Numerical or Categorical Independent (Explanatory) Variables
Used Mainly for Prediction & Estimation
Regression Modeling Steps
Hypothesize Deterministic Component
Estimate Unknown Parameters
Specify Probability Distribution of Random Error Term
Estimate Standard Deviation of Error
Evaluate the fitted Model
Use Model for Prediction & Estimation
Model Specification
Specifying the deterministic component
Define the dependent variable and independent variable
Hypothesize Nature of Relationship
Expected Effects (i.e., Coefficients’ Signs)
Functional Form (Linear or Non-Linear)
Interactions
Model Specification Is Based on Theory
Theory of Field (e.g., Epidemiology)
Mathematical Theory
Previous Research
‘Common Sense’
Types of Regression Models
Simple
Multiple
Linear
Non-Linear
Linear Regression Model
Linear Equations: Y = mX + b where:
b = Y-intercept
m = \frac{\text{Change in Y}}{\text{Change in X}} = Slope
Yi = \beta0 + \beta1 Xi + \epsilon_i Linear Regression Model
Relationship Between Variables Is a Linear Function
Dependent (Response) Variable (e.g., CD+ c.)
Independent (Explanatory) Variable (e.g., Years s. serocon.)
\beta_1 Population Slope
\beta_0 Population Y-Intercept
\epsilon_i Random Error
Population & Sample Regression Models
Population: Yi = \beta0 + \beta1 Xi + \epsilon_i (Unknown Relationship)
Random Sample: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}
Population Linear Regression Model
Observed value: Yi = \beta0 + \beta1 Xi + \epsilon_i
E(Y) = \beta0 + \beta1 X_i
\epsilon_i = Random error
Sample Linear Regression Model
Observed value: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}
\hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi
\epsilon_i = Random error
Estimating Parameters: Least Squares Method
Scatter plot:
Plot of All (Xi, Yi) Pairs
Suggests How Well Model Will Fit
Least Squares
‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative ones. So square errors!
\sum{i=1}^{n} \hat{\epsiloni}^2 = \sum{i=1}^{n} (Yi - \hat{Y_i})^2
LS Minimizes the Sum of the Squared Differences (errors) (SSE)
Least Squares Graphically
LS minimizes \sum{i=1}^{n} {\hat{\epsiloni}}^2 = {\hat{\epsilon1}}^2 + {\hat{\epsilon2}}^2 + {\hat{\epsilon3}}^2 + {\hat{\epsilon4}}^2
Coefficient Equations
Prediction equation: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi
Sample slope: \hat{\beta1} = \frac{SS{xy}}{SS{xx}} = \frac{\sum{i=1}^{n} (xi - \bar{x})(yi - \bar{y})}{\sum{i=1}^{n} (xi - \bar{x})^2}
Sample Y - intercept: \hat{\beta0} = \bar{y} - \hat{\beta1} \bar{x}
Derivation of Parameters (1)
Least Squares (L-S): Minimize squared error
\frac{\partial \epsilon}{\partial \beta0} = \frac{\partial}{\partial \beta0} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2
\frac{\partial \epsilon}{\partial \beta1} = \frac{\partial}{\partial \beta1} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2
Computation Table
X_i | Y_i | X_i^2 | Y_i^2 | XiYi |
|---|---|---|---|---|
X_1 | Y_1 | X_1^2 | Y_1^2 | X1Y1 |
X_2 | Y_2 | X_2^2 | Y_2^2 | X2Y2 |
: | : | : | : | : |
X_n | Y_n | X_n^2 | Y_n^2 | XnYn |
\sum X_i | \sum Y_i | \sum X_i^2 | \sum Y_i^2 | \sum XiYi |
Interpretation of Coefficients
Slope (\beta_1)
Estimated Y Changes by \beta_1 for Each 1 Unit Increase in X
If \beta_1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X
Y-Intercept (\beta_0)
Average Value of Y When X = 0
If \beta_0 = 4, then Average Y Is Expected to Be 4 When X Is 0
Parameter Estimation Example
Obstetrics: What is the relationship between Mother’s Estriol level & Birthweight using the following data?
Estriol (mg/24h) | Birthweight (g/1000) |
|---|---|
1 | 1 |
2 | 1 |
3 | 2 |
4 | 2 |
5 | 4 |
Parameter Estimation Solution Table
X_i | Y_i | X_i^2 | Y_i^2 | XiYi |
|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 |
2 | 1 | 4 | 1 | 2 |
3 | 2 | 9 | 4 | 6 |
4 | 2 | 16 | 4 | 8 |
5 | 4 | 25 | 16 | 20 |
15 | 10 | 55 | 26 | 37 |
Parameter Estimation Solution
\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{5(37) - 15(10)}{5(55) - (15)^2} = \frac{35}{50} = 0.70
\hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{10}{5} - (0.70) \frac{15}{5} = 2 - (0.70)(3) = -0.10
Coefficient Interpretation Solution
Slope (\beta_1)
Birthweight (Y) Is Expected to Increase by .7 Units for Each 1 unit Increase in Estriol (X)
Intercept (\beta_0)
Average Birthweight (Y) Is -.10 Units When Estriol level (X) Is 0
Difficult to explain
The birthweight should always be positive
Parameter Estimation R codes
# Linear regression
# Birthweight and mother’s estriol level example data
el<-c(1,2,3,4,5) # this is mother’s estriol level
bw<-c(1,1,2,2,4) # this is child birthweight
mod<-lm(bw~el) # fitting linear regression model
summary(mod) # call for the results from the model
Parameter Estimation R Computer Output
Parameter | Standard | ||||
|---|---|---|---|---|---|
Variable | DF | Estimate | Error | t Value | Pr > |
Intercept | 1 | -0.10000 | 0.63509 | -0.16 | 0.8849 |
Estriol | 1 | 0.70000 | 0.19149 | 3.66 | 0.0354 |
Parameter Estimation Thinking Challenge
You’re a Vet epidemiologist for the county cooperative. You gather the following data:
Food (lb.) | Milk yield (lb.) |
|---|---|
4 | 3.0 |
6 | 5.5 |
10 | 6.5 |
12 | 9.0 |
What is the relationship between cows’ food intake and milk yield?
Parameter Estimation Solution Table
X_i | Y_i | X_i^2 | Y_i^2 | XiYi |
|---|---|---|---|---|
4 | 3.0 | 16 | 9.00 | 12 |
6 | 5.5 | 36 | 30.25 | 33 |
10 | 6.5 | 100 | 42.25 | 65 |
12 | 9.0 | 144 | 81.00 | 108 |
32 | 24.0 | 296 | 162.50 | 218 |
Parameter Estimation Solution
\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{4(218) - 32(24)}{4(296) - (32)^2} = 0.65
\hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{24}{4} - (0.65) \frac{32}{4} = 6 - (0.65)(8) = 0.80
Coefficient Interpretation Solution
Slope (\beta_1)
Milk Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Food intake (X)
Y-Intercept (\beta_0)
Average Milk yield (Y) Is Expected to Be 0.8 lb. When Food intake (X) Is 0
BMI = \frac{\text{Weight i
BMI = \frac{\text{Weight (pounds)} \times 703}{(\t