Linear Regression Notes

Biostatistics and Algebra - FHMS 103: Linear Regression

Learning Objectives

  • Describe the Linear Regression Model

  • State the Regression Modeling Steps

  • Explain Ordinary Least Squares

  • Compute Regression Coefficients

  • Understand and check model assumptions

  • Predict Response Variable

  • Comments of R Output

  • Correlation Models

  • Link between a correlation model and a regression model

  • Test of coefficient of Correlation

Models

What is a Model?
  • Representation of Some Phenomenon (Non-Math/Stats Model)

What is a Math/Stats Model?
  • Often Describes Relationship between Variables

  • Types:

    • Deterministic Models (no randomness)

    • Probabilistic Models (with randomness)

Deterministic Models
  • Hypothesize Exact Relationships

  • Suitable When Prediction Error is Negligible

  • Example: Body mass index (BMI) is a measure of body fat based on:

    • Metric Formula: BMI = \frac{\text{Weight in Kilograms}}{(\text{Height in Meters})^2}

    • Non-metric Formula: BMI = \frac{\text{Weight (pounds)} \times 703}{(\text{Height in inches})^2}

Probabilistic Models
  • Hypothesize 2 Components:

    • Deterministic

    • Random Error

  • Example: Systolic blood pressure of newborns Is 6 Times the Age in days + Random Error

    • SBP = 6 \times \text{age(d)} + \epsilon

    • Random Error May Be Due to Factors Other Than age in days (e.g. Birthweight)

Types of Probabilistic Models
  • Regression Models

  • Correlation Models

  • Other Models

Regression Models

  • Relationship between one dependent variable and explanatory variable(s)

  • Use equation to set up relationship

    • Numerical Dependent (Response) Variable

    • 1 or More Numerical or Categorical Independent (Explanatory) Variables

  • Used Mainly for Prediction & Estimation

Regression Modeling Steps
  1. Hypothesize Deterministic Component

    • Estimate Unknown Parameters

  2. Specify Probability Distribution of Random Error Term

    • Estimate Standard Deviation of Error

  3. Evaluate the fitted Model

  4. Use Model for Prediction & Estimation

Model Specification

Specifying the deterministic component
  1. Define the dependent variable and independent variable

  2. Hypothesize Nature of Relationship

    • Expected Effects (i.e., Coefficients’ Signs)

    • Functional Form (Linear or Non-Linear)

    • Interactions

Model Specification Is Based on Theory
  1. Theory of Field (e.g., Epidemiology)

  2. Mathematical Theory

  3. Previous Research

  4. ‘Common Sense’

Types of Regression Models

  • Simple

  • Multiple

  • Linear

  • Non-Linear

Linear Regression Model

  • Linear Equations: Y = mX + b where:

    • b = Y-intercept

    • m = \frac{\text{Change in Y}}{\text{Change in X}} = Slope

  • Yi = \beta0 + \beta1 Xi + \epsilon_i Linear Regression Model

    • Relationship Between Variables Is a Linear Function

      • Dependent (Response) Variable (e.g., CD+ c.)

      • Independent (Explanatory) Variable (e.g., Years s. serocon.)

      • \beta_1 Population Slope

      • \beta_0 Population Y-Intercept

      • \epsilon_i Random Error

Population & Sample Regression Models

  • Population: Yi = \beta0 + \beta1 Xi + \epsilon_i (Unknown Relationship)

  • Random Sample: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}

Population Linear Regression Model
  • Observed value: Yi = \beta0 + \beta1 Xi + \epsilon_i

  • E(Y) = \beta0 + \beta1 X_i

  • \epsilon_i = Random error

Sample Linear Regression Model
  • Observed value: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi + \hat{\epsilon_i}

  • \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi

  • \epsilon_i = Random error

Estimating Parameters: Least Squares Method

  • Scatter plot:

    • Plot of All (Xi, Yi) Pairs

    • Suggests How Well Model Will Fit

Least Squares
  • ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative ones. So square errors!

  • \sum{i=1}^{n} \hat{\epsiloni}^2 = \sum{i=1}^{n} (Yi - \hat{Y_i})^2

  • LS Minimizes the Sum of the Squared Differences (errors) (SSE)

Least Squares Graphically
  • LS minimizes \sum{i=1}^{n} {\hat{\epsiloni}}^2 = {\hat{\epsilon1}}^2 + {\hat{\epsilon2}}^2 + {\hat{\epsilon3}}^2 + {\hat{\epsilon4}}^2

Coefficient Equations
  • Prediction equation: \hat{Yi} = \hat{\beta0} + \hat{\beta1} Xi

  • Sample slope: \hat{\beta1} = \frac{SS{xy}}{SS{xx}} = \frac{\sum{i=1}^{n} (xi - \bar{x})(yi - \bar{y})}{\sum{i=1}^{n} (xi - \bar{x})^2}

  • Sample Y - intercept: \hat{\beta0} = \bar{y} - \hat{\beta1} \bar{x}

Derivation of Parameters (1)
  • Least Squares (L-S): Minimize squared error

  • \frac{\partial \epsilon}{\partial \beta0} = \frac{\partial}{\partial \beta0} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2

  • \frac{\partial \epsilon}{\partial \beta1} = \frac{\partial}{\partial \beta1} \sum{i=1}^{n} (yi - \beta0 - \beta1 x_i)^2

Computation Table

X_i

Y_i

X_i^2

Y_i^2

XiYi

X_1

Y_1

X_1^2

Y_1^2

X1Y1

X_2

Y_2

X_2^2

Y_2^2

X2Y2

:

:

:

:

:

X_n

Y_n

X_n^2

Y_n^2

XnYn

\sum X_i

\sum Y_i

\sum X_i^2

\sum Y_i^2

\sum XiYi

Interpretation of Coefficients

  1. Slope (\beta_1)

    • Estimated Y Changes by \beta_1 for Each 1 Unit Increase in X

      • If \beta_1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X

  2. Y-Intercept (\beta_0)

    • Average Value of Y When X = 0

      • If \beta_0 = 4, then Average Y Is Expected to Be 4 When X Is 0

Parameter Estimation Example

  • Obstetrics: What is the relationship between Mother’s Estriol level & Birthweight using the following data?

Estriol (mg/24h)

Birthweight (g/1000)

1

1

2

1

3

2

4

2

5

4

Parameter Estimation Solution Table

X_i

Y_i

X_i^2

Y_i^2

XiYi

1

1

1

1

1

2

1

4

1

2

3

2

9

4

6

4

2

16

4

8

5

4

25

16

20

15

10

55

26

37

Parameter Estimation Solution
  • \hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{5(37) - 15(10)}{5(55) - (15)^2} = \frac{35}{50} = 0.70

  • \hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{10}{5} - (0.70) \frac{15}{5} = 2 - (0.70)(3) = -0.10

Coefficient Interpretation Solution
  1. Slope (\beta_1)

    • Birthweight (Y) Is Expected to Increase by .7 Units for Each 1 unit Increase in Estriol (X)

  2. Intercept (\beta_0)

    • Average Birthweight (Y) Is -.10 Units When Estriol level (X) Is 0

      • Difficult to explain

      • The birthweight should always be positive

Parameter Estimation R codes
# Linear regression
# Birthweight and mother’s estriol level example data
el<-c(1,2,3,4,5) # this is mother’s estriol level 
bw<-c(1,1,2,2,4) # this is child birthweight
mod<-lm(bw~el) # fitting linear regression model
summary(mod) # call for the results from the model
Parameter Estimation R Computer Output

Parameter

Standard

Variable

DF

Estimate

Error

t Value

Pr >

Intercept

1

-0.10000

0.63509

-0.16

0.8849

Estriol

1

0.70000

0.19149

3.66

0.0354

Parameter Estimation Thinking Challenge

  • You’re a Vet epidemiologist for the county cooperative. You gather the following data:

Food (lb.)

Milk yield (lb.)

4

3.0

6

5.5

10

6.5

12

9.0

  • What is the relationship between cows’ food intake and milk yield?

Parameter Estimation Solution Table

X_i

Y_i

X_i^2

Y_i^2

XiYi

4

3.0

16

9.00

12

6

5.5

36

30.25

33

10

6.5

100

42.25

65

12

9.0

144

81.00

108

32

24.0

296

162.50

218

Parameter Estimation Solution
  • \hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{4(218) - 32(24)}{4(296) - (32)^2} = 0.65

  • \hat{\beta0} = \bar{Y} - \hat{\beta1} \bar{X} = \frac{24}{4} - (0.65) \frac{32}{4} = 6 - (0.65)(8) = 0.80

Coefficient Interpretation Solution
  1. Slope (\beta_1)

    • Milk Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Food intake (X)

  2. Y-Intercept (\beta_0)

    • Average Milk yield (Y) Is Expected to Be 0.8 lb. When Food intake (X) Is 0

BMI = \frac{\text{Weight i
BMI = \frac{\text{Weight (pounds)} \times 703}{(\t