Linear Regression Notes

Biostatistics and Algebra - FHMS 103: Linear Regression

Learning Objectives

  • Describe the Linear Regression Model

  • State the Regression Modeling Steps

  • Explain Ordinary Least Squares

  • Compute Regression Coefficients

  • Understand and check model assumptions

  • Predict Response Variable

  • Comments of R Output

  • Correlation Models

  • Link between a correlation model and a regression model

  • Test of coefficient of Correlation

Models

What is a Model?
  • Representation of Some Phenomenon (Non-Math/Stats Model)

What is a Math/Stats Model?
  • Often Describes Relationship between Variables

  • Types:

    • Deterministic Models (no randomness)

    • Probabilistic Models (with randomness)

Deterministic Models
  • Hypothesize Exact Relationships

  • Suitable When Prediction Error is Negligible

  • Example: Body mass index (BMI) is a measure of body fat based on:

    • Metric Formula: BMI=Weight in Kilograms(Height in Meters)2BMI = \frac{\text{Weight in Kilograms}}{(\text{Height in Meters})^2}

    • Non-metric Formula: BMI=Weight (pounds)×703(Height in inches)2BMI = \frac{\text{Weight (pounds)} \times 703}{(\text{Height in inches})^2}

Probabilistic Models
  • Hypothesize 2 Components:

    • Deterministic

    • Random Error

  • Example: Systolic blood pressure of newborns Is 6 Times the Age in days + Random Error

    • SBP=6×age(d)+ϵSBP = 6 \times \text{age(d)} + \epsilon

    • Random Error May Be Due to Factors Other Than age in days (e.g. Birthweight)

Types of Probabilistic Models
  • Regression Models

  • Correlation Models

  • Other Models

Regression Models

  • Relationship between one dependent variable and explanatory variable(s)

  • Use equation to set up relationship

    • Numerical Dependent (Response) Variable

    • 1 or More Numerical or Categorical Independent (Explanatory) Variables

  • Used Mainly for Prediction & Estimation

Regression Modeling Steps
  1. Hypothesize Deterministic Component

    • Estimate Unknown Parameters

  2. Specify Probability Distribution of Random Error Term

    • Estimate Standard Deviation of Error

  3. Evaluate the fitted Model

  4. Use Model for Prediction & Estimation

Model Specification

Specifying the deterministic component
  1. Define the dependent variable and independent variable

  2. Hypothesize Nature of Relationship

    • Expected Effects (i.e., Coefficients’ Signs)

    • Functional Form (Linear or Non-Linear)

    • Interactions

Model Specification Is Based on Theory
  1. Theory of Field (e.g., Epidemiology)

  2. Mathematical Theory

  3. Previous Research

  4. ‘Common Sense’

Types of Regression Models

  • Simple

  • Multiple

  • Linear

  • Non-Linear

Linear Regression Model

  • Linear Equations: Y=mX+bY = mX + b where:

    • bb = Y-intercept

    • m=Change in YChange in Xm = \frac{\text{Change in Y}}{\text{Change in X}} = Slope

  • Y<em>i=β</em>0+β<em>1X</em>i+ϵiY<em>i = \beta</em>0 + \beta<em>1 X</em>i + \epsilon_i Linear Regression Model

    • Relationship Between Variables Is a Linear Function

      • Dependent (Response) Variable (e.g., CD+ c.)

      • Independent (Explanatory) Variable (e.g., Years s. serocon.)

      • β1\beta_1 Population Slope

      • β0\beta_0 Population Y-Intercept

      • ϵi\epsilon_i Random Error

Population & Sample Regression Models

  • Population: Y<em>i=β</em>0+β<em>1X</em>i+ϵiY<em>i = \beta</em>0 + \beta<em>1 X</em>i + \epsilon_i (Unknown Relationship)

  • Random Sample: Y<em>i^=β</em>0^+β<em>1^X</em>i+ϵi^\hat{Y<em>i} = \hat{\beta</em>0} + \hat{\beta<em>1} X</em>i + \hat{\epsilon_i}

Population Linear Regression Model
  • Observed value: Y<em>i=β</em>0+β<em>1X</em>i+ϵiY<em>i = \beta</em>0 + \beta<em>1 X</em>i + \epsilon_i

  • E(Y)=β<em>0+β</em>1XiE(Y) = \beta<em>0 + \beta</em>1 X_i

  • ϵi\epsilon_i = Random error

Sample Linear Regression Model
  • Observed value: Y<em>i^=β</em>0^+β<em>1^X</em>i+ϵi^\hat{Y<em>i} = \hat{\beta</em>0} + \hat{\beta<em>1} X</em>i + \hat{\epsilon_i}

  • Y<em>i^=β</em>0^+β<em>1^X</em>i\hat{Y<em>i} = \hat{\beta</em>0} + \hat{\beta<em>1} X</em>i

  • ϵi\epsilon_i = Random error

Estimating Parameters: Least Squares Method

  • Scatter plot:

    • Plot of All (X<em>iX<em>i, Y</em>iY</em>i) Pairs

    • Suggests How Well Model Will Fit

Least Squares
  • ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y Values Are a Minimum. But Positive Differences Off-Set Negative ones. So square errors!

  • <em>i=1nϵ</em>i^2=<em>i=1n(Y</em>iYi^)2\sum<em>{i=1}^{n} \hat{\epsilon</em>i}^2 = \sum<em>{i=1}^{n} (Y</em>i - \hat{Y_i})^2

  • LS Minimizes the Sum of the Squared Differences (errors) (SSE)

Least Squares Graphically
  • LS minimizes <em>i=1nϵ</em>i^2=ϵ<em>1^2+ϵ</em>2^2+ϵ<em>3^2+ϵ</em>4^2\sum<em>{i=1}^{n} {\hat{\epsilon</em>i}}^2 = {\hat{\epsilon<em>1}}^2 + {\hat{\epsilon</em>2}}^2 + {\hat{\epsilon<em>3}}^2 + {\hat{\epsilon</em>4}}^2

Coefficient Equations
  • Prediction equation: Y<em>i^=β</em>0^+β<em>1^X</em>i\hat{Y<em>i} = \hat{\beta</em>0} + \hat{\beta<em>1} X</em>i

  • Sample slope: β<em>1^=SS</em>xySS<em>xx=</em>i=1n(x<em>ixˉ)(y</em>iyˉ)<em>i=1n(x</em>ixˉ)2\hat{\beta<em>1} = \frac{SS</em>{xy}}{SS<em>{xx}} = \frac{\sum</em>{i=1}^{n} (x<em>i - \bar{x})(y</em>i - \bar{y})}{\sum<em>{i=1}^{n} (x</em>i - \bar{x})^2}

  • Sample Y - intercept: β<em>0^=yˉβ</em>1^xˉ\hat{\beta<em>0} = \bar{y} - \hat{\beta</em>1} \bar{x}

Derivation of Parameters (1)
  • Least Squares (L-S): Minimize squared error

  • ϵβ<em>0=β</em>0<em>i=1n(y</em>iβ<em>0β</em>1xi)2\frac{\partial \epsilon}{\partial \beta<em>0} = \frac{\partial}{\partial \beta</em>0} \sum<em>{i=1}^{n} (y</em>i - \beta<em>0 - \beta</em>1 x_i)^2

  • ϵβ<em>1=β</em>1<em>i=1n(y</em>iβ<em>0β</em>1xi)2\frac{\partial \epsilon}{\partial \beta<em>1} = \frac{\partial}{\partial \beta</em>1} \sum<em>{i=1}^{n} (y</em>i - \beta<em>0 - \beta</em>1 x_i)^2

Computation Table

XiX_i

YiY_i

Xi2X_i^2

Yi2Y_i^2

X<em>iY</em>iX<em>iY</em>i

X1X_1

Y1Y_1

X12X_1^2

Y12Y_1^2

X<em>1Y</em>1X<em>1Y</em>1

X2X_2

Y2Y_2

X22X_2^2

Y22Y_2^2

X<em>2Y</em>2X<em>2Y</em>2

:

:

:

:

:

XnX_n

YnY_n

Xn2X_n^2

Yn2Y_n^2

X<em>nY</em>nX<em>nY</em>n

Xi\sum X_i

Yi\sum Y_i

Xi2\sum X_i^2

Yi2\sum Y_i^2

X<em>iY</em>i\sum X<em>iY</em>i

Interpretation of Coefficients

  1. Slope (β1\beta_1)

    • Estimated Y Changes by β1\beta_1 for Each 1 Unit Increase in X

      • If β1\beta_1 = 2, then Y Is Expected to Increase by 2 for Each 1 Unit Increase in X

  2. Y-Intercept (β0\beta_0)

    • Average Value of Y When X = 0

      • If β0\beta_0 = 4, then Average Y Is Expected to Be 4 When X Is 0

Parameter Estimation Example

  • Obstetrics: What is the relationship between Mother’s Estriol level & Birthweight using the following data?

Estriol (mg/24h)

Birthweight (g/1000)

1

1

2

1

3

2

4

2

5

4

Parameter Estimation Solution Table

XiX_i

YiY_i

Xi2X_i^2

Yi2Y_i^2

X<em>iY</em>iX<em>iY</em>i

1

1

1

1

1

2

1

4

1

2

3

2

9

4

6

4

2

16

4

8

5

4

25

16

20

15

10

55

26

37

Parameter Estimation Solution
  • β1^=nXYXYnX2(X)2=5(37)15(10)5(55)(15)2=3550=0.70\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{5(37) - 15(10)}{5(55) - (15)^2} = \frac{35}{50} = 0.70

  • β<em>0^=Yˉβ</em>1^Xˉ=105(0.70)155=2(0.70)(3)=0.10\hat{\beta<em>0} = \bar{Y} - \hat{\beta</em>1} \bar{X} = \frac{10}{5} - (0.70) \frac{15}{5} = 2 - (0.70)(3) = -0.10

Coefficient Interpretation Solution
  1. Slope (β1\beta_1)

    • Birthweight (Y) Is Expected to Increase by .7 Units for Each 1 unit Increase in Estriol (X)

  2. Intercept (β0\beta_0)

    • Average Birthweight (Y) Is -.10 Units When Estriol level (X) Is 0

      • Difficult to explain

      • The birthweight should always be positive

Parameter Estimation R codes
# Linear regression
# Birthweight and mother’s estriol level example data
el<-c(1,2,3,4,5) # this is mother’s estriol level 
bw<-c(1,1,2,2,4) # this is child birthweight
mod<-lm(bw~el) # fitting linear regression model
summary(mod) # call for the results from the model
Parameter Estimation R Computer Output

Parameter

Standard

Variable

DF

Estimate

Error

t Value

Pr >

Intercept

1

-0.10000

0.63509

-0.16

0.8849

Estriol

1

0.70000

0.19149

3.66

0.0354

Parameter Estimation Thinking Challenge

  • You’re a Vet epidemiologist for the county cooperative. You gather the following data:

Food (lb.)

Milk yield (lb.)

4

3.0

6

5.5

10

6.5

12

9.0

  • What is the relationship between cows’ food intake and milk yield?

Parameter Estimation Solution Table

XiX_i

YiY_i

Xi2X_i^2

Yi2Y_i^2

X<em>iY</em>iX<em>iY</em>i

4

3.0

16

9.00

12

6

5.5

36

30.25

33

10

6.5

100

42.25

65

12

9.0

144

81.00

108

32

24.0

296

162.50

218

Parameter Estimation Solution
  • β1^=nXYXYnX2(X)2=4(218)32(24)4(296)(32)2=0.65\hat{\beta_1} = \frac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \frac{4(218) - 32(24)}{4(296) - (32)^2} = 0.65

  • β<em>0^=Yˉβ</em>1^Xˉ=244(0.65)324=6(0.65)(8)=0.80\hat{\beta<em>0} = \bar{Y} - \hat{\beta</em>1} \bar{X} = \frac{24}{4} - (0.65) \frac{32}{4} = 6 - (0.65)(8) = 0.80

Coefficient Interpretation Solution
  1. Slope (β1\beta_1)

    • Milk Yield (Y) Is Expected to Increase by .65 lb. for Each 1 lb. Increase in Food intake (X)

  2. Y-Intercept (β0\beta_0)

    • Average Milk yield (Y) Is Expected to Be 0.8 lb. When Food intake (X) Is 0

BMI = \frac{\text{Weight i
BMI = \frac{\text{Weight (pounds)} \times 703}{(\t