Exercises_-_OLS

Page 1

Course Overview

  • Title: Data Science Exercises

  • Institution: OLS HOCHSCHULE Prof. Diego d'Andria SCHMALKALDEN

  • Semester: WS 2024-2025

  • Context: University of Applied Sciences


Page 2

Instructions

  • Software used: R

  • Built-in datasets accessed via data() command in R console.

  • Uses standard R commands with exceptions:

    • Exercise 4: Uses sandwich and lmtest packages for robust standard errors.

    • Exercise 6: Uses dataset from the ggplot2 package.


Page 3

Exercise 1: Describing Saving Rates Data

  • Dataset: LifeCycleSavings

    • Time period: 1960–1970

  • Tasks:

    1. Load the data and summarize each variable (min, mean, max).

    2. Find the variable with the largest range of values.

    3. Obtain a correlation matrix: identify pairs with highest and lowest correlations.


Page 4

Exercise 2: OLS on Saving Rates

  • Dataset: LifeCycleSavings

  • Tasks:

    1. Run OLS model with saving rate (sr) as dependent variable.

    2. Identify statistically significant regressors (p < 0.10).

    3. Create reduced model excluding insignificant regressors: assess if explanatory power improves.


Page 5

Exercise 3: Diagnosing Errors from OLS Model

  • Dataset: LifeCycleSavings

  • Tasks:

    1. Run full model from Exercise 2:

    • Plot histogram of residuals: analyze normal distribution.

    1. Conduct Shapiro-Wilk normality test on residuals: evaluate reliability of the hypothesis test.


Page 6

Exercise 4: Obtaining Heteroskedasticity-Robust Standard Errors

  • Dataset: LifeCycleSavings

  • Tasks:

    1. Execute full model from Exercise 2.

    2. Obtain robust standard errors: investigate changes in p-values for variables pop15 and ddpi.


Page 7

Exercise 5: The Mass of Wood from Fallen Trees

  • Dataset: trees

    • Variables: Diameter (Girth), Height, Volume of cherry trees.

  • Tasks:

    1. Run OLS model with Volume ~ Girth + Height.

    2. Assess how well the regressors predict Volume.

    3. Evaluate normality of residuals.

    4. Run log-transformed model: ln(Volume) ~ ln(Girth) + ln(Height. Compare significance and R² with first OLS model.


Page 8

Exercise 6: Poverty in the US Midwest

  • Dataset: midwest from ggplot2

  • Tasks:

    1. OLS model: percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults.

    2. Compute percadults as popadults/poptotal.

    3. Test necessity of all six regressors: omit one at a time and verify model fit.


Page 9

Solutions

  • Overview of the solutions provided for exercises 1-6.


Page 10

Exercise 1 - Solution

  • Load dataset:

    data(LifeCycleSavings)
  • Descriptive statistics:

    summary(LifeCycleSavings)
  • Correlation matrix:

    cor(LifeCycleSavings)

Page 11

Exercise 2 - Solution

  • Load dataset:

    data(LifeCycleSavings)
  • OLS model:

    model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
    summary(model1)
  • Significant variables: pop15 and ddpi.

  • Reduced model:

    model2 <- lm(sr ~ pop15 + ddpi, data=LifeCycleSavings)
    summary(model2)
  • Compare Adjusted R²:

    • Full model: 0.2797

    • Reduced model: 0.2575


Page 12

Exercise 3 - Solution

  • Load dataset and run regression:

    data(LifeCycleSavings)
    model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
    summary(model1)
  • Histogram of residuals:

    hist(model1$residuals)
  • Shapiro-Wilk test:

    shapiro.test(rstandard(model1))
  • Interpretation: p-value = 0.91, not rejecting normality.


Page 13

Exercise 4 - Solution

  • Load dataset and run regression:

    data(LifeCycleSavings)
    model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
    summary(model1)
  • Obtain robust errors:

    library(sandwich)
    library(lmtest)
    model1_robust <- coeftest(model1, vcov = vcovHC)
    print(model1_robust)
  • Interpretation: pop15 remains significant while ddpi loses significance.


Page 14

Exercise 5 - Solution

  • Load dataset:

    data(trees)
    trees <- data.frame(trees)
  • OLS model:

    reg <- lm(Volume ~ Girth + Height, data=trees)
    summary(reg)
  • Model R² = 0.948, F-statistic p-value < 0.001.

  • Shapiro-Wilk test:

    shapiro.test(rstandard(reg))
  • Normality implication: Cannot reject normality (p-value ~ 0.64).


Page 15

Exercise 5 - Solution (cont.)

  • Log transformed model:

    reg_ln <- lm(log(Volume) ~ log(Girth) + log(Height), data=trees)
    summary(reg_ln)
  • Outcomes: Significance for Height improved, R² increased to 0.977.


Page 16

Exercise 6 - Solution

  • Load dataset:

    library(ggplot2)
    data(midwest)
    midwest <- data.frame(midwest)
    midwest$percadults <- midwest$popadults/midwest$poptotal
  • Run full OLS model:

    reg <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults, data=midwest)
    summary(reg)

Page 17

Exercise 6 - Solution (cont.)

  • Test each regressor's necessity:

    reg1 <- lm(percbelowpoverty ~ popdensity + perchsd + percollege + percprof + percadults, data=midwest)
    reg2 <- lm(percbelowpoverty ~ poptotal + perchsd + percollege + percprof + percadults, data=midwest)
    reg3 <- lm(percbelowpoverty ~ poptotal + popdensity + percollege + percprof + percadults, data=midwest)
    reg4 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percprof + percadults, data=midwest)
    reg5 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percadults, data=midwest)
    reg6 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof, data=midwest)

Page 18

Exercise 6 - Solution (cont.)

  • Compare Adjusted R² for models:

    summary(reg)$adj.r.squared
    summary(reg1)$adj.r.squared
    summary(reg2)$adj.r.squared
    summary(reg3)$adj.r.squared
    summary(reg4)$adj.r.squared
    summary(reg5)$adj.r.squared
    summary(reg6)$adj.r.squared
  • Findings: Omitting poptotal yields highest R² (0.441); omitting perchsd results in lowest R² (0.129).


Page 19

R Code for Exercises

  • Exercise 1:

    data(LifeCycleSavings)
    summary(LifeCycleSavings)
    cor(LifeCycleSavings)
  • Exercise 2:

    model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
    summary(model1)
    model2 <- lm(sr ~ pop15 + ddpi, data=LifeCycleSavings)
    summary(model2)
  • Exercise 3:

    hist(model1$residuals)
    shapiro.test(rstandard(model1))
  • Exercise 4:

    library(sandwich)
    library(lmtest)
    model1_robust <- coeftest(model1, vcov = vcovHC)
    print(model1_robust)

Page 20

R Code for Exercises (cont.)

  • Exercise 5:

    trees <- data.frame(trees)
    reg <- lm(Volume ~ Girth + Height, data=trees)
    summary(reg)
    shapiro.test(rstandard(reg))
    reg_ln <- lm(log(Volume) ~ log(Girth) + log(Height), data=trees)
    summary(reg_ln)
  • Exercise 6:

    library(ggplot2)
    data(midwest)
    midwest <- data.frame(midwest)
    midwest$percadults <- midwest$popadults/midwest$poptotal
    reg <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults, data=midwest)
    reg1 <- lm(percbelowpoverty ~ popdensity + perchsd + percollege + percprof + percadults, data=midwest)
    reg2 <- lm(percbelowpoverty ~ poptotal + perchsd + percollege + percprof + percadults, data=midwest)
    reg3 <- lm(percbelowpoverty ~ poptotal + popdensity + percollege + percprof + percadults, data=midwest)
    reg4 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percprof + percadults, data=midwest)
    reg5 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percadults, data=midwest)
    reg6 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof, data=midwest)
    summary(reg)$adj.r.squared
    summary(reg1)$adj.r.squared
    summary(reg2)$adj.r.squared
    summary(reg3)$adj.r.squared
    summary(reg4)$adj.r.squared
    summary(reg5)$adj.r.squared
    summary(reg6)$adj.r.squared
robot