Title: Data Science Exercises
Institution: OLS HOCHSCHULE Prof. Diego d'Andria SCHMALKALDEN
Semester: WS 2024-2025
Context: University of Applied Sciences
Software used: R
Built-in datasets accessed via data()
command in R console.
Uses standard R commands with exceptions:
Exercise 4: Uses sandwich
and lmtest
packages for robust standard errors.
Exercise 6: Uses dataset from the ggplot2
package.
Dataset: LifeCycleSavings
Time period: 1960–1970
Tasks:
Load the data and summarize each variable (min, mean, max).
Find the variable with the largest range of values.
Obtain a correlation matrix: identify pairs with highest and lowest correlations.
Dataset: LifeCycleSavings
Tasks:
Run OLS model with saving rate (sr
) as dependent variable.
Identify statistically significant regressors (p < 0.10).
Create reduced model excluding insignificant regressors: assess if explanatory power improves.
Dataset: LifeCycleSavings
Tasks:
Run full model from Exercise 2:
Plot histogram of residuals: analyze normal distribution.
Conduct Shapiro-Wilk normality test on residuals: evaluate reliability of the hypothesis test.
Dataset: LifeCycleSavings
Tasks:
Execute full model from Exercise 2.
Obtain robust standard errors: investigate changes in p-values for variables pop15
and ddpi
.
Dataset: trees
Variables: Diameter (Girth), Height, Volume of cherry trees.
Tasks:
Run OLS model with Volume ~ Girth + Height.
Assess how well the regressors predict Volume.
Evaluate normality of residuals.
Run log-transformed model: ln(Volume) ~ ln(Girth) + ln(Height. Compare significance and R² with first OLS model.
Dataset: midwest
from ggplot2
Tasks:
OLS model: percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults.
Compute percadults
as popadults/poptotal.
Test necessity of all six regressors: omit one at a time and verify model fit.
Overview of the solutions provided for exercises 1-6.
Load dataset:
data(LifeCycleSavings)
Descriptive statistics:
summary(LifeCycleSavings)
Correlation matrix:
cor(LifeCycleSavings)
Load dataset:
data(LifeCycleSavings)
OLS model:
model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
summary(model1)
Significant variables: pop15
and ddpi
.
Reduced model:
model2 <- lm(sr ~ pop15 + ddpi, data=LifeCycleSavings)
summary(model2)
Compare Adjusted R²:
Full model: 0.2797
Reduced model: 0.2575
Load dataset and run regression:
data(LifeCycleSavings)
model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
summary(model1)
Histogram of residuals:
hist(model1$residuals)
Shapiro-Wilk test:
shapiro.test(rstandard(model1))
Interpretation: p-value = 0.91, not rejecting normality.
Load dataset and run regression:
data(LifeCycleSavings)
model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
summary(model1)
Obtain robust errors:
library(sandwich)
library(lmtest)
model1_robust <- coeftest(model1, vcov = vcovHC)
print(model1_robust)
Interpretation: pop15
remains significant while ddpi
loses significance.
Load dataset:
data(trees)
trees <- data.frame(trees)
OLS model:
reg <- lm(Volume ~ Girth + Height, data=trees)
summary(reg)
Model R² = 0.948, F-statistic p-value < 0.001.
Shapiro-Wilk test:
shapiro.test(rstandard(reg))
Normality implication: Cannot reject normality (p-value ~ 0.64).
Log transformed model:
reg_ln <- lm(log(Volume) ~ log(Girth) + log(Height), data=trees)
summary(reg_ln)
Outcomes: Significance for Height
improved, R² increased to 0.977.
Load dataset:
library(ggplot2)
data(midwest)
midwest <- data.frame(midwest)
midwest$percadults <- midwest$popadults/midwest$poptotal
Run full OLS model:
reg <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults, data=midwest)
summary(reg)
Test each regressor's necessity:
reg1 <- lm(percbelowpoverty ~ popdensity + perchsd + percollege + percprof + percadults, data=midwest)
reg2 <- lm(percbelowpoverty ~ poptotal + perchsd + percollege + percprof + percadults, data=midwest)
reg3 <- lm(percbelowpoverty ~ poptotal + popdensity + percollege + percprof + percadults, data=midwest)
reg4 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percprof + percadults, data=midwest)
reg5 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percadults, data=midwest)
reg6 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof, data=midwest)
Compare Adjusted R² for models:
summary(reg)$adj.r.squared
summary(reg1)$adj.r.squared
summary(reg2)$adj.r.squared
summary(reg3)$adj.r.squared
summary(reg4)$adj.r.squared
summary(reg5)$adj.r.squared
summary(reg6)$adj.r.squared
Findings: Omitting poptotal
yields highest R² (0.441); omitting perchsd
results in lowest R² (0.129).
Exercise 1:
data(LifeCycleSavings)
summary(LifeCycleSavings)
cor(LifeCycleSavings)
Exercise 2:
model1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=LifeCycleSavings)
summary(model1)
model2 <- lm(sr ~ pop15 + ddpi, data=LifeCycleSavings)
summary(model2)
Exercise 3:
hist(model1$residuals)
shapiro.test(rstandard(model1))
Exercise 4:
library(sandwich)
library(lmtest)
model1_robust <- coeftest(model1, vcov = vcovHC)
print(model1_robust)
Exercise 5:
trees <- data.frame(trees)
reg <- lm(Volume ~ Girth + Height, data=trees)
summary(reg)
shapiro.test(rstandard(reg))
reg_ln <- lm(log(Volume) ~ log(Girth) + log(Height), data=trees)
summary(reg_ln)
Exercise 6:
library(ggplot2)
data(midwest)
midwest <- data.frame(midwest)
midwest$percadults <- midwest$popadults/midwest$poptotal
reg <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof + percadults, data=midwest)
reg1 <- lm(percbelowpoverty ~ popdensity + perchsd + percollege + percprof + percadults, data=midwest)
reg2 <- lm(percbelowpoverty ~ poptotal + perchsd + percollege + percprof + percadults, data=midwest)
reg3 <- lm(percbelowpoverty ~ poptotal + popdensity + percollege + percprof + percadults, data=midwest)
reg4 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percprof + percadults, data=midwest)
reg5 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percadults, data=midwest)
reg6 <- lm(percbelowpoverty ~ poptotal + popdensity + perchsd + percollege + percprof, data=midwest)
summary(reg)$adj.r.squared
summary(reg1)$adj.r.squared
summary(reg2)$adj.r.squared
summary(reg3)$adj.r.squared
summary(reg4)$adj.r.squared
summary(reg5)$adj.r.squared
summary(reg6)$adj.r.squared