MATH1041 Descriptive Statistics – Lecture 2 Lecturer Annotated

Lecture 1 – Numerical & Graphical Summaries for One Variable

Administrative / Context

Chapter 2 of MATH1041 “Descriptive Statistics” → 4 lectures
Overarching aim of course: collecting, analysing, interpreting data
Textbook mapping: Moore et al., 2021, Sections 1.2–1.3
RStudio is the compulsory software (install + watch MarinStatsLectures videos). Other packages (SAS, SPSS, Excel, …) mentioned but not used.

Learning outcomes (Ch2-L1-O1 … O4)

Describe a variable numerically & graphically
Recognise that choice depends on variable type (categorical vs quantitative)
Comment on graphs (centre, spread, shape, outliers)
Detect outliers & understand their impact

Variable Types & Strategy Table (single variable perspective)

Categorical → table  + bar chart
Quantitative → mean/SD OR five-number summary + histogram/boxplot

Decide summary after checking type & number of variables

Statistical Distribution (Def 2.1)

Set of possible values (or categories)
Corresponding counts / frequencies / proportions (for quantitative: counts per class interval)

Replacing raw data with distribution loses labels/individual values but yields concise overview

Categorical Variables

Numerical Summary

Frequency table (Def 2.2): list of categories + counts/percentages
R: table(x) → optionally wrap with prop.table() ×100 for %
Examples: Watching Australian Open (Yes/No) ; Mode of transport ; Hemisphere of origin

Graphical Summary

Bar chart (vertical or horizontal). Order bars logically to aid readability (e.g., descending frequency)
R base: barplot(table(catVar))

Quantitative Variables

Why not frequency table? (Slide 2.26)

Potentially infinite distinct values ⇒ table would be huge

Key Features to Capture

Location (centre)
Spread (variability)
Optional: shape (skew, modality)

Measures of Location

Mean ˉ=n∑i=1nx
Sensitive to outliers (e.g.
Gina Rinehart in bar → average wealth skyrockets)
Median M =\begin{cases}x{((n+1)/2)} & n \text{ odd}\ \frac{x{(n/2)}+x_{(n/2+1)}}{2}& n \text{ even}\end{cases} (Def 2.4)
Robust to outliers (e.g.
engineer salary example)
Guidance: use median when distribution skewed / heavy-tailed

Measures of Spread

Quartiles (Def 2.5): Q1 & Q3 are medians of lower/upper halves (median included if n odd, consistent with R)
Inter-quartile range IQR = Q3-Q1 (Def 2.6) – robust
Standard deviation s=\sqrt{\frac{\sum{i=1}^{n}(xi-\bar x)^2}{n-1}} (Def 2.7) – sensitive to outliers
Variance s^2 (SD squared)

Five-Number Summary

Min
Q1
Median (Q2)
Q3
Max

R: summary(x)
Graphical counterpart: boxplot

Outliers

Suspected if beyond Q1-1.5\times IQR or Q3+1.5\times IQR (1.5×IQR rule)
Show as points on “modified boxplot” (default R behaviour)

Graphical Tools

Histogram: good for shape; choose appropriate bin width (breaks=)
Boxplot: concise five-number; useful for comparing groups
Bar plot for discrete quantitative (narrow bars)
Comment on graph in terms of location, spread, shape, outliers

R Cheat-sheet (one variable)

# numbers
mean(x);  sd(x);  median(x)
quantile(x,c(.25,.5,.75));  IQR(x)
summary(x)
# graphs
hist(x); boxplot(x)

Lecture 2 – Relationships Between Two Variables

Learning outcomes (Ch2-L2-O1 – O3)

Pick correct tools when ≥1 variable categorical
Describe scatterplot (presence, strength, direction, type)
Understand Pearson correlation, its derivation, pros/cons

Summary Table (two variables)

Both categorical → 2-way frequency table + clustered bar chart
Cat + Quant → 5-num (or mean/SD) per group + comparative boxplots / clustered bars
Both quantitative → correlation + scatterplot

Case 1: ≥1 Categorical Variable

Strategy: split data by categories, then apply single-variable methods within each subgroup, finally compare
Two-way table: table(cat1,cat2)
Comparative boxplots: boxplot(y ~ cat)

Case 2: Both Quantitative → Scatterplot

Axes: X – explanatory (if any), Y – response
Describe nature using 4 × dichotomies1. Existent
non-existent
1. Strong
  weak
2. Increasing (+)
  decreasing (–)
3. Linear
  non-linear
Outliers may signal data error or interesting structure (example: Palm Beach 2000 election)

Pearson Correlation (Def 2.8)

Standardise each variable x^\bulleti=\frac{xi-\bar x}{sx},\; y^\bulleti=\frac{yi-\bar y}{sy}
Coefficient r=\frac{\sum{i=1}^{n}x^\bulleti\,y^\bullet_i}{n-1}
Properties: -1\le r\le1; sign = direction; magnitude = linear strength
Cautions- Measures linear association only (Anscombe, NBA points vs age)
- Sensitive to outliers (interactive examples pushing r from 0.1 to 0.99)
Non-linear dependence? → use alternative measures (e.g.
Hellinger correlation)

R Cheat-sheet (two vars)

plot(y ~ x)           # scatterplot
cor(x,y)              # Pearson r
by(y,cat,summary)     # num summaries by group
barplot(table(cat1,cat2),beside=TRUE)  # clustered bars

Lecture 3 – Least-Squares Regression (Simple Linear)

Learning outcomes (Ch2-L3-O1 – O4)

Distinguish explanatory (X) vs response (Y)
Understand construction & meaning of least-squares line
Use line for prediction; interpret slope & intercept
Recognise danger of extrapolation

Model Form

\hat y = b0 + b1 x

b_1 (slope): change in \hat y for one-unit increase in x
b_0 (intercept): predicted y at x=0 (may be nonsensical)

Least-Squares Criterion

Choose b0,b1 to minimise \sum{i=1}^{n}(yi-\hat y_i)^2 (sum of squared residuals)
Historical: Gauss/Laplace (~1800)

Interpretation Examples

Ski-jump scores: \hat y=110.44+1.16x ⇒ every extra point in round 1 adds $\approx$1.16 points total
Snake length–weight example for visual intuition

Prediction & Extrapolation

Prediction: plug new x within data range → \hat y
Extrapolation: using x outside observed range; may produce impossible results (US farmer population → negative)

R Workflow

fit <- lsfit(x,y)        # returns list with coefficients
coef(fit)                # b0,b1
plot(x,y); abline(coef(fit))
# predict
x.new <- 137
y.pred <- coef(fit)[1] + coef(fit)[2]*x.new

Lecture 4 – Residuals & r^2

Learning outcomes (Ch2-L4-O1 – O4)

Always visualise data (Anscombe’s quartet)
Use residual plots to assess linear model assumptions
Compute & interpret r^2 (strength of regression)
Handle outliers; concepts of leverage & influence

Residuals (Def 2.9)

ei = yi - \hat yi = yi - (b0+b1x_i)

Provide diagnostic info; measured in same units as y

Residual Plot (Def 2.10)

Scatter of ei vs xi (or vs \hat y)
Ideal pattern: random cloud centred at 0 (no curvature, equal spread)
Systematic structure (arches, funnels) ⇒ violation of linearity/constant variance
Example: Janka hardness shows arch → switch to quadratic model

Coefficient of Determination r^2

For simple linear regression with intercept: r^2 = r^2 (square of Pearson correlation)
Interpretation: proportion of variance in y explained by model
r^2 = \frac{\text{Var}(\hat y)}{\text{Var}(y)}
Range 0–1 (report as %). Example: temperature explains 81 % of variation in log electricity use

Alternative formula (standardised data)

r^2 = 1 - \frac{\sum (y^\bulleti-\hat y^\bulleti)^2}{n}

⇒ closer points to fitted line in standardised scale → higher r²

Outliers, Leverage, Influence

Residual large → vertical outlier
Leverage: extreme x value
Influential: point whose inclusion markedly changes fit
Handling rules-of-thumb:- If outlier doesn’t affect results → keep
- If clear data error / atypical case → justify removal & report both analyses
Examples: Mitsubishi Lancer ad (high price), Lift capacity dataset

Summary of Regression Diagnostics Workflow

Plot data, compute r
Fit least-squares line (within appropriate x-range)
Plot residuals; look for non-random patterns & equal variance
Calculate r^2; gauge practical usefulness
Investigate outliers/leverage; consider alternative models if needed

Global R Command Reference

# One variable
mean(x); sd(x); median(x); IQR(x)
summary(x); table(cat)
barplot(table(cat)); hist(x); boxplot(x)

# Two variables
plot(y~x)                 # scatter
cor(x,y)                  # correlation r
lsfit(x,y)                # simple linear regression
residuals(lsfit(x,y))     # residuals
abline(lsfit(x,y))
by(y,cat,summary)         # summaries by category

# Misc
quantile(x,c(.25,.75))
IQR(x)

Key Terminology

Statistical distribution, frequency table
Mean \bar x, Median M, Quartiles Q1,Q3, IQR, SD s, Variance s^2
Five-number summary
Bar chart vs Histogram vs Boxplot
Shape: symmetric, left-skew, right-skew; modality (uni/bi)
Outlier (1.5×IQR rule)
Scatterplot, Positive/Negative association, Strength, Linearity
Pearson correlation r
Explanatory vs Response variable
Least-squares regression line (intercept b0, slope b1)
Prediction; Extrapolation
Residuals, Residual plot
r^2 (coefficient of determination)
Leverage, Influence

Ethical & Practical Considerations

Data collection quality dictates validity; flawed collection → flawed study
“Association ≠ Causation” (observational vs experimental designs)
Randomisation, comparison, replication principles for experiments
Always inspect data graphically before trusting numerical summaries
Report methodology transparently (e.g., treatment of outliers, model assumptions)