MATH1041 Descriptive Statistics – Lecture 2 Lecturer Annotated
Lecture 1 – Numerical & Graphical Summaries for One Variable
Administrative / Context
Chapter 2 of MATH1041 “Descriptive Statistics” → 4 lectures
Overarching aim of course: collecting, analysing, interpreting data
Textbook mapping: Moore et al., 2021, Sections 1.2–1.3
RStudio is the compulsory software (install + watch MarinStatsLectures videos). Other packages (SAS, SPSS, Excel, …) mentioned but not used.
Learning outcomes (Ch2-L1-O1 … O4)
Describe a variable numerically & graphically
Recognise that choice depends on variable type (categorical vs quantitative)
Comment on graphs (centre, spread, shape, outliers)
Detect outliers & understand their impact
Variable Types & Strategy Table (single variable perspective)
Categorical → table + bar chart
Quantitative → mean/SD OR five-number summary + histogram/boxplot
Decide summary after checking type & number of variables
Statistical Distribution (Def 2.1)
Set of possible values (or categories)
Corresponding counts / frequencies / proportions (for quantitative: counts per class interval)
Replacing raw data with distribution loses labels/individual values but yields concise overview
Categorical Variables
Numerical Summary
Frequency table (Def 2.2): list of categories + counts/percentages
R:
table(x)→ optionally wrap withprop.table()×100 for %Examples: Watching Australian Open (Yes/No) ; Mode of transport ; Hemisphere of origin
Graphical Summary
Bar chart (vertical or horizontal). Order bars logically to aid readability (e.g., descending frequency)
R base:
barplot(table(catVar))
Quantitative Variables
Why not frequency table? (Slide 2.26)
Potentially infinite distinct values ⇒ table would be huge
Key Features to Capture
Location (centre)
Spread (variability)
Optional: shape (skew, modality)
Measures of Location
Mean ˉ=n∑i=1nx
Sensitive to outliers (e.g.
Gina Rinehart in bar → average wealth skyrockets)
Median M =\begin{cases}x{((n+1)/2)} & n \text{ odd}\ \frac{x{(n/2)}+x_{(n/2+1)}}{2}& n \text{ even}\end{cases} (Def 2.4)
Robust to outliers (e.g.
engineer salary example)
Guidance: use median when distribution skewed / heavy-tailed
Measures of Spread
Quartiles (Def 2.5): Q1 & Q3 are medians of lower/upper halves (median included if n odd, consistent with R)
Inter-quartile range IQR = Q3-Q1 (Def 2.6) – robust
Standard deviation s=\sqrt{\frac{\sum{i=1}^{n}(xi-\bar x)^2}{n-1}} (Def 2.7) – sensitive to outliers
Variance s^2 (SD squared)
Five-Number Summary
Min
Q1
Median (Q2)
Q3
Max
R:
summary(x)Graphical counterpart: boxplot
Outliers
Suspected if beyond Q1-1.5\times IQR or Q3+1.5\times IQR (1.5×IQR rule)
Show as points on “modified boxplot” (default R behaviour)
Graphical Tools
Histogram: good for shape; choose appropriate bin width (
breaks=)Boxplot: concise five-number; useful for comparing groups
Bar plot for discrete quantitative (narrow bars)
Comment on graph in terms of location, spread, shape, outliers
R Cheat-sheet (one variable)
# numbers
mean(x); sd(x); median(x)
quantile(x,c(.25,.5,.75)); IQR(x)
summary(x)
# graphs
hist(x); boxplot(x)
Lecture 2 – Relationships Between Two Variables
Learning outcomes (Ch2-L2-O1 – O3)
Pick correct tools when ≥1 variable categorical
Describe scatterplot (presence, strength, direction, type)
Understand Pearson correlation, its derivation, pros/cons
Summary Table (two variables)
Both categorical → 2-way frequency table + clustered bar chart
Cat + Quant → 5-num (or mean/SD) per group + comparative boxplots / clustered bars
Both quantitative → correlation + scatterplot
Case 1: ≥1 Categorical Variable
Strategy: split data by categories, then apply single-variable methods within each subgroup, finally compare
Two-way table:
table(cat1,cat2)Comparative boxplots:
boxplot(y ~ cat)
Case 2: Both Quantitative → Scatterplot
Axes: X – explanatory (if any), Y – response
Describe nature using 4 × dichotomies1. Existent

non-existent
Strong

weak
Increasing (+)

decreasing (–)
Linear

non-linear
Outliers may signal data error or interesting structure (example: Palm Beach 2000 election)
Pearson Correlation (Def 2.8)
Standardise each variable x^\bulleti=\frac{xi-\bar x}{sx},\; y^\bulleti=\frac{yi-\bar y}{sy}
Coefficient r=\frac{\sum{i=1}^{n}x^\bulleti\,y^\bullet_i}{n-1}
Properties: -1\le r\le1; sign = direction; magnitude = linear strength
Cautions- Measures linear association only (Anscombe, NBA points vs age)
Sensitive to outliers (interactive examples pushing r from 0.1 to 0.99)
Non-linear dependence? → use alternative measures (e.g.
Hellinger correlation)
R Cheat-sheet (two vars)
plot(y ~ x) # scatterplot
cor(x,y) # Pearson r
by(y,cat,summary) # num summaries by group
barplot(table(cat1,cat2),beside=TRUE) # clustered bars
Lecture 3 – Least-Squares Regression (Simple Linear)
Learning outcomes (Ch2-L3-O1 – O4)
Distinguish explanatory (X) vs response (Y)
Understand construction & meaning of least-squares line
Use line for prediction; interpret slope & intercept
Recognise danger of extrapolation
Model Form
\hat y = b0 + b1 x
b_1 (slope): change in \hat y for one-unit increase in x
b_0 (intercept): predicted y at x=0 (may be nonsensical)
Least-Squares Criterion
Choose b0,b1 to minimise \sum{i=1}^{n}(yi-\hat y_i)^2 (sum of squared residuals)
Historical: Gauss/Laplace (~1800)
Interpretation Examples
Ski-jump scores: \hat y=110.44+1.16x ⇒ every extra point in round 1 adds $\approx$1.16 points total
Snake length–weight example for visual intuition
Prediction & Extrapolation
Prediction: plug new x within data range → \hat y
Extrapolation: using x outside observed range; may produce impossible results (US farmer population → negative)
R Workflow
fit <- lsfit(x,y) # returns list with coefficients
coef(fit) # b0,b1
plot(x,y); abline(coef(fit))
# predict
x.new <- 137
y.pred <- coef(fit)[1] + coef(fit)[2]*x.new
Lecture 4 – Residuals & r^2
Learning outcomes (Ch2-L4-O1 – O4)
Always visualise data (Anscombe’s quartet)
Use residual plots to assess linear model assumptions
Compute & interpret r^2 (strength of regression)
Handle outliers; concepts of leverage & influence
Residuals (Def 2.9)
ei = yi - \hat yi = yi - (b0+b1x_i)
Provide diagnostic info; measured in same units as y
Residual Plot (Def 2.10)
Scatter of ei vs xi (or vs \hat y)
Ideal pattern: random cloud centred at 0 (no curvature, equal spread)
Systematic structure (arches, funnels) ⇒ violation of linearity/constant variance
Example: Janka hardness shows arch → switch to quadratic model
Coefficient of Determination r^2
For simple linear regression with intercept: r^2 = r^2 (square of Pearson correlation)
Interpretation: proportion of variance in y explained by model
r^2 = \frac{\text{Var}(\hat y)}{\text{Var}(y)}
Range 0–1 (report as %). Example: temperature explains 81 % of variation in log electricity use
Alternative formula (standardised data)
r^2 = 1 - \frac{\sum (y^\bulleti-\hat y^\bulleti)^2}{n}
⇒ closer points to fitted line in standardised scale → higher r²
Outliers, Leverage, Influence
Residual large → vertical outlier
Leverage: extreme x value
Influential: point whose inclusion markedly changes fit
Handling rules-of-thumb:- If outlier doesn’t affect results → keep
If clear data error / atypical case → justify removal & report both analyses
Examples: Mitsubishi Lancer ad (high price), Lift capacity dataset
Summary of Regression Diagnostics Workflow
Plot data, compute r
Fit least-squares line (within appropriate x-range)
Plot residuals; look for non-random patterns & equal variance
Calculate r^2; gauge practical usefulness
Investigate outliers/leverage; consider alternative models if needed
Global R Command Reference
# One variable
mean(x); sd(x); median(x); IQR(x)
summary(x); table(cat)
barplot(table(cat)); hist(x); boxplot(x)
# Two variables
plot(y~x) # scatter
cor(x,y) # correlation r
lsfit(x,y) # simple linear regression
residuals(lsfit(x,y)) # residuals
abline(lsfit(x,y))
by(y,cat,summary) # summaries by category
# Misc
quantile(x,c(.25,.75))
IQR(x)
Key Terminology
Statistical distribution, frequency table
Mean \bar x, Median M, Quartiles Q1,Q3, IQR, SD s, Variance s^2
Five-number summary
Bar chart vs Histogram vs Boxplot
Shape: symmetric, left-skew, right-skew; modality (uni/bi)
Outlier (1.5×IQR rule)
Scatterplot, Positive/Negative association, Strength, Linearity
Pearson correlation r
Explanatory vs Response variable
Least-squares regression line (intercept b0, slope b1)
Prediction; Extrapolation
Residuals, Residual plot
r^2 (coefficient of determination)
Leverage, Influence
Ethical & Practical Considerations
Data collection quality dictates validity; flawed collection → flawed study
“Association ≠ Causation” (observational vs experimental designs)
Randomisation, comparison, replication principles for experiments
Always inspect data graphically before trusting numerical summaries
Report methodology transparently (e.g., treatment of outliers, model assumptions)