MATH1041 Descriptive Statistics – Lecture 2 Lecturer Annotated

Lecture 1 – Numerical & Graphical Summaries for One Variable

Administrative / Context

  • Chapter 2 of MATH1041 “Descriptive Statistics” → 4 lectures

  • Overarching aim of course: collecting, analysing, interpreting data

  • Textbook mapping: Moore et al., 2021, Sections 1.2–1.3

  • RStudio is the compulsory software (install + watch MarinStatsLectures videos). Other packages (SAS, SPSS, Excel, …) mentioned but not used.

Learning outcomes (Ch2-L1-O1 … O4)

  • Describe a variable numerically & graphically

  • Recognise that choice depends on variable type (categorical vs quantitative)

  • Comment on graphs (centre, spread, shape, outliers)

  • Detect outliers & understand their impact

Variable Types & Strategy Table (single variable perspective)

Categorical → table  + bar chart
Quantitative → mean/SD OR five-number summary + histogram/boxplot
  • Decide summary after checking type & number of variables

Statistical Distribution (Def 2.1)

  1. Set of possible values (or categories)

  2. Corresponding counts / frequencies / proportions (for quantitative: counts per class interval)

  • Replacing raw data with distribution loses labels/individual values but yields concise overview

Categorical Variables

Numerical Summary

  • Frequency table (Def 2.2): list of categories + counts/percentages

  • R: table(x) → optionally wrap with prop.table() ×100 for %

  • Examples: Watching Australian Open (Yes/No) ; Mode of transport ; Hemisphere of origin

Graphical Summary

  • Bar chart (vertical or horizontal). Order bars logically to aid readability (e.g., descending frequency)

  • R base: barplot(table(catVar))

Quantitative Variables

Why not frequency table? (Slide 2.26)

  • Potentially infinite distinct values ⇒ table would be huge

Key Features to Capture

  • Location (centre)

  • Spread (variability)

  • Optional: shape (skew, modality)

Measures of Location

  • Mean ˉ=ni=1nx

  • Sensitive to outliers (e.g.

    Gina Rinehart in bar → average wealth skyrockets)

  • Median M =\begin{cases}x{((n+1)/2)} & n \text{ odd}\ \frac{x{(n/2)}+x_{(n/2+1)}}{2}& n \text{ even}\end{cases} (Def 2.4)

  • Robust to outliers (e.g.

    engineer salary example)

  • Guidance: use median when distribution skewed / heavy-tailed

Measures of Spread

  • Quartiles (Def 2.5): Q1 & Q3 are medians of lower/upper halves (median included if n odd, consistent with R)

  • Inter-quartile range IQR = Q3-Q1 (Def 2.6) – robust

  • Standard deviation s=\sqrt{\frac{\sum{i=1}^{n}(xi-\bar x)^2}{n-1}} (Def 2.7) – sensitive to outliers

  • Variance s^2 (SD squared)

Five-Number Summary

  1. Min

  2. Q1

  3. Median (Q2)

  4. Q3

  5. Max

  • R: summary(x)

  • Graphical counterpart: boxplot

Outliers

  • Suspected if beyond Q1-1.5\times IQR or Q3+1.5\times IQR (1.5×IQR rule)

  • Show as points on “modified boxplot” (default R behaviour)

Graphical Tools

  • Histogram: good for shape; choose appropriate bin width (breaks=)

  • Boxplot: concise five-number; useful for comparing groups

  • Bar plot for discrete quantitative (narrow bars)

  • Comment on graph in terms of location, spread, shape, outliers

R Cheat-sheet (one variable)

# numbers
mean(x);  sd(x);  median(x)
quantile(x,c(.25,.5,.75));  IQR(x)
summary(x)
# graphs
hist(x); boxplot(x)

Lecture 2 – Relationships Between Two Variables

Learning outcomes (Ch2-L2-O1 – O3)

  1. Pick correct tools when ≥1 variable categorical

  2. Describe scatterplot (presence, strength, direction, type)

  3. Understand Pearson correlation, its derivation, pros/cons

Summary Table (two variables)

Both categorical → 2-way frequency table + clustered bar chart
Cat + Quant → 5-num (or mean/SD) per group + comparative boxplots / clustered bars
Both quantitative → correlation + scatterplot

Case 1: ≥1 Categorical Variable

  • Strategy: split data by categories, then apply single-variable methods within each subgroup, finally compare

  • Two-way table: table(cat1,cat2)

  • Comparative boxplots: boxplot(y ~ cat)

Case 2: Both Quantitative → Scatterplot

  • Axes: X – explanatory (if any), Y – response

  • Describe nature using 4 × dichotomies1. Existent left_right_arrow emoji

    non-existent

    1. Strong left_right_arrow emoji

      weak

    2. Increasing (+) left_right_arrow emoji

      decreasing (–)

    3. Linear left_right_arrow emoji

      non-linear

  • Outliers may signal data error or interesting structure (example: Palm Beach 2000 election)

Pearson Correlation (Def 2.8)

  • Standardise each variable x^\bulleti=\frac{xi-\bar x}{sx},\; y^\bulleti=\frac{yi-\bar y}{sy}

  • Coefficient r=\frac{\sum{i=1}^{n}x^\bulleti\,y^\bullet_i}{n-1}

  • Properties: -1\le r\le1; sign = direction; magnitude = linear strength

  • Cautions- Measures linear association only (Anscombe, NBA points vs age)

    • Sensitive to outliers (interactive examples pushing r from 0.1 to 0.99)

  • Non-linear dependence? → use alternative measures (e.g.

    Hellinger correlation)

R Cheat-sheet (two vars)

plot(y ~ x)           # scatterplot
cor(x,y)              # Pearson r
by(y,cat,summary)     # num summaries by group
barplot(table(cat1,cat2),beside=TRUE)  # clustered bars

Lecture 3 – Least-Squares Regression (Simple Linear)

Learning outcomes (Ch2-L3-O1 – O4)

  • Distinguish explanatory (X) vs response (Y)

  • Understand construction & meaning of least-squares line

  • Use line for prediction; interpret slope & intercept

  • Recognise danger of extrapolation

Model Form

\hat y = b0 + b1 x

  • b_1 (slope): change in \hat y for one-unit increase in x

  • b_0 (intercept): predicted y at x=0 (may be nonsensical)

Least-Squares Criterion

  • Choose b0,b1 to minimise \sum{i=1}^{n}(yi-\hat y_i)^2 (sum of squared residuals)

  • Historical: Gauss/Laplace (~1800)

Interpretation Examples

  • Ski-jump scores: \hat y=110.44+1.16x ⇒ every extra point in round 1 adds $\approx$1.16 points total

  • Snake length–weight example for visual intuition

Prediction & Extrapolation

  • Prediction: plug new x within data range → \hat y

  • Extrapolation: using x outside observed range; may produce impossible results (US farmer population → negative)

R Workflow

fit <- lsfit(x,y)        # returns list with coefficients
coef(fit)                # b0,b1
plot(x,y); abline(coef(fit))
# predict
x.new <- 137
y.pred <- coef(fit)[1] + coef(fit)[2]*x.new

Lecture 4 – Residuals & r^2

Learning outcomes (Ch2-L4-O1 – O4)

  1. Always visualise data (Anscombe’s quartet)

  2. Use residual plots to assess linear model assumptions

  3. Compute & interpret r^2 (strength of regression)

  4. Handle outliers; concepts of leverage & influence

Residuals (Def 2.9)

ei = yi - \hat yi = yi - (b0+b1x_i)

  • Provide diagnostic info; measured in same units as y

Residual Plot (Def 2.10)

  • Scatter of ei vs xi (or vs \hat y)

  • Ideal pattern: random cloud centred at 0 (no curvature, equal spread)

  • Systematic structure (arches, funnels) ⇒ violation of linearity/constant variance

  • Example: Janka hardness shows arch → switch to quadratic model

Coefficient of Determination r^2

  • For simple linear regression with intercept: r^2 = r^2 (square of Pearson correlation)

  • Interpretation: proportion of variance in y explained by model

    r^2 = \frac{\text{Var}(\hat y)}{\text{Var}(y)}

  • Range 0–1 (report as %). Example: temperature explains 81 % of variation in log electricity use

Alternative formula (standardised data)

r^2 = 1 - \frac{\sum (y^\bulleti-\hat y^\bulleti)^2}{n}

⇒ closer points to fitted line in standardised scale → higher r²

Outliers, Leverage, Influence

  • Residual large → vertical outlier

  • Leverage: extreme x value

  • Influential: point whose inclusion markedly changes fit

  • Handling rules-of-thumb:- If outlier doesn’t affect results → keep

    • If clear data error / atypical case → justify removal & report both analyses

  • Examples: Mitsubishi Lancer ad (high price), Lift capacity dataset

Summary of Regression Diagnostics Workflow

  1. Plot data, compute r

  2. Fit least-squares line (within appropriate x-range)

  3. Plot residuals; look for non-random patterns & equal variance

  4. Calculate r^2; gauge practical usefulness

  5. Investigate outliers/leverage; consider alternative models if needed


Global R Command Reference
# One variable
mean(x); sd(x); median(x); IQR(x)
summary(x); table(cat)
barplot(table(cat)); hist(x); boxplot(x)

# Two variables
plot(y~x)                 # scatter
cor(x,y)                  # correlation r
lsfit(x,y)                # simple linear regression
residuals(lsfit(x,y))     # residuals
abline(lsfit(x,y))
by(y,cat,summary)         # summaries by category

# Misc
quantile(x,c(.25,.75))
IQR(x)

Key Terminology
  • Statistical distribution, frequency table

  • Mean \bar x, Median M, Quartiles Q1,Q3, IQR, SD s, Variance s^2

  • Five-number summary

  • Bar chart vs Histogram vs Boxplot

  • Shape: symmetric, left-skew, right-skew; modality (uni/bi)

  • Outlier (1.5×IQR rule)

  • Scatterplot, Positive/Negative association, Strength, Linearity

  • Pearson correlation r

  • Explanatory vs Response variable

  • Least-squares regression line (intercept b0, slope b1)

  • Prediction; Extrapolation

  • Residuals, Residual plot

  • r^2 (coefficient of determination)

  • Leverage, Influence


Ethical & Practical Considerations

  • Data collection quality dictates validity; flawed collection → flawed study

  • “Association ≠ Causation” (observational vs experimental designs)

  • Randomisation, comparison, replication principles for experiments

  • Always inspect data graphically before trusting numerical summaries

  • Report methodology transparently (e.g., treatment of outliers, model assumptions)