Course: MATH1041 – Statistics for Life & Social Sciences
Delivered by the UNSW School of Mathematics & Statistics – Faculty of Science.
Lecture notes compiled by Y. Bunjamin, P. Lafaye de Micheaux, L. Helme-Guizon, J. Stocklosa, D. Warton & previous lecturers.
Teaching team: lecturers (Pierre, Bill, Yudhi, Larraine, Hilda, Jonathan, …), tutors, helpers, admin.
Platforms & tools:
Moodle (announcements, datasets, Chapter 0 resources, weekly Mobius lessons).
R/RStudio or Posit Cloud, R scripts (e.g., mindmap.R).
How-to guides: RStudio manual, dataset folder, PollEverywhere/Kahoot links.
Weekly announcements: exclusively on Moodle – not repeated in lectures.
Over-arching aim: Introduce Statistics as “the science of collecting, analysing & interpreting data”.
Master both “turning a story into maths formalism” and “translating mathematical results back into plain language”.
Develop statistical vocabulary, computing literacy with R, and deep conceptual understanding.
Chapter 1 is split over 4 lectures: 1. Introduction to Data Collection & Organisation
Sources of Data & Variable Types
Observational Studies vs Experiments
Experimental Designs
Direct mapping to textbook (Moore et al., 2021): Sections 1.1, 2.7, 3.1, 3.2.
Recurring learning cycle for each lecture: - Learning Outcomes (O-tags) → Success Criteria → Think-Pair-Share / Polls → Kahoot / R demos → Reflection.
Learning Outcomes
O1: Recognise that data carry context & purpose – research question dictates what data are needed.
O2: Master vocabulary: data, data set, population, cases, labels, variables, number of variables (p), sample size (n).
O3: Realise data can be stored in multiple file formats (rectangular tables, “complex” non-tabular structures, etc.).
Key success checkpoints
Precisely phrase a research question.
Identify the population, required variables, sample, (n) & (p).
Wall-of-Knowledge Prezi: positions stats as integrative layer connecting scientific disciplines.
Raw survey table (2019 T2, 60 rows shown) → visually overwhelming.
Statistics goal: transform a “bunch of numbers” into insight.
Human limitation: pattern-finding in large tables is hard – need summaries & visualisations.
Is MATH1041 a good course? - “Good” can mean enjoyable, prepares for future studies, teaches job-relevant basics, …
Each definition changes population, variables, and measurement method (e.g., satisfaction scores vs later research-project marks).
Flu epidemic scenario: students practise defining population, variables, etc.
Choose data based on research question.
Decide how to collect -> design of experiments / observational studies / simulations.
Organise data (notebooks, files, databases, DNA storage!).
Describe (metadata + numerical / graphical summaries).
Analyse (relationships, inference).
Population: entire group of interest (e.g., all MATH1041 students in 2025 T2).
Cases / observational units: individual members from population (e.g., one student with ≥1 mark).
Labels / IDs: unique identifiers (e.g., zID).
Sample: subset actually studied; size (n).
Variable: measurable attribute; total number (p).
Observation: vector of variable values for one case.
Example – marks file after Week 11:
Population: whole cohort.
Sample: students appearing in Excel file.
Variables: one per assessment (5).
Thus (p = 5,
n = \text{number of rows}).
Common extensions this course: .txt, .csv, .dat, .xls, .xlsx, .RData.
Distinguish ASCII vs binary (use Notepad peek).
Hands-on exercise: classify Moodle files (ageinc.dat, ApartmentList.txt, titanic.csv, Mondanat.img, Mondanat.hdr, 1041.RData).
RStudio demo: load MATH1041-2024T1.csv and RData via GUI & code:
survey.df <- data.frame(mget(ls()))
Learning Outcomes - O1: Identify anecdotal, available, and self-collected/simulated data.
O2: Differentiate categorical vs quantitative variables.
O3: State units for quantitative variables.
Anecdotal: unsystematic, single experiences → prone to bias. - Eg: “Coffee improved my attention last week.”
Available: previously recorded for another purpose (e.g., ABS census tables, UNSW enrolment records).
Collect / simulate your own: design surveys, experiments or computer simulations (Honours thesis example).
Population vs Sample (Def 1.4).
Census = measure whole population (Def 1.5); often infeasible (time, cost, ethics).
Sample survey (Def 1.6) & voluntary response (Def 1.7) → bias risk: loud voices dominate.
Convenience sampling (Def 1.8): choose easiest units – low external validity.
Decision algorithm (Slides 1.50–1.51): 1. No order → categorical.
Ordered, continuous scale → continuous quantitative.
Countable integers with meaningful differences → discrete quantitative.
Ordered categories w/o equal spacing → ordinal (treated as categorical unless using special methods).
Examples: - Satisfaction 0–10 → quantitative (discrete).
Travel method → categorical.
Temperature ^{\circ}\mathrm{C} → quantitative (continuous).
Determines: - Summary numbers (mean/SD vs frequency table).
Appropriate plots (histogram/boxplot vs bar chart).
Inference procedure (e.g., \chi^2 test vs two-sample t-test).
Course roadmap table provided linking variable combination to R functions & statistical tests: - One quantitative: \bar x,\; s,\;\text{histogram}; CI/test on \mu.
Two categorical: contingency table, \chi^2 independence.
Etc.
Learning Outcomes - O1: Distinguish study types.
O2: Identify explanatory (independent) vs response (dependent) variables.
O3: Recognise explanations for association (common response, causation, confounding).
Observational: measure variables as they occur, no intervention (e.g., hospital size vs stay length).
Experiment: researcher imposes treatment (coffee/no-coffee study).
Sample survey = special observational study.
Demonstrated via Spurious Correlations (Tyler Vigen): - US science spending vs suicides by hanging, Nicholas Cage films vs drownings, cheese consumption vs bedsheet entanglements.
Explanatory diagrams: double-headed arrows = association, single-headed = causation.
Explanatory variable (x) vs Response (y) (Def 1.11).
Lurking variable (z) (Def 1.12): unmeasured but influential.
Confounding (Def 1.14): effects of two variables intermixed. - Example: Parent BMI & child diet on Child BMI; heredity vs environment conflated.
Common response: both variables respond to unseen cause (temperature → ice-cream & heat strokes).
Causation: moon gravity → tides (rarely provable without experiment).
Confounding: exercise vs fitness muddled by age.
Bradford-Hill criteria: strength, consistency, dose–response, temporality, biological plausibility.
Smoking–cancer discussion: ethical limits prevent RCT; weight of evidence still establishes causation.
Learning Outcomes - O1: Appreciate experiments for causal insight.
O2: Describe subjects, factors, levels, treatments, response.
O3: Compare design types & evaluate efficiency.
Subjects / Experimental units.
Factor: manipulated explanatory variable.
Levels: categories/values of a factor.
Treatment: specific combo of factor levels.
Response variable: outcome measured post-treatment.
Example – Tennessee STAR:
Factor: class size (3 levels) → treatments.
Response: standardised test scores.
Compare: include control (placebo) or baseline.
Randomise: allocate subjects to treatments randomly → balances confounders.
Repeat: apply each treatment to many subjects (reduce chance variation).
Replicate: redo entire experiment independently → confirm findings.
Randomised Comparative Experiment (Def 1.17): classic multi-arm RCT.
Matched Pairs (Def 1.18): pairs of similar subjects or before–after on same subject; higher precision.
Randomised Block: generalisation; subjects grouped into homogeneous blocks, treatments randomised within block.
Example – Smartphone & driving simulator:
Design 1 (two independent groups) → Randomised Comparative.
Design 2 (each student both conditions) → Matched Pairs; requires fewer subjects, controls inter-individual variability.
Advertising study: factor of interest = ad frequency; nuisance = ad duration. - Full factorial (3 \times 2) → \text{3} \times \text{2} = 6 treatments, 10 students each.
Ignoring duration would falsely mask or reverse effect (Simpson’s paradox demonstration).
Traditional: numbered balls in urn, coin flips.
Modern: R’s sample()
; example code provided to assign 200 sailors.
Controlled scurvy trial – James Lind (1747).
Early randomisation – Peirce & Jastrow (1880s); Fisher & Neyman (1920s ag-experiments).
Persian physician al-Razi (9th cent.) used control group.
Ethical constraints dictate design feasibility (e.g., cannot force smoking).
Poor control choice → placebo needed.
Extraneous changes (pill colour).
Lack of blinding → introduce double-blind procedures.
Low realism → replicate real-world settings.
Insufficient sample size or lack of replication.
Anecdotal evidence, Available data, Census, Sample, Voluntary response, Convenience sample.
Observational Study, Experiment, Association, Causation, Common response, Confounding.
Subjects, Factor, Level, Treatment, Response variable.
Control group, Placebo, Double-blind, Randomised Comparative, Matched Pairs, Randomised Block.
Sample size: n
Number of variables: p
Body Mass Index: \text{BMI}=\dfrac{\text{weight (kg)}}{\text{height (m)}^{2}}
Five-number summary: \min,\; Q1,\; \text{median},\; Q3,\; \max
Correlation coefficient: r; regression slope \beta_1 (covered later).
Confidence interval examples: \text{CI}{\mu},\;\text{CI}{p},\; \text{CI}{\mu1-\mu_2}.
Read CSV: read.csv("file.csv")
→ data frame.
Inspect structure: str(df)
; open spreadsheet-like viewer by clicking in Environment.
Basic summaries: summary(x)
; frequencies table(x)
; grouped summary by(y, group, summary)
.
Plots: hist(x)
, boxplot(x)
, barplot(table(x))
, plot(y ~ x)
.
Randomisation: sample(pop, size, replace = FALSE)
.
GIGO principle: “Garbage In – Garbage Out” → flawless analysis cannot rescue flawed data collection.
Experiments often limited by cost, time, ethics (e.g., cannot assign harmful smoking).
Importance of anonymisation (zID codes) & privacy in educational datasets.
Statistical thinking transcends subject matter – applies in science, social policy, medicine, business.
Chapter 1 lays foundation for later topics: - Numerical summaries, graphical methods (Week 2).
Probability & inference (later weeks).
Regression & ANOVA rely on clear distinction of explanatory/response variables and proper experimental design.
Upcoming preparatory tasks: - Install/verify RStudio via weekly Mobius lesson.
Begin personal summary notes early – cumulative advantage.
Moore, McCabe & Craig (2021) Introduction to the Practice of Statistics, 10th ed.
McKeachie & Svinicki (2014) Teaching Tips – used for Think-Pair-Share pedagogy.
Lafaye de Micheaux et al. (2021) – internal Rmarkdown slide-creation package.
Tyler Vigen – Spurious Correlations website.
Historical papers: Peirce & Jastrow (1884), Fisher & Neyman (1920s), Lind’s scurvy trial (1747).