MATH1041 Study Design and Data Collection - Vocabulary Flashcards
Course Context & Logistics
Course: MATH1041 – Statistics for Life & Social Sciences
Delivered by the UNSW School of Mathematics & Statistics – Faculty of Science.
Lecture notes compiled by Y. Bunjamin, P. Lafaye de Micheaux, L. Helme-Guizon, J. Stocklosa, D. Warton & previous lecturers.
Teaching team: lecturers (Pierre, Bill, Yudhi, Larraine, Hilda, Jonathan, …), tutors, helpers, admin.
Platforms & tools:
Moodle (announcements, datasets, Chapter 0 resources, weekly Mobius lessons).
R/RStudio or Posit Cloud, R scripts (e.g., mindmap.R).
How-to guides: RStudio manual, dataset folder, PollEverywhere/Kahoot links.
Weekly announcements: exclusively on Moodle – not repeated in lectures.
Over-arching aim: Introduce Statistics as “the science of collecting, analysing & interpreting data”.
Master both “turning a story into maths formalism” and “translating mathematical results back into plain language”.
Develop statistical vocabulary, computing literacy with R, and deep conceptual understanding.
Study Design – Big Picture
Chapter 1 is split over 4 lectures: 1. Introduction to Data Collection & Organisation
Sources of Data & Variable Types
Observational Studies vs Experiments
Experimental Designs
Direct mapping to textbook (Moore et al., 2021): Sections 1.1, 2.7, 3.1, 3.2.
Recurring learning cycle for each lecture: - Learning Outcomes (O-tags) → Success Criteria → Think-Pair-Share / Polls → Kahoot / R demos → Reflection.
Lecture 1 – Introduction to Data Collection & Organisation
Learning Outcomes
O1: Recognise that data carry context & purpose – research question dictates what data are needed.
O2: Master vocabulary: data, data set, population, cases, labels, variables, number of variables (p), sample size (n).
O3: Realise data can be stored in multiple file formats (rectangular tables, “complex” non-tabular structures, etc.).
Key success checkpoints
Precisely phrase a research question.
Identify the population, required variables, sample, (n) & (p).
Wall-of-Knowledge Prezi: positions stats as integrative layer connecting scientific disciplines.
1 • From Raw Data to Information
Raw survey table (2019 T2, 60 rows shown) → visually overwhelming.
Statistics goal: transform a “bunch of numbers” into insight.
Human limitation: pattern-finding in large tables is hard – need summaries & visualisations.
2 • “What data to collect?” mini-cases
Is MATH1041 a good course? - “Good” can mean enjoyable, prepares for future studies, teaches job-relevant basics, …
Each definition changes population, variables, and measurement method (e.g., satisfaction scores vs later research-project marks).
Flu epidemic scenario: students practise defining population, variables, etc.
3 • Statistical Analysis Workflow
Choose data based on research question.
Decide how to collect -> design of experiments / observational studies / simulations.
Organise data (notebooks, files, databases, DNA storage!).
Describe (metadata + numerical / graphical summaries).
Analyse (relationships, inference).
4 • Core Vocabulary (‘Data Sets III’ slides)
Population: entire group of interest (e.g., all MATH1041 students in 2025 T2).
Cases / observational units: individual members from population (e.g., one student with ≥1 mark).
Labels / IDs: unique identifiers (e.g., zID).
Sample: subset actually studied; size (n).
Variable: measurable attribute; total number (p).
Observation: vector of variable values for one case.
Example – marks file after Week 11:
Population: whole cohort.
Sample: students appearing in Excel file.
Variables: one per assessment (5).
Thus (p = 5,
n = \text{number of rows}).
5 • File Formats & Practicalities
Common extensions this course: .txt, .csv, .dat, .xls, .xlsx, .RData.
Distinguish ASCII vs binary (use Notepad peek).
Hands-on exercise: classify Moodle files (ageinc.dat, ApartmentList.txt, titanic.csv, Mondanat.img, Mondanat.hdr, 1041.RData).
RStudio demo: load MATH1041-2024T1.csv and RData via GUI & code:
survey.df <- data.frame(mget(ls()))
Lecture 2 – Sources of Data & Variable Types
Learning Outcomes - O1: Identify anecdotal, available, and self-collected/simulated data.
O2: Differentiate categorical vs quantitative variables.
O3: State units for quantitative variables.
1 • Sources of Data – Definitions & Examples
Anecdotal: unsystematic, single experiences → prone to bias. - Eg: “Coffee improved my attention last week.”
Available: previously recorded for another purpose (e.g., ABS census tables, UNSW enrolment records).
Collect / simulate your own: design surveys, experiments or computer simulations (Honours thesis example).
2 • Sampling & Surveys
Population vs Sample (Def 1.4).
Census = measure whole population (Def 1.5); often infeasible (time, cost, ethics).
Sample survey (Def 1.6) & voluntary response (Def 1.7) → bias risk: loud voices dominate.
Convenience sampling (Def 1.8): choose easiest units – low external validity.
3 • Variable Types
Decision algorithm (Slides 1.50–1.51): 1. No order → categorical.
Ordered, continuous scale → continuous quantitative.
Countable integers with meaningful differences → discrete quantitative.
Ordered categories w/o equal spacing → ordinal (treated as categorical unless using special methods).
Examples: - Satisfaction 0–10 → quantitative (discrete).
Travel method → categorical.
Temperature ^{\circ}\mathrm{C} → quantitative (continuous).
4 • Why the distinction matters
Determines: - Summary numbers (mean/SD vs frequency table).
Appropriate plots (histogram/boxplot vs bar chart).
Inference procedure (e.g., \chi^2 test vs two-sample t-test).
Course roadmap table provided linking variable combination to R functions & statistical tests: - One quantitative: \bar x,\; s,\;\text{histogram}; CI/test on \mu.
Two categorical: contingency table, \chi^2 independence.
Etc.
Lecture 3 – Observational Studies vs Experiments
Learning Outcomes - O1: Distinguish study types.
O2: Identify explanatory (independent) vs response (dependent) variables.
O3: Recognise explanations for association (common response, causation, confounding).
1 • Observational vs Experimental
Observational: measure variables as they occur, no intervention (e.g., hospital size vs stay length).
Experiment: researcher imposes treatment (coffee/no-coffee study).
Sample survey = special observational study.
2 • Association ≠ Causation
Demonstrated via Spurious Correlations (Tyler Vigen): - US science spending vs suicides by hanging, Nicholas Cage films vs drownings, cheese consumption vs bedsheet entanglements.
Explanatory diagrams: double-headed arrows = association, single-headed = causation.
3 • Key Concepts
Explanatory variable (x) vs Response (y) (Def 1.11).
Lurking variable (z) (Def 1.12): unmeasured but influential.
Confounding (Def 1.14): effects of two variables intermixed. - Example: Parent BMI & child diet on Child BMI; heredity vs environment conflated.
4 • Possible explanations for an observed link
Common response: both variables respond to unseen cause (temperature → ice-cream & heat strokes).
Causation: moon gravity → tides (rarely provable without experiment).
Confounding: exercise vs fitness muddled by age.
5 • Causal Inference without experiments
Bradford-Hill criteria: strength, consistency, dose–response, temporality, biological plausibility.
Smoking–cancer discussion: ethical limits prevent RCT; weight of evidence still establishes causation.
Lecture 4 – Experimental Designs
Learning Outcomes - O1: Appreciate experiments for causal insight.
O2: Describe subjects, factors, levels, treatments, response.
O3: Compare design types & evaluate efficiency.
1 • Vocabulary (Def 1.15)
Subjects / Experimental units.
Factor: manipulated explanatory variable.
Levels: categories/values of a factor.
Treatment: specific combo of factor levels.
Response variable: outcome measured post-treatment.
Example – Tennessee STAR:
Factor: class size (3 levels) → treatments.
Response: standardised test scores.
2 • Principles: “Compare, Randomise, Repeat, Replicate”
Compare: include control (placebo) or baseline.
Randomise: allocate subjects to treatments randomly → balances confounders.
Repeat: apply each treatment to many subjects (reduce chance variation).
Replicate: redo entire experiment independently → confirm findings.
3 • Design Types
Randomised Comparative Experiment (Def 1.17): classic multi-arm RCT.
Matched Pairs (Def 1.18): pairs of similar subjects or before–after on same subject; higher precision.
Randomised Block: generalisation; subjects grouped into homogeneous blocks, treatments randomised within block.
Example – Smartphone & driving simulator:
Design 1 (two independent groups) → Randomised Comparative.
Design 2 (each student both conditions) → Matched Pairs; requires fewer subjects, controls inter-individual variability.
4 • Handling Nuisance Factors
Advertising study: factor of interest = ad frequency; nuisance = ad duration. - Full factorial (3 \times 2) → \text{3} \times \text{2} = 6 treatments, 10 students each.
Ignoring duration would falsely mask or reverse effect (Simpson’s paradox demonstration).
5 • Randomisation Mechanics
Traditional: numbered balls in urn, coin flips.
Modern: R’s
sample()
; example code provided to assign 200 sailors.
6 • Historical & Ethical Notes
Controlled scurvy trial – James Lind (1747).
Early randomisation – Peirce & Jastrow (1880s); Fisher & Neyman (1920s ag-experiments).
Persian physician al-Razi (9th cent.) used control group.
Ethical constraints dictate design feasibility (e.g., cannot force smoking).
7 • Common Pitfalls & Mitigations
Poor control choice → placebo needed.
Extraneous changes (pill colour).
Lack of blinding → introduce double-blind procedures.
Low realism → replicate real-world settings.
Insufficient sample size or lack of replication.
Glossary of Key Terms
Anecdotal evidence, Available data, Census, Sample, Voluntary response, Convenience sample.
Observational Study, Experiment, Association, Causation, Common response, Confounding.
Subjects, Factor, Level, Treatment, Response variable.
Control group, Placebo, Double-blind, Randomised Comparative, Matched Pairs, Randomised Block.
Mathematical & Statistical Symbols Recap
Sample size: n
Number of variables: p
Body Mass Index: \text{BMI}=\dfrac{\text{weight (kg)}}{\text{height (m)}^{2}}
Five-number summary: \min,\; Q1,\; \text{median},\; Q3,\; \max
Correlation coefficient: r; regression slope \beta_1 (covered later).
Confidence interval examples: \text{CI}{\mu},\;\text{CI}{p},\; \text{CI}{\mu1-\mu_2}.
R & Computing Cheat-Sheet (Chapter 1 Scope)
Read CSV:
read.csv("file.csv")
→ data frame.Inspect structure:
str(df)
; open spreadsheet-like viewer by clicking in Environment.Basic summaries:
summary(x)
; frequenciestable(x)
; grouped summaryby(y, group, summary)
.Plots:
hist(x)
,boxplot(x)
,barplot(table(x))
,plot(y ~ x)
.Randomisation:
sample(pop, size, replace = FALSE)
.
Ethical, Philosophical & Practical Considerations
GIGO principle: “Garbage In – Garbage Out” → flawless analysis cannot rescue flawed data collection.
Experiments often limited by cost, time, ethics (e.g., cannot assign harmful smoking).
Importance of anonymisation (zID codes) & privacy in educational datasets.
Statistical thinking transcends subject matter – applies in science, social policy, medicine, business.
Connections & Future Lectures
Chapter 1 lays foundation for later topics: - Numerical summaries, graphical methods (Week 2).
Probability & inference (later weeks).
Regression & ANOVA rely on clear distinction of explanatory/response variables and proper experimental design.
Upcoming preparatory tasks: - Install/verify RStudio via weekly Mobius lesson.
Begin personal summary notes early – cumulative advantage.
Further Reading & References
Moore, McCabe & Craig (2021) Introduction to the Practice of Statistics, 10th ed.
McKeachie & Svinicki (2014) Teaching Tips – used for Think-Pair-Share pedagogy.
Lafaye de Micheaux et al. (2021) – internal Rmarkdown slide-creation package.
Tyler Vigen – Spurious Correlations website.
Historical papers: Peirce & Jastrow (1884), Fisher & Neyman (1920s), Lind’s scurvy trial (1747).