MATH1041 Study Design and Data Collection - Vocabulary Flashcards

Course Context & Logistics

Course: MATH1041 – Statistics for Life & Social Sciences
- Delivered by the UNSW School of Mathematics & Statistics – Faculty of Science.
- Lecture notes compiled by Y. Bunjamin, P. Lafaye de Micheaux, L. Helme-Guizon, J. Stocklosa, D. Warton & previous lecturers.
Teaching team: lecturers (Pierre, Bill, Yudhi, Larraine, Hilda, Jonathan, …), tutors, helpers, admin.
Platforms & tools:
- Moodle (announcements, datasets, Chapter 0 resources, weekly Mobius lessons).
- R/RStudio or Posit Cloud, R scripts (e.g., mindmap.R).
- How-to guides: RStudio manual, dataset folder, PollEverywhere/Kahoot links.
Weekly announcements: exclusively on Moodle – not repeated in lectures.
Over-arching aim: Introduce Statistics as “the science of collecting, analysing & interpreting data”.
- Master both “turning a story into maths formalism” and “translating mathematical results back into plain language”.
- Develop statistical vocabulary, computing literacy with R, and deep conceptual understanding.

Study Design – Big Picture

Chapter 1 is split over 4 lectures: 1. Introduction to Data Collection & Organisation
1. Sources of Data & Variable Types
2. Observational Studies vs Experiments
3. Experimental Designs
Direct mapping to textbook (Moore et al., 2021): Sections 1.1, 2.7, 3.1, 3.2.
Recurring learning cycle for each lecture: - Learning Outcomes (O-tags) → Success Criteria → Think-Pair-Share / Polls → Kahoot / R demos → Reflection.

Lecture 1 – Introduction to Data Collection & Organisation

Learning Outcomes
- O1: Recognise that data carry context & purpose – research question dictates what data are needed.
- O2: Master vocabulary: data, data set, population, cases, labels, variables, number of variables (p), sample size (n).
- O3: Realise data can be stored in multiple file formats (rectangular tables, “complex” non-tabular structures, etc.).
Key success checkpoints
- Precisely phrase a research question.
- Identify the population, required variables, sample, (n) & (p).
Wall-of-Knowledge Prezi: positions stats as integrative layer connecting scientific disciplines.

1 • From Raw Data to Information

Raw survey table (2019 T2, 60 rows shown) → visually overwhelming.
Statistics goal: transform a “bunch of numbers” into insight.
Human limitation: pattern-finding in large tables is hard – need summaries & visualisations.

2 • “What data to collect?” mini-cases

Is MATH1041 a good course? - “Good” can mean enjoyable, prepares for future studies, teaches job-relevant basics, …
- Each definition changes population, variables, and measurement method (e.g., satisfaction scores vs later research-project marks).
Flu epidemic scenario: students practise defining population, variables, etc.

3 • Statistical Analysis Workflow

Choose data based on research question.
Decide how to collect -> design of experiments / observational studies / simulations.
Organise data (notebooks, files, databases, DNA storage!).
Describe (metadata + numerical / graphical summaries).
Analyse (relationships, inference).

4 • Core Vocabulary (‘Data Sets III’ slides)

Population: entire group of interest (e.g., all MATH1041 students in 2025 T2).
Cases / observational units: individual members from population (e.g., one student with ≥1 mark).
Labels / IDs: unique identifiers (e.g., zID).
Sample: subset actually studied; size (n).
Variable: measurable attribute; total number (p).
Observation: vector of variable values for one case.

Example – marks file after Week 11:

Population: whole cohort.
Sample: students appearing in Excel file.
Variables: one per assessment (5).
Thus (p = 5,
n = \text{number of rows}).

5 • File Formats & Practicalities

Common extensions this course: .txt, .csv, .dat, .xls, .xlsx, .RData.
Distinguish ASCII vs binary (use Notepad peek).
Hands-on exercise: classify Moodle files (ageinc.dat, ApartmentList.txt, titanic.csv, Mondanat.img, Mondanat.hdr, 1041.RData).
RStudio demo: load MATH1041-2024T1.csv and RData via GUI & code:

survey.df <- data.frame(mget(ls()))

Lecture 2 – Sources of Data & Variable Types

Learning Outcomes - O1: Identify anecdotal, available, and self-collected/simulated data.
- O2: Differentiate categorical vs quantitative variables.
- O3: State units for quantitative variables.

1 • Sources of Data – Definitions & Examples

Anecdotal: unsystematic, single experiences → prone to bias. - Eg: “Coffee improved my attention last week.”
Available: previously recorded for another purpose (e.g., ABS census tables, UNSW enrolment records).
Collect / simulate your own: design surveys, experiments or computer simulations (Honours thesis example).

2 • Sampling & Surveys

Population vs Sample (Def 1.4).
Census = measure whole population (Def 1.5); often infeasible (time, cost, ethics).
Sample survey (Def 1.6) & voluntary response (Def 1.7) → bias risk: loud voices dominate.
Convenience sampling (Def 1.8): choose easiest units – low external validity.

3 • Variable Types

Decision algorithm (Slides 1.50–1.51): 1. No order → categorical.
1. Ordered, continuous scale → continuous quantitative.
2. Countable integers with meaningful differences → discrete quantitative.
3. Ordered categories w/o equal spacing → ordinal (treated as categorical unless using special methods).
Examples: - Satisfaction 0–10 → quantitative (discrete).
- Travel method → categorical.
- Temperature ^{\circ}\mathrm{C} → quantitative (continuous).

4 • Why the distinction matters

Determines: - Summary numbers (mean/SD vs frequency table).
- Appropriate plots (histogram/boxplot vs bar chart).
- Inference procedure (e.g., \chi^2 test vs two-sample t-test).
Course roadmap table provided linking variable combination to R functions & statistical tests: - One quantitative: \bar x,\; s,\;\text{histogram}; CI/test on \mu.
- Two categorical: contingency table, \chi^2 independence.
- Etc.

Lecture 3 – Observational Studies vs Experiments

Learning Outcomes - O1: Distinguish study types.
- O2: Identify explanatory (independent) vs response (dependent) variables.
- O3: Recognise explanations for association (common response, causation, confounding).

1 • Observational vs Experimental

Observational: measure variables as they occur, no intervention (e.g., hospital size vs stay length).
Experiment: researcher imposes treatment (coffee/no-coffee study).
Sample survey = special observational study.

2 • Association ≠ Causation

Demonstrated via Spurious Correlations (Tyler Vigen): - US science spending vs suicides by hanging, Nicholas Cage films vs drownings, cheese consumption vs bedsheet entanglements.
Explanatory diagrams: double-headed arrows = association, single-headed = causation.

3 • Key Concepts

Explanatory variable (x) vs Response (y) (Def 1.11).
Lurking variable (z) (Def 1.12): unmeasured but influential.
Confounding (Def 1.14): effects of two variables intermixed. - Example: Parent BMI & child diet on Child BMI; heredity vs environment conflated.

4 • Possible explanations for an observed link

Common response: both variables respond to unseen cause (temperature → ice-cream & heat strokes).
Causation: moon gravity → tides (rarely provable without experiment).
Confounding: exercise vs fitness muddled by age.

5 • Causal Inference without experiments

Bradford-Hill criteria: strength, consistency, dose–response, temporality, biological plausibility.
Smoking–cancer discussion: ethical limits prevent RCT; weight of evidence still establishes causation.

Lecture 4 – Experimental Designs

Learning Outcomes - O1: Appreciate experiments for causal insight.
- O2: Describe subjects, factors, levels, treatments, response.
- O3: Compare design types & evaluate efficiency.

1 • Vocabulary (Def 1.15)

Subjects / Experimental units.
Factor: manipulated explanatory variable.
Levels: categories/values of a factor.
Treatment: specific combo of factor levels.
Response variable: outcome measured post-treatment.

Example – Tennessee STAR:

Factor: class size (3 levels) → treatments.
Response: standardised test scores.

2 • Principles: “Compare, Randomise, Repeat, Replicate”

Compare: include control (placebo) or baseline.
Randomise: allocate subjects to treatments randomly → balances confounders.
Repeat: apply each treatment to many subjects (reduce chance variation).
Replicate: redo entire experiment independently → confirm findings.

3 • Design Types

Randomised Comparative Experiment (Def 1.17): classic multi-arm RCT.
Matched Pairs (Def 1.18): pairs of similar subjects or before–after on same subject; higher precision.
Randomised Block: generalisation; subjects grouped into homogeneous blocks, treatments randomised within block.

Example – Smartphone & driving simulator:

Design 1 (two independent groups) → Randomised Comparative.
Design 2 (each student both conditions) → Matched Pairs; requires fewer subjects, controls inter-individual variability.

4 • Handling Nuisance Factors

Advertising study: factor of interest = ad frequency; nuisance = ad duration. - Full factorial (3 \times 2) → \text{3} \times \text{2} = 6 treatments, 10 students each.
- Ignoring duration would falsely mask or reverse effect (Simpson’s paradox demonstration).

5 • Randomisation Mechanics

Traditional: numbered balls in urn, coin flips.
Modern: R’s sample(); example code provided to assign 200 sailors.

6 • Historical & Ethical Notes

Controlled scurvy trial – James Lind (1747).
Early randomisation – Peirce & Jastrow (1880s); Fisher & Neyman (1920s ag-experiments).
Persian physician al-Razi (9th cent.) used control group.
Ethical constraints dictate design feasibility (e.g., cannot force smoking).

7 • Common Pitfalls & Mitigations

Poor control choice → placebo needed.
Extraneous changes (pill colour).
Lack of blinding → introduce double-blind procedures.
Low realism → replicate real-world settings.
Insufficient sample size or lack of replication.

Glossary of Key Terms

Anecdotal evidence, Available data, Census, Sample, Voluntary response, Convenience sample.
Observational Study, Experiment, Association, Causation, Common response, Confounding.
Subjects, Factor, Level, Treatment, Response variable.
Control group, Placebo, Double-blind, Randomised Comparative, Matched Pairs, Randomised Block.

Mathematical & Statistical Symbols Recap

Sample size: n
Number of variables: p
Body Mass Index: \text{BMI}=\dfrac{\text{weight (kg)}}{\text{height (m)}^{2}}
Five-number summary: \min,\; Q1,\; \text{median},\; Q3,\; \max
Correlation coefficient: r; regression slope \beta_1 (covered later).
Confidence interval examples: \text{CI}{\mu},\;\text{CI}{p},\; \text{CI}{\mu1-\mu_2}.

R & Computing Cheat-Sheet (Chapter 1 Scope)

Read CSV: read.csv("file.csv") → data frame.
Inspect structure: str(df); open spreadsheet-like viewer by clicking in Environment.
Basic summaries: summary(x); frequencies table(x); grouped summary by(y, group, summary).
Plots: hist(x), boxplot(x), barplot(table(x)), plot(y ~ x).
Randomisation: sample(pop, size, replace = FALSE).

Ethical, Philosophical & Practical Considerations

GIGO principle: “Garbage In – Garbage Out” → flawless analysis cannot rescue flawed data collection.
Experiments often limited by cost, time, ethics (e.g., cannot assign harmful smoking).
Importance of anonymisation (zID codes) & privacy in educational datasets.
Statistical thinking transcends subject matter – applies in science, social policy, medicine, business.

Connections & Future Lectures

Chapter 1 lays foundation for later topics: - Numerical summaries, graphical methods (Week 2).
- Probability & inference (later weeks).
- Regression & ANOVA rely on clear distinction of explanatory/response variables and proper experimental design.
Upcoming preparatory tasks: - Install/verify RStudio via weekly Mobius lesson.
- Begin personal summary notes early – cumulative advantage.

MATH1041 Study Design and Data Collection - Vocabulary Flashcards

Course Context & Logistics

Study Design – Big Picture

Lecture 1 – Introduction to Data Collection & Organisation

1 • From Raw Data to Information

2 • “What data to collect?” mini-cases

3 • Statistical Analysis Workflow

4 • Core Vocabulary (‘Data Sets III’ slides)

5 • File Formats & Practicalities

Lecture 2 – Sources of Data & Variable Types

1 • Sources of Data – Definitions & Examples

2 • Sampling & Surveys

3 • Variable Types

4 • Why the distinction matters

Lecture 3 – Observational Studies vs Experiments

1 • Observational vs Experimental

2 • Association ≠ Causation

3 • Key Concepts

4 • Possible explanations for an observed link

5 • Causal Inference without experiments

Lecture 4 – Experimental Designs

1 • Vocabulary (Def 1.15)

2 • Principles: “Compare, Randomise, Repeat, Replicate”

3 • Design Types

4 • Handling Nuisance Factors

5 • Randomisation Mechanics

6 • Historical & Ethical Notes

7 • Common Pitfalls & Mitigations

Glossary of Key Terms

Mathematical & Statistical Symbols Recap

R & Computing Cheat-Sheet (Chapter 1 Scope)

Ethical, Philosophical & Practical Considerations

Connections & Future Lectures

Further Reading & References