J

MATH1041 Study Design and Data Collection - Vocabulary Flashcards

Course Context & Logistics

  • Course: MATH1041 – Statistics for Life & Social Sciences

    • Delivered by the UNSW School of Mathematics & Statistics – Faculty of Science.

    • Lecture notes compiled by Y. Bunjamin, P. Lafaye de Micheaux, L. Helme-Guizon, J. Stocklosa, D. Warton & previous lecturers.

  • Teaching team: lecturers (Pierre, Bill, Yudhi, Larraine, Hilda, Jonathan, …), tutors, helpers, admin.

  • Platforms & tools:

    • Moodle (announcements, datasets, Chapter 0 resources, weekly Mobius lessons).

    • R/RStudio or Posit Cloud, R scripts (e.g., mindmap.R).

    • How-to guides: RStudio manual, dataset folder, PollEverywhere/Kahoot links.

  • Weekly announcements: exclusively on Moodle – not repeated in lectures.

  • Over-arching aim: Introduce Statistics as “the science of collecting, analysing & interpreting data”.

    • Master both “turning a story into maths formalism” and “translating mathematical results back into plain language”.

    • Develop statistical vocabulary, computing literacy with R, and deep conceptual understanding.

Study Design – Big Picture

  • Chapter 1 is split over 4 lectures: 1. Introduction to Data Collection & Organisation

    1. Sources of Data & Variable Types

    2. Observational Studies vs Experiments

    3. Experimental Designs

  • Direct mapping to textbook (Moore et al., 2021): Sections 1.1, 2.7, 3.1, 3.2.

  • Recurring learning cycle for each lecture: - Learning Outcomes (O-tags) → Success Criteria → Think-Pair-Share / Polls → Kahoot / R demos → Reflection.


Lecture 1 – Introduction to Data Collection & Organisation

  • Learning Outcomes

    • O1: Recognise that data carry context & purpose – research question dictates what data are needed.

    • O2: Master vocabulary: data, data set, population, cases, labels, variables, number of variables (p), sample size (n).

    • O3: Realise data can be stored in multiple file formats (rectangular tables, “complex” non-tabular structures, etc.).

  • Key success checkpoints

    • Precisely phrase a research question.

    • Identify the population, required variables, sample, (n) & (p).

  • Wall-of-Knowledge Prezi: positions stats as integrative layer connecting scientific disciplines.

1 • From Raw Data to Information
  • Raw survey table (2019 T2, 60 rows shown) → visually overwhelming.

  • Statistics goal: transform a “bunch of numbers” into insight.

  • Human limitation: pattern-finding in large tables is hard – need summaries & visualisations.

2 • “What data to collect?” mini-cases
  • Is MATH1041 a good course? - “Good” can mean enjoyable, prepares for future studies, teaches job-relevant basics, …

    • Each definition changes population, variables, and measurement method (e.g., satisfaction scores vs later research-project marks).

  • Flu epidemic scenario: students practise defining population, variables, etc.

3 • Statistical Analysis Workflow
  1. Choose data based on research question.

  2. Decide how to collect -> design of experiments / observational studies / simulations.

  3. Organise data (notebooks, files, databases, DNA storage!).

  4. Describe (metadata + numerical / graphical summaries).

  5. Analyse (relationships, inference).

4 • Core Vocabulary (‘Data Sets III’ slides)
  • Population: entire group of interest (e.g., all MATH1041 students in 2025 T2).

  • Cases / observational units: individual members from population (e.g., one student with ≥1 mark).

  • Labels / IDs: unique identifiers (e.g., zID).

  • Sample: subset actually studied; size (n).

  • Variable: measurable attribute; total number (p).

  • Observation: vector of variable values for one case.

Example – marks file after Week 11:

  • Population: whole cohort.

  • Sample: students appearing in Excel file.

  • Variables: one per assessment (5).

  • Thus (p = 5,
    n = \text{number of rows}).

5 • File Formats & Practicalities
  • Common extensions this course: .txt, .csv, .dat, .xls, .xlsx, .RData.

  • Distinguish ASCII vs binary (use Notepad peek).

  • Hands-on exercise: classify Moodle files (ageinc.dat, ApartmentList.txt, titanic.csv, Mondanat.img, Mondanat.hdr, 1041.RData).

  • RStudio demo: load MATH1041-2024T1.csv and RData via GUI & code:

survey.df <- data.frame(mget(ls()))

Lecture 2 – Sources of Data & Variable Types

  • Learning Outcomes - O1: Identify anecdotal, available, and self-collected/simulated data.

    • O2: Differentiate categorical vs quantitative variables.

    • O3: State units for quantitative variables.

1 • Sources of Data – Definitions & Examples
  • Anecdotal: unsystematic, single experiences → prone to bias. - Eg: “Coffee improved my attention last week.”

  • Available: previously recorded for another purpose (e.g., ABS census tables, UNSW enrolment records).

  • Collect / simulate your own: design surveys, experiments or computer simulations (Honours thesis example).

2 • Sampling & Surveys
  • Population vs Sample (Def 1.4).

  • Census = measure whole population (Def 1.5); often infeasible (time, cost, ethics).

  • Sample survey (Def 1.6) & voluntary response (Def 1.7) → bias risk: loud voices dominate.

  • Convenience sampling (Def 1.8): choose easiest units – low external validity.

3 • Variable Types
  • Decision algorithm (Slides 1.50–1.51): 1. No order → categorical.

    1. Ordered, continuous scale → continuous quantitative.

    2. Countable integers with meaningful differences → discrete quantitative.

    3. Ordered categories w/o equal spacing → ordinal (treated as categorical unless using special methods).

  • Examples: - Satisfaction 0–10 → quantitative (discrete).

    • Travel method → categorical.

    • Temperature ^{\circ}\mathrm{C} → quantitative (continuous).

4 • Why the distinction matters
  • Determines: - Summary numbers (mean/SD vs frequency table).

    • Appropriate plots (histogram/boxplot vs bar chart).

    • Inference procedure (e.g., \chi^2 test vs two-sample t-test).

  • Course roadmap table provided linking variable combination to R functions & statistical tests: - One quantitative: \bar x,\; s,\;\text{histogram}; CI/test on \mu.

    • Two categorical: contingency table, \chi^2 independence.

    • Etc.


Lecture 3 – Observational Studies vs Experiments

  • Learning Outcomes - O1: Distinguish study types.

    • O2: Identify explanatory (independent) vs response (dependent) variables.

    • O3: Recognise explanations for association (common response, causation, confounding).

1 • Observational vs Experimental
  • Observational: measure variables as they occur, no intervention (e.g., hospital size vs stay length).

  • Experiment: researcher imposes treatment (coffee/no-coffee study).

  • Sample survey = special observational study.

2 • Association ≠ Causation
  • Demonstrated via Spurious Correlations (Tyler Vigen): - US science spending vs suicides by hanging, Nicholas Cage films vs drownings, cheese consumption vs bedsheet entanglements.

  • Explanatory diagrams: double-headed arrows = association, single-headed = causation.

3 • Key Concepts
  • Explanatory variable (x) vs Response (y) (Def 1.11).

  • Lurking variable (z) (Def 1.12): unmeasured but influential.

  • Confounding (Def 1.14): effects of two variables intermixed. - Example: Parent BMI & child diet on Child BMI; heredity vs environment conflated.

4 • Possible explanations for an observed link
  1. Common response: both variables respond to unseen cause (temperature → ice-cream & heat strokes).

  2. Causation: moon gravity → tides (rarely provable without experiment).

  3. Confounding: exercise vs fitness muddled by age.

5 • Causal Inference without experiments
  • Bradford-Hill criteria: strength, consistency, dose–response, temporality, biological plausibility.

  • Smoking–cancer discussion: ethical limits prevent RCT; weight of evidence still establishes causation.


Lecture 4 – Experimental Designs

  • Learning Outcomes - O1: Appreciate experiments for causal insight.

    • O2: Describe subjects, factors, levels, treatments, response.

    • O3: Compare design types & evaluate efficiency.

1 • Vocabulary (Def 1.15)
  • Subjects / Experimental units.

  • Factor: manipulated explanatory variable.

  • Levels: categories/values of a factor.

  • Treatment: specific combo of factor levels.

  • Response variable: outcome measured post-treatment.

Example – Tennessee STAR:

  • Factor: class size (3 levels) → treatments.

  • Response: standardised test scores.

2 • Principles: “Compare, Randomise, Repeat, Replicate”
  1. Compare: include control (placebo) or baseline.

  2. Randomise: allocate subjects to treatments randomly → balances confounders.

  3. Repeat: apply each treatment to many subjects (reduce chance variation).

  4. Replicate: redo entire experiment independently → confirm findings.

3 • Design Types
  • Randomised Comparative Experiment (Def 1.17): classic multi-arm RCT.

  • Matched Pairs (Def 1.18): pairs of similar subjects or before–after on same subject; higher precision.

  • Randomised Block: generalisation; subjects grouped into homogeneous blocks, treatments randomised within block.

Example – Smartphone & driving simulator:

  • Design 1 (two independent groups) → Randomised Comparative.

  • Design 2 (each student both conditions) → Matched Pairs; requires fewer subjects, controls inter-individual variability.

4 • Handling Nuisance Factors
  • Advertising study: factor of interest = ad frequency; nuisance = ad duration. - Full factorial (3 \times 2) → \text{3} \times \text{2} = 6 treatments, 10 students each.

    • Ignoring duration would falsely mask or reverse effect (Simpson’s paradox demonstration).

5 • Randomisation Mechanics
  • Traditional: numbered balls in urn, coin flips.

  • Modern: R’s sample(); example code provided to assign 200 sailors.

6 • Historical & Ethical Notes
  • Controlled scurvy trial – James Lind (1747).

  • Early randomisation – Peirce & Jastrow (1880s); Fisher & Neyman (1920s ag-experiments).

  • Persian physician al-Razi (9th cent.) used control group.

  • Ethical constraints dictate design feasibility (e.g., cannot force smoking).

7 • Common Pitfalls & Mitigations
  • Poor control choice → placebo needed.

  • Extraneous changes (pill colour).

  • Lack of blinding → introduce double-blind procedures.

  • Low realism → replicate real-world settings.

  • Insufficient sample size or lack of replication.


Glossary of Key Terms

  • Anecdotal evidence, Available data, Census, Sample, Voluntary response, Convenience sample.

  • Observational Study, Experiment, Association, Causation, Common response, Confounding.

  • Subjects, Factor, Level, Treatment, Response variable.

  • Control group, Placebo, Double-blind, Randomised Comparative, Matched Pairs, Randomised Block.


Mathematical & Statistical Symbols Recap

  • Sample size: n

  • Number of variables: p

  • Body Mass Index: \text{BMI}=\dfrac{\text{weight (kg)}}{\text{height (m)}^{2}}

  • Five-number summary: \min,\; Q1,\; \text{median},\; Q3,\; \max

  • Correlation coefficient: r; regression slope \beta_1 (covered later).

  • Confidence interval examples: \text{CI}{\mu},\;\text{CI}{p},\; \text{CI}{\mu1-\mu_2}.


R & Computing Cheat-Sheet (Chapter 1 Scope)

  • Read CSV: read.csv("file.csv") → data frame.

  • Inspect structure: str(df); open spreadsheet-like viewer by clicking in Environment.

  • Basic summaries: summary(x); frequencies table(x); grouped summary by(y, group, summary).

  • Plots: hist(x), boxplot(x), barplot(table(x)), plot(y ~ x).

  • Randomisation: sample(pop, size, replace = FALSE).


Ethical, Philosophical & Practical Considerations

  • GIGO principle: “Garbage In – Garbage Out” → flawless analysis cannot rescue flawed data collection.

  • Experiments often limited by cost, time, ethics (e.g., cannot assign harmful smoking).

  • Importance of anonymisation (zID codes) & privacy in educational datasets.

  • Statistical thinking transcends subject matter – applies in science, social policy, medicine, business.


Connections & Future Lectures

  • Chapter 1 lays foundation for later topics: - Numerical summaries, graphical methods (Week 2).

    • Probability & inference (later weeks).

    • Regression & ANOVA rely on clear distinction of explanatory/response variables and proper experimental design.

  • Upcoming preparatory tasks: - Install/verify RStudio via weekly Mobius lesson.

    • Begin personal summary notes early – cumulative advantage.


Further Reading & References

  • Moore, McCabe & Craig (2021) Introduction to the Practice of Statistics, 10th ed.

  • McKeachie & Svinicki (2014) Teaching Tips – used for Think-Pair-Share pedagogy.

  • Lafaye de Micheaux et al. (2021) – internal Rmarkdown slide-creation package.

  • Tyler Vigen – Spurious Correlations website.

  • Historical papers: Peirce & Jastrow (1884), Fisher & Neyman (1920s), Lind’s scurvy trial (1747).