Statistics Lecture – Data Sources, Sampling, Variable Types, & Study Design June 4

Data Storage & File Formats

  • Computer files are recognised by their extensions (the characters after the final dot).
    • Common data‐table extensions used in the course:
    • .txt – plain text
    • .csv – comma-separated values (CSV = comma separated values)
    • Others: .xlsx, .sav, .json, etc.
  • Rectangular (row × column) tables dominate introductory statistics workflows.
  • Most course/lab datasets are supplied as CSV files via Moodle; ensure you know where the large “datasets” folder is located.

Study-Design Workflow (recap from previous lecture)

  • 1 ▶ Identify the research question.
  • 2 ▶ Choose the population of interest.
  • 3 ▶ List the variables required to answer the question.
  • 4 ▶ Collect data on those variables following an appropriate design.
  • Bad data → bad analysis → bad conclusions (often undetectable by downstream readers!).
  • Consulting a statistician before data collection is crucial – collection is usually the most expensive stage.

Types of Data Sources

1. Anecdotal Data

  • Informal, casual observations or stories (single personal experiences).
  • Subjective; not reliable for large-scale decision-making.
  • Value: generates intuition/hypotheses (e.g. Newton’s apple, discovery of penicillin, microwave oven serendipity).
  • Example given: “I felt more focused after coffee before last week’s lecture – therefore coffee improves concentration.”

2. Available (Secondary) Data

  • Pre-existing data generated for purposes other than your current study.
  • Examples:
    • Australian Bureau of Statistics (ABS) census tables.
    • Bureau of Meteorology (BoM) weather archives.
    • University enrolment records.
    • Smart-watch step counts/time logs.
  • Advantage: inexpensive, instantly accessible.
  • Caveat: variables, definitions, and quality were chosen for someone else’s objectives.

3. New / Collected / Simulated Data

  • Gathered (or simulated) specifically for the research project.
  • Mechanisms:
    • Surveys & questionnaires (e.g. Moodle student questionnaire).
    • Direct measurement/counting ("how many moths in my living room"; traffic counts).
    • Instrumentation/IoT (Raspberry Pi projects, sensors).
    • Computer simulation.
  • Usually higher relevance & control, but costs time, money, ethics clearance.

Population, Sample & Census

  • Population: all units (people, objects, events) you want information about.
    • Example: all UNSW students when studying average preparation time each morning.
  • Sample: subset actually observed/measured.
    • Must be representative for valid inference.
  • Census: attempt to obtain data on every population member (Australian census ≈ every 5 years).
    • Expensive, time-consuming, often ethically or logistically impossible.

Sampling Methods (people-focused)

  • Sample Survey: ask questions of selected individuals.
  • Voluntary Response Sample: participants self-select.
    • Typically biased (e.g. only highly motivated individuals respond).
  • Convenience Sample: choose units that are easy to reach (e.g. stopping passers-by on Main Walkway).
    • May under-represent segments of the population.

Reasons to Sample Instead of Census

  • Cost & time limitations.
  • Practical/physical impossibility.
  • Ethical/legal constraints (e.g. testing every student for tuberculosis; forcing smoking behaviour).
  • Well-designed samples often yield high-accuracy estimates (e.g. election “quick counts”) without surveying 99 % of voters.

Variable Types

Categorical (Qualitative)

  • Assigns cases to groups/levels.
  • Arithmetic on codes is meaningless.
  • Example: hemisphere {Northern, Southern}; file type; gender.

Quantitative (Numerical)

  • Takes numerical values where arithmetic operations make sense.
  • Must include units.
    • Example: temperature in ^{\circ}C, time in minutes, height in centimetres.

Pitfalls

  • Simply recoding categories as numbers (e.g. Hemisphere 1 & 2) does not make them quantitative; averages, sums, etc. are nonsensical.
  • A formal flowchart (omitted slide) asks: “Is there a natural order? Are arithmetic operations meaningful?”

Observational Studies vs Experiments

FeatureObservational StudyExperiment
Researcher interventionNone – variables merely observedDeliberate treatment imposed
PurposeIdentify associationProvide evidence for causation
Key term to mention in explanations“No treatment”“Treatment/intervention”

Association ≠ Causation

  • Variables can be associated due to:
    1. Causation (A causes B).
    2. Common response to a third variable.
    3. Confounding with lurking variable(s).
  • Spurious correlations abound (e.g. US science funding vs. suicides; per-capita cheese consumption vs. deaths tangled in bedsheets).
  • Always ask: what other variable could explain both factors?

Lurking & Confounding Variables

  • Lurking Variable: unobserved variable that influences interpretation; affects both explanatory & response variables.
  • Confounding Variables:
    • Two variables are confounded when their effects on the response cannot be separated.
    • A confounding variable is an unobserved factor whose influence on the response is indistinguishable from that of the explanatory variable.
  • Heavier concept – remember two criteria: (i) unobserved, (ii) impacts explanatory & response, making attribution difficult.

Classic Example (Fitness Study)

  • Explanatory: weekly exercise time.
  • Response: fitness level.
  • Possible lurking/confounding variables:
    • Age – older people may be less fit and have less time/ability to exercise.
    • Genetics – influences fitness potential and maybe motivation/ability to exercise.

Example Case Studies

  1. Hospital Size vs. Length of Stay

    • Large hospitals show longer average patient stays.
    • Likely explanation: severe/complex cases referred to large hospitals (confounding by illness severity).
  2. Marital Status vs. Income

    • Married/ever-married men earn more than never-married men.
    • Age acts as a confounder: older individuals more likely to be married and have higher income.
  3. Ice-Cream Sales vs. Heat Strokes

    • Both rise in hot weather.
    • Heat (temperature) is lurking variable; association arises via common response/confounding.
  4. Tidal Height vs. Moon Position

    • Clear causal mechanism: lunar gravitational pull → tide cycle.
    • Demonstrates genuine causation.
  5. Parent-Child Body Mass Index

    • Strong association.
    • Confounded by shared diet, lifestyle, environment, and genetics.
  6. Smoking & Lung Cancer

    • Observed association strong; experimenting by forcing people to smoke/non-smoke is unethical.
    • Genetics, stress, socioeconomic status = potential lurking/confounding variables.
    • Evidence for causation eventually established via long-term cohort studies, biological mechanisms, and advanced causal inference.

Ethical & Practical Constraints in Experiments

  • Randomised controlled trials ideal for causation but sometimes impossible (e.g. assigning smoking, withholding medical treatment).
  • Alternative: prospective cohort studies, instrumental-variable techniques, natural experiments (field of causal inference).

Key Quantitative Formulae Quoted

  • Body Mass Index:
    BMI = \frac{\text{Weight (kg)}}{\text{Height (m)}^2}

Big Picture – Why All This Matters

  • Correct identification of variable types guides every later statistical method (see forthcoming “cheat-sheet” table of analyses).
  • Solid study design prevents garbage-in/garbage-out, saving money and reputation.
  • Understanding association, causation, lurking & confounding protects against invalid claims and mis-guiding policy or personal decisions.

Practical Pointers for the Course

  • Whenever faced with data:
    1. Name the population and sample.
    2. Classify each variable (categorical vs. quantitative; explanatory vs. response).
    3. Decide whether the design is observational or experimental (look for “treatment”).
    4. Ask what lurking/confounding variables might exist.
  • Use units for every quantitative measurement.
  • Seek statistical advice before data collection.
  • Remember: well-constructed samples can yield high-accuracy estimates without a full census.