Statistics Lecture – Data Sources, Sampling, Variable Types, & Study Design June 4

Data Storage & File Formats

Computer files are recognised by their extensions (the characters after the final dot).
- Common data‐table extensions used in the course:
- .txt – plain text
- .csv – comma-separated values (CSV = comma separated values)
- Others: .xlsx, .sav, .json, etc.
Rectangular (row × column) tables dominate introductory statistics workflows.
Most course/lab datasets are supplied as CSV files via Moodle; ensure you know where the large “datasets” folder is located.

Study-Design Workflow (recap from previous lecture)

1 ▶ Identify the research question.
2 ▶ Choose the population of interest.
3 ▶ List the variables required to answer the question.
4 ▶ Collect data on those variables following an appropriate design.
Bad data → bad analysis → bad conclusions (often undetectable by downstream readers!).
Consulting a statistician before data collection is crucial – collection is usually the most expensive stage.

Types of Data Sources

1. Anecdotal Data

Informal, casual observations or stories (single personal experiences).
Subjective; not reliable for large-scale decision-making.
Value: generates intuition/hypotheses (e.g. Newton’s apple, discovery of penicillin, microwave oven serendipity).
Example given: “I felt more focused after coffee before last week’s lecture – therefore coffee improves concentration.”

2. Available (Secondary) Data

Pre-existing data generated for purposes other than your current study.
Examples:
- Australian Bureau of Statistics (ABS) census tables.
- Bureau of Meteorology (BoM) weather archives.
- University enrolment records.
- Smart-watch step counts/time logs.
Advantage: inexpensive, instantly accessible.
Caveat: variables, definitions, and quality were chosen for someone else’s objectives.

3. New / Collected / Simulated Data

Gathered (or simulated) specifically for the research project.
Mechanisms:
- Surveys & questionnaires (e.g. Moodle student questionnaire).
- Direct measurement/counting ("how many moths in my living room"; traffic counts).
- Instrumentation/IoT (Raspberry Pi projects, sensors).
- Computer simulation.
Usually higher relevance & control, but costs time, money, ethics clearance.

Population, Sample & Census

Population: all units (people, objects, events) you want information about.
- Example: all UNSW students when studying average preparation time each morning.
Sample: subset actually observed/measured.
- Must be representative for valid inference.
Census: attempt to obtain data on every population member (Australian census ≈ every 5 years).
- Expensive, time-consuming, often ethically or logistically impossible.

Sampling Methods (people-focused)

Sample Survey: ask questions of selected individuals.
Voluntary Response Sample: participants self-select.
- Typically biased (e.g. only highly motivated individuals respond).
Convenience Sample: choose units that are easy to reach (e.g. stopping passers-by on Main Walkway).
- May under-represent segments of the population.

Reasons to Sample Instead of Census

Cost & time limitations.
Practical/physical impossibility.
Ethical/legal constraints (e.g. testing every student for tuberculosis; forcing smoking behaviour).
Well-designed samples often yield high-accuracy estimates (e.g. election “quick counts”) without surveying 99 % of voters.

Variable Types

Categorical (Qualitative)

Assigns cases to groups/levels.
Arithmetic on codes is meaningless.
Example: hemisphere {Northern, Southern}; file type; gender.

Quantitative (Numerical)

Takes numerical values where arithmetic operations make sense.
Must include units.
- Example: temperature in $^{\circ}C$ , time in minutes, height in centimetres.

Pitfalls

Simply recoding categories as numbers (e.g. Hemisphere 1 & 2) does not make them quantitative; averages, sums, etc. are nonsensical.
A formal flowchart (omitted slide) asks: “Is there a natural order? Are arithmetic operations meaningful?”

Observational Studies vs Experiments

Feature	Observational Study	Experiment
Researcher intervention	None – variables merely observed	Deliberate treatment imposed
Purpose	Identify association	Provide evidence for causation
Key term to mention in explanations	“No treatment”	“Treatment/intervention”

Association ≠ Causation

Variables can be associated due to:
1. Causation (A causes B).
2. Common response to a third variable.
3. Confounding with lurking variable(s).
Spurious correlations abound (e.g. US science funding vs. suicides; per-capita cheese consumption vs. deaths tangled in bedsheets).
Always ask: what other variable could explain both factors?

Lurking & Confounding Variables

Lurking Variable: unobserved variable that influences interpretation; affects both explanatory & response variables.
Confounding Variables:
- Two variables are confounded when their effects on the response cannot be separated.
- A confounding variable is an unobserved factor whose influence on the response is indistinguishable from that of the explanatory variable.
Heavier concept – remember two criteria: (i) unobserved, (ii) impacts explanatory & response, making attribution difficult.

Classic Example (Fitness Study)

Explanatory: weekly exercise time.
Response: fitness level.
Possible lurking/confounding variables:
- Age – older people may be less fit and have less time/ability to exercise.
- Genetics – influences fitness potential and maybe motivation/ability to exercise.

Example Case Studies

Hospital Size vs. Length of Stay
- Large hospitals show longer average patient stays.
- Likely explanation: severe/complex cases referred to large hospitals (confounding by illness severity).
Marital Status vs. Income
- Married/ever-married men earn more than never-married men.
- Age acts as a confounder: older individuals more likely to be married and have higher income.
Ice-Cream Sales vs. Heat Strokes
- Both rise in hot weather.
- Heat (temperature) is lurking variable; association arises via common response/confounding.
Tidal Height vs. Moon Position
- Clear causal mechanism: lunar gravitational pull → tide cycle.
- Demonstrates genuine causation.
Parent-Child Body Mass Index
- Strong association.
- Confounded by shared diet, lifestyle, environment, and genetics.
Smoking & Lung Cancer
- Observed association strong; experimenting by forcing people to smoke/non-smoke is unethical.
- Genetics, stress, socioeconomic status = potential lurking/confounding variables.
- Evidence for causation eventually established via long-term cohort studies, biological mechanisms, and advanced causal inference.

Ethical & Practical Constraints in Experiments

Randomised controlled trials ideal for causation but sometimes impossible (e.g. assigning smoking, withholding medical treatment).
Alternative: prospective cohort studies, instrumental-variable techniques, natural experiments (field of causal inference).

Key Quantitative Formulae Quoted

Body Mass Index:
$BMI = \frac{\text{Weight (kg)}}{\text{Height (m)}^2}$

Big Picture – Why All This Matters

Correct identification of variable types guides every later statistical method (see forthcoming “cheat-sheet” table of analyses).
Solid study design prevents garbage-in/garbage-out, saving money and reputation.
Understanding association, causation, lurking & confounding protects against invalid claims and mis-guiding policy or personal decisions.

Practical Pointers for the Course

Whenever faced with data:
1. Name the population and sample.
2. Classify each variable (categorical vs. quantitative; explanatory vs. response).
3. Decide whether the design is observational or experimental (look for “treatment”).
4. Ask what lurking/confounding variables might exist.
Use units for every quantitative measurement.
Seek statistical advice before data collection.
Remember: well-constructed samples can yield high-accuracy estimates without a full census.