Statistics Lecture – Data Sources, Sampling, Variable Types, & Study Design June 4
Data Storage & File Formats
- Computer files are recognised by their extensions (the characters after the final dot).
- Common data‐table extensions used in the course:
- .txt – plain text
- .csv – comma-separated values (CSV = comma separated values)
- Others: .xlsx, .sav, .json, etc.
- Rectangular (row × column) tables dominate introductory statistics workflows.
- Most course/lab datasets are supplied as CSV files via Moodle; ensure you know where the large “datasets” folder is located.
Study-Design Workflow (recap from previous lecture)
- 1 ▶ Identify the research question.
- 2 ▶ Choose the population of interest.
- 3 ▶ List the variables required to answer the question.
- 4 ▶ Collect data on those variables following an appropriate design.
- Bad data → bad analysis → bad conclusions (often undetectable by downstream readers!).
- Consulting a statistician before data collection is crucial – collection is usually the most expensive stage.
Types of Data Sources
1. Anecdotal Data
- Informal, casual observations or stories (single personal experiences).
- Subjective; not reliable for large-scale decision-making.
- Value: generates intuition/hypotheses (e.g. Newton’s apple, discovery of penicillin, microwave oven serendipity).
- Example given: “I felt more focused after coffee before last week’s lecture – therefore coffee improves concentration.”
2. Available (Secondary) Data
- Pre-existing data generated for purposes other than your current study.
- Examples:
- Australian Bureau of Statistics (ABS) census tables.
- Bureau of Meteorology (BoM) weather archives.
- University enrolment records.
- Smart-watch step counts/time logs.
- Advantage: inexpensive, instantly accessible.
- Caveat: variables, definitions, and quality were chosen for someone else’s objectives.
3. New / Collected / Simulated Data
- Gathered (or simulated) specifically for the research project.
- Mechanisms:
- Surveys & questionnaires (e.g. Moodle student questionnaire).
- Direct measurement/counting ("how many moths in my living room"; traffic counts).
- Instrumentation/IoT (Raspberry Pi projects, sensors).
- Computer simulation.
- Usually higher relevance & control, but costs time, money, ethics clearance.
Population, Sample & Census
- Population: all units (people, objects, events) you want information about.
- Example: all UNSW students when studying average preparation time each morning.
- Sample: subset actually observed/measured.
- Must be representative for valid inference.
- Census: attempt to obtain data on every population member (Australian census ≈ every 5 years).
- Expensive, time-consuming, often ethically or logistically impossible.
Sampling Methods (people-focused)
- Sample Survey: ask questions of selected individuals.
- Voluntary Response Sample: participants self-select.
- Typically biased (e.g. only highly motivated individuals respond).
- Convenience Sample: choose units that are easy to reach (e.g. stopping passers-by on Main Walkway).
- May under-represent segments of the population.
Reasons to Sample Instead of Census
- Cost & time limitations.
- Practical/physical impossibility.
- Ethical/legal constraints (e.g. testing every student for tuberculosis; forcing smoking behaviour).
- Well-designed samples often yield high-accuracy estimates (e.g. election “quick counts”) without surveying 99 % of voters.
Variable Types
Categorical (Qualitative)
- Assigns cases to groups/levels.
- Arithmetic on codes is meaningless.
- Example: hemisphere {Northern, Southern}; file type; gender.
Quantitative (Numerical)
- Takes numerical values where arithmetic operations make sense.
- Must include units.
- Example: temperature in ^{\circ}C, time in minutes, height in centimetres.
Pitfalls
- Simply recoding categories as numbers (e.g. Hemisphere 1 & 2) does not make them quantitative; averages, sums, etc. are nonsensical.
- A formal flowchart (omitted slide) asks: “Is there a natural order? Are arithmetic operations meaningful?”
Observational Studies vs Experiments
| Feature | Observational Study | Experiment |
|---|---|---|
| Researcher intervention | None – variables merely observed | Deliberate treatment imposed |
| Purpose | Identify association | Provide evidence for causation |
| Key term to mention in explanations | “No treatment” | “Treatment/intervention” |
Association ≠ Causation
- Variables can be associated due to:
- Causation (A causes B).
- Common response to a third variable.
- Confounding with lurking variable(s).
- Spurious correlations abound (e.g. US science funding vs. suicides; per-capita cheese consumption vs. deaths tangled in bedsheets).
- Always ask: what other variable could explain both factors?
Lurking & Confounding Variables
- Lurking Variable: unobserved variable that influences interpretation; affects both explanatory & response variables.
- Confounding Variables:
- Two variables are confounded when their effects on the response cannot be separated.
- A confounding variable is an unobserved factor whose influence on the response is indistinguishable from that of the explanatory variable.
- Heavier concept – remember two criteria: (i) unobserved, (ii) impacts explanatory & response, making attribution difficult.
Classic Example (Fitness Study)
- Explanatory: weekly exercise time.
- Response: fitness level.
- Possible lurking/confounding variables:
- Age – older people may be less fit and have less time/ability to exercise.
- Genetics – influences fitness potential and maybe motivation/ability to exercise.
Example Case Studies
Hospital Size vs. Length of Stay
- Large hospitals show longer average patient stays.
- Likely explanation: severe/complex cases referred to large hospitals (confounding by illness severity).
Marital Status vs. Income
- Married/ever-married men earn more than never-married men.
- Age acts as a confounder: older individuals more likely to be married and have higher income.
Ice-Cream Sales vs. Heat Strokes
- Both rise in hot weather.
- Heat (temperature) is lurking variable; association arises via common response/confounding.
Tidal Height vs. Moon Position
- Clear causal mechanism: lunar gravitational pull → tide cycle.
- Demonstrates genuine causation.
Parent-Child Body Mass Index
- Strong association.
- Confounded by shared diet, lifestyle, environment, and genetics.
Smoking & Lung Cancer
- Observed association strong; experimenting by forcing people to smoke/non-smoke is unethical.
- Genetics, stress, socioeconomic status = potential lurking/confounding variables.
- Evidence for causation eventually established via long-term cohort studies, biological mechanisms, and advanced causal inference.
Ethical & Practical Constraints in Experiments
- Randomised controlled trials ideal for causation but sometimes impossible (e.g. assigning smoking, withholding medical treatment).
- Alternative: prospective cohort studies, instrumental-variable techniques, natural experiments (field of causal inference).
Key Quantitative Formulae Quoted
- Body Mass Index:
BMI = \frac{\text{Weight (kg)}}{\text{Height (m)}^2}
Big Picture – Why All This Matters
- Correct identification of variable types guides every later statistical method (see forthcoming “cheat-sheet” table of analyses).
- Solid study design prevents garbage-in/garbage-out, saving money and reputation.
- Understanding association, causation, lurking & confounding protects against invalid claims and mis-guiding policy or personal decisions.
Practical Pointers for the Course
- Whenever faced with data:
- Name the population and sample.
- Classify each variable (categorical vs. quantitative; explanatory vs. response).
- Decide whether the design is observational or experimental (look for “treatment”).
- Ask what lurking/confounding variables might exist.
- Use units for every quantitative measurement.
- Seek statistical advice before data collection.
- Remember: well-constructed samples can yield high-accuracy estimates without a full census.