Introduction to Data Collection – Comprehensive Study Notes
Course Logistics
- Welcome to PSTAT 5LS, Summer session.
- Today’s agenda:
- Intro to Data Collection (Slide Set 2)
- Preview of Simulation-Based Inference for p (coming next lecture).
- Tomorrow’s agenda: Full treatment of Simulation-Based Inference for p.
- Upcoming deadlines (all due at 11:59 PM):
- Homework 1: Tuesday, July 1
- Homework 2: Friday, July 4
- Instructor office hours: Tues/Thur 2−3 PM via Zoom.
Two Broad Study Types
- Observational Study
- Researchers record data without interfering in how data arise.
- Can detect associations but cannot confirm causation because of possible confounding variables (lurking factors that influence both variables of interest).
- Experiment
- Researchers randomly assign units to treatment conditions.
- Randomization balances confounders ⇒ allows cause-and-effect conclusions.
Variables in a Study
- Explanatory variable: Candidate influencer.
- Response variable: Outcome we measure.
- Arrow diagram: explanatory→response
- Naming variables does not itself create causality; it only frames the question.
Example: Screens at Bedtime & Attention Span
- Observational design
- Sample two naturally occurring groups: those who choose to use screens at bedtime vs. those who do not.
- Compute & compare mean daytime attention spans.
- Vulnerable to confounders (e.g., sleep quality, caffeine use).
- Experimental design
- Sample participants, randomly assign them either to use screens or to abstain.
- All differences in attention span can be plausibly attributed to bedtime-screen use.
- Question: If a difference in means is observed, can we claim causation?
- Observational study ⇒ No (association only).
- Experiment ⇒ Yes, provided design quality (randomization, blinding, etc.) holds.
Quick Practice: Arial vs. Helvetica Reading Speed
- Volunteers randomly assigned to read text in Arial or Helvetica, average speeds compared.
- Because of random assignment, this is an experiment (not merely observational).
Why Not Take a Census?
- \textbf{Cost & Time}: Full enumeration requires vast resources.
- Accessibility: Some population members are hard to reach; their absence can bias results.
- Population Drift: By the time you finish, population characteristics may change.
Sampling Analogy: Tasting Soup
- Tasting one spoonful = sample.
- Declaring the whole soup salty/unsalty = inference.
- Representativeness demand: Stirring before tasting mimics using randomness to ensure sample ≈ population.
Representative Sampling
- Goal: Guarantee every individual has a known, non-zero chance of selection so that statistical inference is valid.
- Simple Random Sample (SRS)
- Each subset of size n has equal probability of selection.
- No built-in structure linking selected individuals.
- Other probability samples (not emphasized in PSTAT 5LS):
- Stratified, Cluster, Multistage.
Experiments & Causal Questions
- If the research aim is cause-and-effect, choose an experimental framework.
- Validity & reliability hinge on four foundational principles.
Four Principles of Experimental Design
- Control – Compare against a baseline/control group.
- Randomize – Use chance to allocate treatments.
- Replicate – Large n or independent repetition improves precision & generalizability.
- Block – Group units that share a confounder, then randomize within each block.
Blocking Example: Learning R
- Objective: Compare a traditional lecture course vs. an interactive online course for learning R.
- Potential confounder: Prior coding experience.
- Design steps:
- Identify variables
- Explanatory: Course type.
- Response: Mastery of R (e.g., test score).
- Blocking variable: Prior experience (yes/no).
- Step 1 – Form blocks
- Block 1: Experienced coders;
- Block 2: Novices.
- Step 2 – Randomize within blocks
- Half of each block → lecture; half → online.
- Benefit: Differences in prior experience are neutralized, isolating course-type effect.
Blinding to Reduce Bias
- Single-blind: Participants unaware of treatment; researchers know.
- Double-blind: Neither participants nor data-collecting researchers know.
- Rationale:
- Participant expectations can alter behavior (placebo/nocebo effects).
- Researcher expectations can subtly influence observation or analysis.
- Blinding mitigates these biases, making observed effects more trustworthy.
Ethical, Practical & Philosophical Notes
- Researchers must weigh feasibility vs. rigor (e.g., fully blinding a pedagogy study may be impossible; partial masking or objective grading rubrics can help).
- Random assignment raises ethical issues when withholding beneficial treatments; often addressed via equipoise or delayed-treatment designs.
- Sampling fairness intersects with social equity: under-coverage of marginalized groups can skew policy decisions.
- Proportion parameter: p
- Sample proportion: \hat{p} = \dfrac{\text{# successes}}{n}
- These feed into simulation-based inference techniques you’ll see next.
Big-Picture Summary
- Observational studies detect association, not causation.
- Experiments with random assignment can establish causality.
- Random sampling legitimizes generalizing results from sample → population.
- Blocking combats confounding; blinding combats bias.
- Strong study design = trustworthy conclusions, ethical integrity, and reduced error.