Introduction to Data Collection – Comprehensive Study Notes

Course Logistics

  • Welcome to PSTAT 5LS5LS, Summer session.
  • Today’s agenda:
    • Intro to Data Collection (Slide Set 22)
    • Preview of Simulation-Based Inference for pp (coming next lecture).
  • Tomorrow’s agenda: Full treatment of Simulation-Based Inference for pp.
  • Upcoming deadlines (all due at 11:5911{:}59 PM):
    • Homework 11: Tuesday, July 1\text{Tuesday, July }1
    • Homework 22: Friday, July 4\text{Friday, July }4
  • Instructor office hours: Tues/Thur 23\text{Tues/Thur }2{-}3 PM via Zoom.

Two Broad Study Types

  • Observational Study
    • Researchers record data without interfering in how data arise.
    • Can detect associations but cannot confirm causation because of possible confounding variables (lurking factors that influence both variables of interest).
  • Experiment
    • Researchers randomly assign units to treatment conditions.
    • Randomization balances confounders ⇒ allows cause-and-effect conclusions.

Variables in a Study

  • Explanatory variable: Candidate influencer.
  • Response variable: Outcome we measure.
  • Arrow diagram: explanatory    response\text{explanatory}\;\rightarrow\;\text{response}
  • Naming variables does not itself create causality; it only frames the question.

Example: Screens at Bedtime & Attention Span

  • Observational design
    • Sample two naturally occurring groups: those who choose to use screens at bedtime vs. those who do not.
    • Compute & compare mean daytime attention spans.
    • Vulnerable to confounders (e.g., sleep quality, caffeine use).
  • Experimental design
    • Sample participants, randomly assign them either to use screens or to abstain.
    • All differences in attention span can be plausibly attributed to bedtime-screen use.
  • Question: If a difference in means is observed, can we claim causation?
    • Observational study ⇒ No (association only).
    • Experiment ⇒ Yes, provided design quality (randomization, blinding, etc.) holds.

Quick Practice: Arial vs. Helvetica Reading Speed

  • Volunteers randomly assigned to read text in Arial or Helvetica, average speeds compared.
  • Because of random assignment, this is an experiment (not merely observational).

Why Not Take a Census?

  • \textbf{Cost & Time}: Full enumeration requires vast resources.
  • Accessibility\textbf{Accessibility}: Some population members are hard to reach; their absence can bias results.
  • Population Drift\textbf{Population Drift}: By the time you finish, population characteristics may change.

Sampling Analogy: Tasting Soup

  • Tasting one spoonful = sample.
  • Declaring the whole soup salty/unsalty = inference.
  • Representativeness demand: Stirring before tasting mimics using randomness to ensure sample ≈ population.

Representative Sampling

  • Goal: Guarantee every individual has a known, non-zero chance of selection so that statistical inference is valid.
  • Simple Random Sample (SRS)
    • Each subset of size nn has equal probability of selection.
    • No built-in structure linking selected individuals.
  • Other probability samples (not emphasized in PSTAT 5LS5LS):
    • Stratified, Cluster, Multistage.

Experiments & Causal Questions

  • If the research aim is cause-and-effect, choose an experimental framework.
  • Validity & reliability hinge on four foundational principles.

Four Principles of Experimental Design

  1. Control\textbf{Control} – Compare against a baseline/control group.
  2. Randomize\textbf{Randomize} – Use chance to allocate treatments.
  3. Replicate\textbf{Replicate} – Large nn or independent repetition improves precision & generalizability.
  4. Block\textbf{Block} – Group units that share a confounder, then randomize within each block.

Blocking Example: Learning RR

  • Objective: Compare a traditional lecture course vs. an interactive online course for learning RR.
  • Potential confounder: Prior coding experience.
  • Design steps:
    1. Identify variables
    • Explanatory: Course type.
    • Response: Mastery of RR (e.g., test score).
    • Blocking variable: Prior experience (yes/no).
    1. Step 11 – Form blocks
    • Block 11: Experienced coders;
    • Block 22: Novices.
    1. Step 22 – Randomize within blocks
    • Half of each block → lecture; half → online.
  • Benefit: Differences in prior experience are neutralized, isolating course-type effect.

Blinding to Reduce Bias

  • Single-blind: Participants unaware of treatment; researchers know.
  • Double-blind: Neither participants nor data-collecting researchers know.
  • Rationale:
    • Participant expectations can alter behavior (placebo/nocebo effects).
    • Researcher expectations can subtly influence observation or analysis.
    • Blinding mitigates these biases, making observed effects more trustworthy.

Ethical, Practical & Philosophical Notes

  • Researchers must weigh feasibility vs. rigor (e.g., fully blinding a pedagogy study may be impossible; partial masking or objective grading rubrics can help).
  • Random assignment raises ethical issues when withholding beneficial treatments; often addressed via equipoise or delayed-treatment designs.
  • Sampling fairness intersects with social equity: under-coverage of marginalized groups can skew policy decisions.

Statistical Symbols & Formulae Appearing in Later Units (Preview)

  • Proportion parameter: pp
  • Sample proportion: \hat{p} = \dfrac{\text{# successes}}{n}
  • These feed into simulation-based inference techniques you’ll see next.

Big-Picture Summary

  • Observational studies detect association\textbf{association}, not causation.
  • Experiments with random assignment can establish causality\textbf{causality}.
  • Random sampling legitimizes generalizing results from sample → population.
  • Blocking combats confounding; blinding combats bias.
  • Strong study design = trustworthy conclusions, ethical integrity, and reduced error.