Introduction to Data Collection – Comprehensive Study Notes

Welcome to PSTAT $5LS$ , Summer session.
Today’s agenda:
- Intro to Data Collection (Slide Set $2$ )
- Preview of Simulation-Based Inference for $p$ (coming next lecture).
Tomorrow’s agenda: Full treatment of Simulation-Based Inference for $p$ .
Upcoming deadlines (all due at $11{:}59$ PM):
- Homework $1$ : $\text{Tuesday, July }1$
- Homework $2$ : $\text{Friday, July }4$
Instructor office hours: $\text{Tues/Thur }2{-}3$ PM via Zoom.

Observational Study
- Researchers record data without interfering in how data arise.
- Can detect associations but cannot confirm causation because of possible confounding variables (lurking factors that influence both variables of interest).
Experiment
- Researchers randomly assign units to treatment conditions.
- Randomization balances confounders ⇒ allows cause-and-effect conclusions.

Explanatory variable: Candidate influencer.
Response variable: Outcome we measure.
Arrow diagram: $\text{explanatory}\;\rightarrow\;\text{response}$
Naming variables does not itself create causality; it only frames the question.

Observational design
- Sample two naturally occurring groups: those who choose to use screens at bedtime vs. those who do not.
- Compute & compare mean daytime attention spans.
- Vulnerable to confounders (e.g., sleep quality, caffeine use).
Experimental design
- Sample participants, randomly assign them either to use screens or to abstain.
- All differences in attention span can be plausibly attributed to bedtime-screen use.
Question: If a difference in means is observed, can we claim causation?
- Observational study ⇒ No (association only).
- Experiment ⇒ Yes, provided design quality (randomization, blinding, etc.) holds.

Volunteers randomly assigned to read text in Arial or Helvetica, average speeds compared.
Because of random assignment, this is an experiment (not merely observational).

\textbf{Cost & Time}: Full enumeration requires vast resources.
$\textbf{Accessibility}$ : Some population members are hard to reach; their absence can bias results.
$\textbf{Population Drift}$ : By the time you finish, population characteristics may change.

Tasting one spoonful = sample.
Declaring the whole soup salty/unsalty = inference.
Representativeness demand: Stirring before tasting mimics using randomness to ensure sample ≈ population.

Goal: Guarantee every individual has a known, non-zero chance of selection so that statistical inference is valid.
Simple Random Sample (SRS)
- Each subset of size $n$ has equal probability of selection.
- No built-in structure linking selected individuals.
Other probability samples (not emphasized in PSTAT $5LS$ ):
- Stratified, Cluster, Multistage.

If the research aim is cause-and-effect, choose an experimental framework.
Validity & reliability hinge on four foundational principles.

$\textbf{Control}$ – Compare against a baseline/control group.
$\textbf{Randomize}$ – Use chance to allocate treatments.
$\textbf{Replicate}$ – Large $n$ or independent repetition improves precision & generalizability.
$\textbf{Block}$ – Group units that share a confounder, then randomize within each block.

Objective: Compare a traditional lecture course vs. an interactive online course for learning $R$ .
Potential confounder: Prior coding experience.
Design steps:
1. Identify variables
- Explanatory: Course type.
- Response: Mastery of $R$ (e.g., test score).
- Blocking variable: Prior experience (yes/no).
1. Step $1$ – Form blocks
- Block $1$ : Experienced coders;
- Block $2$ : Novices.
1. Step $2$ – Randomize within blocks
- Half of each block → lecture; half → online.
Benefit: Differences in prior experience are neutralized, isolating course-type effect.

Single-blind: Participants unaware of treatment; researchers know.
Double-blind: Neither participants nor data-collecting researchers know.
Rationale:
- Participant expectations can alter behavior (placebo/nocebo effects).
- Researcher expectations can subtly influence observation or analysis.
- Blinding mitigates these biases, making observed effects more trustworthy.

Researchers must weigh feasibility vs. rigor (e.g., fully blinding a pedagogy study may be impossible; partial masking or objective grading rubrics can help).
Random assignment raises ethical issues when withholding beneficial treatments; often addressed via equipoise or delayed-treatment designs.
Sampling fairness intersects with social equity: under-coverage of marginalized groups can skew policy decisions.

Observational studies detect $\textbf{association}$ , not causation.
Experiments with random assignment can establish $\textbf{causality}$ .
Random sampling legitimizes generalizing results from sample → population.
Blocking combats confounding; blinding combats bias.
Strong study design = trustworthy conclusions, ethical integrity, and reduced error.