AP Statistics Unit 3 Experimental Design: How to Collect Data for Causal Conclusions

Introduction to Experiments and Observational Studies

When you collect data, you’re not just gathering numbers—you’re choosing a method that determines what you’ll be allowed to conclude. In AP Statistics, the big divide is between observational studies (you watch what naturally happens) and experiments (you actively impose a treatment). That choice controls whether you can talk about association only, or whether you have a legitimate pathway to arguing cause and effect.

Variables, treatments, and the basic language of studies

A variable is any characteristic recorded on individuals (height, test score, blood pressure, brand preference). In questions about relationships, two roles show up repeatedly:

Explanatory variable: the variable you suspect may help explain differences in another variable.
Response variable: the outcome you measure to see the effect.

In an experimental setting, the explanatory variable is often called a factor—a variable the experimenter controls. Each factor has levels (specific values of the factor). A treatment is a specific combination of factor levels that is actually imposed on experimental units.

The people/objects you apply treatments to are experimental units. If the units are human, they’re often called subjects.

Why this vocabulary matters: AP exam questions frequently test whether you can correctly identify these parts from context. If you mix up “treatment” and “response,” your design and your conclusions will be backward.

Observational studies: measuring without intervening

An observational study records values of variables on individuals without assigning treatments. You might survey people about their sleep and GPA, or compare existing medical records of smokers and non-smokers.

Why it matters: Observational studies are often easier, cheaper, and sometimes the only ethical choice (you cannot randomly assign people to smoke). But because the groups may differ in many ways besides the explanatory variable, observational studies usually cannot support a strong cause-and-effect conclusion.

How it works (typical structure):

Define a population or source of individuals.
Measure an explanatory variable and a response variable on each individual.
Analyze the association between them.
Consider alternative explanations (especially confounding).

A key phrase for your AP mindset: observational studies can show association, not causation.

Experiments: imposing treatments to learn about cause

An experiment deliberately imposes one or more treatments on individuals to observe a response. The defining feature is random assignment (more on this soon): the experimenter uses a random process to decide which units receive which treatment(s).

Why it matters: Random assignment creates treatment groups that are, in expectation, similar in all other ways. That makes it much harder for outside variables to explain differences in outcomes. This is the core reason experiments (when well-designed) can justify cause-and-effect conclusions.

How it works (conceptual outline):

Decide on treatments (what you will impose).
Decide how you will assign units to treatments (ideally, randomly).
Apply the treatments.
Measure the response.
Compare groups.

Notice what’s missing: you’re not just “finding a relationship.” You’re creating conditions to isolate the effect of the treatment.

Association vs causation: what you are allowed to say

Association means two variables tend to vary together in the data (positive association, negative association, or no clear association).
Causation means changing one variable produces a change in the other.

In AP Statistics, a safe rule is:

Random assignment (in an experiment) supports causal language.
Lack of random assignment (in observational studies) usually limits you to association.

A common trap is thinking “large sample size” automatically implies causation. A huge observational study can give very precise estimates of an association—but it doesn’t magically remove confounding.

Example: same research question, two different study types

Question: Does drinking caffeinated coffee improve short-term memory?

Observational approach: Survey students about coffee intake and give them a memory test. You might find coffee drinkers score higher.

Problem: coffee drinkers may differ from non-drinkers in sleep habits, workload, stress, study time, and more.
Conclusion allowed: an association between coffee drinking and memory score (if you find one).

Experimental approach: Recruit volunteers and randomly assign them to drink either caffeinated coffee or decaf (or a placebo beverage with similar taste), then administer the same memory test.

Strength: random assignment helps balance sleep, workload, etc. across groups.
Conclusion allowed: if the caffeinated group performs better, you can argue caffeine caused an improvement (within the study conditions).

Exam Focus

Typical question patterns:
- “Is this an experiment or an observational study? Justify your answer.”
- “Identify the explanatory and response variables (and sometimes the experimental units).”
- “Can we conclude a cause-and-effect relationship? Explain.”
Common mistakes:
- Confusing random sampling (how you select people) with random assignment (how you place them into groups).
- Claiming causation from an observational study because the association is “strong” or the sample is “large.”
- Calling any study with a survey “observational” without checking whether treatments were imposed.

Principles of Experimental Design (Control, Randomization, Replication)

A good experiment doesn’t happen by accident. AP Statistics emphasizes three core principles—control, randomization, and replication—because they work together to produce results that are both interpretable and trustworthy.

Control: creating a fair comparison

Control means you do your best to keep the groups comparable and isolate the effect of the treatment. In practice, “control” shows up in a few ways.

Control through comparison groups

A control group is a group that receives a standard treatment, no treatment, or a placebo, serving as a baseline for comparison.

Why it matters: Without a comparison group, you can’t tell whether changes in the response are due to the treatment or would have happened anyway.

For example, if you give a new study app to a class and their scores rise, that improvement could be due to extra teacher attention, familiarity with the test format, or just more time spent studying. A control group helps separate “app effect” from “everything else.”

Placebos and the placebo effect

A placebo is a treatment that looks like the real treatment but has no active ingredient (or no real “mechanism” expected to affect the response). The placebo effect is when subjects improve (or change) simply because they believe they are being treated.

Why it matters: If you don’t use a placebo when it’s appropriate, you might wrongly attribute the placebo effect to the actual treatment.

Blinding to reduce bias

Blinding means subjects do not know which treatment they received.
Double-blind means neither the subjects nor the people measuring/evaluating the response know which treatment each subject received.

Why it matters: Knowledge of treatment can change behavior (subjects) and can change measurement or evaluation (researchers). Even well-meaning researchers can subconsciously treat subjects differently or interpret ambiguous outcomes in biased ways.

A very AP-relevant nuance: blinding fights bias (systematic error), while randomization fights confounding (systematic differences between groups). They solve different problems.

Randomization: letting chance create comparable groups

Randomization means using a chance process to assign experimental units to treatments. The most common form is random assignment.

Why it matters: Random assignment tends to balance both known and unknown variables across groups. You usually can’t measure or control every relevant variable (sleep, motivation, genetics, prior knowledge). Randomization is your best defense against those variables becoming alternative explanations.

How it works (what AP expects): You should be able to describe a clear, chance-based method, such as:

Label each subject from 01 to 60 and use a random number generator to choose 30 for Treatment A; the remaining 30 get Treatment B.
Put 60 slips in a hat, draw 30 for Treatment A; remaining for Treatment B.

What’s not enough: “We split them evenly” or “We assigned them randomly” with no description. AP graders often want the mechanism.

Random assignment vs random sampling

This distinction is one of the most tested ideas in Unit 3.

Random sampling (from a population) supports generalizing results to that population.
Random assignment (to treatments) supports causal conclusions.

You can have one without the other:

A carefully randomized experiment on volunteers: strong causation for those volunteers, but weaker generalization.
A random sample survey with no treatments imposed: strong generalization about the population, but no causation.

The “gold standard” is a randomized experiment conducted on a random sample—often unrealistic, but that’s the ideal logic.

Replication: using enough units to see real effects

Replication means applying each treatment to multiple experimental units.

Why it matters: Individuals vary naturally. If you only test a treatment on one person (or one plant, or one class), you can’t tell whether the outcome was due to the treatment or just that unit being unusual. Replication reduces the influence of chance variation and makes patterns clearer.

Replication is closely connected to sample size within treatment groups:

More units per group generally produces more stable, reliable comparisons.
Replication also helps you detect smaller effects, because random variation averages out.

A subtle misconception: replication does not “fix” confounding caused by a bad design. If you systematically assign the treatment group to morning classes and the control group to afternoon classes, using more students just gives a more precise estimate of a biased comparison.

Putting the principles together: common experimental designs

AP Statistics commonly focuses on three design structures: completely randomized design, randomized block design, and matched pairs design.

Completely randomized design

A completely randomized design assigns all experimental units to treatments entirely by chance, with no additional grouping.

When it works well: when experimental units are fairly similar already, or when you don’t have a strong reason to group them first.

Example (design description):

A company wants to compare two webpage layouts (A and B) on time spent on the site. They recruit 200 users and randomly assign 100 to layout A and 100 to layout B, then record time on site.

Control: each group serves as a comparison for the other.
Randomization: random assignment of users to A or B.
Replication: 100 users per layout.

Randomized block design

A randomized block design first groups experimental units into blocks based on a variable that is expected to affect the response, then randomizes treatments within each block.

Why blocking matters: Blocking reduces variability by ensuring that comparisons between treatments are made among similar units. This can make treatment effects easier to detect.

Blocks are not treatments—they are pre-existing categories you create to control for variation.

Example (blocking in action):

Suppose you’re testing two study strategies (flashcards vs practice problems) on quiz scores. Prior math background is likely to matter. You could:

Block students into “strong prior background” and “limited prior background.”
Randomly assign half of each block to flashcards and half to practice problems.
Compare strategies within each block, then combine evidence.

Common AP pitfall: Students sometimes block on something affected by the treatment (which is usually not possible before treatment) or treat “block” as just another name for “group.” A block is chosen because it explains response variability and is known before assignment.

Matched pairs design

A matched pairs design is a special type of blocking where each block has size 2 (a pair) or where the same individual receives both treatments in a random order.

Two common forms:

Pair-matching two similar units: You match subjects by similarity (twins, similar pretest scores), then randomly assign one in each pair to Treatment A and the other to Treatment B.
Within-subject (repeated measures): Each subject receives both treatments (order randomized), and you compare the two responses within the same person.

Why it matters: Matched pairs can dramatically reduce variability because each comparison is made between very similar units (or within the same unit).

What can go wrong: In repeated-measures designs, you can get carryover effects (the first treatment affects response to the second). Randomizing order helps, and sometimes you also need a “washout” period.

Worked example: choosing an appropriate design

Scenario: A school wants to test whether a new noise-reducing headphone policy improves students’ reading comprehension during independent work time.

Step 1: Identify variables.

Explanatory variable (factor): headphone condition (policy vs no policy).
Response variable: reading comprehension score (maybe on a common assessment).
Experimental units: students (or possibly classrooms, depending on implementation).

Step 2: Choose a design with the principles.

A strong approach is a randomized block design by grade level:

Block: 9th grade, 10th grade, 11th grade, 12th grade (grade likely affects reading level).
Within each grade, randomly assign students to headphone policy or no policy.

Step 3: Add control features.

Same reading passage and testing conditions across groups.
If possible, the person scoring the comprehension assessment should be blinded to treatment.

Step 4: Replicate.

Ensure enough students in each grade and each treatment to make comparisons meaningful.

If instead the school assigns all morning classes to the policy and all afternoon classes to no policy, that’s not random assignment—and time-of-day effects become a serious alternative explanation.

Exam Focus

Typical question patterns:
- “Describe a randomized experiment to test whether ___ causes ___.”
- “Identify the treatments, experimental units, response variable, and whether the design is completely randomized, blocked, or matched pairs.”
- “Explain why blocking (or blinding, or a placebo) is helpful in this context.”
Common mistakes:
- Saying “random” when the method is actually convenience or alternating assignment (not a chance process).
- Confusing blocks with treatments (blocks are based on an outside variable; treatments are imposed conditions).
- Believing replication means “repeat the whole experiment later” rather than “use many units per treatment now.”

Confounding Variables and Inference from Experiments

Even with careful planning, data collection can mislead you if another variable is mixed up with the effect you’re trying to measure. This is where confounding becomes the central threat—and where the logic of inference (what you can conclude) must be tied directly to the design.

Confounding variables: what they are and why they’re dangerous

A confounding variable is a variable that is related to both the explanatory variable and the response variable in such a way that its effects on the response cannot be separated from the effects of the explanatory variable.

In plain language: when a study finds a difference in outcomes between groups, a confounder offers an alternative story for that difference.

Why it matters: Confounding is the main reason observational studies struggle with causation. It’s also the main reason poorly designed “experiments” fail to support causal conclusions even if they look scientific.

Lurking variables vs confounders

You may also hear lurking variable: a variable not included in the study that influences the relationship between explanatory and response variables. A lurking variable can become a confounder if it is associated with both variables and helps explain the observed association.

AP questions often use these terms loosely; what matters is that you can:

identify a plausible alternative variable,
explain how it could create the observed association, and
explain how design choices (especially random assignment or blocking) reduce its impact.

How confounding happens in practice

Confounding often arises from a lack of random assignment.

Example (classic pattern):

A researcher compares test scores of students who chose to attend after-school tutoring vs those who did not. If tutoring students score higher, you might think tutoring caused improvement.

But students who attend tutoring may differ systematically:

motivation,
parental support,
baseline skill,
available time.

Those differences are confounders because they are related to tutoring attendance (explanatory variable) and to test scores (response variable).

Random assignment as the primary defense against confounding

In a randomized experiment, random assignment tends to balance confounders across treatment groups.

Important nuance: Random assignment doesn’t guarantee perfectly equal groups every time; it makes large imbalances unlikely and allows you to attribute remaining differences to chance rather than bias. That’s why replication (enough units) matters—larger groups tend to balance better.

Blocking and matched pairs are additional tools to handle variables you already know will strongly affect the response.

Inference: what conclusions are justified by which design choices

In AP Statistics, your conclusions should match the design. Two key questions govern your inference:

Can I conclude causation?
Can I generalize to a population?

When can you infer cause and effect?

You can typically infer a cause-and-effect relationship when:

the study is an experiment with random assignment to treatments, and
the design controls major sources of bias (control group, consistent conditions, blinding when appropriate), and
the difference in responses between groups is large enough to be convincing (formal inference comes later in the course, but the idea is that evidence should be strong relative to variability).

If the study is observational, you should default to association language and discuss possible confounding.

When can you generalize?

Generalization depends on random sampling from a population (or at least a sampling method that plausibly represents the population).

Random assignment alone does not guarantee generalization.
Convenience samples (volunteers, one class, one school) weaken external validity.

A helpful way to phrase it on AP-style responses:

“Because subjects were randomly assigned, a cause-and-effect conclusion is reasonable.”
“Because subjects were not randomly selected from the population, we should be cautious about generalizing beyond the study participants.”

Internal validity vs external validity (conceptual, not jargon-heavy)

You may see these ideas described as:

Internal validity: Are we justified in claiming the treatment caused the observed difference for these units? Random assignment, control, and blinding support this.
External validity: Are we justified in generalizing to a larger population? Random sampling supports this.

Even if your course doesn’t emphasize these terms, the reasoning is squarely in the AP Statistics skill set.

Examples: spotting confounding and fixing the design

Example 1: A confounded “experiment”

Claimed study: A teacher wants to test whether standing desks improve attention. She gives standing desks to her 1st period class and keeps traditional desks in her 6th period class, then compares attention ratings.

What went wrong: Treatment is confounded with time of day (and possibly class composition). Period itself could influence attention.

Fix: Use random assignment within the same period (randomly assign students to standing vs traditional desks) or rotate desks among students (a matched pairs or crossover style, if feasible and if carryover is not an issue).

Example 2: An observational association with a lurking variable

Observation: People who carry lighters have higher rates of lung disease.

Likely lurking/confounding variable: Smoking.

Smoking is associated with carrying lighters.
Smoking is associated with lung disease.

So the lighter is not plausibly causing disease; it’s a marker of smoking behavior.

This example is useful because it highlights that not every association is even plausibly causal, and a good explanation identifies a realistic mechanism.

How AP questions ask about confounding and inference

You are often asked to write short justifications. Strong AP answers tend to:

explicitly mention random assignment when arguing causation,
explicitly mention random sampling when arguing generalization,
identify a specific confounder and explain the two-way connection (how it relates to both explanatory and response variables),
propose a concrete design improvement (randomize, block, match, blind, add a control group).

Worked inference example: what can we conclude?

Scenario: Researchers recruit 80 volunteers from a gym and randomly assign 40 to take a new supplement and 40 to take a placebo for 8 weeks. They measure change in body fat percentage.

Causation: Because the volunteers were randomly assigned to supplement vs placebo, and the placebo creates a fair comparison (especially if blinded), differences in average body fat change can reasonably be attributed to the supplement for these volunteers.

Generalization: Because the subjects were volunteers from a gym (not a random sample of all adults), it is risky to generalize results to the broader population. Gym volunteers may differ in diet, exercise habits, and health.

Many students incorrectly say “random assignment means we can generalize.” This scenario is the clean counterexample.

Exam Focus

Typical question patterns:
- “Identify a possible confounding variable and explain how it could affect the results.”
- “Is a cause-and-effect conclusion justified? Is generalization justified? Explain using random assignment vs random sampling.”
- “Describe how you would modify the design to reduce confounding/bias.”
Common mistakes:
- Treating any variable that affects the response as a confounder (it must also be associated with the explanatory variable in a way that mixes effects).
- Using causal language (“leads to,” “results in”) for observational studies without acknowledging confounding.
- Forgetting that poor implementation (different conditions for groups, no blinding when measurement is subjective) can introduce bias even in a randomized experiment.