Unit 8 Inference for Categorical Data: Understanding and Using Chi-Square Procedures
Chi-Square Goodness of Fit Test
What it is
A chi-square goodness of fit test (often abbreviated GOF test) is a hypothesis test used when you have one categorical variable and you want to compare what you observed in a sample to what you would expect to see if a claimed distribution were true.
Typical situations:
- A company claims its customers are distributed across 5 age groups in certain percentages.
- A die is claimed to be fair (each face has probability 1/6).
- A genetics model predicts offspring colors in a particular ratio.
In all these cases, you have one variable (age group, die face, color) with multiple categories, and a model that specifies the expected proportions in each category.
Why it matters
Many real questions are about whether data follow a stated pattern. The GOF test gives you a principled way to decide whether differences between observed counts and expected counts are likely due to random sampling variation or whether they are too large to reasonably attribute to chance.
It’s also a foundational chi-square procedure: once you understand how “observed vs expected” works in GOF, the tests for homogeneity and independence feel like extensions where the expected counts come from a different source.
How it works (the mechanism)
The core idea is to measure how far the sample counts are from the model’s expected counts, across all categories, in a way that accounts for the fact that larger expected counts can naturally have larger raw deviations.
Step 1: State hypotheses
Let there be k categories.
- Null hypothesis: the population distribution matches the claimed proportions.
- Alternative hypothesis: the population distribution does not match the claimed proportions (at least one category proportion differs).
If the claimed proportions are p_1, p_2, \dots, p_k, the null can be written as:
H_0: p_1 = p_{1,0}, p_2 = p_{2,0}, \dots, p_k = p_{k,0}
and the alternative as:
H_a: \text{At least one } p_i \text{ differs from its claimed value}
Step 2: Compute expected counts
If your sample size is n, then the **expected count** in category i is:
E_i = n p_{i,0}
where p_{i,0} is the null (claimed) proportion.
A common mistake is to treat the expected counts as “what you hope to see.” They are not hopes; they are what the null model predicts, and they are essential for the test statistic.
Step 3: Compute the chi-square statistic
The chi-square statistic combines all category discrepancies:
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
- O_i is the observed count in category i.
- E_i is the expected count in category i.
Each term \frac{(O_i - E_i)^2}{E_i} is sometimes called that cell’s chi-square contribution. Bigger contributions point to categories where the observed data disagree most with the model.
Step 4: Degrees of freedom and p-value
For a GOF test with k categories:
df = k - 1
Then the p-value is:
P(\text{Chi-square distribution with } df \ge \chi^2_{\text{observed}})
Chi-square tests are right-tailed because larger \chi^2 means larger overall discrepancy.
Step 5: Check conditions (AP Stats standard)
You should verify conditions before trusting the chi-square approximation.
- Random: Data come from a random sample (or random assignment if relevant, though GOF is usually sampling).
- Independence: Observations are independent (often supported by sampling no more than 10% of the population when sampling without replacement).
- Large expected counts: All expected counts are at least 5:
E_i \ge 5 \text{ for all } i
If expected counts are too small, the chi-square distribution may be a poor approximation. A common fix is to combine categories (if it makes sense contextually) to increase expected counts.
Step 6: Conclude in context
If the p-value is small (below your significance level \alpha), you have evidence the true distribution differs from the claimed one. If the p-value is large, you do not have convincing evidence against the claim.
Be careful: “fail to reject” is not the same as “prove the model is true.” It means the data do not provide strong evidence of a mismatch.
Worked example (GOF)
A vending company claims its machines sell snacks in these proportions: Chips 0.40, Candy 0.35, Granola 0.25. You randomly sample n=200 purchases and observe: Chips 90, Candy 60, Granola 50.
1) Hypotheses
- H_0: the purchase distribution is (0.40, 0.35, 0.25).
- H_a: the distribution differs from (0.40, 0.35, 0.25).
2) Expected counts
E_{\text{Chips}} = 200(0.40)=80
E_{\text{Candy}} = 200(0.35)=70
E_{\text{Granola}} = 200(0.25)=50
All expected counts are at least 5.
3) Test statistic
\chi^2 = \frac{(90-80)^2}{80} + \frac{(60-70)^2}{70} + \frac{(50-50)^2}{50}
Compute each contribution:
\frac{(10)^2}{80}=1.25
\frac{(-10)^2}{70}\approx 1.4286
\frac{0^2}{50}=0
So:
\chi^2 \approx 2.6786
4) Degrees of freedom
df = 3-1=2
5) P-value and conclusion
Using a chi-square distribution with df=2, \chi^2\approx 2.68 is not especially large, so the p-value will be fairly large (greater than 0.05). You would **fail to reject** H_0 and conclude the data do not provide convincing evidence that the true snack-purchase distribution differs from the company’s claim.
Interpreting what’s “driving” the statistic: The biggest mismatch is Candy (observed 60 vs expected 70) and Chips (90 vs 80), while Granola matches perfectly.
Exam Focus
- Typical question patterns:
- “A distribution is claimed to be … Use a chi-square goodness of fit test to evaluate the claim.”
- “Compute expected counts and check conditions, then find \chi^2, df, and a p-value.”
- “Which category contributes most to \chi^2?”
- Common mistakes:
- Using percentages instead of counts for O_i and E_i (the formula is based on counts).
- Forgetting df=k-1 or using k instead.
- Concluding the null is “true” after failing to reject, rather than stating there isn’t convincing evidence against it.
Chi-Square Test for Homogeneity
What it is
A chi-square test for homogeneity compares the distribution of a categorical variable across two or more populations or treatments. You take separate random samples from each population (or assign subjects to treatments and then record a categorical response), and you ask whether the category proportions are the same across groups.
Example settings:
- Are political party affiliations distributed the same way across different states (samples taken in each state)?
- Do different marketing strategies lead to the same distribution of customer satisfaction ratings?
The key structural idea: you have one categorical response, but it’s measured across multiple groups.
Why it matters
Homogeneity tests let you compare multiple groups at once. Instead of doing many pairwise comparisons (which increases the chance of false positives), the chi-square homogeneity test provides a single overall test of whether the distributions look meaningfully different.
It also supports strong conclusions when your data come from well-designed sampling or experiments:
- If you truly randomly sample from each population, you can generalize to those populations.
- If you randomly assign treatments, you can talk about causation (treatment affects distribution of outcomes).
How it works (observed vs expected in a two-way table)
Homogeneity uses a two-way table (contingency table). Rows often represent groups (populations/treatments), columns represent response categories.
Step 1: Hypotheses
Suppose there are r groups and c response categories.
- Null hypothesis: the distribution of the response variable is the same in all groups.
- Alternative hypothesis: at least one group has a different distribution.
It’s important to phrase this as “distributions are the same/different,” not “means are the same/different,” because we are not working with numerical means.
Step 2: Expected counts under the null
Under the null, each group is assumed to share a common set of category proportions. The expected count in each cell is:
E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}
This formula is worth understanding conceptually: if the column total represents how common a category is overall, then a row with a larger sample size should have more expected counts in that category, in proportion to its row total.
Step 3: Test statistic
You again add up contributions over all cells:
\chi^2 = \sum \frac{(O - E)^2}{E}
Step 4: Degrees of freedom
For an r \times c table:
df = (r-1)(c-1)
Step 5: Conditions
- Random: each sample is random from its population (or subjects are randomly assigned to treatments).
- Independence: observations are independent within and across samples (often supported by the 10% condition for each sample when sampling without replacement).
- Large expected counts: all expected cell counts are at least 5.
A subtle but common error is checking that all observed counts exceed 5. The condition is about expected counts.
Worked example (Homogeneity)
A researcher wants to know whether preferred study environment differs by class year. She takes random samples of students from each year and asks whether they prefer (A) quiet library, (B) dorm room, or (C) coffee shop.
Observed counts:
| Class year | Library | Dorm | Coffee | Total |
|---|---|---|---|---|
| First-year | 30 | 25 | 15 | 70 |
| Senior | 20 | 30 | 20 | 70 |
| Total | 50 | 55 | 35 | 140 |
1) Hypotheses
- H_0: the distribution of preferred environment is the same for first-years and seniors.
- H_a: the distributions differ.
2) Expected counts
For First-year & Library:
E = \frac{(70)(50)}{140}=25
First-year & Dorm:
E = \frac{(70)(55)}{140}=27.5
First-year & Coffee:
E = \frac{(70)(35)}{140}=17.5
Because both rows have the same total (70), the Senior row expected counts are the same: 25, 27.5, 17.5.
3) Compute \chi^2
Add contributions for all 6 cells:
\chi^2 = \frac{(30-25)^2}{25} + \frac{(25-27.5)^2}{27.5} + \frac{(15-17.5)^2}{17.5} + \frac{(20-25)^2}{25} + \frac{(30-27.5)^2}{27.5} + \frac{(20-17.5)^2}{17.5}
Compute contributions (rounded):
\frac{25}{25}=1
\frac{6.25}{27.5}\approx 0.2273
\frac{6.25}{17.5}\approx 0.3571
\frac{25}{25}=1
\frac{6.25}{27.5}\approx 0.2273
\frac{6.25}{17.5}\approx 0.3571
Sum:
\chi^2 \approx 3.1688
4) Degrees of freedom
df = (2-1)(3-1)=2
5) Conclude
With df=2, \chi^2\approx 3.17 gives a p-value that is typically above 0.05. You would fail to reject H_0 and conclude the study does not provide convincing evidence that preference distributions differ between first-years and seniors.
Notice how the conclusion is about distributions of preferences, not about individual choices.
Exam Focus
- Typical question patterns:
- “Two (or more) random samples were taken from different populations… Do the data provide evidence that the distributions are the same?”
- “Fill in expected counts for a two-way table and compute \chi^2 and df.”
- “Interpret a statistically significant result in context (what is different across groups?).”
- Common mistakes:
- Calling it “independence” when the design is multiple samples from multiple populations (homogeneity is the correct name).
- Not verifying randomness/independence for each sample separately.
- Giving a conclusion about individuals (“first-years prefer libraries”) instead of a population-level distribution statement.
Chi-Square Test for Independence
What it is
A chi-square test for independence evaluates whether two categorical variables are associated (related) in a single population. You collect one random sample (or use one observational dataset) and measure two categorical variables on each individual, then analyze the two-way table.
Example settings:
- In a random sample of voters, is voting method (mail vs in-person) related to party affiliation?
- In a sample of patients, is smoking status related to diagnosis category?
The key structural idea: one sample, two categorical variables measured on each subject.
Why it matters
Many questions in science and society are about relationships: whether one categorical characteristic tends to occur with another. The independence test gives you a formal method to decide if an observed pattern in a contingency table is likely to reflect a real association in the population.
A crucial interpretation point: independence tests detect association, not causation, unless the data come from a randomized experiment.
How it works (same math as homogeneity, different story)
Independence uses the same calculations as homogeneity. What changes is how you describe the design and the hypotheses.
Step 1: Hypotheses
For an r \times c table:
- Null hypothesis: the variables are independent in the population (no association).
- Alternative hypothesis: the variables are not independent (there is an association).
You can also phrase the null as “the distribution of one variable is the same across the categories of the other variable,” which connects directly to the homogeneity wording.
Step 2: Expected counts under independence
If variables are independent, the probability of being in a particular cell factors into a row proportion times a column proportion. That leads to the same expected count formula:
E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}
Step 3: Test statistic and degrees of freedom
Same as homogeneity:
\chi^2 = \sum \frac{(O - E)^2}{E}
df = (r-1)(c-1)
Step 4: Conditions
- Random: data from a random sample (or random assignment if an experiment).
- Independence: observations are independent; for sampling without replacement, the sample should be no more than 10% of the population.
- Large expected counts: all expected counts at least 5.
A common misconception is thinking “independence” in the condition means the variables must be independent. The condition is about independence of observations (one person’s outcome doesn’t affect another’s), not the conclusion you’re testing.
Seeing what drives significance: residuals (interpretation tool)
If the test is significant, you often want to describe how the variables are associated. The chi-square statistic itself doesn’t tell you direction; it tells you “there is a mismatch somewhere.”
You can look at cell-by-cell discrepancies. One useful quantity is the standardized residual:
\text{standardized residual} = \frac{O - E}{\sqrt{E}}
- Large positive values mean “more observed than expected.”
- Large negative values mean “fewer observed than expected.”
On AP-style problems, you might not be required to compute these, but you should be able to interpret which cells contribute most to \chi^2 by comparing O and E.
Worked example (Independence)
A school takes a random sample of 120 students and records whether each student participates in a club (Yes/No) and whether they prefer online homework (Yes/No).
Observed counts:
| Prefer online: Yes | Prefer online: No | Total | |
|---|---|---|---|
| Club: Yes | 30 | 10 | 40 |
| Club: No | 30 | 50 | 80 |
| Total | 60 | 60 | 120 |
1) Hypotheses
- H_0: Club participation and online-homework preference are independent in the student population.
- H_a: They are associated.
2) Expected counts
Club Yes & Prefer Yes:
E = \frac{(40)(60)}{120}=20
Club Yes & Prefer No:
E = \frac{(40)(60)}{120}=20
Club No & Prefer Yes:
E = \frac{(80)(60)}{120}=40
Club No & Prefer No:
E = \frac{(80)(60)}{120}=40
All expected counts are at least 5.
3) Compute \chi^2
\chi^2 = \frac{(30-20)^2}{20} + \frac{(10-20)^2}{20} + \frac{(30-40)^2}{40} + \frac{(50-40)^2}{40}
Contributions:
\frac{100}{20}=5
\frac{100}{20}=5
\frac{100}{40}=2.5
\frac{100}{40}=2.5
So:
\chi^2 = 15
4) Degrees of freedom
df = (2-1)(2-1)=1
5) Conclude
With df=1, \chi^2=15 gives a very small p-value (well below 0.05). You would reject H_0 and conclude there is convincing evidence of an association between club participation and preference for online homework.
Interpretation of association: Club participants show more “Prefer Yes” than expected under independence (30 observed vs 20 expected), and fewer “Prefer No” (10 observed vs 20 expected). That suggests club participants are more likely to prefer online homework.
Connection to two-proportion z tests (important AP link)
When you have a 2 \times 2 table, the chi-square test (independence/homogeneity) is closely related to the two-proportion z test. In fact, for equivalent setups:
\chi^2 = z^2
This is why these tests typically lead to the same conclusion about significance (though they emphasize slightly different framing).
Exam Focus
- Typical question patterns:
- “A random sample was taken and two categorical variables were recorded… Test for association.”
- “Compute expected counts using the row-total times column-total rule, then compute \chi^2 and df.”
- “Describe which cells contribute most to the association.”
- Common mistakes:
- Interpreting a significant result as causation in an observational study.
- Mixing up independence vs homogeneity based on wording rather than the sampling design.
- Forgetting to state the conclusion in terms of an association between the two variables.
Selecting an Appropriate Inference Procedure for Categorical Data
Start with the real decision: what kind of data and question do you have?
Choosing the correct inference procedure is mostly about recognizing the structure of the data and the research question. For categorical data in AP Statistics, your choice usually comes down to:
- Inference about one proportion (one categorical variable with two outcomes)
- Inference about two proportions (two groups, binary outcome)
- Inference about multiple categories or two-way tables (chi-square methods)
A helpful mindset: first decide whether you’re comparing counts/proportions across categories or groups, and whether there is one variable or two variables involved.
When to use each chi-square test
All chi-square tests in this unit share the same test statistic form:
\chi^2 = \sum \frac{(O - E)^2}{E}
What changes is how you get the expected counts and how you describe the null hypothesis.
Goodness of fit: one variable, many categories
Use a chi-square goodness of fit test when:
- You have one categorical variable with k categories.
- The null model specifies the category probabilities p_{i,0}.
- You want to test whether the population follows that distribution.
Expected counts come from the claimed probabilities:
E_i = n p_{i,0}
Degrees of freedom:
df = k - 1
Typical phrasing clue: “fits a distribution,” “claimed proportions,” “matches the stated percentages.”
Homogeneity: multiple populations/treatments, one categorical response
Use a chi-square test for homogeneity when:
- You have two or more independent random samples from different populations (or subjects randomly assigned to treatments).
- You measure the same categorical response variable for each group.
- You want to know whether the distribution of the response is the same across groups.
Expected counts come from pooled totals:
E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}
Degrees of freedom:
df = (r-1)(c-1)
Typical phrasing clue: “compare distributions across several groups,” “samples from each of several populations.”
Independence: one population/sample, two categorical variables
Use a chi-square test for independence when:
- You have one random sample.
- You record two categorical variables per individual.
- You want to test whether the variables are associated.
Expected counts (same as homogeneity):
E = \frac{(\text{row total})(\text{column total})}{(\text{grand total})}
Degrees of freedom:
df = (r-1)(c-1)
Typical phrasing clue: “relationship between two categorical variables,” “association,” “independent.”
How to distinguish homogeneity vs independence (the most common confusion)
The calculations are identical, so the distinction is about study design and how you interpret the result.
- Homogeneity: separate samples from separate populations (or distinct treatment groups). You are comparing populations/treatments.
- Independence: one sample, two variables measured; you are checking association within one population.
A quick way to decide: ask yourself, “Did I take one sample and measure two things, or did I take multiple samples (one per group) and measure one thing?”
Chi-square vs z procedures for proportions
Sometimes a question could be approached with either a chi-square test (on a 2 \times 2 table) or a two-proportion z test. On the AP exam, the expected choice usually depends on how the question is framed:
- If the response is binary and you’re comparing two groups, a two-proportion z test is a natural choice.
- If the problem is presented as a contingency table (especially bigger than 2 \times 2), a chi-square test is the natural choice.
When it is 2 \times 2, the procedures align through:
\chi^2 = z^2
So the decision often comes down to communication: do you want to talk about a difference in proportions (z test) or an association between variables (chi-square test)?
Practical “procedure selection” examples
Example 1 (GOF): You record hair color (blonde/brown/black/red) in one random sample and compare to a claimed national distribution. That’s one variable with multiple categories and a claimed distribution, so it’s GOF.
Example 2 (Homogeneity): You take random samples of customers from three different store locations and record satisfaction level (low/medium/high). Multiple populations, one categorical response, so homogeneity.
Example 3 (Independence): You take one random sample of customers and record (a) whether they used a coupon and (b) whether they returned within a month. One sample, two categorical variables, so independence.
What goes wrong when you pick the wrong test
- If you call a procedure “independence” when the design is clearly multiple samples, your math might still be correct, but your hypotheses and conclusions will be framed incorrectly, which costs points.
- If you try to use a chi-square test when expected counts are too small, your p-value may not be reliable.
- If you treat categories as numerical (for example, averaging category codes), you leave the world of categorical inference and your analysis becomes meaningless.
Exam Focus
- Typical question patterns:
- “Identify the correct inference procedure and justify your choice.”
- “Is this a test for homogeneity or independence? Explain based on the sampling method.”
- “For a 2 \times 2 table, compare a chi-square test and a two-proportion z test.”
- Common mistakes:
- Deciding based on the table shape alone instead of the design (both homogeneity and independence can use an r \times c table).
- Writing hypotheses that talk about means or quantitative differences instead of distributions/association.
- Ignoring the expected-count condition or checking observed counts instead.