Inference for Distributions of Categorical Data: Chi-Square Test for Goodness of Fit

Inference for Distributions of Categorical Data

  • The study of categorical variables involves determining if a hypothesized distribution of data matches observed results in one or more populations.

  • There are three primary types of chi-square tests used for categorical data, depending on the research question and data structure:   - Goodness of Fit (G.O.F.) Test: Used to determine if a hypothesized distribution for a single categorical variable in a single population seems valid (e.g., used frequently in genetic research).   - Chi-Square Test for Homogeneity: Used to determine whether the distribution of a single categorical variable differs across two or more populations or treatments. Data is typically organized in a two-way table.   - Chi-Square Test for Association/Independence: Used to determine if there is convincing evidence of an association between two categorical variables in a population.

Chi-Square Test for Goodness of Fit (G.O.F.)

  • Definition: A goodness-of-fit test compares the distribution of a categorical variable in a sample to a claimed or hypothesized distribution in the population.

  • Stating Hypotheses:   - $H_0$ (Null Hypothesis): The distribution of the categorical variable in the population of interest is the same as the claimed distribution.   - $H_a$ (Alternative Hypothesis): The distribution of the categorical variable in the population of interest is different from the claimed distribution.   - Symbolic Notation:     - $H_0: p_1 = \text{value}_1, p_2 = \text{value}_2, \dots, p_k = \text{value}_k$     - $H_a$: At least two of the $p_i$ values are incorrect.   - Caution: Do not state $H_a$ in a way that suggests all proportions in the hypothesized distribution are wrong; it only requires that at least two are incorrect to be a valid alternative.

  • Expected Counts:   - The expected count for a specific category is calculated under the assumption that the null hypothesis is true.   - Formula: Expected Counti=n×pi\text{Expected Count}_i = n \times p_i   - Where $n$ is the total sample size and $p_i$ is the relative frequency (probability) for category $i$ specified by $H_0$.

  • The Chi-Square ($\chi^2$) Test Statistic:   - This statistic measures how far the observed counts ($O$) in a sample are from the expected counts ($E$).   - Formula: χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}   - The sum is across all $k$ categories of the variable.   - Large values of $\chi^2$ provide stronger evidence against $H_0$.

Chi-Square Distributions and P-Values

  • Properties of the Chi-Square Distribution:   - The distribution is defined by a density curve that takes only non-negative values.   - It is skewed to the right.   - A specific $\chi^2$ distribution is defined by its degrees of freedom ($df$).   - As $df$ increases, the density curve becomes less skewed and begins to look more normal.   - The mean of a $\chi^2$ distribution is equal to its $df$.   - For $df > 2$, the mode (peak) of the density curve is located at df2df - 2.

  • Degrees of Freedom ($df$) for G.O.F.:   - df=k1df = k - 1   - Where $k$ is the number of categories.

  • P-Values:   - The P-value is the area to the right of the calculated $\chi^2$ statistic under the $\chi^2$ density curve with the appropriate $df$.   - Caution: Failing to reject $H_0$ (when $P$ is large) does not mean the null hypothesis is definitely true; it means we lack convincing evidence that the distribution is different.

Performing a Chi-Square Test (State, Plan, Do, Conclude)

  • Name of Test: Chi-Squared Goodness of Fit Test.

  • Conditions for Inference:   - Random: The data must come from a well-designed random sample from the population or a randomized experiment.   - 10% Rule: When sampling without replacement, the sample size $n$ must be less than $10\%$ of the population size $N$ ($n < 0.10N$).   - Large Counts: All expected counts must be at least 5 ($E_i \ge 5$ for all $i$).

  • AP Exam Tip: When checking the Large Counts condition, you must examine and explicitly label the expected counts, not the observed counts.

  • Calculator Usage (TI-84):   - Input observed counts in List 1 (L1) and expected counts in List 2 (L2).   - Select $\\chi^2$GOF-Test from the Stat/Tests menu.   - Individual terms in the $\chi^2$ calculation are stored in a list called CNTRB (contributions).   - AP Tip: Write out at least the first few terms of the $\chi^2$ summation manually (e.g., ++\dots + \dots + \dots) to earn partial credit even if a calculation error occurs.

Case Study: M&M'S® Milk Chocolate Candies

  • Mars, Inc. Claimed Distribution (Hackettstown, NJ factory):   - Brown: $12.5\%$   - Red: $12.5\%$   - Yellow: $12.5\%$   - Green: $12.5\%$   - Orange: $25.0\%$   - Blue: $25.0\%$

  • Jerome’s Sample Analysis:   - Sample size: $n = 60$.   - Expected counts:     - 60×0.125=7.560 \times 0.125 = 7.5 (Brown, Red, Yellow, Green)     - 60×0.25=15.060 \times 0.25 = 15.0 (Orange, Blue)   - Observed counts: Brown (12), Red (3), Yellow (7), Green (9), Orange (9), Blue (20).   - $\chi^2$ calculation: χ2=(127.5)27.5+(37.5)27.5+(77.5)27.5+(97.5)27.5+(915)215+(2015)215=9.8\chi^2 = \frac{(12-7.5)^2}{7.5} + \frac{(3-7.5)^2}{7.5} + \frac{(7-7.5)^2}{7.5} + \frac{(9-7.5)^2}{7.5} + \frac{(9-15)^2}{15} + \frac{(20-15)^2}{15} = 9.8   - $df = 6 - 1 = 5$.   - P-value results (simulation): $87/1000 = 0.087$. At significance level $\alpha = 0.05$, we would fail to reject $H_0$.

Example 1 & 3: Ceramic Six-Sided Die (Carrie)

  • Scenario: Carrie rolled a custom 6-sided die 90 times to test for fairness.

  • Hypotheses:   - $H_0$: The sides of Carrie’s die are equally likely to show up ($p_1 = p_2 = p_3 = p_4 = p_5 = p_6 = 1/6$).   - $H_a$: The sides of Carrie’s die are not equally likely to show up.

  • Observed Data: 1 (12), 2 (28), 3 (12), 4 (13), 5 (10), 6 (15).

  • Calculations:   - Expected count for each side: 90×(1/6)=15.090 \times (1/6) = 15.0.   - $\chi^2$ value: 14.4114.41.   - $df = 6 - 1 = 5$.   - P-value (using technology): 0.01320.0132.

  • Conclusion: Since the P-value ($0.0132$) is less than $\alpha = 0.05$, reject $H_0$. There is convincing evidence the die is not fair.

Example 4: Birthday Distributions of NHL Players (Malcolm Gladwell)

  • Topic: Discussion of whether a hockey player’s birth month (cut-off Jan 1) influences success.

  • Question: Are birthdays of NHL players uniformly distributed across the four quarters of the year?

  • Sample: $n = 80$ random NHL players.   - Quarter 1 (Jan-Mar): 32 players.   - Quarter 2 (Apr-Jun): 20 players.   - Quarter 3 (Jul-Sep): 16 players.   - Quarter 4 (Oct-Dec): 12 players.

  • Conditions:   - Random: Stated random sample of 80 players.   - 10%: $80 < 10\%$ of all NHL players.   - Large Counts: Each expected count is 80×(1/4)=2080 \times (1/4) = 20, which is $\ge 5$.

  • Results:   - $\chi^2$ value: (3220)220+(2020)220+(1620)220+(1220)220=11.2\frac{(32-20)^2}{20} + \frac{(20-20)^2}{20} + \frac{(16-20)^2}{20} + \frac{(12-20)^2}{20} = 11.2   - $df = 4 - 1 = 3$.   - P-value: 0.01070.0107.

  • Conclusion: Reject $H_0$. There is convincing evidence that NHL player birthdays are not uniformly distributed.

Example 5: High School Lunch Sign-Outs

  • Scenario: A random sample of $n=100$ entries from a school lunch sign-out list.

  • Hypotheses:   - $H_0$: The number of students leaving campus for lunch is uniformly distributed across the 5 days of the week.   - $H_a$: The distribution is not uniform.

  • Calculations:   - Expected counts: 100×(1/5)=20100 \times (1/5) = 20. All expected counts are $\ge 5$.   - $df = 5 - 1 = 4$.   - $\chi^2$ value: 4.84.8.   - P-value: 0.3080.308.

  • Conclusion: Since the P-value ($0.308$) is greater than $\alpha = 0.05$, fail to reject $H_0$. No convincing evidence that the distribution is not uniform.

Example 6: Genetic Makeup of Tobacco Plants

  • Scenario: Crossing pairs of Gg tobacco plants (Dominant G for green, recessive g for color). Expected Punnett square ratio: 1:2:1 (25% green, 50% yellow-green, 25% albino).

  • Observed Data: $n = 84$ offspring. Green (23), Yellow-Green (50), Albino (11).

  • Test results at $\alpha = 0.05$:   - Expected counts: Green (21), Yellow-Green (42), Albino (21).   - $\chi^2$ value: 6.486.48.   - $df = 3 - 1 = 2$.   - P-value: 0.0390.039.

  • Conclusion: Reject $H_0$. The genetic distribution differs from the predicted 1:2:1 ratio.

Follow-Up Analysis

  • Purpose: Conducted identifying when a $\chi^2$ test result is statistically significant to determine which specific categories cause the deviation from $H_0$.

  • Procedure:   - Examine which categories show the largest deviations between observed and expected counts.   - Analyze the individual components of the $\chi^2$ statistic: (OE)2E\frac{(O - E)^2}{E}.   - Provide specific numbers and directions (more than expected or less than expected).

  • NHL Example Follow-up:   - The categories contributing the most to $\chi^2 = 11.2$ were Jan-Mar and Oct-Dec.   - Jan-Mar: 12 more players were born than expected.   - Oct-Dec: 8 fewer players were born than expected.

  • Tobacco Plant Example Follow-up:   - The largest contribution came from the Albino category ($4.762$).   - Observed count for Albinos (11) was 10 less than the expected count (21).