WK11: Statistical inference: Two categorical variables: Chi-square test for goodness of fit
Chi-Square Goodness of Fit Test
Used to test if a single categorical variable has a specific distribution.
Follows the standard hypothesis testing steps:
State null and alternative hypotheses & level of significance.
Check conditions.
Calculate test statistic.
Find p-value.
Make a decision and conclude in context.
Hypotheses
Distribution of a Categorical Variable: Lists categories and their proportions.
If there are K categories: P<em>1,P</em>2,…,PK represent the proportions for each category.
∑<em>i=1KP</em>i=1 (sum of proportions equals 1).
Null Hypothesis (H0): Specifies proportions for all categories.
P<em>1=P</em>10,P<em>2=P</em>20,…,P<em>K=P</em>K0
Alternative Hypothesis (Ha): At least one proportion is different from what the null hypothesis states.
Not all P<em>i are equal to P</em>i0.
Example: Births and Days of the Week
Claim: Births are not evenly distributed across days of the week.
Null Hypothesis (H0): Births are equally likely on all days of the week.
P<em>1=P</em>2=…=P7=71
Alternative Hypothesis (Ha): Births are not equally likely on all days of the week.
Not all Pi are equal to 71.
This doesn't specify which days have different proportions.
Conditions for Chi-Square Goodness of Fit Test
Random Sample.
Expected counts under the null hypothesis must be:
At least 1 for each cell.
At least 80% of cells should have expected counts of at least 5.
Expected Counts Calculation: Multiply the proportion specified in the null hypothesis by the total sample size.
Example: Births and Days of the Week (Conditions)
Sample size = 700
Expected count for each day = 71×700=100
Since 100 > 5 for all days, the conditions are met.
Test Statistic
Chi-Square Statistic: Measures the difference between observed and expected counts.
Formula: χ2=∑<em>i=1KE</em>i(O</em>i−E<em>i)2
Oi = Observed count for category i
Ei = Expected count for category i
Degrees of Freedom: Number of categories minus one (K - 1).
Example: Births and Days of the Week (Test Statistic)
χ2=19.12
Degrees of freedom = 7 - 1 = 6
P-Value and Conclusion
P-value: Probability of observing a test statistic as extreme or more extreme than the one calculated, assuming the null hypothesis is true.
If p-value < level of significance ($\alpha$): Reject the null hypothesis.
The sample gives statistically significant evidence supporting the alternative.
If p-value > level of significance ($\alpha$): Fail to reject the null hypothesis.
The sample does not give statistically significant evidence to support the alternative.
Example: Births and Days of the Week (Conclusion)
P-value is between 0.0025 and 0.005.
Level of significance ($\alpha$) = 0.05
Since p-value < 0.05, reject the null hypothesis.
Conclusion: At the 5% level of significance, the data gives statistically significant evidence that local births are not equally likely on all days of the week.
Equivalence with Test for Proportion of Successes
When a categorical variable has only two categories, the chi-square goodness-of-fit test is equivalent to the test for the proportion of successes.
Both tests will provide the same decision and conclusion.
Example: Spinning a Coin
Data: 168 heads, 232 tails in 400 spins.
Test if this contradicts a 50/50 distribution at \alpha = 0.05.
Using Proportion Test
Success = Landing heads.
H0:P=0.5
Ha:P=0.5
Sample proportion: p^=400168=0.42
Test statistic: Z=nP(1−P)p^−P=4000.5(1−0.5)0.42−0.5=−3.2
P-value: 2 * 0.0007 = 0.0014
Conclusion: P-value < 0.05, so reject the null hypothesis. The sample gives statistically significant evidence contradicting the 50/50 distribution.
Using Chi-Square Goodness of Fit Test
Categories: Heads (H) and Tails (T)
H<em>0:P</em>H=0.5,PT=0.5
Ha: At least one is different from 0.5.
Expected counts: 200 for each category
Test statistic: χ2=∑Ei(O<em>i−E</em>i)2=200(168−200)2+200(232−200)2=10.24
Degrees of freedom: 2 - 1 = 1
P-value: Between 0.001 and 0.002
Conclusion: P-value < 0.05, so reject the null hypothesis. The sample gives statistically significant evidence that PH=0.5, which contradicts the 50/50 distribution.
Relationship between Z and Chi-Square
For two categorical variables with two categories each, χ2=Z2.