WK11: Statistical inference: Two categorical variables: Chi-square test for independence
Chi-Square Test for Independence
Overview
The Chi-square test for independence is used to determine if there is a relationship between two categorical variables. It assesses whether the counts for specific categories in one variable are dependent on the categories of the other variable.
Example: Sickle Cell Trait and Malaria
Background
- Question: Is there a relationship between being a sickle cell carrier and protection against malaria?
- Study: 543 African children were checked for sickle cell trait and malaria infection.
- Variables:
- Sickle cell trait (presence or absence)
- Malaria (presence or absence of heavy infection)
Data
- 36 children with sickle cell trait, 36 were heavily infected with malaria.
- 407 children without sickle cell trait, 152 were heavily infected with malaria.
Contingency Table
| Heavily Infected with Malaria | Not Heavily Infected with Malaria | Total | |
|---|---|---|---|
| Sickle Cell Trait | 36 | 100 | 136 |
| No Sickle Cell Trait | 152 | 255 | 407 |
| Total | 188 | 355 | 543 |
Steps for Chi-Square Test
Hypotheses and Significance Level:
- Null Hypothesis (): There is no relationship between the two categorical variables (they are independent).
- Example: There is no relationship between the presence of sickle cell trait and malaria.
- Alternative Hypothesis (): The two categorical variables are dependent.
- Example: There is a relationship between the presence of sickle cell trait and malaria.
- Significance Level (): 5% (0.05) in this example.
- Null Hypothesis (): There is no relationship between the two categorical variables (they are independent).
Check Conditions for Use of the Test:
- Random Sample: The data should come from a random sample of the population.
- Expected Counts:
- All expected counts should be at least 1.
- At least 80% of the cells in the two-way table should have an expected count of at least 5.
Calculate the Test Statistic:
- Observed Counts: The actual counts from the sample data.
- Expected Counts: The counts expected under the assumption of no relationship between the variables.
- Calculated as:
- For example, if is the total for category A of variable X, is the total for category B of variable Y, and is the table total, the expected count is:
- Example calculation for sickle cell trait and malaria:
- Proportion of children with sickle cell trait: (25.05%)
- Expected count for children with malaria and sickle cell trait:
- Expected count for children without malaria but with sickle cell trait:
- Expected count for children without sickle cell trait and without malaria: 266.09
- Chi-square Statistic ():
- Formula:
- The sum is calculated across all cells in the contingency table.
- Example result:
Find the P-value:
- The p-value is the probability of obtaining a test statistic as large or larger than the calculated one, assuming the null hypothesis is true.
- Degrees of Freedom (df):
- In the sickle cell example:
- Use a Chi-square distribution table to find the p-value corresponding to the calculated statistic and degrees of freedom.
- Example: For and df = 1, the p-value is between 0.02 and 0.025.
Make a Statistical Decision and Conclusion:
- If the p-value is less than or equal to the significance level (), reject the null hypothesis. This suggests there is a statistically significant relationship between the variables.
- If the p-value is greater than , fail to reject the null hypothesis. This suggests there is not enough evidence to support a relationship between the variables.
- Example Conclusion: Since the p-value range (0.02 - 0.025) is below the significance level of 0.05, we reject the null hypothesis. The sample provides statistically significant evidence that there is a relationship between the presence of sickle cell trait and the presence of malaria.