WK11: Statistical inference: Two categorical variables: Chi-square test for independence

Chi-Square Test for Independence

Overview

The Chi-square test for independence is used to determine if there is a relationship between two categorical variables. It assesses whether the counts for specific categories in one variable are dependent on the categories of the other variable.

Example: Sickle Cell Trait and Malaria

Background
  • Question: Is there a relationship between being a sickle cell carrier and protection against malaria?
  • Study: 543 African children were checked for sickle cell trait and malaria infection.
  • Variables:
    • Sickle cell trait (presence or absence)
    • Malaria (presence or absence of heavy infection)
Data
  • 36 children with sickle cell trait, 36 were heavily infected with malaria.
  • 407 children without sickle cell trait, 152 were heavily infected with malaria.
Contingency Table
Heavily Infected with MalariaNot Heavily Infected with MalariaTotal
Sickle Cell Trait36100136
No Sickle Cell Trait152255407
Total188355543

Steps for Chi-Square Test

  1. Hypotheses and Significance Level:

    • Null Hypothesis (H0H_0): There is no relationship between the two categorical variables (they are independent).
      • Example: There is no relationship between the presence of sickle cell trait and malaria.
    • Alternative Hypothesis (HaH_a): The two categorical variables are dependent.
      • Example: There is a relationship between the presence of sickle cell trait and malaria.
    • Significance Level (α\alpha): 5% (0.05) in this example.
  2. Check Conditions for Use of the Test:

    • Random Sample: The data should come from a random sample of the population.
    • Expected Counts:
      • All expected counts should be at least 1.
      • At least 80% of the cells in the two-way table should have an expected count of at least 5.
  3. Calculate the Test Statistic:

    • Observed Counts: The actual counts from the sample data.
    • Expected Counts: The counts expected under the assumption of no relationship between the variables.
      • Calculated as: Row Total×Column TotalTable Total\frac{\text{Row Total} \times \text{Column Total}}{\text{Table Total}}
      • For example, if T<em>AT<em>A is the total for category A of variable X, T</em>BT</em>B is the total for category B of variable Y, and TT is the table total, the expected count is: T<em>A×T</em>BT\frac{T<em>A \times T</em>B}{T}
    • Example calculation for sickle cell trait and malaria:
      • Proportion of children with sickle cell trait: 1365430.2505\frac{136}{543} \approx 0.2505 (25.05%)
      • Expected count for children with malaria and sickle cell trait: 188×136543188 \times \frac{136}{543}
      • Expected count for children without malaria but with sickle cell trait: 355×136543355 \times \frac{136}{543}
      • Expected count for children without sickle cell trait and without malaria: 266.09
    • Chi-square Statistic (χ2\chi^2):
      • Formula: χ2=(Observed CountExpected Count)2Expected Count\chi^2 = \sum \frac{(\text{Observed Count} - \text{Expected Count})^2}{\text{Expected Count}}
      • The sum is calculated across all cells in the contingency table.
      • Example result: χ2=5.33\chi^2 = 5.33
  4. Find the P-value:

    • The p-value is the probability of obtaining a test statistic as large or larger than the calculated one, assuming the null hypothesis is true.
    • Degrees of Freedom (df): (Number of Rows1)×(Number of Columns1)(\text{Number of Rows} - 1) \times (\text{Number of Columns} - 1)
      • In the sickle cell example: (21)×(21)=1(2 - 1) \times (2 - 1) = 1
    • Use a Chi-square distribution table to find the p-value corresponding to the calculated χ2\chi^2 statistic and degrees of freedom.
    • Example: For χ2=5.33\chi^2 = 5.33 and df = 1, the p-value is between 0.02 and 0.025.
  5. Make a Statistical Decision and Conclusion:

    • If the p-value is less than or equal to the significance level (α\alpha), reject the null hypothesis. This suggests there is a statistically significant relationship between the variables.
    • If the p-value is greater than α\alpha, fail to reject the null hypothesis. This suggests there is not enough evidence to support a relationship between the variables.
    • Example Conclusion: Since the p-value range (0.02 - 0.025) is below the significance level of 0.05, we reject the null hypothesis. The sample provides statistically significant evidence that there is a relationship between the presence of sickle cell trait and the presence of malaria.