Notes on Chi-Squared Test of Independence (Chapter 5)

Section 5.0: Introduction

  • Topic: Detecting and describing associations between two categorical variables using the chi-squared test.

  • Key ideas:- Terminology for categorical data analysis.

    • Statistical dependence vs independence: whether population conditional distributions differ across categories.

    • Significance testing framework: chi-squared test to determine if two categorical variables are statistically dependent or independent.

  • Core concepts:- Dependence means conditional distributions differ across the other variable; independence means they are identical across categories.

    • The test assesses whether observed counts in a contingency table are consistent with independence under the null hypothesis H_0.

  • Notation (to be used throughout):- Observed frequency in a cell: f_0.

    • Expected frequency under independence: f_e.

    • Sample size: N (total number of observations).

    • Degrees of freedom will be derived from table dimensions.

  • Practical points:- Large-sample requirement and random sampling are important for the chi-squared approximation to be valid.

    • Fisher’s exact test is an alternative for small samples (not covered in detail here).

Section 5.1: Contingency Tables

  • Contingency tables display counts for all combinations of possible outcomes for two categorical variables.

  • Marginal distributions:- Row totals and column totals summarize the marginal distributions.

  • Example 1 (Ex.1): Gender vs. Political party identification (2×3 table)- Variables: gender (Females, Males) and party ID (Democrat, Independent, Republican).

    • Data:

    • Females: Democrat = 573, Independent = 516, Republican = 422; Row total = 1511.

    • Males: Democrat = 386, Independent = 475, Republican = 399; Row total = 1260.

    • Totals: Democrat = 959, Independent = 991, Republican = 821; Column totals sum to 2771.

    • Conditional distributions (within gender) as relative frequencies:

    • Females: rac{573}{1511}=0.38, rac{516}{1511}=0.34, rac{422}{1511}=0.28

    • (Democrat, Independent, Republican) for Females: (38%, 34%, 28%).

    • Males: rac{386}{1260}=0.31, rac{475}{1260}=0.38, rac{399}{1260}=0.32

    • (Democrat, Independent, Republican) for Males: (31%, 38%, 32%).

    • Interpretation:

    • Whether there is an association is based on whether these conditional distributions differ between females and males.

    • Population question: Is party ID associated with gender?

  • Ex.2 (Independence example): Party ID vs. Ethnic group (data show independence)- Table rows: Ethnic groups (White, Black, Hispanic).

    • Example data (percentages shown as column totals):

    • White: Democrat 44%, Independent 14%, Republican 42% (Total 1000)

    • Black: Democrat 44%, Independent 14%, Republican 42% (Total 100)

    • Hispanic: Democrat 44%, Independent 14%, Republican 42% (Total 250)

    • Conclusion: Since the probability of each party ID is the same across ethnic groups, party ID is independent of ethnic group.

  • Ex.3 (Life after death and variables):- General GSS finding: belief in life after death roughly 80% across groups (gender, race) suggesting independence in these cases.

    • However, belief varies by religion (Catholics, Protestants ~80% vs. Jews and those with no religion ~40–50%), suggesting potential dependence between life-after-death belief and religion.

Section 5.2: Chi-Squared Test of Independence

  • Purpose: Test whether two categorical variables are statistically independent in the population, based on a sample.

  • Hypotheses:- Null hypothesis H_0: The variables are statistically independent.

    • Alternative H_a: The variables are statistically dependent.

  • Requirements:- Randomization and a large sample.

    • Expected frequencies in each cell should be sufficiently large; a common rule is f_e > 5 in each cell.

  • Notation and quantities:- Observed frequency in a cell: f_0.

    • Expected frequency under H_0 (independence):

    • f_e = rac{( ext{row total}) \times ( ext{column total})}{N}.

    • Test statistic (Pearson chi-squared):

    • \chi^2 = \sum{ ext{cells}} \frac{(f0 - fe)^2}{fe}.

    • Degrees of freedom: \text{df} = (r - 1)(c - 1) where r is the number of rows and c the number of columns.

    • P-value: Right-tail probability above the observed \chi^2 value.

    • Decision rule: Reject H_0 if P\text{-value} \le \alpha.

  • Ex.4: Calculation of expected frequencies for Example 1 (2×3 table)- Observed counts (f_0) given in Ex.1.

    • Total N = 2771; row totals: Females 1511, Males 1260; column totals: Democrat 959, Independent 991, Republican 821.

    • Expected frequencies f_e (rounded):

    • Democrat × Females: f_{e, \text{D,F}} = \frac{959 \times 1511}{2771} = 522.9.

    • Democrat × Males: f_{e, \text{D,M}} = \frac{959 \times 1260}{2771} = 436.1.

    • Independent × Females: f_{e, \text{I,F}} = \frac{991 \times 1511}{2771} = 540.4.

    • Independent × Males: f_{e, \text{I,M}} = \frac{991 \times 1260}{2771} = 450.6.

    • Republican × Females: f_{e, \text{R,F}} = \frac{821 \times 1511}{2771} = 447.7.

    • Republican × Males: f_{e, \text{R,M}} = \frac{821 \times 1260}{2771} = 373.3.

  • Ex.5: Run the chi-squared test for Example 1 (with \alpha = 0.01)- Assumptions (five-step framework):

    • Type of data: categorical.

    • Randomization.

    • Large sample: f_e > 5 in all cells (true in this example).

    • Hypotheses:

    • H_0: party ID and gender are independent.

    • H_a: party ID and gender are dependent.

    • Test statistic: From Ex.4, \chi^2 = 16.3 with \text{df} = (2-1)(3-1) = 2.

    • P-value: P \approx 0.0003 (computed via software; table lookup may approximate).

    • Conclusion: Since P < \alpha = 0.01, reject H_0; gender and party ID are statistically dependent in this sample.

    • Interpretation: A large \chi^2 indicates association but does not quantify strength. It merely signals that dependence exists.

    • Note: Large \chi^2 does not imply strong association; strength should be assessed via conditional probabilities or effect size measures.

  • Ex.6: Strength of association vs. sample size (three cases)- Each case shares the same conditional probabilities across Race vs. Yes/No for a certain question, i.e., conditional distributions are identical:

    • Case A: White 49% Yes, 51% No; Black 51% Yes, 49% No; Total 200 per race; \chi^2 = 0.08, P\text{-value} = 0.78.

    • Case B: White 49% vs. 51% across Yes/No with 400 total; \chi^2 = 0.16, P\text{-value} = 0.69.

    • Case C: Very large totals (20,000) with the same conditional proportions; \chi^2 = 8.0, P\text{-value} = 0.005.

    • Key takeaway:

    • For a fixed conditional distribution, \chi^2 is directly proportional to the sample size: larger samples yield larger \chi^2 values and smaller P\text{-values}.

    • A small P\text{-value} can occur with a large sample despite a weak association (as in Case C).

Section 5.3: Chapter Summary

  • Main idea: A sample shows association when certain values of one variable tend to go with certain values of the other.

  • Approaches to study association between two categorical variables:- Describe counts via percentage distributions (conditional distributions) across the response variable categories.

    • Independence in population means identical conditional distributions across the levels of the other variable.

    • If not identical, variables are statistically dependent.

  • Chi-squared test for independence:- Compare observed frequencies f0 to expected frequencies fe under H_0:

    • \chi^2 = \sum \frac{(f0 - fe)^2}{f_e}.

    • Large-sample chi-squared distribution under H_0.

    • Degrees of freedom: \text{df} = (r - 1)(c - 1).

    • P-value: right-tail probability above the observed \chi^2.

  • Practical notes:- The chi-squared test indicates association, not strength.

    • The statistic is sensitive to sample size: large samples can produce significant results even for weak associations.

    • For small samples, consider Fisher’s exact test.

Section 5.4: Practice Problems

  • Problem (1): Abortion opinion by gender (GSS-type data)- Tasks:

    • (a) Construct a contingency table showing the conditional distribution of opinion on unrestricted abortion (Yes, No) by gender (Male, Female).

    • (b) Assess plausibility of statistical independence between gender and opinion, with justification.

  • Problem (2): Educational level and marital status (Country A)- Data: 423 respondents with counts across Educational levels (Middle school or lower, High school, Bachelor’s, Master’s/PhD) and Marital status (Never married, Married, Divorced or widowed).

    • Table (given):

    • Never married: 32, 53, 46, 17

    • Married: 21, 54, 48, 72

    • Divorced or widowed: 12, 13, 21, 34

    • Tasks:

    • Use \alpha = 0.05 and run the chi-squared test to determine if Educational level and Marital status are independent.

  • General notes for solving practice problems:- Build the r \times c contingency table.

    • Compute row totals, column totals, and N.

    • Compute fe for each cell with fe = \frac{(\text{row total}) (\text{column total})}{N}.

    • Compute \chi^2 with \chi^2 = \sum \frac{(f0 - fe)^2}{f_e} and \text{df} = (r-1)(c-1).

    • Determine P\text{-value} from \chi^2 distribution with the appropriate \text{df}, and compare to \alpha to decide independence.