Notes on Chi-Squared Test of Independence (Chapter 5)

Section 5.0: Introduction

Topic: Detecting and describing associations between two categorical variables using the chi-squared test.
Key ideas:- Terminology for categorical data analysis.
- Statistical dependence vs independence: whether population conditional distributions differ across categories.
- Significance testing framework: chi-squared test to determine if two categorical variables are statistically dependent or independent.
Core concepts:- Dependence means conditional distributions differ across the other variable; independence means they are identical across categories.
- The test assesses whether observed counts in a contingency table are consistent with independence under the null hypothesis $H_0$ .
Notation (to be used throughout):- Observed frequency in a cell: $f_0$ .
- Expected frequency under independence: $f_e$ .
- Sample size: $N$ (total number of observations).
- Degrees of freedom will be derived from table dimensions.
Practical points:- Large-sample requirement and random sampling are important for the chi-squared approximation to be valid.
- Fisher’s exact test is an alternative for small samples (not covered in detail here).

Section 5.1: Contingency Tables

Contingency tables display counts for all combinations of possible outcomes for two categorical variables.
Marginal distributions:- Row totals and column totals summarize the marginal distributions.
Example 1 (Ex.1): Gender vs. Political party identification (2×3 table)- Variables: gender (Females, Males) and party ID (Democrat, Independent, Republican).
- Data:
- Females: Democrat = 573, Independent = 516, Republican = 422; Row total = 1511.
- Males: Democrat = 386, Independent = 475, Republican = 399; Row total = 1260.
- Totals: Democrat = 959, Independent = 991, Republican = 821; Column totals sum to 2771.
- Conditional distributions (within gender) as relative frequencies:
- Females: $rac{573}{1511}=0.38, rac{516}{1511}=0.34, rac{422}{1511}=0.28$
- (Democrat, Independent, Republican) for Females: (38%, 34%, 28%).
- Males: $rac{386}{1260}=0.31, rac{475}{1260}=0.38, rac{399}{1260}=0.32$
- (Democrat, Independent, Republican) for Males: (31%, 38%, 32%).
- Interpretation:
- Whether there is an association is based on whether these conditional distributions differ between females and males.
- Population question: Is party ID associated with gender?
Ex.2 (Independence example): Party ID vs. Ethnic group (data show independence)- Table rows: Ethnic groups (White, Black, Hispanic).
- Example data (percentages shown as column totals):
- White: Democrat 44%, Independent 14%, Republican 42% (Total 1000)
- Black: Democrat 44%, Independent 14%, Republican 42% (Total 100)
- Hispanic: Democrat 44%, Independent 14%, Republican 42% (Total 250)
- Conclusion: Since the probability of each party ID is the same across ethnic groups, party ID is independent of ethnic group.
Ex.3 (Life after death and variables):- General GSS finding: belief in life after death roughly 80% across groups (gender, race) suggesting independence in these cases.
- However, belief varies by religion (Catholics, Protestants ~80% vs. Jews and those with no religion ~40–50%), suggesting potential dependence between life-after-death belief and religion.

Section 5.2: Chi-Squared Test of Independence

Purpose: Test whether two categorical variables are statistically independent in the population, based on a sample.
Hypotheses:- Null hypothesis $H_0$ : The variables are statistically independent.
- Alternative $H_a$ : The variables are statistically dependent.
Requirements:- Randomization and a large sample.
- Expected frequencies in each cell should be sufficiently large; a common rule is f_e > 5 in each cell.
Notation and quantities:- Observed frequency in a cell: $f_0$ .
- Expected frequency under $H_0$ (independence):
- $f_e = rac{( ext{row total}) \times ( ext{column total})}{N}.$
- Test statistic (Pearson chi-squared):
- $\chi^2 = \sum{ ext{cells}} \frac{(f0 - fe)^2}{fe}.$
- Degrees of freedom: $\text{df} = (r - 1)(c - 1)$ where $r$ is the number of rows and $c$ the number of columns.
- P-value: Right-tail probability above the observed $\chi^2$ value.
- Decision rule: Reject $H_0$ if $P\text{-value} \le \alpha$ .
Ex.4: Calculation of expected frequencies for Example 1 (2×3 table)- Observed counts ( $f_0$ ) given in Ex.1.
- Total $N = 2771$ ; row totals: Females 1511, Males 1260; column totals: Democrat 959, Independent 991, Republican 821.
- Expected frequencies $f_e$ (rounded):
- Democrat × Females: $f_{e, \text{D,F}} = \frac{959 \times 1511}{2771} = 522.9$ .
- Democrat × Males: $f_{e, \text{D,M}} = \frac{959 \times 1260}{2771} = 436.1$ .
- Independent × Females: $f_{e, \text{I,F}} = \frac{991 \times 1511}{2771} = 540.4$ .
- Independent × Males: $f_{e, \text{I,M}} = \frac{991 \times 1260}{2771} = 450.6$ .
- Republican × Females: $f_{e, \text{R,F}} = \frac{821 \times 1511}{2771} = 447.7$ .
- Republican × Males: $f_{e, \text{R,M}} = \frac{821 \times 1260}{2771} = 373.3$ .
Ex.5: Run the chi-squared test for Example 1 (with $\alpha = 0.01$ )- Assumptions (five-step framework):
- Type of data: categorical.
- Randomization.
- Large sample: f_e > 5 in all cells (true in this example).
- Hypotheses:
- $H_0$ : party ID and gender are independent.
- $H_a$ : party ID and gender are dependent.
- Test statistic: From Ex.4, $\chi^2 = 16.3$ with $\text{df} = (2-1)(3-1) = 2.$
- P-value: $P \approx 0.0003$ (computed via software; table lookup may approximate).
- Conclusion: Since P < \alpha = 0.01, reject $H_0$ ; gender and party ID are statistically dependent in this sample.
- Interpretation: A large $\chi^2$ indicates association but does not quantify strength. It merely signals that dependence exists.
- Note: Large $\chi^2$ does not imply strong association; strength should be assessed via conditional probabilities or effect size measures.
Ex.6: Strength of association vs. sample size (three cases)- Each case shares the same conditional probabilities across Race vs. Yes/No for a certain question, i.e., conditional distributions are identical:
- Case A: White 49% Yes, 51% No; Black 51% Yes, 49% No; Total 200 per race; $\chi^2 = 0.08$ , $P\text{-value} = 0.78$ .
- Case B: White 49% vs. 51% across Yes/No with 400 total; $\chi^2 = 0.16$ , $P\text{-value} = 0.69$ .
- Case C: Very large totals (20,000) with the same conditional proportions; $\chi^2 = 8.0$ , $P\text{-value} = 0.005$ .
- Key takeaway:
- For a fixed conditional distribution, $\chi^2$ is directly proportional to the sample size: larger samples yield larger $\chi^2$ values and smaller $P\text{-values}$ .
- A small $P\text{-value}$ can occur with a large sample despite a weak association (as in Case C).

Section 5.3: Chapter Summary

Main idea: A sample shows association when certain values of one variable tend to go with certain values of the other.
Approaches to study association between two categorical variables:- Describe counts via percentage distributions (conditional distributions) across the response variable categories.
- Independence in population means identical conditional distributions across the levels of the other variable.
- If not identical, variables are statistically dependent.
Chi-squared test for independence:- Compare observed frequencies $f0$ to expected frequencies $fe$ under $H_0$ :
- $\chi^2 = \sum \frac{(f0 - fe)^2}{f_e}.$
- Large-sample chi-squared distribution under $H_0$ .
- Degrees of freedom: $\text{df} = (r - 1)(c - 1).$
- P-value: right-tail probability above the observed $\chi^2$ .
Practical notes:- The chi-squared test indicates association, not strength.
- The statistic is sensitive to sample size: large samples can produce significant results even for weak associations.
- For small samples, consider Fisher’s exact test.

Section 5.4: Practice Problems

Problem (1): Abortion opinion by gender (GSS-type data)- Tasks:
- (a) Construct a contingency table showing the conditional distribution of opinion on unrestricted abortion (Yes, No) by gender (Male, Female).
- (b) Assess plausibility of statistical independence between gender and opinion, with justification.
Problem (2): Educational level and marital status (Country A)- Data: 423 respondents with counts across Educational levels (Middle school or lower, High school, Bachelor’s, Master’s/PhD) and Marital status (Never married, Married, Divorced or widowed).
- Table (given):
- Never married: 32, 53, 46, 17
- Married: 21, 54, 48, 72
- Divorced or widowed: 12, 13, 21, 34
- Tasks:
- Use $\alpha = 0.05$ and run the chi-squared test to determine if Educational level and Marital status are independent.
General notes for solving practice problems:- Build the $r \times c$ contingency table.
- Compute row totals, column totals, and $N$ .
- Compute $fe$ for each cell with $fe = \frac{(\text{row total}) (\text{column total})}{N}$ .
- Compute $\chi^2$ with $\chi^2 = \sum \frac{(f0 - fe)^2}{f_e}$ and $\text{df} = (r-1)(c-1)$ .
- Determine $P\text{-value}$ from $\chi^2$ distribution with the appropriate $\text{df}$ , and compare to $\alpha$ to decide independence.