Notes on Chi-Squared Test of Independence (Chapter 5)
Section 5.0: Introduction
Topic: Detecting and describing associations between two categorical variables using the chi-squared test.
Key ideas:- Terminology for categorical data analysis.
Statistical dependence vs independence: whether population conditional distributions differ across categories.
Significance testing framework: chi-squared test to determine if two categorical variables are statistically dependent or independent.
Core concepts:- Dependence means conditional distributions differ across the other variable; independence means they are identical across categories.
The test assesses whether observed counts in a contingency table are consistent with independence under the null hypothesis H_0.
Notation (to be used throughout):- Observed frequency in a cell: f_0.
Expected frequency under independence: f_e.
Sample size: N (total number of observations).
Degrees of freedom will be derived from table dimensions.
Practical points:- Large-sample requirement and random sampling are important for the chi-squared approximation to be valid.
Fisher’s exact test is an alternative for small samples (not covered in detail here).
Section 5.1: Contingency Tables
Contingency tables display counts for all combinations of possible outcomes for two categorical variables.
Marginal distributions:- Row totals and column totals summarize the marginal distributions.
Example 1 (Ex.1): Gender vs. Political party identification (2×3 table)- Variables: gender (Females, Males) and party ID (Democrat, Independent, Republican).
Data:
Females: Democrat = 573, Independent = 516, Republican = 422; Row total = 1511.
Males: Democrat = 386, Independent = 475, Republican = 399; Row total = 1260.
Totals: Democrat = 959, Independent = 991, Republican = 821; Column totals sum to 2771.
Conditional distributions (within gender) as relative frequencies:
Females: rac{573}{1511}=0.38, rac{516}{1511}=0.34, rac{422}{1511}=0.28
(Democrat, Independent, Republican) for Females: (38%, 34%, 28%).
Males: rac{386}{1260}=0.31, rac{475}{1260}=0.38, rac{399}{1260}=0.32
(Democrat, Independent, Republican) for Males: (31%, 38%, 32%).
Interpretation:
Whether there is an association is based on whether these conditional distributions differ between females and males.
Population question: Is party ID associated with gender?
Ex.2 (Independence example): Party ID vs. Ethnic group (data show independence)- Table rows: Ethnic groups (White, Black, Hispanic).
Example data (percentages shown as column totals):
White: Democrat 44%, Independent 14%, Republican 42% (Total 1000)
Black: Democrat 44%, Independent 14%, Republican 42% (Total 100)
Hispanic: Democrat 44%, Independent 14%, Republican 42% (Total 250)
Conclusion: Since the probability of each party ID is the same across ethnic groups, party ID is independent of ethnic group.
Ex.3 (Life after death and variables):- General GSS finding: belief in life after death roughly 80% across groups (gender, race) suggesting independence in these cases.
However, belief varies by religion (Catholics, Protestants ~80% vs. Jews and those with no religion ~40–50%), suggesting potential dependence between life-after-death belief and religion.
Section 5.2: Chi-Squared Test of Independence
Purpose: Test whether two categorical variables are statistically independent in the population, based on a sample.
Hypotheses:- Null hypothesis H_0: The variables are statistically independent.
Alternative H_a: The variables are statistically dependent.
Requirements:- Randomization and a large sample.
Expected frequencies in each cell should be sufficiently large; a common rule is f_e > 5 in each cell.
Notation and quantities:- Observed frequency in a cell: f_0.
Expected frequency under H_0 (independence):
f_e = rac{( ext{row total}) \times ( ext{column total})}{N}.
Test statistic (Pearson chi-squared):
\chi^2 = \sum{ ext{cells}} \frac{(f0 - fe)^2}{fe}.
Degrees of freedom: \text{df} = (r - 1)(c - 1) where r is the number of rows and c the number of columns.
P-value: Right-tail probability above the observed \chi^2 value.
Decision rule: Reject H_0 if P\text{-value} \le \alpha.
Ex.4: Calculation of expected frequencies for Example 1 (2×3 table)- Observed counts (f_0) given in Ex.1.
Total N = 2771; row totals: Females 1511, Males 1260; column totals: Democrat 959, Independent 991, Republican 821.
Expected frequencies f_e (rounded):
Democrat × Females: f_{e, \text{D,F}} = \frac{959 \times 1511}{2771} = 522.9.
Democrat × Males: f_{e, \text{D,M}} = \frac{959 \times 1260}{2771} = 436.1.
Independent × Females: f_{e, \text{I,F}} = \frac{991 \times 1511}{2771} = 540.4.
Independent × Males: f_{e, \text{I,M}} = \frac{991 \times 1260}{2771} = 450.6.
Republican × Females: f_{e, \text{R,F}} = \frac{821 \times 1511}{2771} = 447.7.
Republican × Males: f_{e, \text{R,M}} = \frac{821 \times 1260}{2771} = 373.3.
Ex.5: Run the chi-squared test for Example 1 (with \alpha = 0.01)- Assumptions (five-step framework):
Type of data: categorical.
Randomization.
Large sample: f_e > 5 in all cells (true in this example).
Hypotheses:
H_0: party ID and gender are independent.
H_a: party ID and gender are dependent.
Test statistic: From Ex.4, \chi^2 = 16.3 with \text{df} = (2-1)(3-1) = 2.
P-value: P \approx 0.0003 (computed via software; table lookup may approximate).
Conclusion: Since P < \alpha = 0.01, reject H_0; gender and party ID are statistically dependent in this sample.
Interpretation: A large \chi^2 indicates association but does not quantify strength. It merely signals that dependence exists.
Note: Large \chi^2 does not imply strong association; strength should be assessed via conditional probabilities or effect size measures.
Ex.6: Strength of association vs. sample size (three cases)- Each case shares the same conditional probabilities across Race vs. Yes/No for a certain question, i.e., conditional distributions are identical:
Case A: White 49% Yes, 51% No; Black 51% Yes, 49% No; Total 200 per race; \chi^2 = 0.08, P\text{-value} = 0.78.
Case B: White 49% vs. 51% across Yes/No with 400 total; \chi^2 = 0.16, P\text{-value} = 0.69.
Case C: Very large totals (20,000) with the same conditional proportions; \chi^2 = 8.0, P\text{-value} = 0.005.
Key takeaway:
For a fixed conditional distribution, \chi^2 is directly proportional to the sample size: larger samples yield larger \chi^2 values and smaller P\text{-values}.
A small P\text{-value} can occur with a large sample despite a weak association (as in Case C).
Section 5.3: Chapter Summary
Main idea: A sample shows association when certain values of one variable tend to go with certain values of the other.
Approaches to study association between two categorical variables:- Describe counts via percentage distributions (conditional distributions) across the response variable categories.
Independence in population means identical conditional distributions across the levels of the other variable.
If not identical, variables are statistically dependent.
Chi-squared test for independence:- Compare observed frequencies f0 to expected frequencies fe under H_0:
\chi^2 = \sum \frac{(f0 - fe)^2}{f_e}.
Large-sample chi-squared distribution under H_0.
Degrees of freedom: \text{df} = (r - 1)(c - 1).
P-value: right-tail probability above the observed \chi^2.
Practical notes:- The chi-squared test indicates association, not strength.
The statistic is sensitive to sample size: large samples can produce significant results even for weak associations.
For small samples, consider Fisher’s exact test.
Section 5.4: Practice Problems
Problem (1): Abortion opinion by gender (GSS-type data)- Tasks:
(a) Construct a contingency table showing the conditional distribution of opinion on unrestricted abortion (Yes, No) by gender (Male, Female).
(b) Assess plausibility of statistical independence between gender and opinion, with justification.
Problem (2): Educational level and marital status (Country A)- Data: 423 respondents with counts across Educational levels (Middle school or lower, High school, Bachelor’s, Master’s/PhD) and Marital status (Never married, Married, Divorced or widowed).
Table (given):
Never married: 32, 53, 46, 17
Married: 21, 54, 48, 72
Divorced or widowed: 12, 13, 21, 34
Tasks:
Use \alpha = 0.05 and run the chi-squared test to determine if Educational level and Marital status are independent.
General notes for solving practice problems:- Build the r \times c contingency table.
Compute row totals, column totals, and N.
Compute fe for each cell with fe = \frac{(\text{row total}) (\text{column total})}{N}.
Compute \chi^2 with \chi^2 = \sum \frac{(f0 - fe)^2}{f_e} and \text{df} = (r-1)(c-1).
Determine P\text{-value} from \chi^2 distribution with the appropriate \text{df}, and compare to \alpha to decide independence.