Chapter 5: Association between Categorical Variables (Chi-Squared Test)
Section 5.0: Introduction
Purpose: Introduce a method for detecting and describing associations between two categorical variables: the chi-squared test.
Key ideas:- Terminology for categorical data analysis.
Statistical dependence vs independence: expressing presence or absence of association in a population.
Introduction of the chi-squared test as a significance test to determine if two categorical variables are statistically dependent or independent.
Core concepts:- Contingency tables display counts for all combinations of two categorical variables.
Marginal distributions are the row totals and column totals.
Notation to remember:- Observed frequency in a cell:
f_0
Expected frequency under independence:
f_e
Pearson chi-squared statistic:
\chi^2 = \sum\frac{(f0 - fe)^2}{f_e}
Big picture: If the population conditional distributions are the same across categories of the other variable, variables are independent; otherwise, dependent.
Section 5.1: Contingency Tables
What contingency tables show:- Counts of subjects by all combinations of outcomes for the two variables.
They summarize the joint distribution and allow computation of conditional distributions.
Example: 2004 General Social Survey (GSS) data on gender and political party identification (Democrat, Independent, Republican).- Table: 2 \times 3 contingency table with rows = Gender (Females, Males) and columns = Party ID (Democrat, Independent, Republican).
Data (sample sizes):
\begin{array}{l||ccc||c}
& \text{Democrat} & \text{Independent} & \text{Republican} & \text{Total} \hline
\text{Females} & 573 & 516 & 422 & 1511 \
\text{Males} & 386 & 475 & 399 & 1260 \
\text{Total} & 959 & 991 & 821 & 2771 \
\end{array}
Conditional distributions (relative frequencies) by gender:- Females:
\text{Democrat} : 573/1511 = 0.38,
\text{Independent} : 516/1511 = 0.34,
\text{Republican} : 422/1511 = 0.28
Males:
\text{Democrat} : 386/1260 = 0.31,
\text{Independent} : 475/1260 = 0.38,
\text{Republican} : 399/1260 = 0.32
Interpretation:- The population question is whether party ID is associated with gender.
If the conditional distributions on party ID were identical for females and males, the variables would be statistically independent.
Marginal/conditional concepts:- Marginal distributions: row totals and column totals.
Conditional distribution example:
P(\text{Democrat}||\text{Female}) = \frac{573}{1511} = 0.38
To assess independence, compare conditional distributions across levels of the other variable.
Section 5.1 (continued): Statistical independence and dependence
Definitions:- Statistically independent: The population conditional distributions on one variable are identical at each category of the other variable.
In other words, the probability of any particular category of one variable is the same for all levels of the other variable.
Statistically dependent: The conditional distributions are not identical.
Illustrative Ex.2 (independence):- Table (Ethnic Group \times Party ID) with percentages indicating independence:
\begin{array}{l||ccc||c}
& \text{Democrat} & \text{Independent} & \text{Republican} & \text{Total} \hline
\text{White} & 440 (44\%) & 140 (14\%) & 420 (42\%) & 1000 (100\%) \
\text{Black} & 44 (44\%) & 14 (14\%) & 42 (42\%) & 100 (100\%) \
\text{Hispanic} & 110 (44\%) & 35 (14\%) & 105 (42\%) & 250 (100\%) \
\hline
\text{Total} & 594 & 189 & 567 & 1350 \
\end{array}
Interpretation: The percentage of each party ID is the same across ethnic groups (each party has the same overall distribution within ethnic groups), indicating independence of party ID and ethnicity.
Ex.3 (dependent possibility):- Life after death belief is about 80% across gender and race (appears independent).
However, belief differs by religion: Catholics/Protestants ~80%, Jews and those with no religion ~40–50%.
Conclusion: Belief after death appears independent of gender and race but dependent on religion.
Section 5.2: Chi-Squared Test of Independence
Central question: If we have a sample, can we infer independence in the population?
Null and alternative hypotheses:- H0: The variables are statistically independent.
Ha: The variables are statistically dependent.
Assumptions for the test:- Randomization (random sample).
Large enough sample so that expected frequencies are adequate (f_e > 5 in each cell).
Notation:- Observed frequency in a cell:
f_0
Expected frequency under H0 (independence):
f_e = \frac{\text{(row total)} (\text{column total})}{N}
Test statistic (Pearson chi-squared):
\chi^2 = \sum{\text{cells}} \frac{(f0 - fe)^2}{fe}
Interpretation:- Under H0, for large samples,
\chi^2 follows a chi-squared distribution with
df = (r - 1)(c - 1)
where r = number of rows and c = number of columns.
Larger values of
\chi^2 provide stronger evidence against H0.
P-value is the right-tail probability:
\text{P-value} = P(\chi^2 \ge \chi^2_{\text{obs}})
Practical workflow (five standard steps):- Assumptions: categorical data, randomization, large sample (f_e > 5).
Hypotheses: H0 and Ha as above.
Test statistic: compute \chi^2 using observed and expected frequencies.
P-value: obtain from chi-squared distribution with df = (r-1)(c-1).
Conclusion: reject H0 if \text{P-value} \le \alpha, otherwise do not reject.
Section 5.2 (continued): Example 1 – compute the chi-squared statistic
Data (observed frequencies): from Ex.1 table:- Females: Dem 573, Ind 516, Rep 422
Males: Dem 386, Ind 475, Rep 399
Totals: Females 1511, Males 1260, Grand total 2771
Expected frequencies under independence (given by the provided calculations):
\begin{array}{ccc||c}
& \text{Democrat} & \text{Independent} & \text{Republican} \hline
\text{F} & 522.9 & 540.4 & 447.7 \
\text{M} & 436.1 & 450.6 & 373.3 \
\end{array}
Computed chi-squared statistic:
\chi^2 = (573-522.9)^2/522.9 + (516-540.2)^2/540.2 + (422-447.7)^2/447.7 + (386-436.1)^2/436.1 + (475-450.6)^2/450.6 + (399-373.3)^2/373.3 = 16.3
Note: The second observed frequency in the calculation used was adapted to align with the given expected values (540.2 vs 540.4 in the textual example); the final reported statistic is
\chi^2 = 16.3\,.
Section 5.2 (continued): Chi-squared distribution, P-values, and decision rule
Properties of the chi-squared distribution:- Concentrated on the positive side of the real line.
Skewed to the right.
Shape depends on degrees of freedom:
df = (r - 1)(c - 1)
Larger \chi^2 implies stronger evidence against H0.
P-value interpretation:- The P-value is the probability, under H0, of observing a chi-squared value as extreme or more extreme than the observed value.
Decision rule: Reject H0 if the P-value is less than or equal to the chosen significance level \alpha.
Important nuance:- A large \chi^2 suggests association, but not necessarily a strong association in the population.
The statistic is sensitive to sample size: larger samples can yield large \chi^2 even for weak associations.
Related topic: Fisher’s exact test for small samples when some expected counts are low.
Section 5.3: Example 5 – Run the chi-squared test (Example 1) with \alpha = 0.01
Standard five-step procedure applied:- Assumptions:
Data are categorical.
Randomization is assumed.
Large sample: each cell expected count > 5.
Hypotheses:
H0: party ID and gender are statistically independent.
Ha: party ID and gender are statistically dependent.
Test statistic: from Ex.4,
\chi^2 = 16.3 and
df = (2 - 1)(3 - 1) = 2
P-value: reported as
\text{P-value} = 0.0003 (computed via software; could also be approximated using a chi-squared table).
Conclusion: Since \text{P-value} < \alpha = 0.01, reject H0. The data suggest that gender and party ID are statistically dependent.
Important interpretation:- A large \chi^2 indicates association, not necessarily a strong one.
Conditional probabilities can be inspected to assess strength of association.
Section 5.3 (continued): Interpreting the strength of association with conditional distributions
Observations from Ex.6 (Case A, B, C): same conditional probabilities but different sample sizes lead to different \chi^2 values and P-values.
Cases:- Case A: White vs Black, 100 per group, 200 total per race; data yield \chi^2 = 0.08, P-value 0.78.
Case B: White vs Black with 200 per group, 400 total; \chi^2 = 0.16, P-value 0.69.
Case C: Large sample: White vs Black with 10,000 each, total 20,000; more pronounced deviation relative to expectation yields \chi^2 = 8.0, P-value 0.005.
Key takeaway:- For a fixed pattern of percentages (i.e., fixed conditional distributions), increasing sample size increases \chi^2 and decreases the P-value.
Therefore, a small P-value can arise from a weak association if the sample size is large.
The strength of association should be assessed via actual conditional probabilities (e.g., differences in marginals or by measures of association) rather than solely by the P-value.
Section 5.4: Chapter Summary (key takeaways)
A sample shows association between two variables if certain values of one variable tend to occur with certain values of the other.
Two main approaches:- Describe counts in contingency tables via percentage distributions (conditional distributions) across categories of the response variable to assess independence.
Use the chi-squared test to test H0: independence between the two categorical variables.
Chi-squared test specifics:- Pearson chi-squared statistic:
\chi^2 = \sum (f0 - fe)^2 / f_e
Under H0 and large samples, the statistic follows a chi-squared distribution with
df = (r - 1)(c - 1)
The P-value is the right-tail probability above the observed \chi^2 value.
Important caveats:- A large \chi^2 indicates association but not necessarily a strong association.
\chi^2 grows with sample size; very large samples can yield small P-values even for modest associations.
For small samples, consider Fisher’s exact test.
Section 5.4: Practice Problems (from Section 5.4)
Problem 1 (GSS abortion opinion vs gender):- Data: Approximately 40% of both males and females believe abortion should be legal for any reason.
Tasks:
(a) Construct a contingency table showing the conditional distribution on whether unrestricted abortion should be legal (Yes, No) by gender.
(b) Based on these results, does statistical independence seem plausible between gender and opinion about unrestricted abortion? Why?
Problem 2 (Country A data on education vs marital status):- Data: Random sample of 423 people with a cross-tabulation by educational level and marital status:
\begin{array}{l||cccc||c}
& \text{Middle school or lower} & \text{High school} & \text{Bachelor’s} & \text{Master’s, PhD or higher} & \text{Total} \hline
\text{Never married} & 32 & 53 & 46 & 17 & 148 \
\text{Married} & 21 & 54 & 48 & 72 & 195 \
\text{Divorced or widowed} & 12 & 13 & 21 & 34 & 80 \
\hline
\text{Total} & 65 & 120 & 115 & 123 & 423 \
\end{array}
Task: At \alpha = 0.05, run the chi-squared test to determine if educational level and marital status are independent.
Note: In practice, you would compute the observed frequencies, the row/column totals, the expected frequencies under H0, the statistic, and the P-value to reach a conclusion.
Key formulas to remember
Observed vs. expected:
f_e = \frac{(\text{row total})(\text{column total})}{N}
Pearson chi-squared statistic:
\chi^2 = \sum{\text{cells}} \frac{(f0 - fe)^2}{fe}
Degrees of freedom:
df = (r - 1)(c - 1)
P-value interpretation:
\text{P-value} = P(\chi^2 \ge \chi^2_{\text{obs}})
Decision rule:- Reject H0 if \text{P-value} \le \alpha; otherwise fail to reject.
Important caveats:- Chi-squared is sensitive to sample size: larger samples can inflate evidence even for weak associations.
Not a measure of strength by itself; inspect conditional distributions or use measures of association where appropriate.