Chapter 5: Association between Categorical Variables (Chi-Squared Test)

Section 5.0: Introduction

  • Purpose: Introduce a method for detecting and describing associations between two categorical variables: the chi-squared test.

  • Key ideas:- Terminology for categorical data analysis.

    • Statistical dependence vs independence: expressing presence or absence of association in a population.

    • Introduction of the chi-squared test as a significance test to determine if two categorical variables are statistically dependent or independent.

  • Core concepts:- Contingency tables display counts for all combinations of two categorical variables.

    • Marginal distributions are the row totals and column totals.

  • Notation to remember:- Observed frequency in a cell:

    f_0

    • Expected frequency under independence:

      f_e

    • Pearson chi-squared statistic:

      \chi^2 = \sum\frac{(f0 - fe)^2}{f_e}

  • Big picture: If the population conditional distributions are the same across categories of the other variable, variables are independent; otherwise, dependent.

Section 5.1: Contingency Tables

  • What contingency tables show:- Counts of subjects by all combinations of outcomes for the two variables.

    • They summarize the joint distribution and allow computation of conditional distributions.

  • Example: 2004 General Social Survey (GSS) data on gender and political party identification (Democrat, Independent, Republican).- Table: 2 \times 3 contingency table with rows = Gender (Females, Males) and columns = Party ID (Democrat, Independent, Republican).

    • Data (sample sizes):

      \begin{array}{l||ccc||c}

      & \text{Democrat} & \text{Independent} & \text{Republican} & \text{Total} \hline

      \text{Females} & 573 & 516 & 422 & 1511 \

      \text{Males} & 386 & 475 & 399 & 1260 \

      \text{Total} & 959 & 991 & 821 & 2771 \

      \end{array}

  • Conditional distributions (relative frequencies) by gender:- Females:

    \text{Democrat} : 573/1511 = 0.38,

    \text{Independent} : 516/1511 = 0.34,

    \text{Republican} : 422/1511 = 0.28

    • Males:

      \text{Democrat} : 386/1260 = 0.31,

      \text{Independent} : 475/1260 = 0.38,

      \text{Republican} : 399/1260 = 0.32

  • Interpretation:- The population question is whether party ID is associated with gender.

    • If the conditional distributions on party ID were identical for females and males, the variables would be statistically independent.

  • Marginal/conditional concepts:- Marginal distributions: row totals and column totals.

    • Conditional distribution example:

      P(\text{Democrat}||\text{Female}) = \frac{573}{1511} = 0.38

    • To assess independence, compare conditional distributions across levels of the other variable.

Section 5.1 (continued): Statistical independence and dependence

  • Definitions:- Statistically independent: The population conditional distributions on one variable are identical at each category of the other variable.

    • In other words, the probability of any particular category of one variable is the same for all levels of the other variable.

    • Statistically dependent: The conditional distributions are not identical.

  • Illustrative Ex.2 (independence):- Table (Ethnic Group \times Party ID) with percentages indicating independence:

    \begin{array}{l||ccc||c}

    & \text{Democrat} & \text{Independent} & \text{Republican} & \text{Total} \hline

    \text{White} & 440 (44\%) & 140 (14\%) & 420 (42\%) & 1000 (100\%) \

    \text{Black} & 44 (44\%) & 14 (14\%) & 42 (42\%) & 100 (100\%) \

    \text{Hispanic} & 110 (44\%) & 35 (14\%) & 105 (42\%) & 250 (100\%) \

    \hline

    \text{Total} & 594 & 189 & 567 & 1350 \

    \end{array}

    • Interpretation: The percentage of each party ID is the same across ethnic groups (each party has the same overall distribution within ethnic groups), indicating independence of party ID and ethnicity.

  • Ex.3 (dependent possibility):- Life after death belief is about 80% across gender and race (appears independent).

    • However, belief differs by religion: Catholics/Protestants ~80%, Jews and those with no religion ~40–50%.

    • Conclusion: Belief after death appears independent of gender and race but dependent on religion.

Section 5.2: Chi-Squared Test of Independence

  • Central question: If we have a sample, can we infer independence in the population?

  • Null and alternative hypotheses:- H0: The variables are statistically independent.

    • Ha: The variables are statistically dependent.

  • Assumptions for the test:- Randomization (random sample).

    • Large enough sample so that expected frequencies are adequate (f_e > 5 in each cell).

  • Notation:- Observed frequency in a cell:

    f_0

    • Expected frequency under H0 (independence):

      f_e = \frac{\text{(row total)} (\text{column total})}{N}

    • Test statistic (Pearson chi-squared):

      \chi^2 = \sum{\text{cells}} \frac{(f0 - fe)^2}{fe}

  • Interpretation:- Under H0, for large samples,

    \chi^2 follows a chi-squared distribution with

    df = (r - 1)(c - 1)

    where r = number of rows and c = number of columns.

    • Larger values of

      \chi^2 provide stronger evidence against H0.

    • P-value is the right-tail probability:

      \text{P-value} = P(\chi^2 \ge \chi^2_{\text{obs}})

  • Practical workflow (five standard steps):- Assumptions: categorical data, randomization, large sample (f_e > 5).

    • Hypotheses: H0 and Ha as above.

    • Test statistic: compute \chi^2 using observed and expected frequencies.

    • P-value: obtain from chi-squared distribution with df = (r-1)(c-1).

    • Conclusion: reject H0 if \text{P-value} \le \alpha, otherwise do not reject.

Section 5.2 (continued): Example 1 – compute the chi-squared statistic

  • Data (observed frequencies): from Ex.1 table:- Females: Dem 573, Ind 516, Rep 422

    • Males: Dem 386, Ind 475, Rep 399

    • Totals: Females 1511, Males 1260, Grand total 2771

  • Expected frequencies under independence (given by the provided calculations):

    \begin{array}{ccc||c}

    & \text{Democrat} & \text{Independent} & \text{Republican} \hline

    \text{F} & 522.9 & 540.4 & 447.7 \

    \text{M} & 436.1 & 450.6 & 373.3 \

    \end{array}

  • Computed chi-squared statistic:

    \chi^2 = (573-522.9)^2/522.9 + (516-540.2)^2/540.2 + (422-447.7)^2/447.7 + (386-436.1)^2/436.1 + (475-450.6)^2/450.6 + (399-373.3)^2/373.3 = 16.3

  • Note: The second observed frequency in the calculation used was adapted to align with the given expected values (540.2 vs 540.4 in the textual example); the final reported statistic is

    \chi^2 = 16.3\,.

Section 5.2 (continued): Chi-squared distribution, P-values, and decision rule

  • Properties of the chi-squared distribution:- Concentrated on the positive side of the real line.

    • Skewed to the right.

    • Shape depends on degrees of freedom:

      df = (r - 1)(c - 1)

    • Larger \chi^2 implies stronger evidence against H0.

  • P-value interpretation:- The P-value is the probability, under H0, of observing a chi-squared value as extreme or more extreme than the observed value.

    • Decision rule: Reject H0 if the P-value is less than or equal to the chosen significance level \alpha.

  • Important nuance:- A large \chi^2 suggests association, but not necessarily a strong association in the population.

    • The statistic is sensitive to sample size: larger samples can yield large \chi^2 even for weak associations.

  • Related topic: Fisher’s exact test for small samples when some expected counts are low.

Section 5.3: Example 5 – Run the chi-squared test (Example 1) with \alpha = 0.01

  • Standard five-step procedure applied:- Assumptions:

    • Data are categorical.

    • Randomization is assumed.

    • Large sample: each cell expected count > 5.

    • Hypotheses:

    • H0: party ID and gender are statistically independent.

    • Ha: party ID and gender are statistically dependent.

    • Test statistic: from Ex.4,

      \chi^2 = 16.3 and

      df = (2 - 1)(3 - 1) = 2

    • P-value: reported as

      \text{P-value} = 0.0003 (computed via software; could also be approximated using a chi-squared table).

    • Conclusion: Since \text{P-value} < \alpha = 0.01, reject H0. The data suggest that gender and party ID are statistically dependent.

  • Important interpretation:- A large \chi^2 indicates association, not necessarily a strong one.

    • Conditional probabilities can be inspected to assess strength of association.

Section 5.3 (continued): Interpreting the strength of association with conditional distributions

  • Observations from Ex.6 (Case A, B, C): same conditional probabilities but different sample sizes lead to different \chi^2 values and P-values.

  • Cases:- Case A: White vs Black, 100 per group, 200 total per race; data yield \chi^2 = 0.08, P-value 0.78.

    • Case B: White vs Black with 200 per group, 400 total; \chi^2 = 0.16, P-value 0.69.

    • Case C: Large sample: White vs Black with 10,000 each, total 20,000; more pronounced deviation relative to expectation yields \chi^2 = 8.0, P-value 0.005.

  • Key takeaway:- For a fixed pattern of percentages (i.e., fixed conditional distributions), increasing sample size increases \chi^2 and decreases the P-value.

    • Therefore, a small P-value can arise from a weak association if the sample size is large.

    • The strength of association should be assessed via actual conditional probabilities (e.g., differences in marginals or by measures of association) rather than solely by the P-value.

Section 5.4: Chapter Summary (key takeaways)

  • A sample shows association between two variables if certain values of one variable tend to occur with certain values of the other.

  • Two main approaches:- Describe counts in contingency tables via percentage distributions (conditional distributions) across categories of the response variable to assess independence.

    • Use the chi-squared test to test H0: independence between the two categorical variables.

  • Chi-squared test specifics:- Pearson chi-squared statistic:

    \chi^2 = \sum (f0 - fe)^2 / f_e

    • Under H0 and large samples, the statistic follows a chi-squared distribution with

      df = (r - 1)(c - 1)

    • The P-value is the right-tail probability above the observed \chi^2 value.

  • Important caveats:- A large \chi^2 indicates association but not necessarily a strong association.

    • \chi^2 grows with sample size; very large samples can yield small P-values even for modest associations.

    • For small samples, consider Fisher’s exact test.

Section 5.4: Practice Problems (from Section 5.4)

  • Problem 1 (GSS abortion opinion vs gender):- Data: Approximately 40% of both males and females believe abortion should be legal for any reason.

    • Tasks:

    • (a) Construct a contingency table showing the conditional distribution on whether unrestricted abortion should be legal (Yes, No) by gender.

    • (b) Based on these results, does statistical independence seem plausible between gender and opinion about unrestricted abortion? Why?

  • Problem 2 (Country A data on education vs marital status):- Data: Random sample of 423 people with a cross-tabulation by educational level and marital status:

    \begin{array}{l||cccc||c}

    & \text{Middle school or lower} & \text{High school} & \text{Bachelor’s} & \text{Master’s, PhD or higher} & \text{Total} \hline

    \text{Never married} & 32 & 53 & 46 & 17 & 148 \

    \text{Married} & 21 & 54 & 48 & 72 & 195 \

    \text{Divorced or widowed} & 12 & 13 & 21 & 34 & 80 \

    \hline

    \text{Total} & 65 & 120 & 115 & 123 & 423 \

    \end{array}

    • Task: At \alpha = 0.05, run the chi-squared test to determine if educational level and marital status are independent.

  • Note: In practice, you would compute the observed frequencies, the row/column totals, the expected frequencies under H0, the statistic, and the P-value to reach a conclusion.

Key formulas to remember

  • Observed vs. expected:

    f_e = \frac{(\text{row total})(\text{column total})}{N}

  • Pearson chi-squared statistic:

    \chi^2 = \sum{\text{cells}} \frac{(f0 - fe)^2}{fe}

  • Degrees of freedom:

    df = (r - 1)(c - 1)

  • P-value interpretation:

    \text{P-value} = P(\chi^2 \ge \chi^2_{\text{obs}})

  • Decision rule:- Reject H0 if \text{P-value} \le \alpha; otherwise fail to reject.

  • Important caveats:- Chi-squared is sensitive to sample size: larger samples can inflate evidence even for weak associations.

    • Not a measure of strength by itself; inspect conditional distributions or use measures of association where appropriate.