10. Chi Squared Tests

Types of Questions Addressed

  • Examining relationships between two categorical variables:
    • Do Canadian men and women differ in educational attainment?
    • Are gender and educational attainment independent?
    • Do immigrants differ from Canadian-born adults in social media use?
    • Are immigrant status and social media use independent?
    • Do Gen Z vs older Canadians differ in the time they spend following the news?
    • Is age/generation and time consuming the news independent?

Week 10 Key Points

  • Conduct chi-squared hypothesis tests for pairs of categorical variables:
    • Perform tests by hand and using Stata.
    • Interpret results meaningfully.
    • Review descriptive statistics (crosstabs) for pairs of categorical variables, essential for Assignment 2.

Steps for All Hypothesis Tests

  1. Check if assumptions for the test are met.
  2. Choose the significance level (𝛼), typically set at 0.05.
  3. State the hypotheses:
    • Null hypothesis (H0): distributions of the variables across populations are the same (variables are independent).
    • Alternative hypothesis (Ha): distributions are not the same (variables are not independent).
  4. Compute the chi-squared statistic using the formula: [ \chi^2 = \sum \frac{(O - E)^2}{E} ]
    • Where O = observed frequency, E = expected frequency.
  5. Find the associated p-value and compare it to 𝛼:
    • If p < 𝛼, reject H0.
    • If p ≥ 𝛼, do not reject H0.
  6. Interpret results in plain English.

Assumptions for Chi-Squared Test

  • The following must be satisfied:
    1. Simple Random Sample (SRS).
    2. Expected count in each cell must be at least 5 (E ≥ 5).
  • Under these conditions, the sampling distribution under the null hypothesis follows roughly a chi-squared distribution with degrees of freedom (d.f.) = (r-1)(c-1).

Understanding Hypotheses

  • H0: The distributions across tested populations are the same (independent).
  • Ha: The distributions are not the same (not independent).

Test Statistic and P-Value Calculation

  • Chi-squared statistic is calculated as: [ \chi^2 = \sum \frac{(O - E)^2}{E} ]
    • O = observed frequency in each cell.
    • E is the expected frequency calculated as:
      [ E = \frac{(row\ total) \times (column\ total)}{N} ]
  • The test statistic follows a chi-squared distribution with d.f. (r-1)(c-1).

Concluding and Interpreting Results

  • If p < 𝛼, reject the null hypothesis (H0): indicate significant association/difference.
  • If p ≥ 𝛼, do not reject H0: no evidence of significant association/difference.

Important Notes on Chi-Squared Distribution

  • 1 parameter: degrees of freedom.
  • Shape changes with degrees of freedom but stays positive and approaches normality as d.f. increases.
  • Can be expressed as the sum of k independent standard normals:
    [ \chi^2 = (Z1)^2 + (Z2)^2 + … + (Z_k)^2 ]
  • Karl Pearson proposed its use for independence tests and goodness-of-fit in 1900.

Textbook Section References

  • Goodness-of-fit (11.2), independence (11.3), and homogeneity (11.4) tests are essentially the same, differing mainly in terminology.

Understanding Expected Counts

  • Expected counts are derived from the product of row and column totals divided by N (total observations).
    • Example: If the total number of men is 40 and women is 60, for a particular food preference, E can be calculated accordingly.

Plain English Interpretation

  • If the null is rejected, we can conclude:
    • There is a significant association between the variables.
    • The distribution of one variable significantly differs across the levels of the other.
  • If the null is not rejected, we can state:
    • No significant association between the variables.
    • The distribution of one variable is not significantly different across the levels of the other.

Stata Commands for Chi-Squared Tests

  • For aggregate counts: use the command:
    tabi a b \ c d \ ... with options.
  • For raw data analysis: use the command:
    tab var1 var2, [options] for various analyses like chi-squared tests, column percentages, and more.

Calculating Probabilities Under Chi-Squared Distribution

  • Use the command:
    dis chi2(df, threshold) for cumulative probability.
  • To find the complement for the right tail:
    dis 1 - chi2(df, threshold).