10. Chi Squared Tests

Conduct chi-squared hypothesis tests for pairs of categorical variables:
- Perform tests by hand and using Stata.
- Interpret results meaningfully.
- Review descriptive statistics (crosstabs) for pairs of categorical variables, essential for Assignment 2.

Check if assumptions for the test are met.
Choose the significance level (𝛼), typically set at 0.05.
State the hypotheses:
- Null hypothesis (H0): distributions of the variables across populations are the same (variables are independent).
- Alternative hypothesis (Ha): distributions are not the same (variables are not independent).
Compute the chi-squared statistic using the formula: [ \chi^2 = \sum \frac{(O - E)^2}{E} ]
- Where O = observed frequency, E = expected frequency.
Find the associated p-value and compare it to 𝛼:
- If p < 𝛼, reject H0.
- If p ≥ 𝛼, do not reject H0.
Interpret results in plain English.

The following must be satisfied:
1. Simple Random Sample (SRS).
2. Expected count in each cell must be at least 5 (E ≥ 5).
Under these conditions, the sampling distribution under the null hypothesis follows roughly a chi-squared distribution with degrees of freedom (d.f.) = (r-1)(c-1).

Chi-squared statistic is calculated as: [ \chi^2 = \sum \frac{(O - E)^2}{E} ]
- O = observed frequency in each cell.
- E is the expected frequency calculated as:
  [ E = \frac{(row\ total) \times (column\ total)}{N} ]
The test statistic follows a chi-squared distribution with d.f. (r-1)(c-1).

If p < 𝛼, reject the null hypothesis (H0): indicate significant association/difference.
If p ≥ 𝛼, do not reject H0: no evidence of significant association/difference.

1 parameter: degrees of freedom.
Shape changes with degrees of freedom but stays positive and approaches normality as d.f. increases.
Can be expressed as the sum of k independent standard normals:
[ \chi^2 = (Z1)^2 + (Z2)^2 + … + (Z_k)^2 ]
Karl Pearson proposed its use for independence tests and goodness-of-fit in 1900.

Goodness-of-fit (11.2), independence (11.3), and homogeneity (11.4) tests are essentially the same, differing mainly in terminology.

Expected counts are derived from the product of row and column totals divided by N (total observations).
- Example: If the total number of men is 40 and women is 60, for a particular food preference, E can be calculated accordingly.

If the null is rejected, we can conclude:
- There is a significant association between the variables.
- The distribution of one variable significantly differs across the levels of the other.
If the null is not rejected, we can state:
- No significant association between the variables.
- The distribution of one variable is not significantly different across the levels of the other.

For aggregate counts: use the command:
tabi a b \ c d \ ... with options.
For raw data analysis: use the command:
tab var1 var2, [options] for various analyses like chi-squared tests, column percentages, and more.