Chi-Square Test Notes

Chi-Square Test

The chi-square test is symbolized as χ². It is a statistical measure for comparing variance in sampling analysis.
It is a non-parametric test.
It determines if categorical data shows dependency or if classifications are independent.
It compares theoretical populations with actual data using categories.
Researchers use the chi-square test to:
- test goodness of fit
- assess the significance of association between attributes
- test population variance homogeneity

Chi-Square as a Test for Comparing Variance

The chi-square value helps judge the significance of population variance.
It tests if a random sample comes from a normal population with mean µ and specified variance σ².
Based on the χ²-distribution, it deals with sums of squares.
By dividing each sample variance by the known population variance and multiplying by (n - 1), where n is the sample size, we get a χ²-distribution.
The formula is (s²/σ²) * (n - 1), which follows a χ²-distribution with (n - 1) degrees of freedom.

Chi-Square Distribution

The χ²-distribution is asymmetrical and only has positive values.
The shape of the distribution depends on the degrees of freedom; fewer degrees result in a more skewed distribution.
Tables provide critical χ² values for different degrees of freedom.
Testing Population Variance:
To test population variance using chi-square, we calculate χ² to test the null hypothesis (H₀: σ² = σ²) as follows: \chi^2 = \frac{s^2}{\sigma^2} * (n - 1) where:
- σ² = sample variance
- σ² = population variance
- (n – 1) = degrees of freedom, with n being the sample size
We compare the calculated χ² value to the table value for (n – 1) degrees of freedom at a specific significance level.
If the calculated value is less than the table value, we accept the null hypothesis; if it is equal to or greater, we reject it.

Example

Weight of 10 students is as follows:
Can we say that the variance of the distribution of weight of all students from which the above sample of 10 students was drawn is equal to 20 kgs?
Test this at 5 per cent and 1 per cent level of significance.
Solution:
First, calculate the variance of the sample data or σ².
X = \frac{\Sigma X}{n} = \frac{470}{10} = 47 kgs.
σ = \sqrt{\frac{\Sigma(X_i - X)^2}{n-1}} = \sqrt{\frac{280}{10 - 1}} = \sqrt{31.11} = 5.57
Let the null hypothesis be H0: σp^2 = 20.
Calculate the χ² value as follows:
X^2 = \frac{s^2}{\sigma^2} * (n - 1) = \frac{31.11}{20} * (10-1) = 13.999

Degrees of Freedom

Degrees of freedom in the given case is (n - 1) = (10-1)=9.
At 5 per cent level of significance the table value of x² = 16.92 and at 1 per cent level of significance, it is 21.67 for 9 d.f.
Both these values are greater than the calculated value of x² which is 13.999.
Hence we accept the null hypothesis and conclude that the variance of the given distribution can be taken as 20 kgs at 5 per cent as also at 1 per cent level of significance.
In other words, the sample can be said to have been taken from a population with variance 20 kgs.

Chi-Square as a Non-Parametric Test and Goodness of Fit

Chi-square is a key non-parametric test that does not require strict assumptions about the population type. Only the degrees of freedom (related to sample size) are needed.
It is used for:
- testing goodness of fit
- and testing independence.
Goodness of Fit Test
- Evaluates how well an assumed theoretical distribution (e.g., Binomial, Poisson, Normal) fits the observed data.
- If the calculated χ2 is less than the table value, the fit is good, indicating that the differences are due to sampling fluctuations.
- If the calculated χ2 is greater than the table value, the fit is poor.
Test of Independence
- Determines if two attributes are associated.
- Example: Evaluating the effectiveness of a new medicine for fever control.
- Null hypothesis: Attributes are independent (medicine is not effective).
- Calculate expected frequencies and χ2.
- If the calculated χ2 is less than the table value, the null hypothesis stands (attributes are independent).
- If greater, the null hypothesis is rejected (attributes are associated, and the medicine is effective).

Application Requirements and Formula

Observed and expected frequencies must be grouped similarly.
Theoretical distribution must match the total frequency of the observed distribution.
Formula for χ2 Calculation: χ^2 = \Sigma \frac{(Oi - Ei)^2}{E_i}
- Exact Match: If observed and theoretical distributions are identical, χ2 = 0.
- Typically, χ2 ≠ 0 due to sampling errors.
  Significance:
Use χ2 tables to determine significance for given degrees of freedom and significance levels.
- If calculated χ2 ≥ table value: difference is significant.
- If calculated χ2 < table value: difference is insignificant (due to chance).
Degrees of Freedom (d.f.):
- For frequency classes: d.f.=n− 1 (where n is the number of groups).
- For contingency tables: d.f.=(c−1)(r−1) (where c is the number of columns and r is the number of rows).

Conditions for Applying Chi-Square Test

Random Observations: Data must be collected randomly.
Independence: All items in the sample must be independent.
Minimum Group Size: No group should contain fewer than 10 items. If necessary, combine groups to meet this criterion (some statisticians accept a minimum of 5, but 10 is preferred).
Sample Size: The overall sample size should be at least 50.
Linear Constraints: Constraints must be linear equations (no squares or higher powers) in the cell frequencies of a contingency table.

Steps Involved in Applying Chi-Square Test

Expected Frequencies: Calculate expected frequencies based on the hypothesis or null hypothesis. For a contingency table, use:
E_{ij} = \frac{(\text{Row Total}) \times (\text{Column Total})}{\text{Grand Total}}
Calculate Differences: Find the difference between observed and expected frequencies and square these differences:
(Oi - Ei)^2
Normalize Differences: Divide the squared differences by the corresponding expected frequencies:
\frac{(Oi - Ei)^2}{E_i}
Summation: Sum all the normalized differences to obtain the χ2 value:
χ^2 = \Sigma \frac{(Oi - Ei)^2}{E_i}
Comparison with Table: Compare the calculated χ2 value with the table value for the relevant degrees of freedom to draw conclusions.

Example: Die Throwing

A die is thrown 132 times with the following results:

Number turned up	1	2	3	4	5	6
Frequency	16	20	25	14	29	28

Is the die unbiased?

Solution: Let H_0: the die is unbiased.
If that is so, the probability of obtaining any one of the six numbers is 1/6 and as such the expected frequency of any one number coming upward is 132 ×1/6 = 22.

No. turned up	Observed frequency (O₁)	Expected frequency (E₁)	(O₁- E₁)	(O₁-E₁)²	(O-E)²/E
1	16	22	-6	36	36/22
2	20	22	-2	4	4/22
3	25	22	3	9	9/22
4	14	22	-8	64	64/22
5	29	22	7	49	49/22
6	28	22	6	36	36/22

\Sigma[(Oi-Ei)^2/E] = 9
Hence, the calculated value of x² = 9.
Degrees of freedom in the given problem is (n-1) = (6-1) = 5.
The table value of x² for 5 degrees of freedom at 5 per cent level of significance is 11.071.
Comparing calculated and table values of X², we find that calculated value is less than the table value and as such could have arisen due to fluctuations of sampling. The result, thus, supports the hypothesis and it can be concluded that the die is unbiased.

Alternative Formula for (2 × 2) Table

There is an alternative method of calculating the value of X² in the case of a (2 × 2) table. If we write the cell frequencies and marginal totals in case of a (2 × 2) table thus:


a	b	(a + b)
c	d	(c+d)
(a + c)	(b + d)	N

then the formula for calculating the value of x² will be stated as follows:
x² = \frac{(ad – bc)². N}{(a + c) (b + d) (a + b) (c + d)}
- where N means the total frequency, ad means the larger cross product, bc means the smaller cross product and (a+c), (b + d), (a + b), and (c + d) are the marginal totals.
The alternative formula is rarely used in finding out the value of chi-square as it is not applicable uniformly in all cases but can be used only in a (2 × 2) contingency table.

Yates' Correction

F. Yates has suggested a correction for continuity in X² value calculated in connection with a (2 × 2) table, particularly when cell frequencies are small (since no cell frequency should be less than 5 in any case, through 10 is better as stated earlier) and X² is just on the significance level.
The correction suggested by Yates is popularly known as Yates' correction.
It involves the reduction of the deviation of observed from expected frequencies which of course reduces the value of x².
The rule for correction is to adjust the observed frequency in each cell of a (2 × 2) table in such a way as to reduce the deviation of the observed from the expected frequency for that cell by 0.5, but this adjustment is made in all the cells without disturbing the marginal totals.
The formula for finding the value of x² after applying Yates' correction can be stated thus:
x² \text{ (corrected)} = \frac{N · (| ad – bc | −0.5N) ²}{(a + b) (c + d) (a + c) (b+d)}
In case we use the usual formula for calculating the value of chi-square viz.,
x² = \Sigma \frac{(Oi - Ei)^2}{E_i}
then Yates' correction can be applied as under:
x² \text{ (corrected)} = \frac{[ | O1 – E1 | −0.5]^2}{E1} + \frac{[ | O2 – E2 |−0.5]^2}{E2} + …
It may again be emphasised that Yates' correction is made only in case of (2 × 2) table and that too when cell frequencies are small.

Example: Shops in Towns and Villages

The following information is obtained concerning an investigation of 50 ordinary shops of small size:

Shops	In towns	In villages	Total
Run by men	17	18	35
Run by women	3	12	15
Total	20	30	50

Can it be inferred that shops run by women are relatively more in villages than in towns? Use x² test.
Solution:

Take the hypothesis that there is no difference so far as shops run by men and women in towns and villages. With this hypothesis the expectation of shops run by men in towns would be:
Expectation of (AB) = \frac{(A) × (B)}{N}
- where A = shops run by men
- B = shops in towns
- (A) = 35; (B) = 20 and N = 50
- Thus, expectation of (AB) = \frac{35 × 20}{50} = 14
Table of expected frequencies:

	Shops in towns	Shops in villages	Total
Run by men	14 (AB)	21 (Ab)	35
Run by women	6 (aB)	9 (ab)	15
Total	20	30	50

Calculation of X² value:

Groups	Observed frequency (Oᵢ)	Expected frequency (Eᵢ)	(Oᵢ-Eᵢ)	(Oᵢ-Eᵢ)²/E
(AB)	17	14	3	9/14=0.64
(Ab)	18	21	-3	9/21=0.43
(aB)	3	6	-3	9/6=1.50
(ab)	12	9	3	9/9=1.00

x² = \Sigma \frac{(O{ij} - E{ij})^2}{E_{ij}} = 3.57

Table Value and Hypothesis Testing

For 1 degree of freedom at 5% significance, the table value of χ2 is 3.841.
If the calculated χ2 (both before and after Yates' correction) is less than 3.841, the hypothesis stands.
Conclusion:
No significant difference between shops run by men and women in villages and towns.

Additive Property of χ2

χ2 values from multiple samples can be added together.
The degrees of freedom (d.f.) are also additive.
Combined χ2 value and total d.f. provide a better understanding of the significance of the overall problem.

Conversion of χ2 into Phi Coefficient (φ)

Purpose: χ2 assesses significance, not magnitude of association. Phi coefficient (φ) provides this magnitude.
Formula: φ= \sqrt{\frac{X^2}{N}}

Conversion of χ2 into Coefficient of Contingency (C)

Purpose: Assess magnitude of association, especially for higher-order contingency tables.
Formula: C = \sqrt{\frac{X^2}{N + X^2}}

Important Characteristics of χ2 Test

Based on Frequencies: Utilizes frequencies rather than mean and standard deviation.
Hypothesis Testing: Used for hypothesis testing, not estimation.
Additive Property: χ2 test values are additive.
Applicability: Can be applied to complex contingency tables, making it useful in research.
Non-Parametric: No rigid assumptions about population type or parameter values.

Caution in Using χ2 Test

Independence of Observations: Ensure observations are independent.
Handling Small Theoretical Frequencies: Handle small theoretical frequencies with care.
Common Errors: Neglect of non-occurrence frequencies, failure to equalize observed and expected frequencies, incorrect determination of degrees of freedom, and computation errors.
Researcher's Responsibility: Researchers should thoroughly understand the test's rationale and potential pitfalls before application.