Lecture 18: Chi Square Goodness of Fit Test Notes

Overview of Chi-Square Distribution

Definition: The Chi-Square distribution is a probability distribution characterized by being positively skewed.
Parameter: It features an associated degrees of freedom ( $df$ ) parameter which determines the specific shape of the distribution.
Primary Applications in Statistical Inference: * Test of Independence: Used to determine if there is a significant relationship between two categorical variables. * Goodness of Fit Test: Used to determine how well an observed set of data fits a theoretical distribution.
Source Reference: Weiss, p. 599.

Chi-Square Goodness of Fit Test Fundamentals

Purpose: This test is utilized to assess whether the observed sample distribution of a variable (which can be either quantitative or qualitative) agrees with a pre-specified theoretical population pattern or distribution.
Comparison Basis: The test relies on comparing the observed frequency ( $O$ ) of a specific value or category to the expected frequency ( $E$ ).
Expected Frequency Calculation: The expected frequency is calculated under the assumption that the null hypothesis ( $H_0$ ) is true, using the formula: * $E = n imes p$ * Where $n$ is the total sample size and $p$ is the probability or proportion of the category under the assumed distribution ( $0 \le p \le 1$ ).
Logic of the Test: If the observed frequencies and expected frequencies differ significantly, the test leads to the rejection of the assumed distribution as a valid model for the observed pattern.
Null and Alternative Hypotheses: * $H_0$ : The observed distribution matches the theoretical/past distribution. * $H_1$ : The observed distribution differs from the theoretical/past distribution.
The Test Statistic: The Chi-Square test statistic is calculated as: * $\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$ * This statistic follows a Chi-Square distribution with degrees of freedom $df = k - 1$ , where $k$ represents the number of different groups or categories in the distribution.

Requirements and Assumptions for the Goodness of Fit Test

The Rule of Five: Generally, the test is considered valid if each of the expected frequencies is at least $5$ .
Statistical Caveat: Dr. Javed Iqbal notes that on p. 603 of Weiss, it is mentioned that the statistician Cochran suggests the "rule of 5" is often too restrictive for practical application.

Example 1: Market Share for Fabric Softener (Company A)

Context: Company A launched an aggressive advertising campaign to maintain or increase its market share.
Prior Market Distribution: * Company A: $45\,\%$ * Company B (Main Competitor): $40\,\%$ * Other Competitors: $15\,\%$
Sample Data: A random sample of $n = 200$ customers was surveyed after the campaign. * Observed Preference for Company A: $102$ * Observed Preference for Company B: $82$ * Observed Preference for Others: $16$
Hypotheses: * $H_0$ : Market share distribution after campaign is same as the past market share distribution. * $H_1$ : Market share distribution after campaign is different from the past market share distribution.
Calculation Table ( $n = 200$ ): * Category A: $O = 102$ ; $E = 200 \times 0.45 = 90$ ; $\frac{(O - E)^2}{E} = \frac{(102 - 90)^2}{90} = 1.60$ * Category B: $O = 82$ ; $E = 200 \times 0.40 = 80$ ; $\frac{(O - E)^2}{E} = \frac{(82 - 80)^2}{80} = 0.05$ * Others: $O = 16$ ; $E = 30$ (derived as $200 - 90 - 80$ to minimize rounding error); $\frac{(O - E)^2}{E} = \frac{(16 - 30)^2}{30} = 6.53$
Summation of Test Statistic: $\chi^2 = 1.60 + 0.05 + 6.53 = 8.183$
Decision Criteria: * Significance Level ( $\alpha$ ): $5\,\%$ * Degrees of Freedom ( $df$ ): $3 - 1 = 2$ * Critical Value: $\chi^2(0.05, 2) = 5.991$
Conclusion: Since the calculated statistic ( $8.183$ ) is greater than the critical value ( $5.991$ ), the null hypothesis is rejected. The market share distribution has changed.
Observation: Company A successfully increased its market share, seemingly at the expense of "Other" competitors rather than its main competitor, Company B.

Weiss Example 13.2: Violent Crime Patterns

Objective: To determine if the observed pattern of violent crimes in a recent year (Weiss Table 13.2) is the same as the pattern from the year 2010 (Weiss Table 13.1).
Hypotheses: * $H_0$ : Crime distribution of last year is same as the 2010 crime distribution. * $H_1$ : Crime distribution of last year has changed from the 2010 crime distribution.
Statistical Data: * Calculated Test Statistic: $6.529$ * Significance Level: $5\,\%$ * Degrees of Freedom ( $df$ ): $3$ * Critical Value (from Anderson Table 3): $\chi^2(0.05, 3) = 7.815$
Conclusion: There is insufficient evidence in the data to conclude that the violent crime distribution in the last year differs from the 2010 distribution (calculated value 6.529 < 7.815).

Example 2: ABO Blood Type Distribution

Prior Population Data (Believed Distribution): * Type A: $34\,\%$ * Type B: $15\,\%$ * Type AB: $23\,\%$ * Type O: $28\,\%$
Sample Data: $n = 100$ students from a campus. * Observed Frequencies: A = $29$ , B = $17$ , AB = $20$ , O = $34$ .
Hypotheses: * $H_0$ : Observed blood group distribution of campus students agrees with population distribution. * $H_1$ : Observed blood group distribution of campus students does not agree with population distribution.
Calculation Table ( $n = 100$ ): * Type A: $O = 29$ ; $E = 34$ ; $\frac{(O - E)^2}{E} = 0.7352$ * Type B: $O = 17$ ; $E = 15$ ; $\frac{(O - E)^2}{E} = 0.2666$ * Type AB: $O = 20$ ; $E = 23$ ; $\frac{(O - E)^2}{E} = 0.3913$ * Type O: $O = 34$ ; $E = 28$ ; $\frac{(O - E)^2}{E} = 1.2857$
Summation of Test Statistic: $\chi^2 = 2.6789$
Decision Criteria: * Degrees of Freedom ( $df$ ): $4 - 1 = 3$ * Significance Level: $5\,\%$ * Critical Value: $\chi^2(0.05, 3) = 7.815$
Conclusion: The calculated test statistic ( $2.6789$ ) is less than the critical value ( $7.815$ ). Therefore, the null hypothesis is not rejected. The campus blood group distribution matches the broad population.

Weiss Example 13.27: Color Distribution

Overview: This data set analyzes color percentages and observed frequencies for a sample where the sum of observed frequencies is $509$ .
Calculation Table: * Brown: Percentage = $30\,\%$ , $E = 152.7$ , $O = 152$ , $\frac{(O - E)^2}{E} = 0.0032$ * Yellow: Percentage = $20\,\%$ , $E = 101.8$ , $O = 114$ , $\frac{(O - E)^2}{E} = 1.4620$ * Red: Percentage = $20\,\%$ , $E = 101.8$ , $O = 106$ , $\frac{(O - E)^2}{E} = 0.1732$ * Orange: Percentage = $10\,\%$ , $E = 50.9$ , $O = 51$ , $\frac{(O - E)^2}{E} = 0.0002$ * Green: Percentage = $10\,\%$ , $E = 50.9$ , $O = 43$ , $\frac{(O - E)^2}{E} = 1.226$ * Blue: Percentage = $10\,\%$ , $E = 50.9$ , $O = 43$ , $\frac{(O - E)^2}{E} = 1.226$
Sums: Total Expected = $509$ , Total Observed = $509$ , Calculated $\chi^2$ statistic = $4.091$ .

Reference Exercises

Anderson (pdf p. 611): Exercises 22, 23, and 24 are listed for further study of these concepts.