Lecture 18: Chi Square Goodness of Fit Test Notes

Overview of Chi-Square Distribution

  • Definition: The Chi-Square distribution is a probability distribution characterized by being positively skewed.
  • Parameter: It features an associated degrees of freedom (dfdf) parameter which determines the specific shape of the distribution.
  • Primary Applications in Statistical Inference:     * Test of Independence: Used to determine if there is a significant relationship between two categorical variables.     * Goodness of Fit Test: Used to determine how well an observed set of data fits a theoretical distribution.
  • Source Reference: Weiss, p. 599.

Chi-Square Goodness of Fit Test Fundamentals

  • Purpose: This test is utilized to assess whether the observed sample distribution of a variable (which can be either quantitative or qualitative) agrees with a pre-specified theoretical population pattern or distribution.
  • Comparison Basis: The test relies on comparing the observed frequency (OO) of a specific value or category to the expected frequency (EE).
  • Expected Frequency Calculation: The expected frequency is calculated under the assumption that the null hypothesis (H0H_0) is true, using the formula:     * E=nimespE = n imes p     * Where nn is the total sample size and pp is the probability or proportion of the category under the assumed distribution (0p10 \le p \le 1).
  • Logic of the Test: If the observed frequencies and expected frequencies differ significantly, the test leads to the rejection of the assumed distribution as a valid model for the observed pattern.
  • Null and Alternative Hypotheses:     * H0H_0: The observed distribution matches the theoretical/past distribution.     * H1H_1: The observed distribution differs from the theoretical/past distribution.
  • The Test Statistic: The Chi-Square test statistic is calculated as:     * χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}     * This statistic follows a Chi-Square distribution with degrees of freedom df=k1df = k - 1, where kk represents the number of different groups or categories in the distribution.

Requirements and Assumptions for the Goodness of Fit Test

  • The Rule of Five: Generally, the test is considered valid if each of the expected frequencies is at least 55.
  • Statistical Caveat: Dr. Javed Iqbal notes that on p. 603 of Weiss, it is mentioned that the statistician Cochran suggests the "rule of 5" is often too restrictive for practical application.

Example 1: Market Share for Fabric Softener (Company A)

  • Context: Company A launched an aggressive advertising campaign to maintain or increase its market share.
  • Prior Market Distribution:     * Company A: 45%45\,\%     * Company B (Main Competitor): 40%40\,\%     * Other Competitors: 15%15\,\%
  • Sample Data: A random sample of n=200n = 200 customers was surveyed after the campaign.     * Observed Preference for Company A: 102102     * Observed Preference for Company B: 8282     * Observed Preference for Others: 1616
  • Hypotheses:     * H0H_0: Market share distribution after campaign is same as the past market share distribution.     * H1H_1: Market share distribution after campaign is different from the past market share distribution.
  • Calculation Table (n=200n = 200):     * Category A: O=102O = 102; E=200×0.45=90E = 200 \times 0.45 = 90; (OE)2E=(10290)290=1.60\frac{(O - E)^2}{E} = \frac{(102 - 90)^2}{90} = 1.60     * Category B: O=82O = 82; E=200×0.40=80E = 200 \times 0.40 = 80; (OE)2E=(8280)280=0.05\frac{(O - E)^2}{E} = \frac{(82 - 80)^2}{80} = 0.05     * Others: O=16O = 16; E=30E = 30 (derived as 2009080200 - 90 - 80 to minimize rounding error); (OE)2E=(1630)230=6.53\frac{(O - E)^2}{E} = \frac{(16 - 30)^2}{30} = 6.53
  • Summation of Test Statistic: χ2=1.60+0.05+6.53=8.183\chi^2 = 1.60 + 0.05 + 6.53 = 8.183
  • Decision Criteria:     * Significance Level (α\alpha): 5%5\,\%     * Degrees of Freedom (dfdf): 31=23 - 1 = 2     * Critical Value: χ2(0.05,2)=5.991\chi^2(0.05, 2) = 5.991
  • Conclusion: Since the calculated statistic (8.1838.183) is greater than the critical value (5.9915.991), the null hypothesis is rejected. The market share distribution has changed.
  • Observation: Company A successfully increased its market share, seemingly at the expense of "Other" competitors rather than its main competitor, Company B.

Weiss Example 13.2: Violent Crime Patterns

  • Objective: To determine if the observed pattern of violent crimes in a recent year (Weiss Table 13.2) is the same as the pattern from the year 2010 (Weiss Table 13.1).
  • Hypotheses:     * H0H_0: Crime distribution of last year is same as the 2010 crime distribution.     * H1H_1: Crime distribution of last year has changed from the 2010 crime distribution.
  • Statistical Data:     * Calculated Test Statistic: 6.5296.529     * Significance Level: 5%5\,\%     * Degrees of Freedom (dfdf): 33     * Critical Value (from Anderson Table 3): χ2(0.05,3)=7.815\chi^2(0.05, 3) = 7.815
  • Conclusion: There is insufficient evidence in the data to conclude that the violent crime distribution in the last year differs from the 2010 distribution (calculated value 6.529 < 7.815).

Example 2: ABO Blood Type Distribution

  • Prior Population Data (Believed Distribution):     * Type A: 34%34\,\%     * Type B: 15%15\,\%     * Type AB: 23%23\,\%     * Type O: 28%28\,\%
  • Sample Data: n=100n = 100 students from a campus.     * Observed Frequencies: A = 2929, B = 1717, AB = 2020, O = 3434.
  • Hypotheses:     * H0H_0: Observed blood group distribution of campus students agrees with population distribution.     * H1H_1: Observed blood group distribution of campus students does not agree with population distribution.
  • Calculation Table (n=100n = 100):     * Type A: O=29O = 29; E=34E = 34; (OE)2E=0.7352\frac{(O - E)^2}{E} = 0.7352     * Type B: O=17O = 17; E=15E = 15; (OE)2E=0.2666\frac{(O - E)^2}{E} = 0.2666     * Type AB: O=20O = 20; E=23E = 23; (OE)2E=0.3913\frac{(O - E)^2}{E} = 0.3913     * Type O: O=34O = 34; E=28E = 28; (OE)2E=1.2857\frac{(O - E)^2}{E} = 1.2857
  • Summation of Test Statistic: χ2=2.6789\chi^2 = 2.6789
  • Decision Criteria:     * Degrees of Freedom (dfdf): 41=34 - 1 = 3     * Significance Level: 5%5\,\%     * Critical Value: χ2(0.05,3)=7.815\chi^2(0.05, 3) = 7.815
  • Conclusion: The calculated test statistic (2.67892.6789) is less than the critical value (7.8157.815). Therefore, the null hypothesis is not rejected. The campus blood group distribution matches the broad population.

Weiss Example 13.27: Color Distribution

  • Overview: This data set analyzes color percentages and observed frequencies for a sample where the sum of observed frequencies is 509509.
  • Calculation Table:     * Brown: Percentage = 30%30\,\%, E=152.7E = 152.7, O=152O = 152, (OE)2E=0.0032\frac{(O - E)^2}{E} = 0.0032     * Yellow: Percentage = 20%20\,\%, E=101.8E = 101.8, O=114O = 114, (OE)2E=1.4620\frac{(O - E)^2}{E} = 1.4620     * Red: Percentage = 20%20\,\%, E=101.8E = 101.8, O=106O = 106, (OE)2E=0.1732\frac{(O - E)^2}{E} = 0.1732     * Orange: Percentage = 10%10\,\%, E=50.9E = 50.9, O=51O = 51, (OE)2E=0.0002\frac{(O - E)^2}{E} = 0.0002     * Green: Percentage = 10%10\,\%, E=50.9E = 50.9, O=43O = 43, (OE)2E=1.226\frac{(O - E)^2}{E} = 1.226     * Blue: Percentage = 10%10\,\%, E=50.9E = 50.9, O=43O = 43, (OE)2E=1.226\frac{(O - E)^2}{E} = 1.226
  • Sums: Total Expected = 509509, Total Observed = 509509, Calculated χ2\chi^2 statistic = 4.0914.091.

Reference Exercises

  • Anderson (pdf p. 611): Exercises 22, 23, and 24 are listed for further study of these concepts.