Chi-Square Test for Independence and Association

Chi-Square Test for Independence or Association

  • General Definition: This test is conducted to determine if there is a relationship between two categorical variables from a single population.

  • Key Distinctions: Unlike the Chi-Square Test for Homogeneity (which compares one variable across separate samples), the Chi-Square Test for Independence involves one sample and asks about two variables.

  • Terminology: The terms "independence" and "association" may be used interchangeably when naming the test or stating hypotheses.

  • Data Structure: This test is always performed on data presented in a two-way table.

Hypotheses and Variable Relationships

  • Null Hypothesis (H0H_0): States that there is no relationship between the variables. They are independent or not associated.     * Example: "Taco Tongue and Evil Eyebrow are independent" or "There is no association between Taco Tongue and Evil Eyebrow."

  • Alternative Hypothesis (HaH_a): States that there is a relationship between the variables. They are dependent or associated.     * Example: "Taco Tongue and Evil Eyebrow are associated" or "The variables are dependent."

  • The Concept of Independence: If variables are independent, one does not affect the other. For instance, being able to perform a "taco tongue" (folding the tongue) has nothing to do with the ability to raise one eyebrow (evil eyebrow).

Calculating Expected Counts

  • Calculation Method: To find the expected counts for a cell in a two-way table without using a technology matrix, use the following formula:     E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

  • Numerical Example (n=600n = 600):     * Total sample size (Grand Total): 600600     * Row totals: 480480, 120120     * Column totals: 200200, 400400     * Calculation 1: 480×200600=160\frac{480 \times 200}{600} = 160     * Calculation 2: 480×400600=320\frac{480 \times 400}{600} = 320     * Calculation 3: 120×200600=40\frac{120 \times 200}{600} = 40     * Calculation 4: 120×400600=80\frac{120 \times 400}{600} = 80

  • Significance of Expected Counts: These counts are used in multiple-choice questions on exams and are essential for checking conditions.

Conditions for Inference

  1. Random: Data must come from a random sample to generalize the findings to the population.     * In the example: A random sample of 600600 seniors was taken to generalize to all seniors.

  2. 10% Condition: When sampling without replacement, the sample size (nn) must be less than 10%10\% of the population size (NN).     * Condition check: 600 < 0.10 \times (\text{All Seniors}).     * Critical Exception: Do not check the 10%10\% condition if the data comes from an experiment using random assignment. Checking it in this context will result in a loss of points on the exam.

  3. Large Counts: All expected counts must be greater than or equal to 55.     * In the example: The expected counts were 160160, 320320, 4040, and 8080. Since all values are 5\ge 5, the condition is satisfied.     * Strict Reporting Rule: It is mandatory to list the specific expected counts; simply stating they are all above 55 is insufficient.

Calculations and Technology Output

  • Chi-Square Test Statistic Formula:     χ2=(ObservedExpected)2Expected\chi^2 = \sum \frac{(\text{Observed} - \text{Expected})^2}{\text{Expected}}

  • Matrix Setup for Technology:     * Matrix A (Observed Data):         (180amp;300 20amp;100)\begin{pmatrix} 180 &amp; 300 \ 20 &amp; 100 \end{pmatrix}     * Matrix B (Expected Data):         (160amp;320 40amp;80)\begin{pmatrix} 160 &amp; 320 \ 40 &amp; 80 \end{pmatrix}

  • Degrees of Freedom (dfdf):     * Formula for two-way tables: df=(r1)×(c1)df = (r-1) \times (c-1)     * Calculation: (21)×(21)=1×1=1(2-1) \times (2-1) = 1 \times 1 = 1

  • Results for the Example Problem:     * Chi-Square Test Statistic: χ2=18.75\chi^2 = 18.75     * P-value: P0P \approx 0     * Significance Level: α=0.05\alpha = 0.05     * Result interpretation: A very small P-value indicates that the observed counts are very different from the expected counts planned under the assumption of independence.

Decision and Conclusion

  • Decision: Since the P-value (P0P \approx 0) is less than the alpha level (α=0.05\alpha = 0.05), we reject the null hypothesis (H0H_0).

  • Conclusion Statement: We have convincing evidence that Taco Tongue and Evil Eyebrow are associated among seniors.

Error Types and Statistical Power

  • Type I Error: Occurs if we reject the null hypothesis when it is actually true.     * The probability of a Type I error is equal to the significance level: P(Type I Error)=α=0.05P(\text{Type I Error}) = \alpha = 0.05.

  • Type II Error: Occurs if we fail to reject the null hypothesis when it is actually false.

  • Relationship between Alpha, Type II Error, and Power:     * As α\alpha (Type I Error probability) increases, the probability of a Type II error decreases.     * As the probability of a Type II error decreases, the Power of the test increases.     * Power and Alpha move in the same direction: If α\alpha increases, Power increases.

  • How to Increase Power:     * Increase the sample size (nn).     * Increase the significance level (α\alpha) (e.g., from 0.050.05 to 0.150.15).     * Use a value in the alternative hypothesis that is further away from the null value.

Comparison Summary: GOF vs. Homogeneity vs. Association

  • Goodness of Fit (GOF): One sample, one variable; checks if the sample distribution matches a specific population distribution.

  • Homogeneity: Two or more samples (or groups), one variable; checks if the distribution of a single variable is the same across multiple populations.

  • Independence/Association: One sample, two variables; checks if there is a relationship between two variables within a single population.

Questions & Discussion

  • Student Question: "Do you remember how to find expected counts on a two-way table?"     * Response: Row total times column total over table total.

  • Student Question: "Is it homogeneity because we're looking at only one sample?"     * Response: No, it is not homogeneity. If you have one sample with two variables, it's a test for independence/association. Homogeneity requires two or more samples.

  • Student Question: "What are the lines for? (referring to notations like 'ta' and 'ee')"     * Response: Those are short-hand notations to be "lazy" while writing. Instead of writing "Taco Tongue" and "Evil Eyebrow" repeatedly, "TT" and "EE" are used.

  • Student Question: "Why are we checking 10%?"     * Response: We check the 10% condition because we are sampling without replacement.

  • Student Question: "If it's an experiment, do we have to check 10%?"     * Response: No. If it is an experiment with random assignment, do not check the 10% condition, or you will lose points.

  • Student Question: "How can you get decreased making in Type II error?"     * Response: Increase the sample size or increase the alpha level (e.g., set alpha to 0.150.15).