Focuses on inference related to more than two categorical variables.
Aims to explore interactions between categorical variables in depth.
This chapter shifts focus towards theoretical approaches for inference.
Chapter 9 will address Mean Group Differences.
Qualitative or categorical measurements include examples such as:
M&M colors (6 possible colors)
Airline ticket classes (coach, business, first)
Survey responses (strongly disagree to strongly agree)
Such data can be recorded as counts across categories, representing a multinomial experiment.
Binomial experiments are limited to two categories.
For experiments with two categories, can model with a weighted coin flip.
For k > 2 categories, use weighted dice to simulate data with specific probabilities for each category.
Use frequency observations to create the sample statistic (n1, n2, …, nk).
P-value is computed by finding combinations of probabilities that match or are lower than our case.
Larger samples require more computational power, leading to a preference for theoretical methods.
A local pharmacy's ice cream sales data is analyzed to verify if flavor preferences have changed from five years ago:
Previous proportions: Strawberry (25%), Chocolate (40%), Vanilla (20%), Butterscotch (15%).
Owner collects customer preference data over one day.
Plan to evaluate the evidence of preference change at a 10% significance level.
The analysis is a multinomial experiment with one variable.
Utilize the xmutlti
formula from the XNominal
package in R to perform inference.
Example output shows 4960 different tables can be constructed; the observed situation has a probability of 0.002638.
Assumption: Simple random sample from a multinomial distribution.
Hypotheses:
H0: Ice cream preference remains the same.
Ha: Ice cream preference differs.
Test Statistic: 7
P-value: Simulated p = 0.5666
Conclusions: p > 0.10
Fail to reject H0; insufficient evidence for a preference difference.
This test allows simulation for 2x2 tables and is based on a multivariate hypergeometric distribution.
The test statistic utilizes contingency tables, with the p-value derived by summing probabilities of observed configurations.
Larger samples often warrant theoretical tests instead due to computational complexity.
Focuses on theoretical applications of inferring single qualitative variables with two or more categories.
Traditional z-procedures and t-procedures do not fit scenarios with multiple categories; Chi-Square tests are applied.
The Chi-Square distribution is introduced, leading to the Goodness-of-Fit Test.
Chi-square distribution is right-skewed with degrees of freedom (df).
Notation: χ2(df),α indicates the critical χ2 value at significance level α.
Basic properties:
Total area under the curve = 1.
Begins at 0 and extends to the right indefinitely.
Right-skewed curve.
As df increases, the curve resembles a normal distribution.
M&M's are produced in claimed proportions; a sample is analyzed to see if the observed distribution aligns with the expected.
Null Hypothesis (H0): M&M color distribution is accurate according to the company's claims.
Alternative Hypothesis (Ha): Distribution differs from the claimed proportions.
Hypotheses expressed as:
H0: π1 = π1,0, π2 = π2,0, ..., πk = πk,0
Ha: At least one πi ≠ πi,0
Aim is to show at least one category has a different proportion.
Compare actual counts (Oi) with expected counts (Ei) under H0.
Compute expected counts using the formula: E_i = n * π_i,0
The chi-square test statistic approximates chi-square distribution under specific conditions:
Simple random sample.
Sample size large enough that each expected frequency Ei ≥ 5.
Alternative conditions can be applied regarding expected frequencies.
If expected counts are accurate, differences between observed and expected counts are minor (χ² ≈ 0).
High differences indicate errors in expected counts resulting in a high χ² value.
Degrees of freedom calculated as k - 1, where k = number of categories.
R output used for hypotheses testing:
H0: M&M color distribution matches company claims.
Ha: Distribution differs.
Test Statistic: χ² = 1.2468, 5 df
P-value: p = 0.9403
p > 0.05 ➔ Fail to reject H0; insufficient evidence for claim discrepancy.
Introduces the concept of contingency tables to assess the relationship between two categorical variables.
Questions focus on whether variables are associated.
If two variables are independent, one variable provides no information on the other.
The survey investigates the relationship between happiness and family income.
Assess observed versus expected counts under the null hypothesis to find suggestions of dependence.
Applies a similar form as the Goodness-of-Fit Test:
Simple random sample.
Sufficiently large sample size.
Reject H0 if the test statistic is large enough.
Assess results from GSS to determine if perceived happiness is associated with income and understand significance levels.
Highlight analysis where the chi-square tests provide mere procedures but require careful interpretation.