Chapter 13 Categorical Data Analysis

Chapter 13 Categorical Data Analysis

13.1 Categorical Data and the Multinomial Experiment

  • Recall Variable Types:

    • Quantitative (Numerical) Variables: Includes measurable quantities such as the number of students, temperature, or height.

    • Qualitative (Categorical) Variables: Include non-numerical categories such as the color of hair, brand of car, or type of blood.

Binomial vs. Multinomial Experiments
  • Binomial Experiment: Involves categorical variables with only two possible outcomes (success or failure).

  • Multinomial Experiment: Involves categorical variables with more than two possible outcomes.

Properties of the Multinomial Experiment:
  1. The experiment consists of n identical trials.

  2. There are k possible outcomes for each trial, referred to as classes, categories, or cells.

  3. The probabilities of the possible outcomes remain the same for each trial, denoted by:
    extP<em>iext{P}<em>i where i=1,2,ext,ki = 1, 2, ext{…}, k and extP</em>1+extP<em>2++extP</em>k=1ext{P}</em>1 + ext{P}<em>2 + … + ext{P}</em>k = 1

  4. The trials are independent.

  5. The random variables of interest are the counts in each cell, denoted as nin_i, the number of observations in each category.

Example 1: Educational Level of Employees
  • Context: Analyzing the educational levels of employees in a company, which includes four levels: High School, BS, MS, and PHD.

  • Assumed Probability Distribution:

    • extP1=5%ext{P}_1 = 5\% (High School)

    • extP2=70%ext{P}_2 = 70\% (BS)

    • extP3=20%ext{P}_3 = 20\% (MS)

    • extP4=5%ext{P}_4 = 5\% (PHD)

  • Random Sample Size: 1000 employees.

  • Results of Counts:

    • High School: 55

    • BS: 678

    • MS: 197

    • PHD: 70

  • Evaluation:

    1. Total trials n = 1000.

    2. Each trial has k = 4 outcomes.

    3. The probabilities extPiext{P}_i are consistent across all trials.

    4. The education level of one employee does not affect another.

    5. The counts in each educational category represent the random variables of interest.

13.2 Testing Categorical Probabilities: One Categorical Variable

  • One-Way Table Analysis:

    • Purpose: Testing a hypothesis regarding multinomial probabilities using the Chi-Square Test for Goodness of Fit.

    • Hypothesis:

    • Null Hypothesis (H0): At least one of the multinomial probabilities does not equal its hypothesized value.

    • Test Statistic:

    • extx2=<em>i=1k(O</em>iE<em>i)2E</em>iext{x}^2 = \sum<em>{i=1}^{k} \frac{(O</em>i - E<em>i)^2}{E</em>i} where O<em>iO<em>i is the observed cell count and E</em>iE</em>i is the expected cell count determining how different the observed counts are from the expected under the null hypothesis.

    • Rejection Region: If x^2 > x^2{\alpha} where x2</em>αx^2</em>{\alpha} has (k-1) degrees of freedom.

Distribution Properties
  1. The total area under the curve of a probability distribution is 1.

  2. The Chi-Square distribution is right skewed.

13.3 Conditions for Valid Testing

  • Conditions Required for a Valid Test:

    1. The sample must be a random sample from a multinomial experiment.

    2. The expected cell count for each cell must be large, generally, each expected frequency should be at least 5.

Example 1: Voting Preferences Survey
  • Context: Survey conducted to analyze voter preferences with the following outcomes:

    • Candidate 1: 61

    • Candidate 2: 53

    • Candidate 3: 36

  • Statistical Results:

    • Total Sample Size: n = 150.

    • Hypothesis Testing at α = 0.05: Does the sample data show a preference for any of the candidates?

Example 2: Violent Crimes Distribution
  • Data: The FBI published information regarding the distribution of crimes in 1995.

  • Frequency Analysis of a Sample Size of 500:

    • Murder: 9

    • Forcible Rape: 26

    • Robbery: 144

    • Aggressive Assault: 321

  • Hypothetical Test: Determine if the distribution of violent crimes last year statistically changed using α = 0.01.

Example 3: American Roulette Wheel Outcomes
  • Setup: An American roulette wheel with outcomes over 200 trials.

    • Red: 88

    • Black: 102

    • Green: 10

  • Hypothesis Test: Determine if the wheel is balanced at a 5% significance level, predicting outcomes for each color based on a fair distribution.

Two Categorical Variables Analysis

Example: Education Level and Gender
  • Data Collected: Counts categorized by gender and education level, from a sample of 1000 employees with counts for male and female across educational levels.

    • Male: High School (26), BS (373), MS (126), PHD (49)

    • Female: High School (29), BS (305), MS (71), PHD (21)

  • Hypothesis Testing Objective: Investigate if there is a dependence between education level and gender using a Chi-Square Test for Independence.

  • Conditions for Valid Test:

    1. The observed counts must be derived from a random sample.

    2. Sufficiently large expected counts for each cell.