WK10: Analyzing Categorical Data

Categorical Data Analysis

Analyzing categorical data involves examining variables whose values fall into distinct groups or categories. These variables can be inherently categorical, like gender, or created by grouping quantitative data, such as categorizing age into classes (e.g., child, adult).

Contingency Tables

Contingency tables, also known as two-way tables, are used to display distributions of categorical variables and investigate relationships between two categorical variables.

Structure
  • Rows: Represent categories of one variable (e.g., living arrangement).

  • Columns: Represent categories of the other variable (e.g., age).

  • Cells: Contain the number of individuals belonging to a specific combination of categories.

Example

Investigating the relationship between living arrangement (parent's home, another person's home, own place, group quarters, other) and age (19, 20, 21, 22).

  • The cell at the intersection of "parent's home" and "19" contains the count of 19-year-olds living with their parents.

Marginal Distribution

Marginal distributions represent the distribution of a single categorical variable, disregarding the other variable in the contingency table. They are calculated by summing the counts for each category across all categories of the other variable.

  • For living arrangement, the marginal distribution shows the total count for each living arrangement category, regardless of age.

  • For age, the marginal distribution shows the total count for each age category, regardless of living arrangement.

Calculation
  1. Sum the counts for each category of interest.

  2. Divide by the total number of individuals in the sample to obtain proportions or percentages.

  • Proportion: p<em>i=count</em>itotalp<em>i = \frac{count</em>i}{total}

  • Percentage: percentage<em>i=p</em>i×100percentage<em>i = p</em>i \times 100

  1. Verify that proportions sum to 1 or percentages sum to 100%.

  • Example: 1357 individuals live in their parent's home out of 2984 total individuals. The proportion is 13572984=0.455\frac{1357}{2984} = 0.455, or 45.5%.

Conditional Distributions

Conditional distributions show the distribution of one variable's values, considering only individuals within a specific category of another variable. They help us assess potential relationships between categorical variables.

Process
  1. For each category of the second variable, calculate the distribution of the first variable.

  2. Compare the conditional distributions across different categories.

Interpretation
  • Similar distributions: Suggest little to no relationship.

  • Different distributions: Suggest a potential relationship.

Example: Does Living Arrangement Depend on Age?
  1. Calculate the conditional distribution of living arrangements for each age group.

  2. For 19-year-olds, divide the count in each living arrangement category by the total number of 19-year-olds.

  3. Compare the resulting proportions or percentages across different living arrangement categories.

  • Example: For 19-year-olds, 245 live in their parent's home out of 408 total 19-year-olds, giving a proportion of 245408=0.6\frac{245}{408} = 0.6, or 60%.

Example: Does Age Depend on Living Arrangement?
  1. Calculate the conditional distribution of age for each living arrangement category.

  2. For individuals living with parents, divide the count in each age category by the total number of individuals living with parents.

  • Example: Out of 1357 individuals living with parents, 324 are 19 years old, giving a proportion of 3241357=0.239\frac{324}{1357} = 0.239, or 23.9%.

Visualizing Conditional Distributions

Stacked 100% Bar Chart
  • A useful tool for visualizing and comparing conditional distributions.

  • Each bar represents a category of one variable, with segments showing the proportions of categories of the other variable.

  • Differences in segment sizes across bars highlight potential relationships.

Lurking Variables and Simpson's Paradox

Lurking Variables
  • Variables that influence the relationship between two variables but are not considered in the initial analysis.

  • Ignoring lurking variables can lead to misleading conclusions.

Simpson's Paradox
  • A phenomenon where the relationship between two variables changes or reverses when a lurking variable is considered.

Example: Transport Mode and Survival Rate
  • Initial observation: Lower survival rate for helicopter transport compared to road transport.

  • Lurking variable: Seriousness of the accident.

  • When considering seriousness: Helicopter transport yields higher survival rates for both serious and less serious accidents.

Data
  • Helicopter: 32% death, 68% survival

  • Road: 23.6% death, 76.4% survival

  • Serious Accidents (Helicopter): 52% survival

  • Serious Accidents (Road): 40% survival

  • Less Serious Accidents (Helicopter): 84% survival

  • Less Serious Accidents (Road): 80% survival

Implications
  • Always consider potential lurking variables when analyzing relationships between variables.

  • Failing to account for lurking variables can lead to incorrect interpretations and decisions.

Statistical Inference

After preliminary investigation, proper statistical inference should be performed to make more robust conclusions about the relationship between two categorical variables.