WK11: Statistical inference: Two categorical variables: Two-way contingency table

Analyzing Categorical Data

Categorical Variables

  • Categorical variables are variables whose values fall into groups or categories.
  • Examples:
    • Gender (categorical by nature)
    • Age (grouped into classes, e.g., child 0-18, adult 19-100)

Two-Way Contingency Table

  • Used for preliminary investigation of the relationship between two categorical variables.
  • Displays counts for combinations of categories.
  • Example: Investigating the relationship between living arrangement and age.
    • Living arrangements (rows): Parent's home, another person's home, own place, group quarters, other.
    • Age (columns): 19, 20, 21, 22 years old.
  • Each cell contains the number of individuals belonging to a specific age and living arrangement combination.

Marginal Distribution

  • Counts for each category of a single variable, regardless of the other variable.
  • Calculated by summing up counts across all categories of the other variable.
  • Example: Marginal distribution for living arrangement:
    • Count of individuals living in their parent's home, regardless of age.
  • Marginal distribution is represented by percentages or proportions.
  • Proportion = Count of individuals in a category / Total number of individuals.
  • Example: 1,357 individuals live in their parents' home out of a total of 2,984.
    • Proportion = 1357/2984=0.4551357 / 2984 = 0.455 (45.5%)
  • Check that percentages sum up to 100% or proportions sum up to 1.

Conditional Distributions

  • Distribution of values or categories of one variable, considering only individuals belonging to a particular category in the other variable.
  • Used to investigate the relationship between two categorical variables.
  • If conditional distributions are similar, there is likely no relationship.
  • If conditional distributions are different, there might be a relationship.
  • To determine whether living arrangement depends on age, obtain conditional distributions of living arrangements for each age category and compare.
  • Proportion calculation: Count of individuals in a living arrangement category / Total number of individuals in that age category.
  • Example: For individuals who are 19, 324 live in their parents' home out of 540 total.
    • Proportion = 324/540=0.6324 / 540 = 0.6 (60%)
  • Check that percentages in each conditional distribution sum up to 100% or proportions sum up to 1.
  • Visual comparison of conditional distributions can be done using plots.

Statistical Inference and Lurking Variables

  • When doing statistical inference, it is important to watch out for lurking variables.
  • Lurking variables explain changes in the response, but are not known prior to the study.
  • Example: Simpson's Paradox
    • Relationship between two quantitative variables (X and Y) appears decreasing.
    • By considering a categorical variable, the relationship between X and Y is actually increasing.
    • If the effect of the lurking variable is not taken into account, misleading conclusions may result.

Simpson's Paradox Example: Survival and Mode of Transport

  • Relationship between survival of a victim in an accident and mode of transport to the hospital.
  • Conditional distributions of survival conditioned on mode of transport:
    • Helicopter: 32% death, 68% survival.
    • Road: 23.6% death, 76.4% survival.
  • Helicopter transport appears to have a lower survival rate.
  • Lurking variable: Seriousness of the accident (less serious, more serious).
  • Data separated by seriousness of the accident:
    • Serious Accidents:
      • Helicopter: 52% survival.
      • Road: 40% survival.
    • Less Serious Accidents:
      • Helicopter: 84% survival.
      • Road: 80% survival.
  • In both cases, helicopter transport yields a higher survival rate.
  • This example shows how a lurking variable can reverse the overall interpretation of the data.