WK11: Statistical inference: Two categorical variables: Two-way contingency table
Analyzing Categorical Data
Categorical Variables
- Categorical variables are variables whose values fall into groups or categories.
- Examples:
- Gender (categorical by nature)
- Age (grouped into classes, e.g., child 0-18, adult 19-100)
Two-Way Contingency Table
- Used for preliminary investigation of the relationship between two categorical variables.
- Displays counts for combinations of categories.
- Example: Investigating the relationship between living arrangement and age.
- Living arrangements (rows): Parent's home, another person's home, own place, group quarters, other.
- Age (columns): 19, 20, 21, 22 years old.
- Each cell contains the number of individuals belonging to a specific age and living arrangement combination.
Marginal Distribution
- Counts for each category of a single variable, regardless of the other variable.
- Calculated by summing up counts across all categories of the other variable.
- Example: Marginal distribution for living arrangement:
- Count of individuals living in their parent's home, regardless of age.
- Marginal distribution is represented by percentages or proportions.
- Proportion = Count of individuals in a category / Total number of individuals.
- Example: 1,357 individuals live in their parents' home out of a total of 2,984.
- Proportion = 1357/2984=0.455 (45.5%)
- Check that percentages sum up to 100% or proportions sum up to 1.
Conditional Distributions
- Distribution of values or categories of one variable, considering only individuals belonging to a particular category in the other variable.
- Used to investigate the relationship between two categorical variables.
- If conditional distributions are similar, there is likely no relationship.
- If conditional distributions are different, there might be a relationship.
- To determine whether living arrangement depends on age, obtain conditional distributions of living arrangements for each age category and compare.
- Proportion calculation: Count of individuals in a living arrangement category / Total number of individuals in that age category.
- Example: For individuals who are 19, 324 live in their parents' home out of 540 total.
- Proportion = 324/540=0.6 (60%)
- Check that percentages in each conditional distribution sum up to 100% or proportions sum up to 1.
- Visual comparison of conditional distributions can be done using plots.
Statistical Inference and Lurking Variables
- When doing statistical inference, it is important to watch out for lurking variables.
- Lurking variables explain changes in the response, but are not known prior to the study.
- Example: Simpson's Paradox
- Relationship between two quantitative variables (X and Y) appears decreasing.
- By considering a categorical variable, the relationship between X and Y is actually increasing.
- If the effect of the lurking variable is not taken into account, misleading conclusions may result.
Simpson's Paradox Example: Survival and Mode of Transport
- Relationship between survival of a victim in an accident and mode of transport to the hospital.
- Conditional distributions of survival conditioned on mode of transport:
- Helicopter: 32% death, 68% survival.
- Road: 23.6% death, 76.4% survival.
- Helicopter transport appears to have a lower survival rate.
- Lurking variable: Seriousness of the accident (less serious, more serious).
- Data separated by seriousness of the accident:
- Serious Accidents:
- Helicopter: 52% survival.
- Road: 40% survival.
- Less Serious Accidents:
- Helicopter: 84% survival.
- Road: 80% survival.
- In both cases, helicopter transport yields a higher survival rate.
- This example shows how a lurking variable can reverse the overall interpretation of the data.