Categorical Variable Relationships: Independence, Association, and Segmented Bar Charts
Relationship Between Variables: Independence and Association
- Associations in Variables: Variables can exhibit associations in many different ways and to varying degrees. Understanding these associations is critical for interpreting data sets, such as the historical records of the Titanic.
- Defining Independence:
- In the context of a contingency table, variables are considered independent when the distribution of one variable remains identical across all categories of another variable.
- If independence is established, there is no association between the variables in question.
- The Logical Equivalence of Terms:
- To say that variables are "independent" is equivalent to saying they are "not associated."
- To say that variables are "not independent" is equivalent to saying they are "associated."
- Visual and Formal Checks: While formal mathematical methods to check for independence exist, a primary method involves comparing distributions visually or through conditional tables.
Analyzing Categorical Relationships with Contingency Tables
- Conditional Distributions: These are used to look closer at variable relationships by conditioning one variable on the values of another.
- The Titanic Dataset Example (Table 3.7): This conditional table displays the distribution of ticket Class (First, Second, Third, and Crew) conditioned on Survival status (Alive or Dead).
- Conditional Table Data for Survival:
- Condition: Alive (Survivors)
- First Class: 203 individuals, representing 28.6%
- Second Class: 118 individuals, representing 16.6%
- Third Class: 178 individuals, representing 25.0%
- Crew: 212 individuals, representing 29.8%
- Total Alive: 711 individuals, representing 100%
- Condition: Dead (Non-survivors)
- First Class: 122 individuals, representing 8.2%
- Second Class: 167 individuals, representing 11.2%
- Third Class: 528 (transcript notes 354 but calculations show 35.4%×1490≈527.46; the table lists 35.4%
- Crew: 673 individuals, representing 45.2%
- Total Dead: 1490 individuals, representing 100%
- Observing Associations: The margins of the contingency table are vital because they show the actual numbers of people involved. By comparing the percentages across the "Alive" and "Dead" rows, it becomes clear that the distribution of classes is not the same for both groups, implying an association between ticket Class and Survival.
Visualizing Association: Segmented Bar Charts
- Definition of a Segmented Bar Chart (Figure 3.9): This display treats each individual bar as a "whole" (representing 100%). It divides the bar proportionally into segments corresponding to the percentage of each category within that group.
- Key Features of Figure 3.9:
- Y-Axis: Percent, ranging from 0% to 100%. Each bar always totals 100%
- X-Axis Categories: Survival status, specifically "Alive" and "Dead."
- Color/Segment Coding: Identifies the ticket Class (First, Second, Third, and Crew).
- Comparison of Bar Heights: Even though the raw totals for survivors (711) and non-survivors (1490) are significantly different, the bars in the chart are the same height because the data has been converted into percentages.
- Interpreting the Chart:
- If the segments within the bars look different across the categories (e.g., the "First Class" segment is much larger in the "Alive" bar than in the "Dead" bar), then the variables are associated.
- In the Titanic case, the visible differences in distributions across the categories of survival indicate that survival was not independent of ticket class.
- Comparison with Other Visuals: The segmented bar chart is often used as an alternative to side-by-side pie charts (referenced as Figure 3.6 in the text) to display the same type of data.
Writing Comprehensive Statistical Conclusions
- Requirement for Proper Conclusion: When tasked with answering questions regarding association or independence between two categorical variables, there is a specific procedural requirement:
- You must create a segmented bar chart.
- You must interpret the chart by describing the differences or similarities in the distributions.
- Identifying Association: If the segments of the bars vary significantly between the categories being compared, you must conclude there is an association (the variables are not independent).