2.6
Data Analysis for Two-Way Tables
- Overview of Two-Way Tables
- Statistical analysis of two-way tables is considered the most widely used method in the subfield of statistics referred to as categorical data analysis.
- The discussion begins with a motivating example to illustrate the concept.
Simpson’s Paradox
Definition
- Simpson’s Paradox refers to the phenomenon where the effects of lurking variables can strongly influence the perceived relationships between two categorical variables.
Application of Simpson's Paradox
- A significant example is provided to demonstrate this concept.
Example: Airline Data
Context
- Air travelers generally prefer to arrive on time.
- Airlines collect data on on-time arrivals across their flights.
- The example discusses data collected for a year regarding flights directed to Prince George from two airlines: Air Canada and WestJet Air.
Data Reporting
- The table provided details a year’s worth of on-time and delayed arrivals for two airlines:
- Air Canada:
- On time: 718
- Delayed: 74
- Total flights: 792
- WestJet Air:
- On time: 5534
- Delayed: 532
- Total flights: 6066
Percentage of Delays
- WestJet Air's delay percentage:
- , or 8.8%.
- Air Canada’s delay percentage:
- , or 9.3%.
Initial Impressions
- Analysis looking only at these percentages suggests that WestJet Air is preferred to minimize delays.
Detailed Departure Studies
Analysis by Departure City
- The subsequent data table further breaks down delays by city of departure:
Vancouver Departures:
- Air Canada:
- On time: 497
- Delayed: 62
- Total: 559
- Delay percentage: , or 11.1%.
- WestJet Air:
- On time: 694
- Delayed: 117
- Total: 811
- Delay percentage: , or 14.4%.
- Air Canada:
Toronto Departures:
- Air Canada:
- On time: 221
- Delayed: 12
- Total: 233
- Delay percentage: , or 5.2%.
- WestJet Air:
- On time: 4840
- Delayed: 415
- Total: 5255
- Delay percentage: , or 7.9%.
- Air Canada:
Conclusions from Departure Data
- For both Vancouver and Toronto, Air Canada shows a lower percentage of delays compared to WestJet Air, suggesting that in both instances, Air Canada is the better choice for on-time arrivals.
Three-Way Table and Lurking Variables
- Understanding Data Aggregation
- The previous table is an example of a three-way table, as it incorporates the city of departure.
- The initial two-way airline data table was derived by aggregating data without considering the city variable, which consequently masked the effects of the lurking variable (city departure).
Deeper Dive into Simpson’s Paradox
Definition Reconfirmed
- An association or comparison that holds for all several groups can reverse direction when the data are combined to form a single group; this is termed Simpson’s Paradox.
Simpson’s Paradox for Quantitative Variables
- An example involving two quantitative variables will be introduced, demonstrating how relationships can differ when viewed through varying lenses.
Example: Numeric Data and Regression Analysis
Illustrative Graphs
- Data points and their relationships (shown in graphical format across several slides) illustrate how relationships can vary, with the following described:
Estimated Regression Equations
- An example regression equation includes:
- Correlation coefficient:
Visual Representation
- Different scatter plots show various data groupings with their respective trends, including:
- An upper left grouping with an equation: , and correlation .
- A lower right grouping with an equation: and correlation .
Recap of Key Statistics
- Summary of regression data:
- All data set:
- Intercept: 14.618
- Slope: -0.445
- Correlation: -0.564
- Upper Left data set:
- Intercept: 8.364
- Slope: 0.455
- Correlation: 0.455
- Lower Right data set:
- Intercept: 0.182
- Slope: 0.455
- Correlation: 0.455
Final Remarks on Data Relationships
- It is emphasized that the association observed within each of the groups can reverse when the data is aggregated into a single set, reinforcing the significance of consistent analysis across varying data groupings and the potential confounding effects of lurking variables.