2.6

Data Analysis for Two-Way Tables

  • Overview of Two-Way Tables
    • Statistical analysis of two-way tables is considered the most widely used method in the subfield of statistics referred to as categorical data analysis.
    • The discussion begins with a motivating example to illustrate the concept.

Simpson’s Paradox

  • Definition

    • Simpson’s Paradox refers to the phenomenon where the effects of lurking variables can strongly influence the perceived relationships between two categorical variables.
  • Application of Simpson's Paradox

    • A significant example is provided to demonstrate this concept.

Example: Airline Data

  • Context

    • Air travelers generally prefer to arrive on time.
    • Airlines collect data on on-time arrivals across their flights.
    • The example discusses data collected for a year regarding flights directed to Prince George from two airlines: Air Canada and WestJet Air.
  • Data Reporting

    • The table provided details a year’s worth of on-time and delayed arrivals for two airlines:
    • Air Canada:
      • On time: 718
      • Delayed: 74
      • Total flights: 792
    • WestJet Air:
      • On time: 5534
      • Delayed: 532
      • Total flights: 6066
  • Percentage of Delays

    • WestJet Air's delay percentage:
    • rac5326066=0.088rac{532}{6066} = 0.088, or 8.8%.
    • Air Canada’s delay percentage:
    • rac74792=0.093rac{74}{792} = 0.093, or 9.3%.
  • Initial Impressions

    • Analysis looking only at these percentages suggests that WestJet Air is preferred to minimize delays.

Detailed Departure Studies

  • Analysis by Departure City

    • The subsequent data table further breaks down delays by city of departure:
  • Vancouver Departures:

    • Air Canada:
      • On time: 497
      • Delayed: 62
      • Total: 559
      • Delay percentage: rac62559=0.111rac{62}{559} = 0.111, or 11.1%.
    • WestJet Air:
      • On time: 694
      • Delayed: 117
      • Total: 811
      • Delay percentage: rac117811=0.144rac{117}{811} = 0.144, or 14.4%.
  • Toronto Departures:

    • Air Canada:
      • On time: 221
      • Delayed: 12
      • Total: 233
      • Delay percentage: rac12233=0.052rac{12}{233} = 0.052, or 5.2%.
    • WestJet Air:
      • On time: 4840
      • Delayed: 415
      • Total: 5255
      • Delay percentage: rac4155255=0.079rac{415}{5255} = 0.079, or 7.9%.
  • Conclusions from Departure Data

    • For both Vancouver and Toronto, Air Canada shows a lower percentage of delays compared to WestJet Air, suggesting that in both instances, Air Canada is the better choice for on-time arrivals.

Three-Way Table and Lurking Variables

  • Understanding Data Aggregation
    • The previous table is an example of a three-way table, as it incorporates the city of departure.
    • The initial two-way airline data table was derived by aggregating data without considering the city variable, which consequently masked the effects of the lurking variable (city departure).

Deeper Dive into Simpson’s Paradox

  • Definition Reconfirmed

    • An association or comparison that holds for all several groups can reverse direction when the data are combined to form a single group; this is termed Simpson’s Paradox.
  • Simpson’s Paradox for Quantitative Variables

    • An example involving two quantitative variables will be introduced, demonstrating how relationships can differ when viewed through varying lenses.

Example: Numeric Data and Regression Analysis

  • Illustrative Graphs

    • Data points and their relationships (shown in graphical format across several slides) illustrate how relationships can vary, with the following described:
  • Estimated Regression Equations

    • An example regression equation includes:
    • Y=14.6180.445XY = 14.618 - 0.445X
    • Correlation coefficient: r=0.564r = -0.564
  • Visual Representation

    • Different scatter plots show various data groupings with their respective trends, including:
    • An upper left grouping with an equation: Y=8.364+0.455XY = 8.364 + 0.455X, and correlation r=0.455r = 0.455.
    • A lower right grouping with an equation: Y=0.182+0.455XY = 0.182 + 0.455X and correlation r=0.455r = 0.455.
  • Recap of Key Statistics

    • Summary of regression data:
    • All data set:
      • Intercept: 14.618
      • Slope: -0.445
      • Correlation: -0.564
    • Upper Left data set:
      • Intercept: 8.364
      • Slope: 0.455
      • Correlation: 0.455
    • Lower Right data set:
      • Intercept: 0.182
      • Slope: 0.455
      • Correlation: 0.455
  • Final Remarks on Data Relationships

    • It is emphasized that the association observed within each of the groups can reverse when the data is aggregated into a single set, reinforcing the significance of consistent analysis across varying data groupings and the potential confounding effects of lurking variables.