Chapter 3 Notes on Contingency Tables and Related Concepts

3.1 Contingency Tables

  • Definition: A contingency table is a table that displays two categorical variables and their relationships.
  • Example: There are 897 females who have both a cat and a dog.
  • The bottom row represents the totals for gender and is called the marginal distribution. The right column represents the totals for pets and the corresponding margins.

3.1 Table of Percents

  • Percents in each column: 54.8\% of pet-owning men have dogs but not cats.

  • Column margins sum to 100%.

  • Women don’t favor either pet; Men seem to favor dogs over cats.

  • What is wrong with just comparing these percentages? Look at Row Percents.

  • 3.1 Table of Percents (continued):

    • 60.9\% of dual pet owners are women.
    • 54.2\% of the participants are women.
  • 3.1 Table of Percents (another set):

    • 6.3\% of OkCupid pet owners are women who have both a dog and a cat.
    • 25.1\% of OkCupid pet owners are men with a dog only.

3.2 Conditional Distributions

  • Pie charts are a good way to display percentages.

  • Pie charts make it difficult to be precise about comparisons.

  • Conditional distributions are better shown with side-by-side bar charts.

  • Example: 65.8\% of people who voiced an opinion for the commercials were women.

3.2 Independence and Association

  • Independence: The distribution of one variable is the same for all categories of another.
  • For dependent variables, there is an association between the two variables.

3.3 Association Example

  • Slide shows an association example (no specific numbers given here in this heading).

3.3 Association Example (continued)

  • 31\% of women versus 26.8\% of men didn’t plan to watch.
  • Appears to be an association between gender and what people are planning to watch.
  • Women: 38.8\% game; 30.2\% commercials.
  • Men: 56.7\% game; 16.5\% commercials.

3.4 Examining Contingency Tables Example

  • Medical researchers followed 6272 Swedish men for 30 years to see whether there was any association between the amount of fish in their diet and prostate cancer.
  • Results summarized in a contingency table (the lecture slide references the table).
  • Goal: determine whether there is an association between fish consumption and prostate cancer.

3.4 Example (continued) Mechanics

  • Two categories of the diet are quite small: only 2.0\% in Never/Seldom eating fish and 8.8\% in the “Large Part” category.
  • Overall, 7.4\% of the men in this study had prostate cancer.

3.4 Example (continued) Mechanics (Part II)

  • Further mechanics about interpreting small category sizes and overall percentages.

3.4 Example (continued) Mechanics (Part III)

  • Additional mechanics about chart choices and how to read the data.

3.4 Example (continued) Conclusion

  • Overall, 7.4\% rate of prostate cancer.
  • 89.3\% ate fish either as moderate or small part of diet.
  • Pie charts are hard to read differences; Bar charts can show potential differences more clearly.
  • Only 124 of the 6272 men fell into the “Never/Seldom” category. Only 14 of them developed prostate cancer.
  • More study needed before recommending dietary changes.

3.4 Observational Studies

  • Observational studies: Look at a sample of data to learn about a larger population.
  • Questions to consider:
    • What population is represented by the data? All Swedish men? All men?
    • Are there other factors associated with prostate cancer?
    • Are there habits of fish-eating men that distinguish them from non-fish-eating men?
    • Is it those habits that affect the rate of prostate cancer?
  • Observational studies often lead to contradictory results.
  • Later study suggested fatty acids may increase risk of prostate cancer.

3.5 RANDOM MATTERS: Nightmares and Sleep Side

  • Nightmares study: Do you sleep on your side, and do you remember dreams?
  • Data: 63 participants; 41 right-side sleepers; 22 left-side sleepers.
  • Among right-side sleepers, 6 reported frequent nightmares.
  • Among left-side sleepers, 9 reported nightmares.

3.5 RANDOM MATTERS: What the Question Really Asks

  • The question: “Can we see a difference?” Does sleeping side affect dreams, or is the difference due to randomness?
  • Data can be organized into a contingency table.

3.5 RANDOM MATTERS: Random Distribution Illustration

  • Suppose nightmares are unrelated to sleep side; we could distribute the 15 nightmares randomly among sleepers and obtain a new table. An example of a table with randomly distributed nightmares might look like this (illustrative).

3.5 RANDOM MATTERS: Critical Reflection

  • This may seem convincing. Does it apply to you?
  • The subjects were people who reported: (i) they fell asleep and woke up on the same side consistently, and (ii) usually remembered their dreams.
  • Are they representative enough to draw conclusions about the effect of sleep positions?

3.3 Displaying Contingency Tables

  • 3.3 Displaying Contingency Tables (intro): Contingency Table for the Titanic
  • Contains the information we want, but not necessarily easy to compare.

3.3 Displaying Contingency Tables: Bar Chart

  • Bar Chart interpretation for Titanic: First Class, Second Class, Third Class, Crew.
  • Example values:
    • First Class: 38\% perished
    • Second Class: 58.2\% perished
    • Third Class: 74.6\% perished
    • Crew: 76.2\% perished

3.3 Displaying Contingency Tables: Segmented Bar Chart

  • Segmented Bar Chart: Distributions of ticket Class are clearly different.
  • Survival was not independent of ticket Class.
  • Bars are the same height because the numbers were converted to percents.

3.3 Displaying Contingency Tables: Mosaic Plot

  • A mosaic plot can visualize the relationships among categorical variables.

3.4 Three Categorical Variables: OkCupid Contingency Table

  • Contingency Table for OkCupid:
    • 58.5\% of non-drug using male pet owners have only dogs.
    • 46.1\% of drug-using male pet owners have certain combinations (as shown in the table).

3.4 Association Among Three Variables Example

  • Question: How did survival on the Titanic depend on sex and ticket class? This involves three variables: Sex, Class, and Survived.
  • A mosaic plot can help visualize these relationships.

3.4 Association Among Three Variables Example (Answer)

  • Observations:
    • More men than women.
    • Most of crew were men.
    • Greater fraction of women survived.
    • Class distributions for men were similar for survivors and those who died.
    • Class distributions for women: those who died were overwhelmingly third class.

3.4 Caution: What Can Go Wrong?

  • Don’t confuse similar-sounding percentages.
    • Percent of passengers who were both in first class and survived: 201/2208 = 9.1\%.
    • Percent of first-class passengers who survived: 201/324 = 62.0\%.
    • Percent of the survivors who were in first class: 201/712 = 28.2\%.

3.4 Caution: Additional Warnings

  • Don’t forget to look at the variables separately, too.
  • Use enough individuals.
  • Don’t overstate your case.
  • Watch out for lurking variables.

3.5 Simpson’s Paradox

  • Simpson’s Paradox is named after a statistician who described it; relates to apparent contradictions when aggregating data.
  • Example context: gender discrimination at UC Berkeley was explained by lurking variable: the school to which applicants applied.
  • The apparent paradox is not truly paradoxical; it’s a failure to realize there is an important third variable that was not considered.
  • The paradox is easy to explain if you know what to look for.

3.6 What Can Go Wrong? (Recap of pitfalls)

  • Don’t confuse similar-sounding percentages.
  • Don’t mix up the interpretation of percentages within different groups.
  • Look at both conditional and marginal distributions.
  • Consider lurking variables and sample size.
  • Distinguish correlation from causation.

4 Chapter Review

  • Make and interpret a contingency table: Put counts and/or percents in a two-way table (contingency table).
  • Make and interpret bar charts and pie charts of marginal distributions.
  • Look at the marginal distribution of each variable.
  • Also look at the conditional distribution of a variable within each category of the other variable.
  • Comparing conditional distributions of one variable across categories of another tells us about the association between variables.
  • If the conditional distributions of one variable are (roughly) the same for every category of the other, the variables are independent.
  • Consider a third variable whenever it is appropriate.

4 Summary: Key Takeaways

  • Contingency tables summarize relationships between two or more categorical variables.
  • Independence means no association between the variables; association exists if distributions differ across categories.
  • Conditional distributions help reveal associations that marginal totals can obscure.
  • Visualizations (side-by-side bars, segmented bars, mosaic plots) can aid interpretation; beware of misleading overall percentages (Simpson’s paradox).
  • Always consider potential lurking variables and sample size when drawing conclusions.