Chapter 3 Notes on Contingency Tables and Related Concepts
3.1 Contingency Tables
- Definition: A contingency table is a table that displays two categorical variables and their relationships.
- Example: There are 897 females who have both a cat and a dog.
- The bottom row represents the totals for gender and is called the marginal distribution. The right column represents the totals for pets and the corresponding margins.
3.1 Table of Percents
Percents in each column: 54.8\% of pet-owning men have dogs but not cats.
Column margins sum to 100%.
Women don’t favor either pet; Men seem to favor dogs over cats.
What is wrong with just comparing these percentages? Look at Row Percents.
3.1 Table of Percents (continued):
- 60.9\% of dual pet owners are women.
- 54.2\% of the participants are women.
3.1 Table of Percents (another set):
- 6.3\% of OkCupid pet owners are women who have both a dog and a cat.
- 25.1\% of OkCupid pet owners are men with a dog only.
3.2 Conditional Distributions
Pie charts are a good way to display percentages.
Pie charts make it difficult to be precise about comparisons.
Conditional distributions are better shown with side-by-side bar charts.
Example: 65.8\% of people who voiced an opinion for the commercials were women.
3.2 Independence and Association
- Independence: The distribution of one variable is the same for all categories of another.
- For dependent variables, there is an association between the two variables.
3.3 Association Example
- Slide shows an association example (no specific numbers given here in this heading).
3.3 Association Example (continued)
- 31\% of women versus 26.8\% of men didn’t plan to watch.
- Appears to be an association between gender and what people are planning to watch.
- Women: 38.8\% game; 30.2\% commercials.
- Men: 56.7\% game; 16.5\% commercials.
3.4 Examining Contingency Tables Example
- Medical researchers followed 6272 Swedish men for 30 years to see whether there was any association between the amount of fish in their diet and prostate cancer.
- Results summarized in a contingency table (the lecture slide references the table).
- Goal: determine whether there is an association between fish consumption and prostate cancer.
3.4 Example (continued) Mechanics
- Two categories of the diet are quite small: only 2.0\% in Never/Seldom eating fish and 8.8\% in the “Large Part” category.
- Overall, 7.4\% of the men in this study had prostate cancer.
3.4 Example (continued) Mechanics (Part II)
- Further mechanics about interpreting small category sizes and overall percentages.
3.4 Example (continued) Mechanics (Part III)
- Additional mechanics about chart choices and how to read the data.
3.4 Example (continued) Conclusion
- Overall, 7.4\% rate of prostate cancer.
- 89.3\% ate fish either as moderate or small part of diet.
- Pie charts are hard to read differences; Bar charts can show potential differences more clearly.
- Only 124 of the 6272 men fell into the “Never/Seldom” category. Only 14 of them developed prostate cancer.
- More study needed before recommending dietary changes.
3.4 Observational Studies
- Observational studies: Look at a sample of data to learn about a larger population.
- Questions to consider:
- What population is represented by the data? All Swedish men? All men?
- Are there other factors associated with prostate cancer?
- Are there habits of fish-eating men that distinguish them from non-fish-eating men?
- Is it those habits that affect the rate of prostate cancer?
- Observational studies often lead to contradictory results.
- Later study suggested fatty acids may increase risk of prostate cancer.
3.5 RANDOM MATTERS: Nightmares and Sleep Side
- Nightmares study: Do you sleep on your side, and do you remember dreams?
- Data: 63 participants; 41 right-side sleepers; 22 left-side sleepers.
- Among right-side sleepers, 6 reported frequent nightmares.
- Among left-side sleepers, 9 reported nightmares.
3.5 RANDOM MATTERS: What the Question Really Asks
- The question: “Can we see a difference?” Does sleeping side affect dreams, or is the difference due to randomness?
- Data can be organized into a contingency table.
3.5 RANDOM MATTERS: Random Distribution Illustration
- Suppose nightmares are unrelated to sleep side; we could distribute the 15 nightmares randomly among sleepers and obtain a new table. An example of a table with randomly distributed nightmares might look like this (illustrative).
3.5 RANDOM MATTERS: Critical Reflection
- This may seem convincing. Does it apply to you?
- The subjects were people who reported: (i) they fell asleep and woke up on the same side consistently, and (ii) usually remembered their dreams.
- Are they representative enough to draw conclusions about the effect of sleep positions?
3.3 Displaying Contingency Tables
- 3.3 Displaying Contingency Tables (intro): Contingency Table for the Titanic
- Contains the information we want, but not necessarily easy to compare.
3.3 Displaying Contingency Tables: Bar Chart
- Bar Chart interpretation for Titanic: First Class, Second Class, Third Class, Crew.
- Example values:
- First Class: 38\% perished
- Second Class: 58.2\% perished
- Third Class: 74.6\% perished
- Crew: 76.2\% perished
3.3 Displaying Contingency Tables: Segmented Bar Chart
- Segmented Bar Chart: Distributions of ticket Class are clearly different.
- Survival was not independent of ticket Class.
- Bars are the same height because the numbers were converted to percents.
3.3 Displaying Contingency Tables: Mosaic Plot
- A mosaic plot can visualize the relationships among categorical variables.
3.4 Three Categorical Variables: OkCupid Contingency Table
- Contingency Table for OkCupid:
- 58.5\% of non-drug using male pet owners have only dogs.
- 46.1\% of drug-using male pet owners have certain combinations (as shown in the table).
3.4 Association Among Three Variables Example
- Question: How did survival on the Titanic depend on sex and ticket class? This involves three variables: Sex, Class, and Survived.
- A mosaic plot can help visualize these relationships.
3.4 Association Among Three Variables Example (Answer)
- Observations:
- More men than women.
- Most of crew were men.
- Greater fraction of women survived.
- Class distributions for men were similar for survivors and those who died.
- Class distributions for women: those who died were overwhelmingly third class.
3.4 Caution: What Can Go Wrong?
- Don’t confuse similar-sounding percentages.
- Percent of passengers who were both in first class and survived: 201/2208 = 9.1\%.
- Percent of first-class passengers who survived: 201/324 = 62.0\%.
- Percent of the survivors who were in first class: 201/712 = 28.2\%.
3.4 Caution: Additional Warnings
- Don’t forget to look at the variables separately, too.
- Use enough individuals.
- Don’t overstate your case.
- Watch out for lurking variables.
3.5 Simpson’s Paradox
- Simpson’s Paradox is named after a statistician who described it; relates to apparent contradictions when aggregating data.
- Example context: gender discrimination at UC Berkeley was explained by lurking variable: the school to which applicants applied.
- The apparent paradox is not truly paradoxical; it’s a failure to realize there is an important third variable that was not considered.
- The paradox is easy to explain if you know what to look for.
3.6 What Can Go Wrong? (Recap of pitfalls)
- Don’t confuse similar-sounding percentages.
- Don’t mix up the interpretation of percentages within different groups.
- Look at both conditional and marginal distributions.
- Consider lurking variables and sample size.
- Distinguish correlation from causation.
4 Chapter Review
- Make and interpret a contingency table: Put counts and/or percents in a two-way table (contingency table).
- Make and interpret bar charts and pie charts of marginal distributions.
- Look at the marginal distribution of each variable.
- Also look at the conditional distribution of a variable within each category of the other variable.
- Comparing conditional distributions of one variable across categories of another tells us about the association between variables.
- If the conditional distributions of one variable are (roughly) the same for every category of the other, the variables are independent.
- Consider a third variable whenever it is appropriate.
4 Summary: Key Takeaways
- Contingency tables summarize relationships between two or more categorical variables.
- Independence means no association between the variables; association exists if distributions differ across categories.
- Conditional distributions help reveal associations that marginal totals can obscure.
- Visualizations (side-by-side bars, segmented bars, mosaic plots) can aid interpretation; beware of misleading overall percentages (Simpson’s paradox).
- Always consider potential lurking variables and sample size when drawing conclusions.