Summarizing Categorical Data

Summarizing Categorical Data

Tools for Summarizing Data

  • Tools for calculating numerical and graphical summaries are divided between categorical and numerical data.

  • The focus here is on categorical data.

Penguin Data

  • Dr. Gorman collected data near Palmer Station, Antarctica.

  • Data includes 8 variables on 333 penguins.

  • Example of the dataset:

    • species: Adelie, Chinstrap, Gentoo

    • island: Biscoe, Dream, Torgersen

    • billlengthmm, billdepthmm, flipperlengthmm, bodymassg, sex, year

Challenges with Raw Data

  • Raw data frames contain too much information to process at a glance.

  • Need to consolidate information into numerical summaries.

Contingency Tables

  • Categorical variables don’t take numbers as values, so canʼt take an average or a median.

  • Count the number of penguins in every combination of levels.

  • Display counts in a contingency table.

  • Example:

    • Counts of penguin species on different islands.

    • Adelie: Biscoe (44), Dream (55), Torgersen (47)

    • Chinstrap: Biscoe (0), Dream (68), Torgersen (0)

    • Gentoo: Biscoe (119), Dream (0), Torgersen (0)

Definition of Contingency Table
  • A table that shows the counts/frequencies of observations in every combination of levels of two categorical variables.

  • Used to display the relationship between variables.

Bar Charts

  • Present counts in graphical form using bar charts.

  • Two common methods:

    • Stacked bar chart

    • Side-by-side (dodged) bar chart

Stacked Bar Chart
  • One variable on the x-axis (e.g., species).

  • Y-axis filled according to the counts in each level of the other variable (e.g., island).

Side-by-Side (Dodged) Bar Chart
  • Similar to stacked, but unstacks the bars.

  • Puts the levels of the second variable besides one another.

Choosing a Bar Chart Type
  • Stacked bar charts highlight total counts (e.g., total number of Adelie penguins).

  • Side-by-side charts make it easier to see the relative sizes of each level of the second variable (e.g., which islands Adelie penguins came from).

From Counts to Proportions

  • Emphasize relative magnitude by converting counts into proportions.

  • Example table of proportions:

    • Adelie: Biscoe (0.132), Dream (0.165), Torgersen (0.142), Total (0.439)

    • Chinstrap: Biscoe (0.000), Dream (0.204), Torgersen (0.000), Total (0.204)

    • Gentoo: Biscoe (0.357), Dream (0.000), Torgersen (0.000), Total (0.357)

    • Total: Biscoe (0.489), Dream (0.369), Torgersen (0.142), Grand Total (1.000)

Joint Proportion
  • Proportion of observations of multiple variables in a combination of levels.

  • Example: 44/333=0.13244 / 333 = 0.132, the proportion of all penguins that were Adelie and from Biscoe.

Marginal Proportion
  • Proportion of observations in one variable that appear in a single level of that variable.

  • Example: (44+119)/333=0.489(44 + 119) / 333 = 0.489, the proportion of all penguins that were from Biscoe.

Conditional Proportion
  • Proportion of observations in one level of one variable that appear in a level of a second variable.

  • Example: 0.132/0.489=0.2690.132 / 0.489 = 0.269, the proportion of penguins from Biscoe that were Adelie.

Normalized Stacked Bar Chart

  • Plots proportions instead of raw counts.

  • Typically shows conditional proportions.

  • Example: Height of a bar indicates the proportion of all penguins of a species that were from Torgersen: 47/146=0.32247 / 146 = 0.322.

  • Conditioning on a variable means that variable is in the denominator of the proportion.

Conditioning on Different Variables

  • Conditioning on island instead of species gives a different story.

  • Example: Proportion of penguins from Biscoe that were Gentoo: 119/163=0.730119 / 163 = 0.730.

  • Choosing which variable to condition on is crucial for the message you want to convey.

Importance of Choosing the Right Chart

  • Example: To determine species distribution on Dream Island, a chart conditioning on the island is needed.

  • Conditioning on species loses the relative numbers of Chinstraps and Adelies.

  • It is essential to think carefully about which conditional proportion your chart is displaying if you are making a claim rooted in a conditional proportion and are using a normalized bar chart.

Association

  • There is an association between two categorical variables if the conditional proportions vary as you move from one level of the conditioning variable to the next.

  • Example: If the distribution of species is unchanged across islands, there is no association.

  • Real data shows an association: Biscoe dominated by Gentoos, Dream with Chinstraps and Adelies, Torgersen only Adelies.

Summary

  • Categorical summaries involve counts of categories or proportions.

  • Proportions can be joint, marginal, or conditional.

  • Counts and proportions are displayed in contingency tables or bar charts.

  • Subtle choices of which proportion to present results in the telling of dramatically different stories.

  • Summarizing categorical data involves deciding what to add and divide.