Summarizing Categorical Data
Summarizing Categorical Data
Tools for Summarizing Data
Tools for calculating numerical and graphical summaries are divided between categorical and numerical data.
The focus here is on categorical data.
Penguin Data
Dr. Gorman collected data near Palmer Station, Antarctica.
Data includes 8 variables on 333 penguins.
Example of the dataset:
species: Adelie, Chinstrap, Gentoo
island: Biscoe, Dream, Torgersen
billlengthmm, billdepthmm, flipperlengthmm, bodymassg, sex, year
Challenges with Raw Data
Raw data frames contain too much information to process at a glance.
Need to consolidate information into numerical summaries.
Contingency Tables
Categorical variables don’t take numbers as values, so canʼt take an average or a median.
Count the number of penguins in every combination of levels.
Display counts in a contingency table.
Example:
Counts of penguin species on different islands.
Adelie: Biscoe (44), Dream (55), Torgersen (47)
Chinstrap: Biscoe (0), Dream (68), Torgersen (0)
Gentoo: Biscoe (119), Dream (0), Torgersen (0)
Definition of Contingency Table
A table that shows the counts/frequencies of observations in every combination of levels of two categorical variables.
Used to display the relationship between variables.
Bar Charts
Present counts in graphical form using bar charts.
Two common methods:
Stacked bar chart
Side-by-side (dodged) bar chart
Stacked Bar Chart
One variable on the x-axis (e.g., species).
Y-axis filled according to the counts in each level of the other variable (e.g., island).
Side-by-Side (Dodged) Bar Chart
Similar to stacked, but unstacks the bars.
Puts the levels of the second variable besides one another.
Choosing a Bar Chart Type
Stacked bar charts highlight total counts (e.g., total number of Adelie penguins).
Side-by-side charts make it easier to see the relative sizes of each level of the second variable (e.g., which islands Adelie penguins came from).
From Counts to Proportions
Emphasize relative magnitude by converting counts into proportions.
Example table of proportions:
Adelie: Biscoe (0.132), Dream (0.165), Torgersen (0.142), Total (0.439)
Chinstrap: Biscoe (0.000), Dream (0.204), Torgersen (0.000), Total (0.204)
Gentoo: Biscoe (0.357), Dream (0.000), Torgersen (0.000), Total (0.357)
Total: Biscoe (0.489), Dream (0.369), Torgersen (0.142), Grand Total (1.000)
Joint Proportion
Proportion of observations of multiple variables in a combination of levels.
Example: , the proportion of all penguins that were Adelie and from Biscoe.
Marginal Proportion
Proportion of observations in one variable that appear in a single level of that variable.
Example: , the proportion of all penguins that were from Biscoe.
Conditional Proportion
Proportion of observations in one level of one variable that appear in a level of a second variable.
Example: , the proportion of penguins from Biscoe that were Adelie.
Normalized Stacked Bar Chart
Plots proportions instead of raw counts.
Typically shows conditional proportions.
Example: Height of a bar indicates the proportion of all penguins of a species that were from Torgersen: .
Conditioning on a variable means that variable is in the denominator of the proportion.
Conditioning on Different Variables
Conditioning on island instead of species gives a different story.
Example: Proportion of penguins from Biscoe that were Gentoo: .
Choosing which variable to condition on is crucial for the message you want to convey.
Importance of Choosing the Right Chart
Example: To determine species distribution on Dream Island, a chart conditioning on the island is needed.
Conditioning on species loses the relative numbers of Chinstraps and Adelies.
It is essential to think carefully about which conditional proportion your chart is displaying if you are making a claim rooted in a conditional proportion and are using a normalized bar chart.
Association
There is an association between two categorical variables if the conditional proportions vary as you move from one level of the conditioning variable to the next.
Example: If the distribution of species is unchanged across islands, there is no association.
Real data shows an association: Biscoe dominated by Gentoos, Dream with Chinstraps and Adelies, Torgersen only Adelies.
Summary
Categorical summaries involve counts of categories or proportions.
Proportions can be joint, marginal, or conditional.
Counts and proportions are displayed in contingency tables or bar charts.
Subtle choices of which proportion to present results in the telling of dramatically different stories.
Summarizing categorical data involves deciding what to add and divide.