Categorical Data Analysis: Bar Charts and Contingency Tables
Principles and Visual Representation of Categorical Data
Applicability of Bar Charts: * Bar charts are utilized exclusively for categorical variables. * They are not used for quantitative data; instead, histograms are the appropriate tool for quantitative distributions.
The Area Principle: * A fundamental rule for graphical displays is that the area occupied by a part of the graph should be proportional to the magnitude of the value it represents. * In a bar chart that satisfies the area principle, the width of the bars must be identical. * Because the widths are the same, the height of each bar determines its area, ensuring the visual impression is accurate. * If widths are varied (as seen in some decorative graphics), it can lead to misleading interpretations of the data distribution.
Structural Requirements of Bar Charts: * Bar Width: All bars must be the same width. The width itself carries no numerical meaning. * Spacing: Small spaces must be maintained between the bars. These gaps signal that the categories are freestanding and represent discrete groups rather than a continuous scale. * Order: Because the variables are categorical, the bars could technically be rearranged into any order without changing the statistical meaning, unlike a histogram where the x-axis follows a numerical sequence. * Base Line: All bars are lined up along a common horizontal base.
Case Study: Titanic Ticket Class Distribution (Figure 3.3): * The bar chart displays the distribution of passengers across four categories: First, Second, Third, and Crew. * Categorical Comparisons: * The majority of people on board were not crew members (correcting misconceptions created by non-standardized graphics). * There were approximately times as many crew members as second-class passengers. * There were more than twice as many third-class passengers as either first-class or second-class passengers. * Variable Type: Ticket class is defined as the "Class-Variable." * Data Representation: The y-axis represents "Counts," ranging from to in increments of .
Naming Conventions and Software Caveats: * A "Bar chart" is specifically the term for a display of counts of a categorical variable with bars. * Some computer programs incorrectly assign the name "bar chart" to any graph utilizing bars. * Other software packages change the nomenclature based on orientation (horizontal vs. vertical bars), which can be misleading for students.
Relative Frequency Bar Charts
Definition and Purpose: * A relative frequency bar chart is used when the analyst wants to draw attention to the relative proportion of individuals in each category rather than the raw counts. * In this chart, counts are replaced with percentages.
Visual Characteristics (Figure 3.4): * The shape of the relative frequency bar chart is identical to the standard bar chart. * The only difference is the scale on the y-axis, which displays percentages (e.g., , , , ) rather than absolute numbers.
Educational Activity: * The material suggests an activity to "Watch bar charts grow from data" and then use a statistics package to create independent bar charts.
Introduction to Contingency Tables (Two-Way Tables)
Definition of Two-Way Tables: * To analyze the relationship between two categorical variables simultaneously, counts are arranged into a two-way table. * These are formally known as Contingency Tables because they show how individuals are distributed along one variable contingent on the values of the other variable.
Structure of the Titanic Contingency Table (Table 3.4): * Categories of Variable 1 (Class): First, Second, Third, and Crew. * Categories of Variable 2 (Survival): Alive and Dead. * Cell Entries (Counts): * Alive: First: ; Second: ; Third: ; Crew: ; Total Alive: . * Dead: First: ; Second: ; Third: ; Crew: ; Total Dead: . * Grand Total: The total number of individuals accounted for is .
Marginal Distributions
Definition: * The frequency distribution of a single variable when it is presented in the margins of a contingency table is called the marginal distribution.
Identification in Tables: * The margins are located on the far right and the bottom of the table to provide totals. * Bottom Margin: Represents the marginal distribution of Ticket Class. These totals (, , , ) match the counts found in a standard one-way frequency table for class. * Right Margin: Represents the marginal distribution of the variable Survival. * Alive: * Dead: * The far bottom-right cell represents the total count of all observations in the dataset ().