Categorical Data Analysis: Bar Charts and Contingency Tables

Principles and Visual Representation of Categorical Data

  • Applicability of Bar Charts:     * Bar charts are utilized exclusively for categorical variables.     * They are not used for quantitative data; instead, histograms are the appropriate tool for quantitative distributions.

  • The Area Principle:     * A fundamental rule for graphical displays is that the area occupied by a part of the graph should be proportional to the magnitude of the value it represents.     * In a bar chart that satisfies the area principle, the width of the bars must be identical.     * Because the widths are the same, the height of each bar determines its area, ensuring the visual impression is accurate.     * If widths are varied (as seen in some decorative graphics), it can lead to misleading interpretations of the data distribution.

  • Structural Requirements of Bar Charts:     * Bar Width: All bars must be the same width. The width itself carries no numerical meaning.     * Spacing: Small spaces must be maintained between the bars. These gaps signal that the categories are freestanding and represent discrete groups rather than a continuous scale.     * Order: Because the variables are categorical, the bars could technically be rearranged into any order without changing the statistical meaning, unlike a histogram where the x-axis follows a numerical sequence.     * Base Line: All bars are lined up along a common horizontal base.

  • Case Study: Titanic Ticket Class Distribution (Figure 3.3):     * The bar chart displays the distribution of passengers across four categories: First, Second, Third, and Crew.     * Categorical Comparisons:         * The majority of people on board were not crew members (correcting misconceptions created by non-standardized graphics).         * There were approximately 33 times as many crew members as second-class passengers.         * There were more than twice as many third-class passengers as either first-class or second-class passengers.     * Variable Type: Ticket class is defined as the "Class-Variable."     * Data Representation: The y-axis represents "Counts," ranging from 00 to 10001000 in increments of 200200.

  • Naming Conventions and Software Caveats:     * A "Bar chart" is specifically the term for a display of counts of a categorical variable with bars.     * Some computer programs incorrectly assign the name "bar chart" to any graph utilizing bars.     * Other software packages change the nomenclature based on orientation (horizontal vs. vertical bars), which can be misleading for students.

Relative Frequency Bar Charts

  • Definition and Purpose:     * A relative frequency bar chart is used when the analyst wants to draw attention to the relative proportion of individuals in each category rather than the raw counts.     * In this chart, counts are replaced with percentages.

  • Visual Characteristics (Figure 3.4):     * The shape of the relative frequency bar chart is identical to the standard bar chart.     * The only difference is the scale on the y-axis, which displays percentages (e.g., 10%10\%, 20%20\%, 30%30\%, 40%40\%) rather than absolute numbers.

  • Educational Activity:     * The material suggests an activity to "Watch bar charts grow from data" and then use a statistics package to create independent bar charts.

Introduction to Contingency Tables (Two-Way Tables)

  • Definition of Two-Way Tables:     * To analyze the relationship between two categorical variables simultaneously, counts are arranged into a two-way table.     * These are formally known as Contingency Tables because they show how individuals are distributed along one variable contingent on the values of the other variable.

  • Structure of the Titanic Contingency Table (Table 3.4):     * Categories of Variable 1 (Class): First, Second, Third, and Crew.     * Categories of Variable 2 (Survival): Alive and Dead.     * Cell Entries (Counts):         * Alive: First: 203203; Second: 118118; Third: 178178; Crew: 212212; Total Alive: 711711.         * Dead: First: 122122; Second: 167167; Third: 528528; Crew: 673673; Total Dead: 14901490.     * Grand Total: The total number of individuals accounted for is 22012201.

Marginal Distributions

  • Definition:     * The frequency distribution of a single variable when it is presented in the margins of a contingency table is called the marginal distribution.

  • Identification in Tables:     * The margins are located on the far right and the bottom of the table to provide totals.     * Bottom Margin: Represents the marginal distribution of Ticket Class. These totals (325325, 285285, 706706, 885885) match the counts found in a standard one-way frequency table for class.     * Right Margin: Represents the marginal distribution of the variable Survival.         * Alive: 711711         * Dead: 14901490     * The far bottom-right cell represents the total count of all observations in the dataset (22012201).