Categorical Data Analysis and Contingency Tables

Contingency Tables and Categorical Organization

Contingency tables are used to organize categorical data by counting how many individuals fall into specific categories. This method allows researchers to compare groups and observe relationships, determining how one variable is contingent upon another. For example, a dataset might track exam performance with categories such as below average, average, above average, or high performance against daily caffeine consumption levels including no caffeine, low, moderate, high, or very high. In the provided example, the total counts for exam performance were $35$ for below average, $70$ for average, $59$ for above average, and $41$ for high performance. For caffeine consumption, totals showed $35$ students consumed no caffeine and $15$ students consumed a very high amount.

Proportions, Relative Frequencies, and Percentages

Proportions measure the fraction of a total group, often referred to as relative frequencies because they describe frequency in relation to a total number. In a sample of $n = 205$ students, the proportion is calculated by dividing the specific count by the total; for instance, a cell count of $8$ results in a proportion of $0.039$ , which is expressed as $3.9\,\%$ when converted to a percentage by multiplying by $100$ . Contingency tables allow for detailed analysis of these portions, such as looking at the $45$ students who consumed a high level of caffeine and finding that $16$ of them performed average on the exam, or identifying that $10$ students who consumed a high level of caffeine performed above average.

Categorical Comparison and Data Visualization

Visualizations such as bar plots and box and whisker plots facilitate the comparison of categorical groups. A bar plot can reveal that students with very high caffeine consumption are twice as likely to perform below average compared to those with no caffeine. Using the Titanic dataset, proportions of survival can be compared across age groups: approximately $\frac{10}{32}$ (roughly $31\,\%$ ) of survivors were in the $60$ to $70$ year old category, while roughly $\frac{50}{82}$ (roughly $61\,\%$ ) of survivors were in the $0$ to $10$ year old category. Additionally, box plots provide a way to compare numerical variables like stress levels across categories, showing that the mean stress score for students getting $8$ or more hours of sleep ( $n = 35$ ) is approximately $20$ , whereas students getting $6$ or less hours ( $n = 15$ ) have a higher mean stress score between $35$ and $38$ .