Categorical Data Analysis and Contingency Tables

Contingency Tables and Categorical Organization

Contingency tables are used to organize categorical data by counting how many individuals fall into specific categories. This method allows researchers to compare groups and observe relationships, determining how one variable is contingent upon another. For example, a dataset might track exam performance with categories such as below average, average, above average, or high performance against daily caffeine consumption levels including no caffeine, low, moderate, high, or very high. In the provided example, the total counts for exam performance were 3535 for below average, 7070 for average, 5959 for above average, and 4141 for high performance. For caffeine consumption, totals showed 3535 students consumed no caffeine and 1515 students consumed a very high amount.

Proportions, Relative Frequencies, and Percentages

Proportions measure the fraction of a total group, often referred to as relative frequencies because they describe frequency in relation to a total number. In a sample of n=205n = 205 students, the proportion is calculated by dividing the specific count by the total; for instance, a cell count of 88 results in a proportion of 0.0390.039, which is expressed as 3.9%3.9\,\% when converted to a percentage by multiplying by 100100. Contingency tables allow for detailed analysis of these portions, such as looking at the 4545 students who consumed a high level of caffeine and finding that 1616 of them performed average on the exam, or identifying that 1010 students who consumed a high level of caffeine performed above average.

Categorical Comparison and Data Visualization

Visualizations such as bar plots and box and whisker plots facilitate the comparison of categorical groups. A bar plot can reveal that students with very high caffeine consumption are twice as likely to perform below average compared to those with no caffeine. Using the Titanic dataset, proportions of survival can be compared across age groups: approximately 1032\frac{10}{32} (roughly 31%31\,\%) of survivors were in the 6060 to 7070 year old category, while roughly 5082\frac{50}{82} (roughly 61%61\,\%) of survivors were in the 00 to 1010 year old category. Additionally, box plots provide a way to compare numerical variables like stress levels across categories, showing that the mean stress score for students getting 88 or more hours of sleep (n=35n = 35) is approximately 2020, whereas students getting 66 or less hours (n=15n = 15) have a higher mean stress score between 3535 and 3838.