1.1 Analyzing Categorical Data
DISTRIBUTIONS
distributions: displays the possible values of a variable and how often it takes them
depending on the type of data and what you are looking for, there are different ways to organize and display data
frequency table: displays the possible categories and the number of observations in each category
car make
relative frequency
american
12.9%
asian
67.7%
european
19.35%
total
100%
relative frequency: for a particular category is the fraction or proportion of the time
area principle: the area occupied by a part of the graph should be proportional to the value it represents
bar chart: represents the count of individuals in each category of a categorical variable
vertical axis represents counts, horizontal axis represents the different categories
bars representing different categories do not touch
relative frequency bar chart represents the proportion or percent of individuals in each category of a categorical variable
vertical axis represents proportions or percentages, horizontal axis represents the different categories
pie charts and segmented bar charts: useful when comparing categories that form parts of the whole
age group, gender, political affiliation
contingency table: displays counts and sometimes percentages of individuals falling into named categories on two or more variables
the comparison of two categorical variables can be accomplished by use of two-way tables
shows how individuals are distributed along each variable
analysis of categorical data uses the counts or percents of individuals that fall within the various categories being studied
two-way tables always include column totals, row totals, and overall table total for ease of analysis
if the totals mentioned above are not present, then be sure to provide the yourself
each cell within the two-way table gives the count or percent for a particular combination of the two categorical variables
MARGINAL DISTRIBUTIONS
marginal distributions are used to describe the distribution of one variable only
depending on which variables is being studied, either the column totals or the row totals are used as numerators of the fractions used to compute distribution percents, and the table total is used the denominator of those fractions
some round-off error may occur when working with two-way tables
CONDITIONAL DISTRIBUTIONS
used to show an association between the two variables by placing a condition on one of the variables and then looking at the distribution of the remaining variable based on the stated condition
once a condition has been stated, we are now concerned with a single row or column
computed using the values within a row or column as the numerators of the fractions used to compute condition distribution percentages; row or column total is denominator
DISPLAYING AND DESCRIBING CATEGORICAL VARIABLES
displayed using bar chart or pie chart
bar chart preferred for AP Statistics
two or more conditional distributions can be displayed and compared by using a segmented bar chart for each condition that is studied or a mosaic graph
males:
game 279/492=56.71%
commercials 81/492=16.46%
won’t watch 132/492=26.83%
females:
game: 122/316=38.61%
commercials: 96/316=30.38%
won’t watch 98/316=31.01%
when working with these; use comparative words such as “similar to” or “about twice as many”; reciting percentages is not enough
say that there “appears” to be relationships— do not know extraneous variables
CONTINGENCY TABLES AND INDEPENDENCE
when the distribution of one variable is the same for all categories of the second variable, then we consider the variables to be independent
one variable does not affect the other variable
with the example above, since adults are more likely to be female when they are more interested in watching commercials than in watching the game itself, we can safely assume that gender and response are not independent