1.1 Analyzing Categorical Data

DISTRIBUTIONS

  • distributions: displays the possible values of a variable and how often it takes them

  • depending on the type of data and what you are looking for, there are different ways to organize and display data

  • frequency table: displays the possible categories and the number of observations in each category

    car make

    relative frequency

    american

    12.9%

    asian

    67.7%

    european

    19.35%

    total

    100%

  • relative frequency: for a particular category is the fraction or proportion of the time

  • area principle: the area occupied by a part of the graph should be proportional to the value it represents

  • bar chart: represents the count of individuals in each category of a categorical variable

    • vertical axis represents counts, horizontal axis represents the different categories

    • bars representing different categories do not touch

  • relative frequency bar chart represents the proportion or percent of individuals in each category of a categorical variable

    • vertical axis represents proportions or percentages, horizontal axis represents the different categories

  • pie charts and segmented bar charts: useful when comparing categories that form parts of the whole

    • age group, gender, political affiliation

  • contingency table: displays counts and sometimes percentages of individuals falling into named categories on two or more variables

  • the comparison of two categorical variables can be accomplished by use of two-way tables

    • shows how individuals are distributed along each variable

  • analysis of categorical data uses the counts or percents of individuals that fall within the various categories being studied

  • two-way tables always include column totals, row totals, and overall table total for ease of analysis

    • if the totals mentioned above are not present, then be sure to provide the yourself

  • each cell within the two-way table gives the count or percent for a particular combination of the two categorical variables

MARGINAL DISTRIBUTIONS

  • marginal distributions are used to describe the distribution of one variable only

  • depending on which variables is being studied, either the column totals or the row totals are used as numerators of the fractions used to compute distribution percents, and the table total is used the denominator of those fractions

  • some round-off error may occur when working with two-way tables

CONDITIONAL DISTRIBUTIONS

  • used to show an association between the two variables by placing a condition on one of the variables and then looking at the distribution of the remaining variable based on the stated condition

  • once a condition has been stated, we are now concerned with a single row or column

  • computed using the values within a row or column as the numerators of the fractions used to compute condition distribution percentages; row or column total is denominator

DISPLAYING AND DESCRIBING CATEGORICAL VARIABLES

  • displayed using bar chart or pie chart

    • bar chart preferred for AP Statistics

  • two or more conditional distributions can be displayed and compared by using a segmented bar chart for each condition that is studied or a mosaic graph

  • males:

    • game 279/492=56.71%

    • commercials 81/492=16.46%

    • won’t watch 132/492=26.83%

  • females:

    • game: 122/316=38.61%

    • commercials: 96/316=30.38%

    • won’t watch 98/316=31.01%

  • when working with these; use comparative words such as “similar to” or “about twice as many”; reciting percentages is not enough

  • say that there “appears” to be relationships— do not know extraneous variables

CONTINGENCY TABLES AND INDEPENDENCE

  • when the distribution of one variable is the same for all categories of the second variable, then we consider the variables to be independent

    • one variable does not affect the other variable

  • with the example above, since adults are more likely to be female when they are more interested in watching commercials than in watching the game itself, we can safely assume that gender and response are not independent