MODULE 5

Introduction

  • Summarizing data is essential for understanding, especially with large datasets.

  • A distribution is a way to summarize qualitative data.

  • Distributions can be displayed using frequency tables or graphs.

  • Numerical summaries of qualitative data involve:

    • Computing proportions or percentages.

    • Computing odds.

    • Computing modes.

    • Medians (for ordinal data only).

Frequency Tables for Qualitative Data

  • Qualitative data is typically collated in a frequency table.

  • Rows or columns list the levels of the variable.

  • Levels should be exhaustive (cover all levels) and exclusive (observations belong to only one level).

  • For nominal data, levels can be displayed:

    • Alphabetically

    • By size

    • By personal preference

    • Any other way that is useful to readers.

  • For ordinal data, use the natural order of levels.

  • Example: A study surveyed 400 Phoenix residents about autonomous vehicles (AVs).

    • Gender (nominal, two levels)

    • Age group (ordinal, six levels)

    • Safety opinions (ordinal, five levels)

Graphs

  • Options for graphing qualitative data:

    • Dot chart: Usually a good choice.

    • Bar chart: Usually a good choice.

    • Pie chart: Only useful in special circumstances; can be harder to interpret.

  • For nominal data: order the levels in the most useful way.

  • For ordinal data: use the natural order of the levels.

  • Graphs can also be used for quantitative data with a small number of possible options and discrete quantitative data if many values are possible.

  • The purpose of a graph is to display information clearly and simply.

Dot Charts (Qualitative Data)

  • Dot charts indicate counts (or percentages) in each level using dots on a line starting at zero.

  • Levels can be on the horizontal or vertical axis.

  • Placing level names on the vertical axis often makes it easier to read and provides space for long labels.

  • For dot charts:

    • Place the qualitative variable on the horizontal or vertical axis and label with the levels of the variable.

    • Use counts or percentages on the other axis.

    • For nominal data, think about the most helpful order for the levels.

Bar Charts

  • Bar charts indicate counts in each category using bars starting from zero.

  • Levels can be on the horizontal or vertical axis, but placing level names on the vertical axis often makes it easier to read and allows room for long labels.

  • For bar charts:

    • Place the qualitative variable on the horizontal or vertical axis and label with the levels of the variable.

    • Use counts or percentages on the other axis.

    • For nominal data, levels can be ordered any way. Think about the most helpful order.

    • Bars have gaps between them, as the bars represent distinct categories.

  • Bar charts have gaps between all of the bars.

  • The bars in histograms are butted together (except when an interval has a count of zero), as the variable-axis represent a continuous numerical scale.

Pie Charts

  • In pie charts, a circle is divided into segments proportional to the number in each level of the qualitative variable.

  • Pie charts may present challenges:

    • Pie charts only work when graphing parts of a whole.

    • Pie charts only work when all options are present (‘exhaustive’).

    • Pie charts are difficult to use with levels having zero counts or small counts.

    • Pie charts are difficult to read with many categories present.

    • Pie charts are hard to read: Humans compare lengths (as in bar and dot charts) better than angles (as in pie charts).

Comparing Pie, Bar, and Dot Charts

  • Determining which age groups have the most respondents is hard in the pie chart.

  • The equivalent bar chart (or dot chart) makes the comparison easy: clearly the youngest age group has the smallest representation, while the 25 to 34 and the 35 to 44 age groups have the most respondents.

  • The tilted pie chart makes this comparison even harder.

  • Recall that the purpose of a graph is to display information in the clearest, simplest possible way, to help the reader understand the message(s) in the data.

  • A pie chart often makes the message hard to see.

Parameters and Statistics

  • In quantitative research, both qualitative and quantitative data are summarised and analysed numerically.

  • Numerical quantities are computed from one of countless possible samples, even though the whole population is of interest.

  • A statistic (a sample value) is a numerical value estimating the unknown parameter (population value).

  • Since countless possible samples are possible, countless possible values for the statistic—all of which are estimates of the value of the parameter—are possible.

  • The value of the statistic that is observed depends on which one of the countless possible samples is (randomly) selected.

  • The RQ identifies the population, but in practice only one of the many possible samples is studied.

  • Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample. We only observe one value of the statistic from our single sample.

Proportions and Percentages

  • Qualitative data can be summarised using proportions or percentages.

  • These can be given instead of, or with, the counts.

  • Definition: A proportion is a fraction out of a total and is a number between 0 and 1.

  • Definition: A percentage is a proportion multiplied by 100. In this context, percentages are numbers between 0% and 100%.

  • Population proportions are almost always unknown.

  • Instead, the population proportion (the parameter), denoted p, is estimated by a sample proportion (a statistic), denoted by \hat{p}.

  • Example: Consider the AV data, summarizing results from n = 400 respondents.

    • The sample proportion aged 25 to 34 is: 76 \div 400, or 0.19.

    • The sample percentage aged 25 to 34 is: 0.19 \times 100, or 19%.

Odds

  • The number of females is slightly larger than the number of males.

  • The ratio of females to males is 204 \div 196 = 1.04.

  • That is, there are 1.04 times as many females as males.

  • This value of 1.04 is the odds that a respondent is female.

  • An alternative interpretation: there 1.04 \times 100 = 104 females for every 100 males.

  • Take care:

    • Proportions and percentages are the number of interest divided by the total number.

    • Odds are the number of interest divided by the remaining number.

  • Definition: The odds are the number (or proportion, or percentage) of times that an event happens, divided by the number (or proportion, or percentage) of times that the event does not happen:
    Odds = \frac{Number \space of \space times \space event \space happens}{Number \space of \space times \space event \space doesn’t \space happen}

  • or (equivalently)
    Odds = \frac{Proportion \space of \space times \space event \space happens}{Proportion \space of \space times \space event \space doesn’t \space happen}

  • The odds are how many times an event happens compared to the event not happening.

  • Example:

    • The AV data includes 204 females and 196 males.

    • The odds that a respondent is female is 1.04, as found above.

    • The odds is greater than one, as the number of females is larger than the number of males.

    • The odds that a respondent is male is 196/204 = 0.96; that is, there are 0.96 times as many males as females.

    • The odds is less than one, as the number of males is smaller than the number of females.

    • Alternatively, there are 96 males for every 100 females.

  • Take care interpreting odds:

    • Odds are greater than 1: the event is more likely to happen than not to happen.

    • Odds are equal to 1: the event is just as likely to happen as it is not to happen.

    • Odds are less than 1: the event is less likely to happen than not to happen.

  • Population odds are almost always unknown.

  • Instead, the population odds (the parameter) is estimated by a sample odds (a statistic).

Modes

  • Because qualitative data has levels, all qualitative data (nominal; ordinal) can be numerically summarised by counting the number of observations in each level (or computing the percentage of observations in each level).

  • The mode is the level (or levels) with the most observations.

  • Definition: A mode is the level (or levels) of a qualitative variable with the most observations.

  • Population modes are almost always unknown.

  • Instead, the population mode (the parameter) is estimated by a sample mode (a statistic).
    *Consider again the AV data.
    *‘Gender’ is nominal qualitative; age group is ordinal qualitative.
    *The responses to the question are ordinal.
    *The mode could be used to summarise each variable

  • The mode for gender is ‘Female’ (with 204 respondents).

  • The mode age groups are 25 to 34 and 35 to 44 (both with 76 respondents).

  • The modal response to the question about driving near AVs is ‘Somewhat safe’ (97 respondents).

  • The modal response to the question about cycling near AVs is ‘Somewhat unsafe’ (104 respondents).

  • The modal response to the question about walking near AVs is ‘Neutral’ (103 respondents).

Medians for Ordinal Data

  • Ordinal data can be summarised in ways that nominal data cannot be since ordinal data have levels with a natural order.

  • Ordinal qualitative data, but not nominal data, can be summarised using medians.

  • Find the median by locating the response that is in the middle when the levels from all individuals are placed in order.

  • Medians can be used to summarise qualitative data and ordinal data, but never nominal data.

  • Example:
    *Consider again the AV data.
    *‘Gender’ is nominal qualitative, so medians are not appropriate.

  • The other variables are ordinal, so medians could be used to summarise each variable.

  • Since n = 400, the median response will be halfway between the location of the 200th and 201st response when ordered.

  • The median age group is 35 to 44 (observations 200 and 201 fall here).

  • The median response to the driving question is ‘Neutral’ (observations 200 and 201 fall here).

  • The median response to the cycling question is ‘Neutral’ (observations 200 and 201 fall here).

  • The median response to the walking question is ‘Neutral’ (observations 200 and 201 fall here).

Example: Water Access

  • A study of three rural communities in Cameroon recorded data about access to water.

  • Numerous qualitative variables are recorded.

  • Notice that the levels of the two ordinal variables are displayed in the natural order.

  • The distance to the nearest water source is usually less than 1 km, and the wait at the source often over 1 min.

  • The most common water source is a bore (68.6%).

Comparing between individuals

  • Relational RQs compare groups.

  • We now consider how to compare qualitative variables in different groups.

  • Tables and graphs are very useful this purpose.

Two-way tables

*When more than one qualitative variable is recorded for each individual, the data can be collated into table.
*When two qualitative variables are cross-tabulated: two-way table.
*As always, the categories should be: exhaustive (cover all values); and exclusive (observations belong to one and only one category).
*A medical study compared two treatments for kidney stones to determine which had a high success rate.
*Data were collected from 700 UK patients, on two qualitative variables:
*the treatment method (‘A’ or ‘B’): The explanatory variable.
*the result (‘success’ or ‘failure’ of the procedure): The response variable.
*Both variables are qualitative with two levels, and used on 350 patients.
*Treatment A: used from 1972–1980.
*Treatment B: used from 1980–1985.
*Treatments were not randomly allocated, and so confounding may be present.
*So, the researchers also recorded the size of the kidney stone (‘small’ or ‘large’) as one possible confounding variable.
*Firstly, consider just the small stones , displayed in the two-way table.

Summary tables by rows and columns

*Each variable in a two-way table can be analysed separately, using percentage, proportions or odds.
*For example, the two variables (Method; Result) can be analysed separately, using percentages, proportions or odds.
*the percentage of procedures that were successful is 315/357 \times 100 = 88.2\%.
*the odds that a procedure was successful is 315/42 = 7.5; that is, there were 7.5 times as many successful procedures as unsuccessful procedures.
*However, to compare Methods A and B, these odds and percentages can be computed for each row (or column) separately.

Graphs

*When a qualitative variable is compared across different groups (i.e., comparing between individuals), options for plotting include:
*Stacked bar charts;
*Side-by-side bar charts; or
*Dot charts.

Stacked bar charts

*The data can be graphed by using a bar for each level of one variable, and stacking the bars the levels of the second variable.
*For the kidney-stone data, a stacked bar chart can be created.
*Numbers or percentages within each group can be used.

Comparing odds: odds ratios

*For the small kidney stone data:
*Method A: the odds of success are 13.5 (13.5 times as many successes as failures).
*Method B: the odds of success are 6.5 (6.5 times as many successes as failures).
*The odds of success for Method A and Method B are very different.
*In the sample, the odds of success for Method A is many times greater than for Method B.
*In the sample, the odds of success for Method A is 13.5 \div 6.5 = 2.08 times the odds of a success for Method B.
*This value is the odds ratio (OR).
*The sample odds ratio is a statistic, and the (unknown) population odds ratio is a parameter.
Odds ratio = \frac{Odds \space of \space an \space event \space in \space Group \space A}{Odds \space of \space the \space same \space event \space in \space Group \space B}.
*The odds ratio in jamovi and many other software programs can be interpreted in either of these ways (i.e., both are correct):
*The odds compare Row 1 counts to Row 2 counts, for both columns. The odds ratio then compares the Column 1 odds to the Column 2 odds.
*The odds compare Column 1 counts to Column 2 counts. The odds ratio then compares the Row 1 odds to the Row 2 odds.
*Odds and odds ratios are computed with the first row and first column values on the top of the fraction.
*The OR compares the odds of the same event (e.g., success) in two different groups (e.g., Method A and Method B).
*This means that a 2 × 2 table can be summarised using one number: the odds ratio (OR).
*Odds ratio is greater than 1: the odds of the event is greater for the group in the top of the division compared to the group in the bottom of the division.
*Odds ratio is equal to 1: the odds of the event is the same for both groups (in the top and the bottom of the division).
*Odds ratio is less than 1: the odds of the event is less for the group in the top of the division compared to the group in the bottom of the division.

Example: all kidney stones

*That seems strange:
*Method A performs better for small and large kidney stones.
*Method B performs better when size is unknown (i.e., ignoring size).
*The size of the stone is a confounding variable.
*This confounding could have been avoided by randomly allocating a treatment method to patients.
*However, random allocation was not possible in this study
*This is called Simpson’s paradox.