Protocols
- If the research question (RQ) is well-constructed and the study is well-designed, the data collection process should be easy to describe.
- Data collection can be time-consuming, so it's important to collect data correctly the first time.
- A plan should be established and documented before data collection, explaining how data will be obtained, including operational definitions.
- This plan is called a protocol.
- A pilot study (or practice run) is often conducted to check the protocol and identify potential problems.
Pilot Study Definition
A pilot study is a small test run of the study protocol used to check that the protocol is appropriate and practical, and to identify and hence fix possible problems with the study design or protocol.
A pilot study allows the researcher to:
- Determine the feasibility of the data collection protocol.
- Identify unforeseen challenges.
- Obtain data to determine appropriate sample sizes.
- Potentially save time and money.
- Once finalized, protocols ensure studies are repeatable.
- Protocols should indicate how design aspects (such as blinding, random allocation) will happen.
- The final protocol, without unnecessary detail, should be reported.
- Diagrams can be useful to support explanations.
- All studies should have a well-established protocol for describing how the study was done.
- A protocol usually has at least three components that describe:
- How individuals are chosen from the population (i.e., external validity);
- How information is collected from the individuals (i.e., internal validity);
- The analyses, and what software (and version) was used.
Protocol Example
- To increase the nutritional value of cookies, researchers made cookies using pureed green peas in place of margarine (Romanchik-Cerpovicz, Jeffords, and Onyenwoke 2018).
- The protocol discussed how the individuals were chosen (p. 4).
- The researchers also described how the data was obtained from the individuals (p. 5).
- The analyses and software used were also given.
Using Questionnaires
- Collecting data using questionnaires is common for both observational and experimental studies.
- Questionnaires are very difficult to do well: question wording is crucial, and surprisingly difficult to get right (Fink 1995).
- Pilot testing questionnaires is crucial!
Questionnaire Definition
A questionnaire is a set of questions for respondents to answer.
- Questions in a questionnaire may be:
- Open-ended (respondents can write their own answers) or
- Closed (respondents select from a small number of possible answers, as in multiple-choice questions).
- Open and closed questions both have advantages and disadvantages.
- Answers to open questions more easily lend themselves to qualitative analysis.
- Avoid leading questions, which may lead respondents to answer a certain way.
- Avoid ambiguity: avoid unfamiliar terms and unclear questions.
- Avoid asking the uninformed: avoid asking respondents about issues they don’t know about. Many people will give a response even if they do not understand (such responses are worthless).
- Avoid complex and double-barrelled questions, which can be hard to understand.
- Avoid problems with ethics: avoid questions about people breaking laws, or revealing confidential or private information.
- Ensure clarity in question wording.
- Ensure options are mutually exhaustive, so that answers fit into only one category.
- Ensure options are exhaustive, so that the categories cover all options.
Questionnaire Example
Consider a questionnaire asking these questions:
- Because bottles from bottled water create enormous amounts of non-biodegradable landfill and hence threaten native wildlife, do you support banning bottled water?
- Do you drink more water now?
- Are you more concerned about Coagulase-negative Staphylococcus or Neisseria pharyngis in bottled water?
- Do you drink water in plastic and glass bottles?
- Do you have a water tank installed illegally, without permission?
- Do you avoid purchasing water in plastic bottles unless it is carbonated, unless the bottles are plastic but not necessarily if the lid is recyclable?
Question 1 is leading because the expected response is obvious.
Question 2 is ambiguous: what is ‘more water now’ is being compared to?
Question 3 is unlikely to be answerable, as most people will be uninformed. Nonetheless, many people will still give an opinion.
Question 4 is double-barrelled; better asked as two separate questions.
Question 5 is unlikely to be given ethical approval or to obtain truthful answers, as respondents are unlikely to admit to breaking rules.
Question 6 is unclear, since knowing what a yes or no answer means is confusing.
Question wording can be important (Jardina 2018).
- In the 2014 General Social Survey (https://gss.norc.org), when white Americans were asked for their opinion of the amount America spends on welfare, 58% of respondents answered ‘Too much’ .
- However, when white Americans were asked for their opinion of the amount America spends on assistance to the poor, only 16% of respondents answered ‘Too much’ .
Consider this question: Do you like this new orthotic?
This question is leading, since liking is the only option presented.
Better would be: Do you like or dislike this new orthotic?
In a study to determine the time doctors spent on patients (from Chan et al. (2008)), doctors were given the options:
- 0–5 mins;
- 5–10 mins; or
- more than 10 mins.
This is a poor question, because a respondent does not know which option to select for an answer of ‘5 minutes’ .
The options are not mutually exclusive.
Challenges Using Questionnaires
- Using questionnaires presents myriad challenges.
- Non-response bias: common with questionnaires, as they are often used with voluntary-response samples.
- Response bias: People do not always answer truthfully.
- Recall bias: People may not be able to accurately recall past events clearly, or recall when they happened.
- Question order: The order of the questions can influence the responses.
- Interpretation: Phrases and words such as “Sometimes” and “Somewhat disagree” may means different things to different people.
- Many of these can be managed with careful questionnaire design, but discussing the methods are beyond the scope of this course.
Classifying Variables
- Understanding the type of data collected is essential before starting any analysis.
- The type of data determines how to proceed with summaries and analyses.
- Broadly, data may be classified as either:
- quantitative data, or
- qualitative data.
- The data are the recorded values of the variables.
- So, sometimes we talk about quantitative and qualitative variables.
Variables Example
- ‘Age’ is a variable because age varies from individual to individual.
- The data includes values like 13 months, 21 years and 76 years.
- Quantitative research summarises and analyses data using numerical methods.
- Quantitative research uses both qualitative and quantitative and variables, because both can be summarised numerically and analysed numerically.
Quantitative Data
mathematically numerical.
- Most data arising from counting or measuring is quantitative.
- Quantitative data are often (but not always) measured with measurement units (such as kg or cm).
- Numerical data are not necessarily quantitative.
Quantitative Data Definition
Quantitative data is mathematically numerical: the numbers have numerical meaning and represent quantities or amounts. Quantitative data generally arise from counting or measuring.
Quantitative Data: Discrete and Continuous Data
- The weight of numbats, the thickness of sheet metal, and blood pressure are all measured, and are quantitative variables.
- The number of power failures per year, the number of solar panels per home, and the number of tangelos per tree are all counts, and are quantitative variables.
- Australian postcodes are four-digit numbers, but are not quantitative; the numbers are labels.
- A postcode of 4556 isn’t one ‘better’ or ‘more’ than a postcode of 4555.
- The values do not have numerical meanings.
- Indeed, alphabetic postcodes could have been chosen.
- For example, the postcode of Caboolture is 4510, but could have been QCAB.
- Quantitative data may be further classified as discrete or continuous.
- Discrete quantitative data has possible values that can be counted, at least in theory.
- Sometimes, the possible values may have no theoretical upper limit, yet are still considered ‘countable’ .
- Continuous quantitative data has values that cannot, at least in theory, be recorded exactly.
- Another value can always be found between any two given values of the variable, if we measure to a greater number of decimal places.
- In practice, though, values must be rounded to a reasonable number of decimal places.
Quantitative Data: Discrete and Continuous Data Example
These quantitative variables are discrete:
- The number of heart attacks in the previous year experienced by Croatian women over 40. Possible values: 0, 1, 2, 3, . . .
- The number of cracked eggs in a carton of 12. Possible values: 0, 1, 2, 3, . . . 12.
- The number of orthotic devices a person has used. Possible values: 0, 1, 2, 3, . . .
- The number of turbine cracks after 750 hours use. Possible values: 0, 1, 2, 3, . . .
Continuous quantitative data definition
Continuous quantitative data have (at least in theory) an infinite number of possible values between any two given values.
- Height is continuous: between the heights of 179 cm and 180 cm, many heights exist, depending on how many decimal places are used to record height.
- In practice, however, heights are usually rounded to the nearest centimetre for convenience.
- All continuous data are rounded.
Quantitative Data: Discrete and Continuous Data Example
These quantitative variables are continuous:
- The weight of 6-year-old Fijian children. Values exist between any two given values of weight, by measuring to more decimal places of a kilogram. However, weights are usually reported to the nearest kilogram.
- The energy consumption of houses in London. Values exist between any two given values of energy consumption, by measuring to more and more decimal places of a kiloWatt-hour (kWh). Consumption would usually be given to the nearest kWh.
- The time spent in front of a computer each day for employees in a given industry. Values exist between any two given times, by measuring to more decimal places of a second. The values may be reported to the nearest minute, or the nearest 15 mins.
- Sometimes, discrete quantitative data with a very large number of possible values may be treated as continuous.
Quantitative Data: Discrete and Continuous Data Example
Annual income is discrete: no income is between the values $80 000.00 and $80 000.01. However, annual incomes are usually much larger than cents, and vary at scales much greater than cents, and so are usually treated as continuous.
Qualitative Data
- Qualitative data has distinct labels or categories, and are not mathematically numerical.
- Be careful: numerical data may be qualitative if those numbers don’t have numerical meanings.
- The categories of a qualitative variable are called the levels or the values of the variable.
Qualitative Data Definition
Qualitative data is not mathematically numerical data: it consists of categories or labels.
Definition
The levels (or the values) of a qualitative variable refer to the names of the distinct categories.
Qualitative Data: Nominal and Ordinal Data Example
- ‘Age’ is a continuous quantitative variable, since age could be measured to many decimal places of a second.
- Age is usually rounded down to the number of completed years, for convenience.
- However, the age of young children may be given as ‘3 days’ or ‘10 months’ .
- Sometimes Age group is used (such as Under 20; 20 to under 50; 50 or over) instead of Age.
- ‘Age group’ is qualitative.
- Ensure you are clear about which is used!
- ‘Brand of mobile phone’ is qualitative. Many levels (i.e., brands) are possible, but could be simplified by using the levels as ‘Apple’, ‘Samsung’, ‘Google’ and ‘Other’ .
- Australian postcodes are numbers, but are qualitative.
Qualitative data can be further classified as nominal or ordinal. - Nominal variables are qualitative variables; the levels have no natural order.
- Ordinal variables are qualitative variables; the levels do have a natural order.
- ‘Blood type’ is qualitative nominal; ‘Age group’ is qualitative ordinal.
A nominal qualitative variable definition
A nominal qualitative variable is a qualitative variable where the levels do not have a natural order.
An ordinal qualitative variable definition
An ordinal qualitative variable is a qualitative variable where the levels do have a natural order.
Qualitative Data: Nominal and Ordinal Data Example
The variable ‘How students get to university’ is nominal; the levels may be: Car (driver or passenger); Bus; Ride bicycle; Walk; Other. The data are nominal with five levels. The levels can appear in any order:
- from largest group to smallest,
- in alphabetical order, etc.
Since there is no natural order, the order used should be carefully considered: what is the most useful order when summarising the data?
Qualitative Data: Nominal and Ordinal Data Example
A questionnaire question where respondents are asked to select from options like Strongly disagree; Disagree; Neither agree or disagree; Agree; Strongly agree will produce ordinal data.
For example, the responses to the following question will be ordinal with five levels: Please indicate the extent to which you agree or disagree with this statement: ‘Vaping should be banned’ .
Giving the levels in the given order (or the reverse order) makes sense; giving the levels in alphabetical order, for example, would not make sense.
Qualitative Data: Nominal and Ordinal Data Example
Consider a study to determine if the weight of 500 g bags of pasta really weigh 500 g (or more).
One approach is to record the weight of pasta in each bag (a quantitative variable), and compare the average weight to the target weight of 500 g.
Another approach is to record whether each bag of pasta was underweight or not (using a balance scale).
This variable would be qualitative, with two levels (underweight; not underweight).
The percentage of underweight bags could be reported.
- Most statistical software packages, like jamovi, require the user to describe the variables.
- This enables the software to produce appropriate output and suggest appropriate analyses.
Summarising Quantitative Data
- Except for very small amounts of data, understanding data is difficult without a summary.
- A distribution is a way to summarise quantitative data.
The Distribution of a Variable
The distribution of a variable describes what values are present in the data, and how often those values appear.
- The distribution can be displayed using a frequency table or a graph.
- The distribution of quantitative data can be summarised numerically by:
- computing the average value,
- computing the amount of variation,
- describing the shape, and
- identifying outliers.
Frequency Tables for Quantitative Data
- Quantitative data can be collated in a frequency table.
- Group the variables into appropriate intervals.
- The categories should be:
- exhaustive (cover all values), and
- exclusive (observations belong to one and only one category).
- While not essential, usually the categories are of equal size.
Example
The data are the weights of 44 babies born in a hospital on one day (Dunn 1999; Steele 1997).
The weights can be grouped into weight categories. The percentages are also added. Most babies in the sample are between 3 and 4 kg at birth.
Graphs
- The graphs discussed here are appropriate for continuous quantitative data.
- They may sometimes be useful for discrete quantitative data if many values are possible.
- The purpose of a graph is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.
- Graphs used to display the distribution of one quantitative variable include:
- Histogram: Best for moderate to large amounts of data.
- Stemplots: Best for small amounts of data; only sometimes useful.
- Dot chart: Used for small to moderate amounts of data.
Histograms
- Histograms are a series of boxes.
- The width of the box represents a range of values of the variable being graphed.
- The height of the box represents the number (or percentage) of observations within that range of values.
- Histograms are essentially a picture of a frequency table.
- The vertical axis can be counts (labelled as ‘Counts’, ‘Frequency’, or similar) or percentages.
- When the quantitative variable is discrete and the boxes have a width of one, sometimes the labels are placed on the axis aligned with the centre of the bar.
- When an observation occurs on a boundary between the boxes, software usually (but not universally) places it in the higher box.
- The histogram shows, for example, that 17 babies weighed 3.0 kg or more, but under 3.5 kg.
Stemplots
- Stemplots (or stem-and-leaf plots) are best described and explained using an example.
- Consider the baby data.
- In a stemplot, part of each number is placed to the left of a vertical line (the stem).
- The rest of each number to the right of the line (the leaf).
- The weights are given to one decimal place of a kilogram, so the whole number of kilograms is placed to the left of the line (as the stem).
- The first decimal place is placed on the right of the line (as a leaf).
- The original data remains visible.
- For stemplots:
- place the larger unit (e.g., kilograms) on the left (stems).
- place the smaller unit (e.g., first decimal of a kilogram) on the right (leaves).
- some data do not work well with stemplots.
- round data before creating the stemplot if needed.
- the numbers in each row should be evenly spaced, placing the numbers in the columns under each other.
- within each stem, the observations are ordered so patterns can be seen.
- add an explanation for reading the stemplot (e.g., ‘2 | 6 means 2.6 kg’).
Dot Charts
quantitative data
- Dot charts show the original data on a single (usually horizontal) axis, with each observation represented by a dot (or other symbol).
- Consider again the weights (in kg) of babies.
- Most babies were born between 3 and 4 kg.
- Observations have been jittered (i.e., placed with some added randomness in the vertical direction) to avoid overplotting.
- The chest-beating rate of young gorillas can be displayed using a dot chart.
- Observations are stacked on top of each other when multiple observations are the same, or very nearly so.
Describing the Distribution
- Graphs are constructed to help readers understand the data.
- The distribution of the data should be described:
- The average: What is an average, central or typical value?
- The variation: How much variation is present in the bulk of the data?
- The shape: What is the shape of the distribution? That is, are most of the values smaller or larger, or about even distributed between smaller and larger values?
- Mention any outliers (observations unusually large or small) or unusual features.
- These can be described in rough terms, but usually using numerical quantities.
- Weights of babies: typically between about 2.5 kg and 3 kg (the average).
- Most between 1.5 kg and 4.5 kg (variation).
- A few babies have very lower weights, probably premature births (shape).
- No unusual values are present.
Parameters and Statistics
- In quantitative research, both qualitative and quantitative data are summarised and analysed numerically.
- Importantly, numerical quantities are computed from a sample, even though the whole population is of interest.
- As a result, distinguishing parameters and statistics is important.
Parameter Definition
A parameter is a number, usually unknown, describing some feature of a population.
Statistic Definition
A statistic is a number describing some feature of a sample (to estimate an unknown population parameter).
- A statistic is a numerical value estimating an unknown population value.
- Countless possible samples are possible, and so countless possible values for the statistic—all of which are estimates of the value of the parameter—are possible.
- The value of the statistic that is observed depends on which one of the countless possible samples is (randomly) selected.
- The RQ identifies the population, but in practice only one of the many possible samples is studied.
- Statistics are estimates of parameters, and the value of the statistic is not the same for every possible sample.
- We only observe one value of the statistic from our single observed sample.
Summaries: Shape
- Describing the shape of a distribution can be difficult.
- Introducing terminology helps:
- Right (or positively) skewed: most data are smaller, with some larger values.
- Left (or negatively) skewed: most the data are larger, with some smaller values.
- Symmetric data: Approximately equal numbers of values are smaller and larger.
- Bimodal data: The distribution has two peaks.
Summaries: Averages
- The average (or location, or central value) for quantitative sample data can be described in many ways.
- The most common are:
- the sample mean (or sample arithmetic mean), which estimates the population mean; and
- the sample median, which estimates the population median.
- In both cases, the population parameter is estimated by a sample statistic.
- Understanding whether to use the mean or median is important.
- ‘Average’ can refer to means, medians or other measures of centre.
- Use the precise term ‘mean’ or ‘median’, rather than ‘average’, when possible!
- Consider the daily river flow volume (‘streamflow’) at the Mary River from 01 October 1959 to 17 January 2019.
- The ‘average’ daily streamflow in February could be described using either the mean or the median:
- the mean daily streamflow is 1 123.2 ML.
- the median daily streamflow is 146.1 ML.
- These both summarise the same data, and both give an estimate of the ‘average’ daily streamflow in February, yet give very different answers.
- This implies they measure the ‘average’ differently.
- Which is the best ‘average’ to use?
- To decide, both measures of average will need to be studied.
Average: The Mean
- The mean of the population: \mu.
- The value of \mu is almost always unknown.
- The mean of the population is estimated by the mean of the sample, denoted \bar{x}.
- The value of the unknown parameter is \mu.
- The value of the statistic is \bar{x}. The sample mean estimates the population mean. All the possible samples are likely to have a different sample mean.
For gorillas aged under 20, what is the average chest-beating rate?
- The population mean rate (denoted \mu) is to be estimated.
- Clearly, every gorilla cannot be studied; a sample is studied.
- The unknown population mean is estimated using the sample mean (\bar{x}).
- Every possible sample can give a different value for \bar{x}.
- Measurements were taken from 14 young gorillas
- The sample mean is the ‘balance point’ of the observations.
- Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to zero. Both explanations seem reasonable for identifying an ‘average’ for the data.
- The mean is one way to measure the ‘average’ value of quantitative data.
- To find the value of the sample mean, add (denoted by \sum) all the observations (denoted by x); then divide by the number of observations (denoted by n).
In symbols: \bar{x} = \frac{\sum x}{n}.
- For the chest-beating data, an estimate of the population mean (i.e., the sample mean) is found by summing all n = 14 observations and dividing by n:
\bar{x} = \frac{\sum x}{n} = \frac{0.7 + 0.9 + \cdot \cdot \cdot + 4.4}{14} = \frac{31.1}{14} = 2.221429
- The sample mean is 2.22 beats per 10 h.
- Software (such as jamovi) or a calculator (in Statistics Mode) is usually used to compute the sample mean.
- Software and calculators give numerical answers to many decimal places.
- Typically round to one or two more significant figures than the original data.
- A median is a value separating the largest 50% of the data from the smallest 50% of the data.
- In a dataset with n values, the median is ordered observation number \frac{(n + 1)}{2}.
- (The median is not equal to \frac{(n + 1)}{2}, and is not halfway between the minimum and maximum values in the data.)
- Many calculators cannot find the median.
- The median has no commonly-used symbol.
- The median is one way to measure the ‘average’ value of data.
- A median is a value such that half the values are larger than the median, and half the values are smaller than the median.
- To find a sample median for the chest-beating data, first arrange the data in numerical order.
- The median separates the larger 7 numbers from the smaller 7 numbers.
- With n = 14 observations, the median is the ordered observation located between the seventh and eighth observations (i.e., at position \frac{(14 + 1)}{2} = 7.5; the median itself is not 7.5).
- The sample median is between 1.7 (ordered observation 7) and 1.7 (observation 8).
- Since these values are the same, the sample median is 1.7 beats per 10 h.
- To clarify:
- if the sample size n is odd, the median is the middle number when the observations are ordered.
- if the sample size n is even (such as the chest-beating data), the median is halfway between the two middle numbers, when the observations are ordered.
- Some software uses slightly different rules when n is even, producing slightly different values for the median. The sample median estimates the population median. All the possible samples are likely to have a different sample median.
Which Average To Use?
- Consider the daily streamflow at the Mary River (Bellbird Creek) during February again.
- The mean daily streamflow is 1 123 ML; the median is 146.1 ML.
- Which is ‘best’ for measuring the average streamflow?
- For these data, about 86% of the observations are less than the mean.
- 50% the values are less than the median (by definition).
- The mean is hardly a central value. . .
- A dot chart shows that the data are very highly right-skewed. Many very large outliers are present.
- The streamflow data are very highly right skewed:
- Means are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
- Medians are best used for data that are highly skewed or contain outliers: the median is not influenced by outliers and skewness.
- Means tend to be too large if the data contains large outliers or severe right skewness.
- Means tend to be too small if the data contains small outliers or severe left skewness.
- For the Mary River data, the large outliers—and because they are so extreme and abundant—cause the mean to be so larger than the median. The median is the better measure of average for these data.
- The mean is generally used if possible (for practical and mathematical reasons), and is the most commonly-used measure of location.
- However, the mean is not always appropriate; the median is not influenced by outliers and skewness.
- The mean and median are similar in approximately symmetric distributions.
- Sometimes, quoting both the mean and the median may be appropriate.
Summaries: Variation
- For quantitative data, the amount of variation in the bulk of the data should be described.
- Many ways exist to measure the variation in a dataset:
- the range: very simple and simplistic, so not often used.
- the standard deviation: commonly used.
- the interquartile range (or IQR): commonly used.
- percentiles: useful in specific situations. The sample values estimates the population values. All the possible samples are likely to have a different sample values.
Variation: The Range
- The simplest measure of variation, but not often used.
- The range is the maximum value minus the minimum value.
- Not often used, as it only uses the two values: the two extreme observations.
- This means the range is highly influenced by outliers.
- Sometimes, the range is given by stating both the maximum and the minimum value in the data instead of giving the difference between these values.
- The range is measured in the same measurement units as the data.
- The range is usually quoted with the median.
Example
For the chest-beating data, the largest value is 4.4, and the smallest value is 0.7; hence Range = 4.4 − 0.7 = 3.7. The sample median chest-beating rate is 1.7 beats per 10 h, with a range of 3.7 beats per 10 h.
Variation: The Standard Deviation
- The population standard deviation: \sigma (the parameter).
- It is estimated by the sample standard deviation s (the statistic).
- The standard deviation is the most commonly-used measure of variation but is tedious to compute manually.
- You will almost always find the sample standard deviation s using computer software (e.g., jamovi) or a calculator (in Statistics Mode)).
- The standard deviation is (approximately) the mean distance that observations are from the mean.
- This seems a reasonable measure of variation.
- The sample standard deviation estimates the population standard deviation, and every one of the possible samples is likely to have a different sample standard deviation.
Standard Deviation Definition
The standard deviation is, approximately, the average distance of the observations from the mean.
- The sample standard deviation s is:
- positive (unless all observations are the same, when it is zero: no variation);
- best used for (approximately) symmetric data;
- usually quoted with the mean;
- the most commonly-used measure of variation;
- measured in the same units as the data;
- influenced by skewness and outliers, like the mean.
Variation: The Inter-Quartile Range (IQR)
- The standard deviation uses the value of \bar{x}, so is affected by skewness like the sample mean.
- The IQR is not impacted by skewness, outliers.
- To understand the IQR, understanding quartiles is necessary.
- The first quartile Q1 is a value separating the smallest 25% of observations from the largest 75%. The Q1 is like the median of the smaller half of the data, halfway between the minimum value and the median.
- The second quartile Q2 is a value separating the smallest 50% of observations from the largest 50%. (This is the also the median.)
- The third quartile Q3 is a value separating the smallest 75% of observations from the largest 25%. The Q3 is like the median of the larger half of the data, halfway between the median and the maximum value.
- Quartiles divide the data into four parts of approximately equal numbers of observations.
- The inter-quartile range (or IQR) is the difference between Q3 and Q1.
- Since the IQR measures the range of the central 50% of the data, the IQR is not influenced by outliers.
- The IQR is measured in the same measurements units as the data. The sample IQR estimates the population IQR, and every one of the possible samples is likely to have a different sample IQR.
- For the chest-beating data, the median is 1.7.
- The data then can be split into the smaller and the larger halves, each with seven values:
- Smaller half: 0.7 0.9 1.3 1.5 1.5 1.5 1.7
- Larger half: 1.7 1.8 2.6 3.0 4.1 4.4 4.4
- Since each half has seven observations, the median of each half is the (7 + 1)/2 = 4th value.
- (When n is odd, the median may or may not be included in each of these halves; we decide not to include the median in each half.)
- Q1, the first quartile, is the median of the smaller half: Q1 = 1.5.
- Q2, the second quartile or median, is 1.7.
- Q3, the third quartile, is the median of the larger half: Q3 = 3.0. We say ‘about’ these values, as exact values cannot be found here; each quartile is required to have 14/4 = 3.5 observations, which is not possible.
Variation: Percentiles
- Percentiles are similar in principle to quantiles.
- The pth percentile of the data is a value separating the smallest p% of the data from the rest.