Chapter 3 - Describing, Exploring, and Comparing Data
A measure of center is a value at the center/middle of a data set
Mean (or arithmetic mean) of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values
Sample means drawn from the same population tend to vary LESS than other measures of center
The mean uses every data value
A disadvantage of the mean is that just 1 extreme value (outlier) can change the value of the mean substantially (not resistant)
Mean = sum of all data values / number of data values
A statistic is resistant if the presence of extreme values (outliers) does not cause it to change very much
The median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing/decreasing magnitude
Median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center
The median does not directly use every data value
If the number of data values is ODD, the median is the number located in the exact middle of the sorted list. If the number of data values is EVEN, the median is found by computing the mean of the 2 middle numbers in the sorted list
The mode of a data set is the value(s) that occur(s) with the greatest frequency
The mode can be found with qualitative data
A data set can have no mode or 1 mode or multiple modes
When 2 data values occur with the same greatest frequency, each one is a mode and the data set is said to be bimodal
When more than 2 data values occur with the same greatest frequency, each one is a mode and the data set is said to be multimodal
When no data value is repeated, we say there is no mode
The midrange of a data set is the measure of center that is the value midway between the max and min values in the original data set
Midrange = (max data value + min data value) / 2
The midrange is not resistant
Round off rules for measures of center:
For the mean, median, and midrange, carry 1 more decimal place than is present in the original set of values
For the mode, leave the value as is without rounding
The mean from a frequency distribution = (sum of products of each frequency and class midpoint) / (sum of frequencies)
A weighted mean is computed when different x data values are assigned different weights w
Weighted mean = (sum of product of weight and value) / (sum of weights)
Round off rule: when rounding the value of a measure of variation, carry 1 more decimal place than what is in the original set of data
Range = maximum data value - minimum data value
not resistant
does not take every value into account
Standard deviation: a set of sample values, denoted by s, is a measure of how much data values deviate away from the mean.
The value is never negative. It is 0 when all the data values are exactly the same.
Large s value indicates greater amounts of variation.
s is a biased estimator of the population standard deviation
Utilize the range rule of thumb for identifying significant values
The variance of a set of values is a measure of variation equal to the square of the standard deviation.
The units of variance are the squares of the units of the original data values.
The sample variance is an unbiased estimator of the population variance.
The mean absolute deviation is the mean distance of the data from the mean.
Utilize the empirical rule (68-95-99.7) for data with a bell-shaped distribution.
Chebyshev's theorem applies to ANY data set, unlike the empirical rule.
The coefficient of variation (or CV) describes the standard deviation relative to the mean.
A z-score is the number of standard deviations that a given value x is above or below the mean.
z-scores have no units of measurement
A data value is significantly low if its z-score is less than or equal to -2, or significantly high if the z-score is greater than or equal to 2
Percentiles are 1 type of quantiles, or fractiles, which partition data into groups with roughly the same number of values in each group.
Percentiles are measures of location, which divide a set of data in 100 groups with about 1% of the values in each group
Percentile of value x = ( (number of values < x) / total # of values ) * 100
Quartiles are measures of location, denoted Q1, Q2, Q3, which divide a set of data into 4 groups with about 25% of the values in each group
5 number summary consists of: minimum, Q1, Q2 (or the median), Q3, maximum
A boxplot is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, the median, and the third quartile.
A modified boxplot is a regular boxplot constructed with these modifications:
a special symbol used to identify outliers
solid horizontal line extends only as far as the minimum data value that is not an outlier and the max data value that is not an outlier
A measure of center is a value at the center/middle of a data set
Mean (or arithmetic mean) of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values
Sample means drawn from the same population tend to vary LESS than other measures of center
The mean uses every data value
A disadvantage of the mean is that just 1 extreme value (outlier) can change the value of the mean substantially (not resistant)
Mean = sum of all data values / number of data values
A statistic is resistant if the presence of extreme values (outliers) does not cause it to change very much
The median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing/decreasing magnitude
Median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center
The median does not directly use every data value
If the number of data values is ODD, the median is the number located in the exact middle of the sorted list. If the number of data values is EVEN, the median is found by computing the mean of the 2 middle numbers in the sorted list
The mode of a data set is the value(s) that occur(s) with the greatest frequency
The mode can be found with qualitative data
A data set can have no mode or 1 mode or multiple modes
When 2 data values occur with the same greatest frequency, each one is a mode and the data set is said to be bimodal
When more than 2 data values occur with the same greatest frequency, each one is a mode and the data set is said to be multimodal
When no data value is repeated, we say there is no mode
The midrange of a data set is the measure of center that is the value midway between the max and min values in the original data set
Midrange = (max data value + min data value) / 2
The midrange is not resistant
Round off rules for measures of center:
For the mean, median, and midrange, carry 1 more decimal place than is present in the original set of values
For the mode, leave the value as is without rounding
The mean from a frequency distribution = (sum of products of each frequency and class midpoint) / (sum of frequencies)
A weighted mean is computed when different x data values are assigned different weights w
Weighted mean = (sum of product of weight and value) / (sum of weights)
Round off rule: when rounding the value of a measure of variation, carry 1 more decimal place than what is in the original set of data
Range = maximum data value - minimum data value
not resistant
does not take every value into account
Standard deviation: a set of sample values, denoted by s, is a measure of how much data values deviate away from the mean.
The value is never negative. It is 0 when all the data values are exactly the same.
Large s value indicates greater amounts of variation.
s is a biased estimator of the population standard deviation
Utilize the range rule of thumb for identifying significant values
The variance of a set of values is a measure of variation equal to the square of the standard deviation.
The units of variance are the squares of the units of the original data values.
The sample variance is an unbiased estimator of the population variance.
The mean absolute deviation is the mean distance of the data from the mean.
Utilize the empirical rule (68-95-99.7) for data with a bell-shaped distribution.
Chebyshev's theorem applies to ANY data set, unlike the empirical rule.
The coefficient of variation (or CV) describes the standard deviation relative to the mean.
A z-score is the number of standard deviations that a given value x is above or below the mean.
z-scores have no units of measurement
A data value is significantly low if its z-score is less than or equal to -2, or significantly high if the z-score is greater than or equal to 2
Percentiles are 1 type of quantiles, or fractiles, which partition data into groups with roughly the same number of values in each group.
Percentiles are measures of location, which divide a set of data in 100 groups with about 1% of the values in each group
Percentile of value x = ( (number of values < x) / total # of values ) * 100
Quartiles are measures of location, denoted Q1, Q2, Q3, which divide a set of data into 4 groups with about 25% of the values in each group
5 number summary consists of: minimum, Q1, Q2 (or the median), Q3, maximum
A boxplot is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, the median, and the third quartile.
A modified boxplot is a regular boxplot constructed with these modifications:
a special symbol used to identify outliers
solid horizontal line extends only as far as the minimum data value that is not an outlier and the max data value that is not an outlier