1.2: Histograms, Box Plots, Outliers, and Standard Deviation
Histogram: a bar graph for quantitative data
The horizontal axis is divided into classes
Each class needs to cover the same range of values
Generally, 5-7 classes is a good minimum
The more classes, the more detail/nuance shown
The vertical axis measures how much data is in each class
The bars must be touching
If a data point is on the break of a class group (on a tick mark on the x-axis), it is included in the right bar
Frequency histogram: a histogram showing the number of data points
Relative frequency histogram: a histogram showing the percent of data
Can be made by taking the frequency in each class and dividing it by the total number of data points
The center is generally found by estimation, especially if only a graph is given
A histogram displays how many pieces of data are in each class
Consistent scales on both axes
Labels for both axes
A break on the x-axis if it does not start at 0
The y-axis starting at 0
Data points are considered outliers if they lie…
Eg. data set: 18, 19, 13, 2, 15, 19, 15, 31, 17, 16, 29
Q1 = 15
Q3 = 19
IQR = Q3 - Q1 = 4
Q1 - 1.5(IQR) = 15 - 1.5(4) = 9 → anything below 9 is an outlier
Q3 + 1.5(1QR) = 19 + 1.5(4) = 25 → anything above 25 is an outlier
So, 2, 29, and 31 are outliers
Mean is the numerical standardized average of a set of data
Standard deviation is the spread of data about the mean
Standard deviation uses the same units as the original data
Skew and outliers influence both mean and standard deviation
Skew: the extent to which a graph is pulled to one side or centered around the middle
If skew/outliers are present in a data set, this means that mean and standard deviation should not be used
These measurements work well when data is approximately symmetrical with no outliers
Resistent to outliers
These measurements work well when data is skewed and/or contains outliers
Range = maximum - minimum
IQR = Q3 - Q1
Standard Deviation
x̄ = mean
Standard deviation measures the rough average distance between each point and the mean
Larger standard deviations indicate that there is more data further from the mean
Moderate standard deviations indicate that data is moderately spread around the mean
Smaller standard deviations indicate that there is more data clumped closer to the mean
Variance
Variance is also equal to the square root of standard deviation
Remember to always plot data; measures of spread and center only display specific facts about a data set, but graphs give the best overall pictures of distributions
Histogram: a bar graph for quantitative data
The horizontal axis is divided into classes
Each class needs to cover the same range of values
Generally, 5-7 classes is a good minimum
The more classes, the more detail/nuance shown
The vertical axis measures how much data is in each class
The bars must be touching
If a data point is on the break of a class group (on a tick mark on the x-axis), it is included in the right bar
Frequency histogram: a histogram showing the number of data points
Relative frequency histogram: a histogram showing the percent of data
Can be made by taking the frequency in each class and dividing it by the total number of data points
The center is generally found by estimation, especially if only a graph is given
A histogram displays how many pieces of data are in each class
Consistent scales on both axes
Labels for both axes
A break on the x-axis if it does not start at 0
The y-axis starting at 0
Data points are considered outliers if they lie…
Eg. data set: 18, 19, 13, 2, 15, 19, 15, 31, 17, 16, 29
Q1 = 15
Q3 = 19
IQR = Q3 - Q1 = 4
Q1 - 1.5(IQR) = 15 - 1.5(4) = 9 → anything below 9 is an outlier
Q3 + 1.5(1QR) = 19 + 1.5(4) = 25 → anything above 25 is an outlier
So, 2, 29, and 31 are outliers
Mean is the numerical standardized average of a set of data
Standard deviation is the spread of data about the mean
Standard deviation uses the same units as the original data
Skew and outliers influence both mean and standard deviation
Skew: the extent to which a graph is pulled to one side or centered around the middle
If skew/outliers are present in a data set, this means that mean and standard deviation should not be used
These measurements work well when data is approximately symmetrical with no outliers
Resistent to outliers
These measurements work well when data is skewed and/or contains outliers
Range = maximum - minimum
IQR = Q3 - Q1
Standard Deviation
x̄ = mean
Standard deviation measures the rough average distance between each point and the mean
Larger standard deviations indicate that there is more data further from the mean
Moderate standard deviations indicate that data is moderately spread around the mean
Smaller standard deviations indicate that there is more data clumped closer to the mean
Variance
Variance is also equal to the square root of standard deviation
Remember to always plot data; measures of spread and center only display specific facts about a data set, but graphs give the best overall pictures of distributions