1.2: Histograms, Box Plots, Outliers, and Standard Deviation

Introduction to Histograms

  • Histogram: a bar graph for quantitative data
  • The horizontal axis is divided into classes
  • Each class needs to cover the same range of values
  • Generally, 5-7 classes is a good minimum
  • The more classes, the more detail/nuance shown
  • The vertical axis measures how much data is in each class
  • The bars must be touching
  • If a data point is on the break of a class group (on a tick mark on the x-axis), it is included in the right bar
  • Frequency histogram: a histogram showing the number of data points
  • Relative frequency histogram: a histogram showing the percent of data
    • Can be made by taking the frequency in each class and dividing it by the total number of data points
  • The center is generally found by estimation, especially if only a graph is given
  • A histogram displays how many pieces of data are in each class

Histograms must have

  • Consistent scales on both axes
  • Labels for both axes
  • A break on the x-axis if it does not start at 0
  • The y-axis starting at 0

Outliers

  • Data points are considered outliers if they lie…

  • Eg. data set: 18, 19, 13, 2, 15, 19, 15, 31, 17, 16, 29

    • Q1 = 15
    • Q3 = 19
    • IQR = Q3 - Q1 = 4
    • Q1 - 1.5(IQR) = 15 - 1.5(4) = 9 → anything below 9 is an outlier
    • Q3 + 1.5(1QR) = 19 + 1.5(4) = 25 → anything above 25 is an outlier
    • So, 2, 29, and 31 are outliers

Choosing Relevant Measurements

Mean and Standard Deviation

  • Mean is the numerical standardized average of a set of data
  • Standard deviation is the spread of data about the mean
  • Standard deviation uses the same units as the original data
  • Skew and outliers influence both mean and standard deviation
    • Skew: the extent to which a graph is pulled to one side or centered around the middle
    • If skew/outliers are present in a data set, this means that mean and standard deviation should not be used
  • These measurements work well when data is approximately symmetrical with no outliers

Median, Quartiles, Range, and IQR

  • Resistent to outliers
  • These measurements work well when data is skewed and/or contains outliers

Measuring the Spread of Data

  • Range = maximum - minimum
  • IQR = Q3 - Q1
  • Standard Deviation
    • = mean
    • Standard deviation measures the rough average distance between each point and the mean
    • Larger standard deviations indicate that there is more data further from the mean
    • Moderate standard deviations indicate that data is moderately spread around the mean
    • Smaller standard deviations indicate that there is more data clumped closer to the mean
  • Variance
  • Variance is also equal to the square root of standard deviation
  • Remember to always plot data; measures of spread and center only display specific facts about a data set, but graphs give the best overall pictures of distributions

\