knowt ap exam guide logo

1.2: Histograms, Box Plots, Outliers, and Standard Deviation

Introduction to Histograms

  • Histogram: a bar graph for quantitative data

  • The horizontal axis is divided into classes

  • Each class needs to cover the same range of values

  • Generally, 5-7 classes is a good minimum

  • The more classes, the more detail/nuance shown

  • The vertical axis measures how much data is in each class

  • The bars must be touching

  • If a data point is on the break of a class group (on a tick mark on the x-axis), it is included in the right bar

  • Frequency histogram: a histogram showing the number of data points

  • Relative frequency histogram: a histogram showing the percent of data

    • Can be made by taking the frequency in each class and dividing it by the total number of data points

  • The center is generally found by estimation, especially if only a graph is given

  • A histogram displays how many pieces of data are in each class

Histograms must have

  • Consistent scales on both axes

  • Labels for both axes

  • A break on the x-axis if it does not start at 0

  • The y-axis starting at 0

Outliers

  • Data points are considered outliers if they lie…

  • Eg. data set: 18, 19, 13, 2, 15, 19, 15, 31, 17, 16, 29

    • Q1 = 15

    • Q3 = 19

    • IQR = Q3 - Q1 = 4

      • Q1 - 1.5(IQR) = 15 - 1.5(4) = 9 → anything below 9 is an outlier

      • Q3 + 1.5(1QR) = 19 + 1.5(4) = 25 → anything above 25 is an outlier

    • So, 2, 29, and 31 are outliers

Choosing Relevant Measurements

Mean and Standard Deviation

  • Mean is the numerical standardized average of a set of data

  • Standard deviation is the spread of data about the mean

  • Standard deviation uses the same units as the original data

  • Skew and outliers influence both mean and standard deviation

    • Skew: the extent to which a graph is pulled to one side or centered around the middle

    • If skew/outliers are present in a data set, this means that mean and standard deviation should not be used

  • These measurements work well when data is approximately symmetrical with no outliers

Median, Quartiles, Range, and IQR

  • Resistent to outliers

  • These measurements work well when data is skewed and/or contains outliers

Measuring the Spread of Data

  • Range = maximum - minimum

  • IQR = Q3 - Q1

  • Standard Deviation

    • = mean

    • Standard deviation measures the rough average distance between each point and the mean

      • Larger standard deviations indicate that there is more data further from the mean

      • Moderate standard deviations indicate that data is moderately spread around the mean

      • Smaller standard deviations indicate that there is more data clumped closer to the mean

  • Variance

  • Variance is also equal to the square root of standard deviation

  • Remember to always plot data; measures of spread and center only display specific facts about a data set, but graphs give the best overall pictures of distributions

1.2: Histograms, Box Plots, Outliers, and Standard Deviation

Introduction to Histograms

  • Histogram: a bar graph for quantitative data

  • The horizontal axis is divided into classes

  • Each class needs to cover the same range of values

  • Generally, 5-7 classes is a good minimum

  • The more classes, the more detail/nuance shown

  • The vertical axis measures how much data is in each class

  • The bars must be touching

  • If a data point is on the break of a class group (on a tick mark on the x-axis), it is included in the right bar

  • Frequency histogram: a histogram showing the number of data points

  • Relative frequency histogram: a histogram showing the percent of data

    • Can be made by taking the frequency in each class and dividing it by the total number of data points

  • The center is generally found by estimation, especially if only a graph is given

  • A histogram displays how many pieces of data are in each class

Histograms must have

  • Consistent scales on both axes

  • Labels for both axes

  • A break on the x-axis if it does not start at 0

  • The y-axis starting at 0

Outliers

  • Data points are considered outliers if they lie…

  • Eg. data set: 18, 19, 13, 2, 15, 19, 15, 31, 17, 16, 29

    • Q1 = 15

    • Q3 = 19

    • IQR = Q3 - Q1 = 4

      • Q1 - 1.5(IQR) = 15 - 1.5(4) = 9 → anything below 9 is an outlier

      • Q3 + 1.5(1QR) = 19 + 1.5(4) = 25 → anything above 25 is an outlier

    • So, 2, 29, and 31 are outliers

Choosing Relevant Measurements

Mean and Standard Deviation

  • Mean is the numerical standardized average of a set of data

  • Standard deviation is the spread of data about the mean

  • Standard deviation uses the same units as the original data

  • Skew and outliers influence both mean and standard deviation

    • Skew: the extent to which a graph is pulled to one side or centered around the middle

    • If skew/outliers are present in a data set, this means that mean and standard deviation should not be used

  • These measurements work well when data is approximately symmetrical with no outliers

Median, Quartiles, Range, and IQR

  • Resistent to outliers

  • These measurements work well when data is skewed and/or contains outliers

Measuring the Spread of Data

  • Range = maximum - minimum

  • IQR = Q3 - Q1

  • Standard Deviation

    • = mean

    • Standard deviation measures the rough average distance between each point and the mean

      • Larger standard deviations indicate that there is more data further from the mean

      • Moderate standard deviations indicate that data is moderately spread around the mean

      • Smaller standard deviations indicate that there is more data clumped closer to the mean

  • Variance

  • Variance is also equal to the square root of standard deviation

  • Remember to always plot data; measures of spread and center only display specific facts about a data set, but graphs give the best overall pictures of distributions