Descriptive Statistics: Measures of Location, Variation, and the Five-Number Summary

Notes on Descriptive Statistics: Measures of Location, Variation, and the Five-Number Summary

  • Core goal of statistics: the science of collecting, organizing, summarizing, and interpreting data to help us make decisions.

    • Process flow: collect data → clean/restructure → summarize → interpret → tell the story of the data.
    • Graphical analysis and numerical analysis are parts of the summarizing step.
  • Two broad types of statistics:

    • Descriptive statistics: uses data from the sample to describe features of the sample itself.
    • Inferential statistics: uses sample data to make inferences about a population, often with quantified uncertainty (e.g., confidence intervals, hypothesis tests).
    • There is always uncertainty in estimation of population parameters from samples; the goal is to quantify and communicate that uncertainty to aid decision-making.
  • Population vs. Sample vs. Parameters vs. Statistics:

    • Population: the group of interest that we want to learn about (e.g., all MSU campus students).
    • Parameter: a true, usually unknown value describing the population (e.g., the population mean 
    • Sample: a subset of the population used to learn about the population (e.g., 100 students sampled on campus).
    • Statistic: a numerical summary computed from the sample (e.g., sample mean , or sample standard deviation s).
    • Relationship: statistics estimate parameters; the process is called statistical estimation and, after using the sample to estimate, we can perform inference about the population.
  • The circle model of population, parameter, sample, and statistic:

    • Population  → Parameter (true population value, unknown)
    • Sample (subset)  → Statistic (computed from the sample)
    • Inference uses the statistic to draw conclusions about the parameter.
    • Descriptive statistics describe the sample; inferential statistics extend to the population with uncertainty quantification.
  • Measures of location (central tendency): key ideas

    • Mean: the arithmetic average; the balance point of the data; used when data are symmetric with no outliers.
    • Median: the middle value (or the average of the two middle values for even n); robust to outliers and skewness.
    • Mode: the most frequent value; only option when data are categorical or highly skewed with outliers.
    • Trimmed mean: a resistant alternative that removes a portion of extreme values before averaging.
    • Decision rules (based on shape):
    • If the data are roughly symmetric: use the mean.
    • If the data are skewed or have outliers: use the median (or trimmed mean).
    • If data are categorical: mean/median are not defined; mode is typically used.
  • Example of location measures (conceptual)

    • Calculation of sample mean:
    • Let the data be
      x1, x2, \dots, xn then the sample mean is \bar{x} = \frac{1}{n} \sum{i=1}^n x_i.
    • Example from notes: a dataset with a minimum of 43 and a maximum of 125; removing these extremes leaves 10 observations; the reported sample mean is
      \bar{x} = 65.83.
    • Median for the same trimmed 10-observation set (even n):
      \text{Median} = \frac{x{(n/2)} + x{(n/2+1)}}{2}.
      In the example, the median is reported as 51.
    • Range (as a simple measure of spread):
      \text{Range} = x{(n)} - x{(1)}.
    • For the example: range = 125 - 43 = 82.
  • Measures of variation (spread): key ideas

    • Range: difference between the maximum and minimum values.
    • Interquartile range (IQR): the spread of the middle 50% of the data.
    • IQR = Q3 − Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile.
    • Variance and standard deviation measure dispersion around the center:
    • Sample variance:
      s^2 = \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2.
    • Sample standard deviation:
      s = \sqrt{s^2}.
    • Relationship to units: standard deviation is in the same unit as the data, which makes it easier to interpret than variance.
  • Five-number summary and box plots: core concepts

    • Five-number summary consists of:
    • Minimum, Q1 (first quartile), Median (second quartile, Q2), Q3 (third quartile), Maximum.
    • It provides a compact numeric description of the distribution.
    • Box plot components:
    • A box spanning from Q1 to Q3 with a horizontal line for the median inside the box.
    • The length of the box represents the IQR.
    • Whiskers extend to the most extreme data points that are not outliers.
    • Outliers are often plotted as individual points or asterisks beyond the whiskers.
    • How to compute quartiles (typical method):
    • Order the data from smallest to largest.
    • For a data set with n values, Q1 is the median of the lower half and Q3 is the median of the upper half (method: split at the median and take medians of each half).
    • Outlier boundaries (fences) in box plots:
    • Lower fence:
      \text{Lower fence} = Q_1 - 1.5 \cdot IQR
    • Upper fence:
      \text{Upper fence} = Q_3 + 1.5 \cdot IQR
    • Observations outside these fences are plotted as outliers (often with a star or asterisk).
    • Reading a box plot:
    • The box shows the middle 50% of the data (Q1 to Q3).
    • The line inside the box is the median.
    • The whiskers extend to the minimum and maximum values within the fences; points beyond are outliers.
  • Worked example: how to compute and interpret a box plot

    • Given a built-in dataset with 71 observations on weight (as described in notes):
    • Ordered data: min = 108, max = 423.
    • The box plot displays Q1, Median, Q3; the exact quartiles are determined from the ordered data (Q1 around the lower quartile, Q3 around the upper quartile).
    • The interquartile range: IQR = Q3 − Q1.
    • The 1.5 × IQR rule gives the whisker reach and identifies potential outliers.
    • Numeric summary reported in the notes for this dataset (example values):
    • Median ≈ 258; Mean ≈ 261.
    • This closeness suggests the distribution is approximately symmetric.
    • If the mean is greater than the median, it can indicate a slight right skew (longer tail toward larger values).
    • Interpretation for reading the box plot:
    • Symmetric dataset: box roughly centered around the median; similar tails on both sides.
    • Right-skewed: mean pulled toward the right tail; gravity center shifted to higher values; median closer to the left side of the box.
  • Graphical vs numerical summaries and interpretation

    • Graphical analysis (histograms, box plots) helps identify shape, central tendency, and spread visually, including outliers.
    • Numerical analysis provides precise summaries:
    • Location: mean, median, mode, trimmed mean.
    • Variation: range, IQR, variance, standard deviation.
    • Five-number summary as a compact descriptor for the distribution.
    • The choice between mean vs median (and trimmed mean) depends on distribution shape and presence of outliers; this choice affects interpretation of the central tendency.
  • Connections to inference and future topics (brief orientation)

    • After mastering descriptive summaries, the course moves to inferential techniques: confidence intervals and hypothesis tests.
    • Conceptually, a point estimate (e.g., a sample mean) gives a single best guess of a population parameter, but an interval estimate (e.g., a confidence interval) provides a range that likely contains the true parameter with a stated confidence level (e.g., 95%).
    • The entire process emphasizes quantifying uncertainty to support decision making, rather than asserting exact population values.
  • Quick recap of essential formulas (to memorize and apply)

    • Population parameter (mean) notation: \mu\text{ (parameter)}
    • Sample mean: \bar{x} = \dfrac{1}{n} \sum{i=1}^n xi
    • Median for sorted data: if n is odd, the middle value; if n is even, \text{Median} = \dfrac{x{(n/2)} + x{(n/2+1)}}{2}
    • Range: \text{Range} = x{(n)} - x{(1)}
    • Interquartile range: IQR = Q3 - Q1
    • Lower/Upper fences for outliers:
      \text{Lower fence} = Q1 - 1.5 \cdot IQR, \quad \text{Upper fence} = Q3 + 1.5 \cdot IQR
    • Variance (sample): s^2 = \dfrac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2
    • Standard deviation (sample): s = \sqrt{s^2}
    • Five-number summary: \min, \; Q1, \; \text{Median}, \; Q3, \; \max
  • Takeaway: The slides emphasize the foundational relationship between population parameters and sample statistics, the distinction between descriptive and inferential statistics, and the practical use of the five-number summary and box plots to summarize and visualize data in a way that supports informed decision making.

  • If you want, I can convert these notes into a compact cheat-sheet or generate a practice problem set (with step-by-step solutions) on calculating the five-number summary, constructing a box plot, and interpreting skewness from mean vs. median.