Descriptive Statistics: Measures of Location, Variation, and the Five-Number Summary
Notes on Descriptive Statistics: Measures of Location, Variation, and the Five-Number Summary
Core goal of statistics: the science of collecting, organizing, summarizing, and interpreting data to help us make decisions.
- Process flow: collect data → clean/restructure → summarize → interpret → tell the story of the data.
- Graphical analysis and numerical analysis are parts of the summarizing step.
Two broad types of statistics:
- Descriptive statistics: uses data from the sample to describe features of the sample itself.
- Inferential statistics: uses sample data to make inferences about a population, often with quantified uncertainty (e.g., confidence intervals, hypothesis tests).
- There is always uncertainty in estimation of population parameters from samples; the goal is to quantify and communicate that uncertainty to aid decision-making.
Population vs. Sample vs. Parameters vs. Statistics:
- Population: the group of interest that we want to learn about (e.g., all MSU campus students).
- Parameter: a true, usually unknown value describing the population (e.g., the population mean
- Sample: a subset of the population used to learn about the population (e.g., 100 students sampled on campus).
- Statistic: a numerical summary computed from the sample (e.g., sample mean , or sample standard deviation s).
- Relationship: statistics estimate parameters; the process is called statistical estimation and, after using the sample to estimate, we can perform inference about the population.
The circle model of population, parameter, sample, and statistic:
- Population → Parameter (true population value, unknown)
- Sample (subset) → Statistic (computed from the sample)
- Inference uses the statistic to draw conclusions about the parameter.
- Descriptive statistics describe the sample; inferential statistics extend to the population with uncertainty quantification.
Measures of location (central tendency): key ideas
- Mean: the arithmetic average; the balance point of the data; used when data are symmetric with no outliers.
- Median: the middle value (or the average of the two middle values for even n); robust to outliers and skewness.
- Mode: the most frequent value; only option when data are categorical or highly skewed with outliers.
- Trimmed mean: a resistant alternative that removes a portion of extreme values before averaging.
- Decision rules (based on shape):
- If the data are roughly symmetric: use the mean.
- If the data are skewed or have outliers: use the median (or trimmed mean).
- If data are categorical: mean/median are not defined; mode is typically used.
Example of location measures (conceptual)
- Calculation of sample mean:
- Let the data be
x1, x2, \dots, xn then the sample mean is \bar{x} = \frac{1}{n} \sum{i=1}^n x_i. - Example from notes: a dataset with a minimum of 43 and a maximum of 125; removing these extremes leaves 10 observations; the reported sample mean is
\bar{x} = 65.83. - Median for the same trimmed 10-observation set (even n):
\text{Median} = \frac{x{(n/2)} + x{(n/2+1)}}{2}.
In the example, the median is reported as 51. - Range (as a simple measure of spread):
\text{Range} = x{(n)} - x{(1)}. - For the example: range = 125 - 43 = 82.
Measures of variation (spread): key ideas
- Range: difference between the maximum and minimum values.
- Interquartile range (IQR): the spread of the middle 50% of the data.
- IQR = Q3 − Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile.
- Variance and standard deviation measure dispersion around the center:
- Sample variance:
s^2 = \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2. - Sample standard deviation:
s = \sqrt{s^2}. - Relationship to units: standard deviation is in the same unit as the data, which makes it easier to interpret than variance.
Five-number summary and box plots: core concepts
- Five-number summary consists of:
- Minimum, Q1 (first quartile), Median (second quartile, Q2), Q3 (third quartile), Maximum.
- It provides a compact numeric description of the distribution.
- Box plot components:
- A box spanning from Q1 to Q3 with a horizontal line for the median inside the box.
- The length of the box represents the IQR.
- Whiskers extend to the most extreme data points that are not outliers.
- Outliers are often plotted as individual points or asterisks beyond the whiskers.
- How to compute quartiles (typical method):
- Order the data from smallest to largest.
- For a data set with n values, Q1 is the median of the lower half and Q3 is the median of the upper half (method: split at the median and take medians of each half).
- Outlier boundaries (fences) in box plots:
- Lower fence:
\text{Lower fence} = Q_1 - 1.5 \cdot IQR - Upper fence:
\text{Upper fence} = Q_3 + 1.5 \cdot IQR - Observations outside these fences are plotted as outliers (often with a star or asterisk).
- Reading a box plot:
- The box shows the middle 50% of the data (Q1 to Q3).
- The line inside the box is the median.
- The whiskers extend to the minimum and maximum values within the fences; points beyond are outliers.
Worked example: how to compute and interpret a box plot
- Given a built-in dataset with 71 observations on weight (as described in notes):
- Ordered data: min = 108, max = 423.
- The box plot displays Q1, Median, Q3; the exact quartiles are determined from the ordered data (Q1 around the lower quartile, Q3 around the upper quartile).
- The interquartile range: IQR = Q3 − Q1.
- The 1.5 × IQR rule gives the whisker reach and identifies potential outliers.
- Numeric summary reported in the notes for this dataset (example values):
- Median ≈ 258; Mean ≈ 261.
- This closeness suggests the distribution is approximately symmetric.
- If the mean is greater than the median, it can indicate a slight right skew (longer tail toward larger values).
- Interpretation for reading the box plot:
- Symmetric dataset: box roughly centered around the median; similar tails on both sides.
- Right-skewed: mean pulled toward the right tail; gravity center shifted to higher values; median closer to the left side of the box.
Graphical vs numerical summaries and interpretation
- Graphical analysis (histograms, box plots) helps identify shape, central tendency, and spread visually, including outliers.
- Numerical analysis provides precise summaries:
- Location: mean, median, mode, trimmed mean.
- Variation: range, IQR, variance, standard deviation.
- Five-number summary as a compact descriptor for the distribution.
- The choice between mean vs median (and trimmed mean) depends on distribution shape and presence of outliers; this choice affects interpretation of the central tendency.
Connections to inference and future topics (brief orientation)
- After mastering descriptive summaries, the course moves to inferential techniques: confidence intervals and hypothesis tests.
- Conceptually, a point estimate (e.g., a sample mean) gives a single best guess of a population parameter, but an interval estimate (e.g., a confidence interval) provides a range that likely contains the true parameter with a stated confidence level (e.g., 95%).
- The entire process emphasizes quantifying uncertainty to support decision making, rather than asserting exact population values.
Quick recap of essential formulas (to memorize and apply)
- Population parameter (mean) notation: \mu\text{ (parameter)}
- Sample mean: \bar{x} = \dfrac{1}{n} \sum{i=1}^n xi
- Median for sorted data: if n is odd, the middle value; if n is even, \text{Median} = \dfrac{x{(n/2)} + x{(n/2+1)}}{2}
- Range: \text{Range} = x{(n)} - x{(1)}
- Interquartile range: IQR = Q3 - Q1
- Lower/Upper fences for outliers:
\text{Lower fence} = Q1 - 1.5 \cdot IQR, \quad \text{Upper fence} = Q3 + 1.5 \cdot IQR - Variance (sample): s^2 = \dfrac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2
- Standard deviation (sample): s = \sqrt{s^2}
- Five-number summary: \min, \; Q1, \; \text{Median}, \; Q3, \; \max
Takeaway: The slides emphasize the foundational relationship between population parameters and sample statistics, the distinction between descriptive and inferential statistics, and the practical use of the five-number summary and box plots to summarize and visualize data in a way that supports informed decision making.
If you want, I can convert these notes into a compact cheat-sheet or generate a practice problem set (with step-by-step solutions) on calculating the five-number summary, constructing a box plot, and interpreting skewness from mean vs. median.