September 11th
Data quality and data collection
Collect quantitative data only when computation and comparison are possible; non-quantitative categories are not suitable for calculations.
When planning data entry, ensure the data will work for the intended analysis.
Acknowledge that different acceptable answers may exist depending on how data is loaded or entered into a computer; check feedback and adjust accordingly.
Five-number summary and quartiles
Five-number summary: \min,\; Q1,\; Q2,\; Q_3,\; \max
Q_2 is the median (= 50th percentile). For an odd sample size, it's the middle value; for even sizes, it’s the average of the two middle values.
Q1 is the median of the lower half; Q3 is the median of the upper half.
This summary provides key cut points of the data distribution.
Box-and-whisker plots and IQR
Box spans from Q1 to Q3; a vertical line inside the box marks the median Q_2.
Whiskers extend from the box to the data’s minimum and maximum values.
Interquartile Range: \text{IQR} = Q3 - Q1
50% of data lie inside the box (between Q1 and Q3).
Box shape reveals skewness: a longer whisker on one side suggests skew toward that side.
Outliers (1.5 IQR rule)
Outlier thresholds: Q1 - 1.5\times\text{IQR} \quad \text{and} \quad Q3 + 1.5\times\text{IQR}
Any data point outside these thresholds is considered an outlier.
Outliers can be identified visually as values far beyond the whiskers and confirmed by threshold calculation.
Percentiles, quartiles, and deciles
Quartiles divide data into 4 equal parts: Q1 (25th percentile), Q2 (50th percentile / median), Q3 (75th percentile).
Deciles divide data into 10 equal parts.
Percentiles indicate the percentage of values at or below a certain point.
Two common definitions for percentile p exist:
Definition A (book-style): percentile p is where \frac{#{Xi < xp}}{n} = \frac{p}{100} (data strictly below x_p).
Definition B (CDF style): F(xp) = \frac{p}{100} where F is the cumulative distribution function (data at or below xp).
An Ogive (cumulative frequency plot) can be used to estimate percentiles.
Z-scores and standardization
Standardization converts data to a common scale with mean 0 and standard deviation 1.
Z-score formulas:
Population: z = \frac{X - \mu}{\sigma}
Sample: z = \frac{\,X - \bar{X}\,}{s}
Z-scores enable direct comparison across different distributions and map to the same probabilities as the original data.
Z-scores are typically rounded to two decimal places.
Normal distribution and the empirical rule
Empirical rule (68-95-99.7 rule) for a normal distribution:
Within \pm 1\sigma: about P(|Z|\le 1) \approx 0.68
Within \pm 2\sigma: about P(|Z|\le 2) \approx 0.95
Within \pm 3\sigma: about P(|Z|\le 3) \approx 0.997
These approximations help estimate probabilities from mean and standard deviation without detailed tables for normally distributed data.
Quick tips for exam and practice
Be consistent with percentile definitions used in that course/book when solving problems.
Use z-scores to streamline probability questions instead of recalculating from scratch.
Read box-and-whisker plots quickly to assess spread (IQR), center (median), skewness (whisker length), and potential outliers.
Report IQR along with quartiles to concisely convey data spread.