September 11th

Data quality and data collection

  • Collect quantitative data only when computation and comparison are possible; non-quantitative categories are not suitable for calculations.

  • When planning data entry, ensure the data will work for the intended analysis.

  • Acknowledge that different acceptable answers may exist depending on how data is loaded or entered into a computer; check feedback and adjust accordingly.

Five-number summary and quartiles

  • Five-number summary: \min,\; Q1,\; Q2,\; Q_3,\; \max

  • Q_2 is the median (= 50th percentile). For an odd sample size, it's the middle value; for even sizes, it’s the average of the two middle values.

  • Q1 is the median of the lower half; Q3 is the median of the upper half.

  • This summary provides key cut points of the data distribution.

Box-and-whisker plots and IQR

  • Box spans from Q1 to Q3; a vertical line inside the box marks the median Q_2.

  • Whiskers extend from the box to the data’s minimum and maximum values.

  • Interquartile Range: \text{IQR} = Q3 - Q1

  • 50% of data lie inside the box (between Q1 and Q3).

  • Box shape reveals skewness: a longer whisker on one side suggests skew toward that side.

Outliers (1.5 IQR rule)

  • Outlier thresholds: Q1 - 1.5\times\text{IQR} \quad \text{and} \quad Q3 + 1.5\times\text{IQR}

  • Any data point outside these thresholds is considered an outlier.

  • Outliers can be identified visually as values far beyond the whiskers and confirmed by threshold calculation.

Percentiles, quartiles, and deciles

  • Quartiles divide data into 4 equal parts: Q1 (25th percentile), Q2 (50th percentile / median), Q3 (75th percentile).

  • Deciles divide data into 10 equal parts.

  • Percentiles indicate the percentage of values at or below a certain point.

  • Two common definitions for percentile p exist:

    • Definition A (book-style): percentile p is where \frac{#{Xi < xp}}{n} = \frac{p}{100} (data strictly below x_p).

    • Definition B (CDF style): F(xp) = \frac{p}{100} where F is the cumulative distribution function (data at or below xp).

  • An Ogive (cumulative frequency plot) can be used to estimate percentiles.

Z-scores and standardization

  • Standardization converts data to a common scale with mean 0 and standard deviation 1.

  • Z-score formulas:

    • Population: z = \frac{X - \mu}{\sigma}

    • Sample: z = \frac{\,X - \bar{X}\,}{s}

  • Z-scores enable direct comparison across different distributions and map to the same probabilities as the original data.

  • Z-scores are typically rounded to two decimal places.

Normal distribution and the empirical rule

  • Empirical rule (68-95-99.7 rule) for a normal distribution:

    • Within \pm 1\sigma: about P(|Z|\le 1) \approx 0.68

    • Within \pm 2\sigma: about P(|Z|\le 2) \approx 0.95

    • Within \pm 3\sigma: about P(|Z|\le 3) \approx 0.997

  • These approximations help estimate probabilities from mean and standard deviation without detailed tables for normally distributed data.

Quick tips for exam and practice

  • Be consistent with percentile definitions used in that course/book when solving problems.

  • Use z-scores to streamline probability questions instead of recalculating from scratch.

  • Read box-and-whisker plots quickly to assess spread (IQR), center (median), skewness (whisker length), and potential outliers.

  • Report IQR along with quartiles to concisely convey data spread.