Statistics 1A: Describing Data with Numerical Measures

Introduction to Descriptive Statistics

  • Graphical methods may not always be sufficient for describing data.
  • Numerical measures can be created for both populations and samples.
    • Parameter: A numerical descriptive measure calculated for a population.
    • Statistic: A numerical descriptive measure calculated for a sample.

Describing Data with Numerical Measures

  • Categories of Descriptive Statistical Measures:
    • Location
    • Shape
    • Spread

Central Location Statistics

  • A central location statistic provides a single number that indicates the sense of the concentration of data values in a sample.

Measures of Location

Example 1.14: The Mean

  • The mean, or arithmetic average, is a frequently utilized measure of the center for a set of numbers, often referred to as the sample mean.
  • Definition: The sample mean of observations $x1, x2, …, xn$ is given by: \bar{x} = \frac{1}{n} \sum{i=1}^{n} x_i
  • Advantages:
    • Uses all data values.
  • Disadvantages:
    • Only valid for numeric variables.
    • Distorted by outliers.
  • Humorous Illustration: "My girlfriend dropped me – she said I am AVERAGE. But I think she was just being MEAN!"

Measures of Location (Median)

Example 1.15: The Median
  • The median is not influenced by outliers but is only appropriate for numeric data.
  • Calculating Mean and Median for different heart rates:
    • 19-year-old patients: 108, 68, 80, 83, 72.
    • 55-year-old patients: 86, 86, 92, 100, 112, 116, 136, 140.
  • The median can be calculated as follows:
    • If $n$ is odd, the median is the middle value.
    • If $n$ is even, the median is the average of the two middle values.

Measures of Location - Population Mean and Median

  • Generally, the population mean and median will not be identical.
  • Skewness: If the population distribution is positively or negatively skewed, then:
    mean ≠ median
  • Important considerations for making inferences based on population characteristics involve deciding which characteristic (mean or median) is more relevant.

Quartiles and Percentiles

  • The median divides the data set into two equal parts.
  • Quartiles: Divide the data set into four equal parts:
    • First quartile ($Q_1$): 25th percentile
    • Second quartile ($Q_2$): Median (50th percentile)
    • Third quartile ($Q_3$): 75th percentile
  • Percentiles: For finer measures, percentiles divide the data into 100 parts. E.g., the 99th percentile separates the highest 1% from the bottom 99%.

The Trimmed Mean

  • The trimmed mean excludes the first $k$ and last $(n-k)$ order statistics to reduce the impact of outliers.
  • Robustness: Trimmed means are not unduly affected by extreme values.
  • Example: Judges' scores in sports where extreme scores are discarded before calculation.

Measures of Variability

  • Reporting a measure of center (mean or median) gives partial information about data sets.
  • Samples can have the same central measures but different spreads.
    • Visual Representation: Dot plots may show varying extents of spread even with identical means and medians.

Descriptive Statistical Measures for Variability

  • Types of Measures:
    • Variance
    • Standard Deviation
    • Interquartile Range (IQR)
    • Range
    • Quartile Deviation

Measures of Variability - The Range

  • The range is the difference between the largest and smallest sample values:
    R = x{max} - x{min}
  • Adequate for small data sets but not comprehensive.

Measures of Variability - The Interquartile Range

  • The Interquartile Range (IQR) is defined as:
    IQR = Q3 - Q1
  • Where $Q3$ is the median of the upper half and $Q1$ is the median of the lower half of the data set.

Variance and Standard Deviation

  • Population Variance ($\sigma^2$) and Sample Variance ($s^2$):
    • Population variance formula:
      \sigma^2 = \frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2
    • Sample variance formula:
      s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2
  • Standard Deviation is the square root of variance.

Variance with a Constant

  • If $y = cx + d$, where $c$ is a constant:
    • Sample Variance of $y$:
      sy^2 = c^2 sx^2

Boxplots

  • A boxplot is based on measures that remain stable in the presence of a few outliers, specifically the median and a measure of spread known as the fourth spread.
  • Definitions for Boxplots:
    • Lower Fourth: Median of the smallest half.
    • Upper Fourth: Median of the largest half.
    • Fourth Spread ($fs$):
      fs = upper hfourth - lower hfourth

The Five-Number Summary

  • A basic boxplot summarizes a data set with:
    • Minimum
    • Lower Fourth
    • Median
    • Upper Fourth
    • Maximum
  • Boxplots provide a quick overview of data distribution and detect outliers.

Outliers in Boxplots

  • Outlier Definition: Any observation farther than 1.5 times the fourth spread ($1.5fs$) from the closest fourth is considered an outlier.
    • Extreme Outlier: More than $3fs$ from the nearest fourth.
    • Mild Outlier: Within the range of $1.5fs$ from the nearest fourth.

Interpreting Box Plots

  • Characteristics of distributions based on boxplot structure:
    • Symmetric Distribution: Median line in center of box and whiskers of equal length.
    • Skewed Right: Median line left of center and long right whisker.
    • Skewed Left: Median line right of center and long left whisker.

Comparative Boxplots

  • Used effectively to reveal similarities and differences between two or more data sets concerning the same variable.
    • Example: Box plots of marks from 450 students across three classes can illustrate their comparative performance.