Descriptive Statistics for Numerical Data - Comprehensive Notes

Descriptive Statistics for Numerical Data

  • Descriptive statistics focus on summarizing and describing the main features of a dataset, not making inferences about a larger population.
  • Key goals: provide a concise overview of the data’s characteristics to gain insights.
  • Three characteristics used to describe data:
    • Central tendency (center)
    • Dispersion (spread)
    • Shape
  • Source note: slides are based on OpenIntro Statistics by Diez et al.

Descriptive Statistics Overview

  • Descriptive statistics summarize the main features of a dataset.
  • They do not infer properties about a population; they describe the data at hand.
  • Central tendency, dispersion, and shape are the primary descriptors used.
  • Central tendency identifies a value that best represents the center of the data distribution.
  • Dispersion describes how spread out the data are around the center.
  • Shape captures the overall distribution form (modality, symmetry/skewness, outliers).

Central Tendency

  • Central tendency is a statistical measure that identifies a value that best represents the center of a data distribution; a value where most data are concentrated.
  • Measures of central tendency:
    • Mean
    • Median
    • Mode
  • The mean (average) is a common way to measure the center of a distribution. Think of the mean as the balancing point of the distribution.
  • Notation:
    • Sample mean:
      xˉ=x<em>1+x</em>2++x<em>nn=1n</em>i=1nxi\bar{x} = \frac{x<em>1 + x</em>2 + \cdots + x<em>n}{n} = \frac{1}{n}\sum</em>{i=1}^n x_i
    • Population mean:
      μ=1N<em>i=1Nx</em>i\mu = \frac{1}{N}\sum<em>{i=1}^N x</em>i
  • The sample mean (\bar{x}) is a point estimate of the population mean (\mu).
  • (\bar{x}) is not a perfect estimate of (\mu), but a good estimate if the sample is representative of the population.
  • Additional notation:
    • The variable in the population is often denoted by X, and the population mean of that variable is (\mu_X).
  • Practical note: The mean can be useful for rescaling or standardizing a metric to compare across datasets.

Using the Mean: Example

  • Example: Emilio’s food truck business over the last 3 months: total revenue = \$11{,}000; total hours = 625.
  • Average revenue per hour (mean rate):
    \text{Average rate} = \frac{11{,}000}{625} \approx 17.60\n/\text{hour}
  • This illustrates the mean as a balance point for interpreting typical performance.

Median and Mode

  • The median is the middle value when data are ordered; for an even number of values, the median is the average of the two middle values.
    • Example: If (X = 2, 4, 7, 9, 9), then (\text{Median}(X) = 7).
    • If (Z = 2, 4, 7, 9), then (\text{Median}(Z) = 5.5).
  • The mode is the value that appears most frequently in the dataset.
    • A dataset can have more than one mode (unimodal, bimodal, multimodal).
    • Example: If (X = 2, 4, 7, 9, 9), then (\text{Mode}(X) = 9).

Dispersion (Spread)

  • Dispersion refers to how spread out the data are around the center; to measure it we use variance and standard deviation.
  • Sample variance ((s^2)) is the average squared deviation from the mean.
    • Formula:
      s2=1n1<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n-1}\sum<em>{i=1}^n (x</em>i - \bar{x})^2
  • Rationale for dividing by (n-1) (instead of (n)) is to make the statistic more reliable as an estimate of the population variance.
  • Why square the deviations?
    1) To remove negative values so they don’t cancel.
    2) To emphasize larger deviations (outliers) more.
  • The standard deviation is the square root of the variance:
    s=s2=1n1<em>i=1n(x</em>ixˉ)2s = \sqrt{\,s^2\,} = \sqrt{\frac{1}{n-1}\sum<em>{i=1}^n (x</em>i - \bar{x})^2}
  • Notation for population dispersion:
    • Population variance: (\sigma^2)
    • Population standard deviation: (\sigma)
  • The range is another dispersion measure: the distance between the first and last observations in the sorted data distribution.
  • Example components in variance calculation show deviations such as (x_i - \bar{x}) and squared deviations, then averaged with denominator (n-1).

Notation: Population vs Sample

  • Population parameter (population mean): (\mu)
  • Sample statistic (sample mean): (\bar{x})
  • The population mean is a fixed, unknown parameter; the sample mean is a statistic computed from a sample.
  • The five-number summary and other metrics are often tied to the sample unless stated otherwise.
  • In graphs, the variable X might represent things like the interest rate charged on a loan; the population mean would be the mean across all loans, while (\bar{x}) would be the mean across the sample.

Shape of the Distribution

  • Shape refers to the visual form of the distribution.
  • Features used to describe shape:
    • Modality: unimodal, bimodal, multimodal
    • Skewness: symmetry or skew (right-skewed vs left-skewed)
    • Outliers: presence of unusual observations
  • Visualization helps reveal shape, including symmetry, skewness, and outliers.

Visualization Methods

  • Dot plots and stacked dot plots: useful for small datasets; darker colors indicate more observations.
  • Histograms: useful for large datasets; group data into bins and plot counts or densities.
  • Box plots: summarize data with a five-number summary and show unusual observations (outliers).

Dot Plots and Stacked Dot Plots

  • A dot plot is a one-variable scatterplot for a numerical variable; darker colors indicate higher concentration.
  • Dot plots can be stacked to show more observations in a given region; higher bars show more observations.

Histograms

  • When data are large, dot plots are hard to read; histograms group data into bins and count observations per bin.
  • A histogram is a plot of the number of observations in each bin (counts) or a plot of relative frequencies.
  • Bin limits rule: observations that fall on the lower bin-limit are allocated to that bin; observations on the upper bin-limit are allocated to the next bin.
  • Histograms provide a view of data density: higher bars indicate more common data values; they reveal the shape (unimodal, skewed, etc.).
  • Bin width affects the story a histogram tells; too wide or too narrow can obscure features.

Frequency vs Relative Frequency in Histograms

  • Histograms can show either frequencies or relative frequencies.
  • Example bin table (age in years, n = 50):
    • 15 ≤ X > 17 : freq 2, rel. freq 0.04
    • 17 ≤ X > 19 : freq 10, rel. freq 0.20
    • 19 ≤ X > 21 : freq 15, rel. freq 0.30
    • 21 ≤ X > 23 : freq 21, rel. freq 0.42
    • 23 ≤ X > 25 : freq 2, rel. freq 0.04
  • Subsequent slides include the same data in bin width form and relative frequency form.

Practice Questions (Histograms and Shape)

  • Practice Question 2: Which graph is more helpful to describe the distribution of interest rate, and why? To describe the data, discuss center, spread, and shape.
  • Practice Question 3: Given the emp_length variable from loan50 data, create a frequency table with an appropriate number of bins, then sketch a histogram (data include values like 1, 2, 10, NA, etc.).
  • Practice Question 4: Describe the shape of the distribution of hours per week spent on extracurricular activities.
  • Practice Question 5: If you want to estimate typical household income for a student, would you prefer the mean or the median? Why?

Box Plots

  • A box plot summarizes a data set using five statistics and also plots unusual observations.
  • Steps to build a box plot: 1) Draw a dark line denoting the median, which is the middle value when data are ordered. 2) Draw a rectangle to represent the middle 50% of the data (the box).
    • (Q_1) = first quartile (25% below this)
    • (Q_3) = third quartile (75% below this)
    • The height of the box is the interquartile range:
      IQR=Q<em>3Q</em>1\mathrm{IQR} = Q<em>3 - Q</em>1
      3) Draw the whiskers to capture data outside the box up to 1.5 × IQR beyond the quartiles.
    • Upper whisker reach: (Q_3 + 1.5 \times \mathrm{IQR})
    • Lower whisker reach: (Q_1 - 1.5 \times \mathrm{IQR})
    • The upper whisker extends to the maximum observation within this limit; outliers beyond the whiskers are potential outliers.
      4) Identify outliers as observations outside the whiskers.
  • Box plots show the five-number summary (min, Q1, median, Q3, max) and highlight potential outliers.
  • Example note: the lower whisker may not extend to the absolute minimum if that value is an outlier.

Outliers and Robust Statistics

  • An outlier is an observation that appears extreme relative to the rest of the data.
  • Why pay attention to outliers?
    • Identify strong skewness in the distribution
    • Identify possible data entry errors
    • Provide insight into interesting properties of the data
  • Robust statistics are statistics that are not unduly influenced by outliers.
    • The median and the interquartile range (IQR) are robust because extreme values have little to no effect on them.
    • Therefore:
    • Skewed distributions → use median and IQR
    • Symmetric distributions → use mean and standard deviation
  • Mean vs. median behavior:
    • If distribution is symmetric, the center is often defined by the mean (mean ≈ median)
    • If distribution is skewed or has extreme outliers, the center is often defined by the median
    • Right-skewed: mean > median
    • Left-skewed: mean < median

Transforming Data

  • When data are extremely skewed, transforming them might make modeling easier.
  • Common transformation: the log transformation (usually base 10), denoted as log10.
  • A transformation is a rescaling of the data using a function.
    • Example: a plot of log10(Population) gives data that are more symmetric; outliers appear less extreme.
    • Transformations can help in building statistical models that fit the data better.
  • Transformations can be applied to one or both variables in a scatterplot.
  • Goals of transforming data:
    • See the data structure differently
    • Reduce skewness
    • Assist in modeling
    • Straighten a nonlinear relationship in a scatterplot

Transforming Data Across Groups and in Practice

  • Transformations are sometimes used when comparing numerical data across groups or when fitting models across skewed datasets.

Shape Across Groups: Comparing Data (At Home)

  • When comparing numerical data across groups, side-by-side box plots and hollow histograms can be helpful.

Practice Questions (Cross-Group and Transformations)

  • Practice Question 5 (revisited): Compare median incomes for counties across two groups and discuss center, dispersion, and shape.

Notes on Notation and Concepts from the Slides

  • Population parameter vs. sample statistic:
    • Population mean: (\mu) (a fixed, unknown parameter describing the entire population)
    • Sample mean: (\bar{x}) (a statistic computed from a sample)
  • The variable of interest is often denoted as X, with population mean (\mu_X).
  • The sample mean (\bar{x}) is a point estimate of the population mean, not a perfect estimate but often useful when the sample is representative.
  • The standard deviation and variance describe typical dispersion around the mean.
  • The range provides a quick sense of spread; IQR provides a robust measure of spread (less sensitive to outliers).

Quick Reference Formulas

  • Mean (sample):
    xˉ=1n<em>i=1nx</em>i\bar{x} = \frac{1}{n}\sum<em>{i=1}^n x</em>i

  • Mean (population):
    μ=1N<em>i=1Nx</em>i\mu = \frac{1}{N}\sum<em>{i=1}^N x</em>i

  • Variance (sample):
    s2=1n1<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n-1}\sum<em>{i=1}^n (x</em>i - \bar{x})^2

  • Standard deviation (sample):
    s=s2s = \sqrt{\,s^2\,}

  • Variance (population):
    σ2\sigma^2

  • Standard deviation (population):
    σ\sigma

  • Interquartile range:
    IQR=Q<em>3Q</em>1\mathrm{IQR} = Q<em>3 - Q</em>1

  • Box-plot whiskers (1.5×IQR):
    Upper whisker=Q<em>3+1.5×IQR,Lower whisker=Q</em>11.5×IQR\text{Upper whisker} = Q<em>3 + 1.5 \times \mathrm{IQR},\quad \text{Lower whisker} = Q</em>1 - 1.5 \times \mathrm{IQR}

  • Range:
    Range=max(x<em>i)min(x</em>i)\text{Range} = \max(x<em>i) - \min(x</em>i)

  • Notes:

    • The choice between mean vs median depends on skewness and presence of outliers.
    • Bin width in histograms can dramatically alter the interpretation of shape.
    • Relative frequency is the proportion of observations in each bin, while frequency is the count.