Descriptive Statistics for Numerical Data - Comprehensive Notes

Descriptive Statistics for Numerical Data

Descriptive statistics focus on summarizing and describing the main features of a dataset, not making inferences about a larger population.
Key goals: provide a concise overview of the data’s characteristics to gain insights.
Three characteristics used to describe data:
- Central tendency (center)
- Dispersion (spread)
- Shape
Source note: slides are based on OpenIntro Statistics by Diez et al.

Descriptive Statistics Overview

Descriptive statistics summarize the main features of a dataset.
They do not infer properties about a population; they describe the data at hand.
Central tendency, dispersion, and shape are the primary descriptors used.
Central tendency identifies a value that best represents the center of the data distribution.
Dispersion describes how spread out the data are around the center.
Shape captures the overall distribution form (modality, symmetry/skewness, outliers).

Central Tendency

Central tendency is a statistical measure that identifies a value that best represents the center of a data distribution; a value where most data are concentrated.
Measures of central tendency:
- Mean
- Median
- Mode
The mean (average) is a common way to measure the center of a distribution. Think of the mean as the balancing point of the distribution.
Notation:
- Sample mean:
 $\bar{x} = \frac{x1 + x2 + \cdots + xn}{n} = \frac{1}{n}\sum{i=1}^n x_i$
- Population mean:
 $\mu = \frac{1}{N}\sum{i=1}^N xi$
The sample mean (\bar{x}) is a point estimate of the population mean (\mu).
(\bar{x}) is not a perfect estimate of (\mu), but a good estimate if the sample is representative of the population.
Additional notation:
- The variable in the population is often denoted by X, and the population mean of that variable is (\mu_X).
Practical note: The mean can be useful for rescaling or standardizing a metric to compare across datasets.

Using the Mean: Example

Example: Emilio’s food truck business over the last 3 months: total revenue = \$11{,}000; total hours = 625.
Average revenue per hour (mean rate):
\text{Average rate} = \frac{11{,}000}{625} \approx 17.60\n/\text{hour}
This illustrates the mean as a balance point for interpreting typical performance.

Median and Mode

The median is the middle value when data are ordered; for an even number of values, the median is the average of the two middle values.
- Example: If (X = 2, 4, 7, 9, 9), then (\text{Median}(X) = 7).
- If (Z = 2, 4, 7, 9), then (\text{Median}(Z) = 5.5).
The mode is the value that appears most frequently in the dataset.
- A dataset can have more than one mode (unimodal, bimodal, multimodal).
- Example: If (X = 2, 4, 7, 9, 9), then (\text{Mode}(X) = 9).

Dispersion (Spread)

Dispersion refers to how spread out the data are around the center; to measure it we use variance and standard deviation.
Sample variance ((s^2)) is the average squared deviation from the mean.
- Formula:
 $s^2 = \frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2$
Rationale for dividing by (n-1) (instead of (n)) is to make the statistic more reliable as an estimate of the population variance.
Why square the deviations?
1) To remove negative values so they don’t cancel.
2) To emphasize larger deviations (outliers) more.
The standard deviation is the square root of the variance:
$s = \sqrt{\,s^2\,} = \sqrt{\frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2}$
Notation for population dispersion:
- Population variance: (\sigma^2)
- Population standard deviation: (\sigma)
The range is another dispersion measure: the distance between the first and last observations in the sorted data distribution.
Example components in variance calculation show deviations such as (x_i - \bar{x}) and squared deviations, then averaged with denominator (n-1).

Notation: Population vs Sample

Population parameter (population mean): (\mu)
Sample statistic (sample mean): (\bar{x})
The population mean is a fixed, unknown parameter; the sample mean is a statistic computed from a sample.
The five-number summary and other metrics are often tied to the sample unless stated otherwise.
In graphs, the variable X might represent things like the interest rate charged on a loan; the population mean would be the mean across all loans, while (\bar{x}) would be the mean across the sample.

Shape of the Distribution

Shape refers to the visual form of the distribution.
Features used to describe shape:
- Modality: unimodal, bimodal, multimodal
- Skewness: symmetry or skew (right-skewed vs left-skewed)
- Outliers: presence of unusual observations
Visualization helps reveal shape, including symmetry, skewness, and outliers.

Visualization Methods

Dot plots and stacked dot plots: useful for small datasets; darker colors indicate more observations.
Histograms: useful for large datasets; group data into bins and plot counts or densities.
Box plots: summarize data with a five-number summary and show unusual observations (outliers).

Dot Plots and Stacked Dot Plots

A dot plot is a one-variable scatterplot for a numerical variable; darker colors indicate higher concentration.
Dot plots can be stacked to show more observations in a given region; higher bars show more observations.

Histograms

When data are large, dot plots are hard to read; histograms group data into bins and count observations per bin.
A histogram is a plot of the number of observations in each bin (counts) or a plot of relative frequencies.
Bin limits rule: observations that fall on the lower bin-limit are allocated to that bin; observations on the upper bin-limit are allocated to the next bin.
Histograms provide a view of data density: higher bars indicate more common data values; they reveal the shape (unimodal, skewed, etc.).
Bin width affects the story a histogram tells; too wide or too narrow can obscure features.

Frequency vs Relative Frequency in Histograms

Histograms can show either frequencies or relative frequencies.
Example bin table (age in years, n = 50):
- 15 ≤ X > 17 : freq 2, rel. freq 0.04
- 17 ≤ X > 19 : freq 10, rel. freq 0.20
- 19 ≤ X > 21 : freq 15, rel. freq 0.30
- 21 ≤ X > 23 : freq 21, rel. freq 0.42
- 23 ≤ X > 25 : freq 2, rel. freq 0.04
Subsequent slides include the same data in bin width form and relative frequency form.

Practice Questions (Histograms and Shape)

Practice Question 2: Which graph is more helpful to describe the distribution of interest rate, and why? To describe the data, discuss center, spread, and shape.
Practice Question 3: Given the emp_length variable from loan50 data, create a frequency table with an appropriate number of bins, then sketch a histogram (data include values like 1, 2, 10, NA, etc.).
Practice Question 4: Describe the shape of the distribution of hours per week spent on extracurricular activities.
Practice Question 5: If you want to estimate typical household income for a student, would you prefer the mean or the median? Why?

Box Plots

A box plot summarizes a data set using five statistics and also plots unusual observations.
Steps to build a box plot: 1) Draw a dark line denoting the median, which is the middle value when data are ordered. 2) Draw a rectangle to represent the middle 50% of the data (the box).
- (Q_1) = first quartile (25% below this)
- (Q_3) = third quartile (75% below this)
- The height of the box is the interquartile range:
 $\mathrm{IQR} = Q3 - Q1$
 3) Draw the whiskers to capture data outside the box up to 1.5 × IQR beyond the quartiles.
- Upper whisker reach: (Q_3 + 1.5 \times \mathrm{IQR})
- Lower whisker reach: (Q_1 - 1.5 \times \mathrm{IQR})
- The upper whisker extends to the maximum observation within this limit; outliers beyond the whiskers are potential outliers.
 4) Identify outliers as observations outside the whiskers.
Box plots show the five-number summary (min, Q1, median, Q3, max) and highlight potential outliers.
Example note: the lower whisker may not extend to the absolute minimum if that value is an outlier.

Outliers and Robust Statistics

An outlier is an observation that appears extreme relative to the rest of the data.
Why pay attention to outliers?
- Identify strong skewness in the distribution
- Identify possible data entry errors
- Provide insight into interesting properties of the data
Robust statistics are statistics that are not unduly influenced by outliers.
- The median and the interquartile range (IQR) are robust because extreme values have little to no effect on them.
- Therefore:
- Skewed distributions → use median and IQR
- Symmetric distributions → use mean and standard deviation
Mean vs. median behavior:
- If distribution is symmetric, the center is often defined by the mean (mean ≈ median)
- If distribution is skewed or has extreme outliers, the center is often defined by the median
- Right-skewed: mean > median
- Left-skewed: mean < median

Transforming Data

When data are extremely skewed, transforming them might make modeling easier.
Common transformation: the log transformation (usually base 10), denoted as log10.
A transformation is a rescaling of the data using a function.
- Example: a plot of log10(Population) gives data that are more symmetric; outliers appear less extreme.
- Transformations can help in building statistical models that fit the data better.
Transformations can be applied to one or both variables in a scatterplot.
Goals of transforming data:
- See the data structure differently
- Reduce skewness
- Assist in modeling
- Straighten a nonlinear relationship in a scatterplot

Transforming Data Across Groups and in Practice

Transformations are sometimes used when comparing numerical data across groups or when fitting models across skewed datasets.

Shape Across Groups: Comparing Data (At Home)

When comparing numerical data across groups, side-by-side box plots and hollow histograms can be helpful.

Practice Questions (Cross-Group and Transformations)

Practice Question 5 (revisited): Compare median incomes for counties across two groups and discuss center, dispersion, and shape.

Notes on Notation and Concepts from the Slides

Population parameter vs. sample statistic:
- Population mean: (\mu) (a fixed, unknown parameter describing the entire population)
- Sample mean: (\bar{x}) (a statistic computed from a sample)
The variable of interest is often denoted as X, with population mean (\mu_X).
The sample mean (\bar{x}) is a point estimate of the population mean, not a perfect estimate but often useful when the sample is representative.
The standard deviation and variance describe typical dispersion around the mean.
The range provides a quick sense of spread; IQR provides a robust measure of spread (less sensitive to outliers).

Quick Reference Formulas

Mean (sample):
$\bar{x} = \frac{1}{n}\sum{i=1}^n xi$
Mean (population):
$\mu = \frac{1}{N}\sum{i=1}^N xi$
Variance (sample):
$s^2 = \frac{1}{n-1}\sum{i=1}^n (xi - \bar{x})^2$
Standard deviation (sample):
$s = \sqrt{\,s^2\,}$
Variance (population):
$\sigma^2$
Standard deviation (population):
$\sigma$
Interquartile range:
$\mathrm{IQR} = Q3 - Q1$
Box-plot whiskers (1.5×IQR):
$\text{Upper whisker} = Q3 + 1.5 \times \mathrm{IQR},\quad \text{Lower whisker} = Q1 - 1.5 \times \mathrm{IQR}$
Range:
$\text{Range} = \max(xi) - \min(xi)$
Notes:
- The choice between mean vs median depends on skewness and presence of outliers.
- Bin width in histograms can dramatically alter the interpretation of shape.
- Relative frequency is the proportion of observations in each bin, while frequency is the count.