Descriptive Statistics for Numerical Data - Comprehensive Notes
Descriptive Statistics for Numerical Data
- Descriptive statistics focus on summarizing and describing the main features of a dataset, not making inferences about a larger population.
- Key goals: provide a concise overview of the data’s characteristics to gain insights.
- Three characteristics used to describe data:
- Central tendency (center)
- Dispersion (spread)
- Shape
- Source note: slides are based on OpenIntro Statistics by Diez et al.
Descriptive Statistics Overview
- Descriptive statistics summarize the main features of a dataset.
- They do not infer properties about a population; they describe the data at hand.
- Central tendency, dispersion, and shape are the primary descriptors used.
- Central tendency identifies a value that best represents the center of the data distribution.
- Dispersion describes how spread out the data are around the center.
- Shape captures the overall distribution form (modality, symmetry/skewness, outliers).
Central Tendency
- Central tendency is a statistical measure that identifies a value that best represents the center of a data distribution; a value where most data are concentrated.
- Measures of central tendency:
- Mean
- Median
- Mode
- The mean (average) is a common way to measure the center of a distribution. Think of the mean as the balancing point of the distribution.
- Notation:
- Sample mean:
- Population mean:
- Sample mean:
- The sample mean (\bar{x}) is a point estimate of the population mean (\mu).
- (\bar{x}) is not a perfect estimate of (\mu), but a good estimate if the sample is representative of the population.
- Additional notation:
- The variable in the population is often denoted by X, and the population mean of that variable is (\mu_X).
- Practical note: The mean can be useful for rescaling or standardizing a metric to compare across datasets.
Using the Mean: Example
- Example: Emilio’s food truck business over the last 3 months: total revenue = \$11{,}000; total hours = 625.
- Average revenue per hour (mean rate):
\text{Average rate} = \frac{11{,}000}{625} \approx 17.60\n/\text{hour} - This illustrates the mean as a balance point for interpreting typical performance.
Median and Mode
- The median is the middle value when data are ordered; for an even number of values, the median is the average of the two middle values.
- Example: If (X = 2, 4, 7, 9, 9), then (\text{Median}(X) = 7).
- If (Z = 2, 4, 7, 9), then (\text{Median}(Z) = 5.5).
- The mode is the value that appears most frequently in the dataset.
- A dataset can have more than one mode (unimodal, bimodal, multimodal).
- Example: If (X = 2, 4, 7, 9, 9), then (\text{Mode}(X) = 9).
Dispersion (Spread)
- Dispersion refers to how spread out the data are around the center; to measure it we use variance and standard deviation.
- Sample variance ((s^2)) is the average squared deviation from the mean.
- Formula:
- Formula:
- Rationale for dividing by (n-1) (instead of (n)) is to make the statistic more reliable as an estimate of the population variance.
- Why square the deviations?
1) To remove negative values so they don’t cancel.
2) To emphasize larger deviations (outliers) more. - The standard deviation is the square root of the variance:
- Notation for population dispersion:
- Population variance: (\sigma^2)
- Population standard deviation: (\sigma)
- The range is another dispersion measure: the distance between the first and last observations in the sorted data distribution.
- Example components in variance calculation show deviations such as (x_i - \bar{x}) and squared deviations, then averaged with denominator (n-1).
Notation: Population vs Sample
- Population parameter (population mean): (\mu)
- Sample statistic (sample mean): (\bar{x})
- The population mean is a fixed, unknown parameter; the sample mean is a statistic computed from a sample.
- The five-number summary and other metrics are often tied to the sample unless stated otherwise.
- In graphs, the variable X might represent things like the interest rate charged on a loan; the population mean would be the mean across all loans, while (\bar{x}) would be the mean across the sample.
Shape of the Distribution
- Shape refers to the visual form of the distribution.
- Features used to describe shape:
- Modality: unimodal, bimodal, multimodal
- Skewness: symmetry or skew (right-skewed vs left-skewed)
- Outliers: presence of unusual observations
- Visualization helps reveal shape, including symmetry, skewness, and outliers.
Visualization Methods
- Dot plots and stacked dot plots: useful for small datasets; darker colors indicate more observations.
- Histograms: useful for large datasets; group data into bins and plot counts or densities.
- Box plots: summarize data with a five-number summary and show unusual observations (outliers).
Dot Plots and Stacked Dot Plots
- A dot plot is a one-variable scatterplot for a numerical variable; darker colors indicate higher concentration.
- Dot plots can be stacked to show more observations in a given region; higher bars show more observations.
Histograms
- When data are large, dot plots are hard to read; histograms group data into bins and count observations per bin.
- A histogram is a plot of the number of observations in each bin (counts) or a plot of relative frequencies.
- Bin limits rule: observations that fall on the lower bin-limit are allocated to that bin; observations on the upper bin-limit are allocated to the next bin.
- Histograms provide a view of data density: higher bars indicate more common data values; they reveal the shape (unimodal, skewed, etc.).
- Bin width affects the story a histogram tells; too wide or too narrow can obscure features.
Frequency vs Relative Frequency in Histograms
- Histograms can show either frequencies or relative frequencies.
- Example bin table (age in years, n = 50):
- 15 ≤ X > 17 : freq 2, rel. freq 0.04
- 17 ≤ X > 19 : freq 10, rel. freq 0.20
- 19 ≤ X > 21 : freq 15, rel. freq 0.30
- 21 ≤ X > 23 : freq 21, rel. freq 0.42
- 23 ≤ X > 25 : freq 2, rel. freq 0.04
- Subsequent slides include the same data in bin width form and relative frequency form.
Practice Questions (Histograms and Shape)
- Practice Question 2: Which graph is more helpful to describe the distribution of interest rate, and why? To describe the data, discuss center, spread, and shape.
- Practice Question 3: Given the emp_length variable from loan50 data, create a frequency table with an appropriate number of bins, then sketch a histogram (data include values like 1, 2, 10, NA, etc.).
- Practice Question 4: Describe the shape of the distribution of hours per week spent on extracurricular activities.
- Practice Question 5: If you want to estimate typical household income for a student, would you prefer the mean or the median? Why?
Box Plots
- A box plot summarizes a data set using five statistics and also plots unusual observations.
- Steps to build a box plot:
1) Draw a dark line denoting the median, which is the middle value when data are ordered.
2) Draw a rectangle to represent the middle 50% of the data (the box).
- (Q_1) = first quartile (25% below this)
- (Q_3) = third quartile (75% below this)
- The height of the box is the interquartile range:
3) Draw the whiskers to capture data outside the box up to 1.5 × IQR beyond the quartiles. - Upper whisker reach: (Q_3 + 1.5 \times \mathrm{IQR})
- Lower whisker reach: (Q_1 - 1.5 \times \mathrm{IQR})
- The upper whisker extends to the maximum observation within this limit; outliers beyond the whiskers are potential outliers.
4) Identify outliers as observations outside the whiskers.
- Box plots show the five-number summary (min, Q1, median, Q3, max) and highlight potential outliers.
- Example note: the lower whisker may not extend to the absolute minimum if that value is an outlier.
Outliers and Robust Statistics
- An outlier is an observation that appears extreme relative to the rest of the data.
- Why pay attention to outliers?
- Identify strong skewness in the distribution
- Identify possible data entry errors
- Provide insight into interesting properties of the data
- Robust statistics are statistics that are not unduly influenced by outliers.
- The median and the interquartile range (IQR) are robust because extreme values have little to no effect on them.
- Therefore:
- Skewed distributions → use median and IQR
- Symmetric distributions → use mean and standard deviation
- Mean vs. median behavior:
- If distribution is symmetric, the center is often defined by the mean (mean ≈ median)
- If distribution is skewed or has extreme outliers, the center is often defined by the median
- Right-skewed: mean > median
- Left-skewed: mean < median
Transforming Data
- When data are extremely skewed, transforming them might make modeling easier.
- Common transformation: the log transformation (usually base 10), denoted as log10.
- A transformation is a rescaling of the data using a function.
- Example: a plot of log10(Population) gives data that are more symmetric; outliers appear less extreme.
- Transformations can help in building statistical models that fit the data better.
- Transformations can be applied to one or both variables in a scatterplot.
- Goals of transforming data:
- See the data structure differently
- Reduce skewness
- Assist in modeling
- Straighten a nonlinear relationship in a scatterplot
Transforming Data Across Groups and in Practice
- Transformations are sometimes used when comparing numerical data across groups or when fitting models across skewed datasets.
Shape Across Groups: Comparing Data (At Home)
- When comparing numerical data across groups, side-by-side box plots and hollow histograms can be helpful.
Practice Questions (Cross-Group and Transformations)
- Practice Question 5 (revisited): Compare median incomes for counties across two groups and discuss center, dispersion, and shape.
Notes on Notation and Concepts from the Slides
- Population parameter vs. sample statistic:
- Population mean: (\mu) (a fixed, unknown parameter describing the entire population)
- Sample mean: (\bar{x}) (a statistic computed from a sample)
- The variable of interest is often denoted as X, with population mean (\mu_X).
- The sample mean (\bar{x}) is a point estimate of the population mean, not a perfect estimate but often useful when the sample is representative.
- The standard deviation and variance describe typical dispersion around the mean.
- The range provides a quick sense of spread; IQR provides a robust measure of spread (less sensitive to outliers).
Quick Reference Formulas
Mean (sample):
Mean (population):
Variance (sample):
Standard deviation (sample):
Variance (population):
Standard deviation (population):
Interquartile range:
Box-plot whiskers (1.5×IQR):
Range:
Notes:
- The choice between mean vs median depends on skewness and presence of outliers.
- Bin width in histograms can dramatically alter the interpretation of shape.
- Relative frequency is the proportion of observations in each bin, while frequency is the count.