Descriptive Statistics - One-Variable Categorical Data (Comprehensive Notes)
Notation and Sampling
Random sample of size n: observations x_i drawn from the same population via a valid sampling procedure.
Observations are independent and identically distributed (i.i.d.).
Notation used: sample values x1, x2,
Exploratory Data Analysis (EDA)
EDA is described as detective work and foundational for data science: roughly 80% EDA and data wrangling in practice.
EDA aims to answer: "What can we learn from the data?" through graphical representations and numerical summaries.
Graphical representations reveal information for context-based claims; numbers gain meaning when placed in context.
EDA Focus and Outcomes
Focus Question: What can we learn from the data?
Key components: shapes, centers, variability, outliers, gaps, clusters, and context.
Learning Outcomes (Ohlone’s Student Learning Outcomes)
Interpret data displayed in tables and graphically.
Calculate measures of central tendency and variation for a given data set.
Summary Statistics
Definition: A summary statistic is a numerical value that summarizes the sample data.
From a random sample, a summary statistic captures what is learned from the sample into a single value.
A single value acts as a proxy for all sample data, losing individuality of observations.
Summary statistics can misrepresent data (quote: "One death is a tragedy. One million is a statistic.").
Summary Statistics and Inference
Descriptive statistics describe the sample; they do not attribute properties to a population.
However, they may provide a basis for conjectures for subsequent testing.
Information learned from a sample (with proper sampling) can be generalized to the population from which the sample was drawn.
Quantitative Data and Variables
Quantitative variable: values are numerical (interval or ratio levels).
Such data can be summarized by a mathematical function (e.g., mean) or transformation.
The most common summary statistic for quantitative data is the mean; others include the median and correlation coefficient.
Discrete vs. Continuous Variables:
Discrete: countable values (integers like 1, 2, 3, …).
Example: Number of children in a household, number of cars owned.
Continuous: infinite values within an interval (e.g., X \in [0, 1]).
Example: Height of a person, temperature, time taken to complete a task.
Measures of Central Tendency
Mean (Sample Mean): \bar{x} = \frac{1}{n}\sum{i=1}^n xi.
Also denoted as the arithmetic middle or balance point of the data.
For a sample of size n, the mean is the average of the observations.
Example: Mean(50, 50, 100) = \frac{50+50+100}{3} = \frac{200}{3} \approx 66.6.
Example: Mean(50, 100, 100) = \frac{50+100+100}{3} = \frac{250}{3} \approx 83.3.
Real-life example: The average test score in a class, or the average income of residents in a city.
Median (Sample Median):
Center of the distribution for ordered data; the middle value that splits the data into two halves.
Not unique in some samples; does not have outstanding mathematical properties in all cases.
Computing the median from ordered data x^{(1)}, x^{(2)}, …, x^{(n)}:
If n is odd: Median = x^{(\tfrac{n+1}{2})}.
If n is even: Median = \dfrac{x^{(\tfrac{n}{2})} + x^{(\tfrac{n}{2}+1)}}{2}.
Example (provided): Ordered data example yields medians as shown in the material.
Real-life example: The median home price in a neighborhood, which is less affected by a few extremely expensive houses than the mean.
Mode (Sample Mode):
Center defined by the most frequently observed value(s).
Data may have multiple modes if frequencies are prominent at several values.
Real-life example: The most popular shoe size sold in a store, or the most frequently chosen color of car.
Measures of Variability
Range: difference between the maximum and minimum values. \text{Range}(x1,\dots,xn) = x^{(n)} - x^{(1)}.
Real-life example: The difference between the highest and lowest temperature recorded in a day.
Interquartile Range (IQR): range of the middle 50% of the data. \text{IQR}(x1,\dots,xn) = Q3 - Q1, where Q1 \approx x^{(0.25n)}, \quad Q3 \approx x^{(0.75n)}.
Real-life example: The spread of the middle 50% of salaries in a company, which helps understand typical salary variations without being affected by extreme high or low earners.
Variance (Sample Variance):
s^2 = \text{Var}(x1, x2, \dots, xn) = \frac{\sum{i=1}^n (x_i - \bar{x})^2}{n-1}.Standard Deviation (Sample SD): s = \text{SD}(x1, x2, \dots, xn) = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^n (x_i - \bar{x})^2}{n-1}}.
Real-life example for SD: How spread out the scores are from the average in a standardized test; a high SD indicates a wide range of scores, while a low SD means scores are clustered around the mean.
Notes on why certain denominators appear (e.g., n-1):
Corrects bias in estimating population variance from a sample (unbiased estimator).
Graphical Displays
General guidelines for graphical displays:
Include titles, labels, and legends.
Pay attention to the scale.
Captions should provide context.
Only include graphics that contribute to the goals.
For quantitative data, describe shape, center, variability, and outstanding features.
Pie Charts:
Circular displays where area/proportion corresponds to relative frequencies.
Relative frequencies sum to 1.00.
Can be misleading due to spatial interpretation.
Example: Class distribution and Survived distribution in Titanic dataset.
Real-life example: Showing the proportion of different categories of expenses in a household budget.
Bar Charts:
Display frequencies or relative frequencies for each category.
x-axis lists categorical values; y-axis shows counts or proportions.
Bars are separated by gaps.
Real-life example: Comparing the number of students enrolled in different majors at a university.
Histograms:
Display where height of each bar shows the number or proportion of observations within an interval.
Construction involves choosing bins of equal width; aim for 5–8 bins.
Real-life example: Illustrating the distribution of ages of customers visiting a store, or the distribution of heights of individuals in a population.
Cumulative (Relative) Frequency Graphs:
Plot cumulative frequencies or relative frequencies against the value axis.
Should be monotonically increasing; maximum at the maximum data value or 1.00 for relative frequencies.
Real-life example: Showing the percentage of students who scored at or below a certain grade on an exam.
Dotplots:
Each observation is a dot on the value axis; identical values stack vertically.
Real-life example: Displaying the number of hours students spent studying per week, where each dot represents a student.
Stemplots (Stem-and-Leaf):
Split each value into a stem (leading digits) and leaf (last digit).
A key translates stems/leaves back to original values.
Back-to-back stemplots can compare two distributions.
Real-life example: Quickly organizing and visualizing individual test scores in a small class, showing the exact values and their distribution simultaneously.
Boxplots:
Display 5-number summary: minimum, Q1, median, Q3, maximum.
Outliers identified via 1.5 IQR rule; whiskers extend to the most extreme non-outlier values.
Real-life example: Comparing the distribution of salaries between different departments in a company, highlighting median salary, spread, and potential outlier earners.
Datasets and Examples
Titanic dataset:
4 variables resulting from cross-tabulation of 2201 observations:
Class: {1st, 2nd, 3rd, Crew}
Sex: {Male, Female}
Age: {Child, Adult}
Survived: {Yes, No}
Used to illustrate frequency tables and relative frequency tables.
Hurricanes data (Examples I and II):
Example I: Hurricanes per year from 1944 to 1969; sample includes values like 0, 1, 2, 3, 4, 5, 6, 7.
Example II: 50 grocery shoppers’ total amounts spent (in dollars).
Descriptions include skewness, median, IQR, and potential outliers (e.g., possible outliers near 70–80 in some datasets).
Describing Distributions
Distribution definition: describes values the variable can take and the likelihood of observing each value.
A distribution can be visualized and organized to reveal patterns and patterns in data behavior.
To describe a distribution, mention:
1) Shape
2) Center
3) Variability
4) Unusual features (outliers, gaps, clusters)
5) Context of the problem
Shape categories:
Symmetrical (unimodal, thin tails)
Real-life example: Heights of adult humans, IQ scores.
Uniform (roughly equal height across values)
Real-life example: Rolling a fair six-sided die, where each outcome has an equal probability.
Skew Right (long right tail; peak on the left)
Real-life example: Household income distribution, where a few individuals earn significantly more.
Skew Left (long left tail; peak on the right)
Real-life example: Scores on an easy exam, where most students score high, and a few score low.
Bimodal/Multimodal (two or more prominent peaks)
Real-life example: Distribution of heights in a population containing both adult males and adult females (two distinct peaks).
Other descriptive features:
Outliers: unusually small or large observations relative to the rest.
Gaps: regions with no observed data.
Clusters: concentrations of data often separated by gaps.
Robust Statistics:
A statistic is robust if it is not heavily influenced by outliers.
In practice, robust statistics are preferred when data contain outliers.
Note: Merely labeling a statistic as "robust" is insufficient without justification.
Robustness for central tendency:
Mean is not robust (sensitive to outliers) but may be appropriate for symmetric distributions.
Median is robust (outliers have limited impact on the 50th percentile).
Mode is robust (outliers are not typically modes).
Robustness for variation:
Range is not robust (outliers affect it).
IQR is robust (middle 50% is less affected by outliers).
Variance and standard deviation are not robust.
Describing center and spread in context:
When distributions are skewed, prefer median and IQR for robust summaries.
For symmetric distributions, mean and standard deviation can be informative.
Descriptive Statistics in Practice: Examples
Example I (Rivers in North America):
Data: lengths of 108 rivers in miles.
Task: Construct a histogram and describe the distribution (shape, center, variability).
Example II (Grocery shoppers):
Data: 50 totals spent (in dollars).
Task: Construct a histogram and describe the distribution; identify skewness and approximate center (median) and variability (IQR).
Observations: Likely skewed right; median between 20 and 30; middle 50% variability around 20.75; potential outliers beyond 70–80.
Example III (Presidents' ages at inauguration):
Data: Ages at inauguration (44 values listed).
Task: Construct a cumulative relative frequency graph; compute the median and IQR; interpret distribution.
Observations: Distribution roughly symmetrical with tails not strictly decreasing; mean between 30 and 40; data range from 10 to 60 inclusive; no unusual features.
Comparing Distributions
When comparing distributions, describe:
Shape, center, and variability for each distribution.
Use comparative language (larger/smaller, more/less).
Numerical summaries may be used to compare independent samples.
Example I: Baby weights for non-smoker vs smoker parents.
Non-smoker weights are left-skewed; smoker weights are more symmetrical.
Median: non-smoker center is larger than smoker center.
IQR: middle 50% variability is smaller for non-smoker group.
No unusual features noted in either distribution.
Comparative Visual Representations
Back-to-back Stemplots:
Compare distributions side-by-side for two groups.
Real-life example: Comparing exam scores of two different classes to see how their performances differ in detail.
Boxplots:
Compare distributions by shape, center, and variability using the five-number summaries.
Real-life example: Comparing the effectiveness of two different fertilizers on crop yield by showing the distribution of yields for each fertilizer.
Dotplots and Histograms:
Used to contrast frequency patterns, with attention to skewness and outliers.
Real-life example: Comparing the distribution of commute times for employees using public transport versus those driving personal cars.
Outliers and Boxplots
Boxplot construction steps:
1) Compute the five-number summary: min, Q1, median, Q3, max.
2) Identify outliers using the 1.5(IQR) rule: any value outside [Q1 - 1.5\cdot\text{IQR}, Q3 + 1.5\cdot\text{IQR}] is an outlier.
3) Draw the box from Q1 to Q3 with a line at the median; whiskers extend to the most extreme non-outlier values; plot outliers individually.
Outlier detection methods:
Mean and standard deviation rule: outlier if not within (\bar{x} - 2s, \bar{x} + 2s).
Median and IQR rule: outlier if not within (Q1 - 1.5\cdot\text{IQR}, Q3 + 1.5\cdot\text{IQR}).
These rules create artificial boundaries for identifying potential outliers.
Example: Outlier assessment in a small sample: values 2, 2, 4, 5, 5, 7, 10, 21.
Use both rules to determine whether 21 is an outlier.
Boxplots and Interpretations
Boxplots summarize the distribution with a 5-number summary, showing central tendency and spread, and highlighting potential outliers.
The whiskers do not always extend to the absolute min and max if outliers are present; they extend to the most extreme non-outlier values.
Reading and Interpreting Cumulative Frequency Graphs
Cumulative Relative Frequency Graphs show the proportion of observations at or below a given value.
Percentiles can be interpreted from these graphs.
Percentiles and Measures of Location
Percentiles: The pth percentile is the value below which p% of the data fall.
The sample median is the 50th percentile.
In formulas: the pth percentile is the value such that p% of the data are \le that value.
Practice: Describing Distributions (In Context)
In practice, you should describe the data in terms of shape, center, and variability, and include any notable features (outliers, gaps, clusters) and the context of the problem.
References and External Resources
Tukey, John W. (1977). Exploratory Data Analysis.
Tufte, Edward R. (2001). The Visual Display of Quantitative Information.
Tufte, Edward R. (1997). Visual Explanations.
Knaflic, Cole Nussbaumer (2015). Storytelling with Data.
Yau, Nathan (2011). Visualize This.
The above notes summarize the content from the provided transcript on Descriptive Statistics for One-Variable Categorical Data, including definitions, formulas, graphical methods, distribution descriptions, and robustness considerations.