SK

Descriptive Statistics - One-Variable Categorical Data (Comprehensive Notes)

Notation and Sampling

  • Random sample of size n: observations x_i drawn from the same population via a valid sampling procedure.

  • Observations are independent and identically distributed (i.i.d.).

  • Notation used: sample values x1, x2,

Exploratory Data Analysis (EDA)

  • EDA is described as detective work and foundational for data science: roughly 80% EDA and data wrangling in practice.

  • EDA aims to answer: "What can we learn from the data?" through graphical representations and numerical summaries.

  • Graphical representations reveal information for context-based claims; numbers gain meaning when placed in context.

EDA Focus and Outcomes

  • Focus Question: What can we learn from the data?

  • Key components: shapes, centers, variability, outliers, gaps, clusters, and context.

Learning Outcomes (Ohlone’s Student Learning Outcomes)

  • Interpret data displayed in tables and graphically.

  • Calculate measures of central tendency and variation for a given data set.

Summary Statistics

  • Definition: A summary statistic is a numerical value that summarizes the sample data.

  • From a random sample, a summary statistic captures what is learned from the sample into a single value.

  • A single value acts as a proxy for all sample data, losing individuality of observations.

  • Summary statistics can misrepresent data (quote: "One death is a tragedy. One million is a statistic.").

Summary Statistics and Inference

  • Descriptive statistics describe the sample; they do not attribute properties to a population.

  • However, they may provide a basis for conjectures for subsequent testing.

  • Information learned from a sample (with proper sampling) can be generalized to the population from which the sample was drawn.

Quantitative Data and Variables

  • Quantitative variable: values are numerical (interval or ratio levels).

  • Such data can be summarized by a mathematical function (e.g., mean) or transformation.

  • The most common summary statistic for quantitative data is the mean; others include the median and correlation coefficient.

  • Discrete vs. Continuous Variables:

    • Discrete: countable values (integers like 1, 2, 3, …).

      • Example: Number of children in a household, number of cars owned.

    • Continuous: infinite values within an interval (e.g., X \in [0, 1]).

      • Example: Height of a person, temperature, time taken to complete a task.

Measures of Central Tendency

  • Mean (Sample Mean): \bar{x} = \frac{1}{n}\sum{i=1}^n xi.

    • Also denoted as the arithmetic middle or balance point of the data.

    • For a sample of size n, the mean is the average of the observations.

    • Example: Mean(50, 50, 100) = \frac{50+50+100}{3} = \frac{200}{3} \approx 66.6.

    • Example: Mean(50, 100, 100) = \frac{50+100+100}{3} = \frac{250}{3} \approx 83.3.

    • Real-life example: The average test score in a class, or the average income of residents in a city.

  • Median (Sample Median):

    • Center of the distribution for ordered data; the middle value that splits the data into two halves.

    • Not unique in some samples; does not have outstanding mathematical properties in all cases.

    • Computing the median from ordered data x^{(1)}, x^{(2)}, …, x^{(n)}:

    • If n is odd: Median = x^{(\tfrac{n+1}{2})}.

    • If n is even: Median = \dfrac{x^{(\tfrac{n}{2})} + x^{(\tfrac{n}{2}+1)}}{2}.

    • Example (provided): Ordered data example yields medians as shown in the material.

    • Real-life example: The median home price in a neighborhood, which is less affected by a few extremely expensive houses than the mean.

  • Mode (Sample Mode):

    • Center defined by the most frequently observed value(s).

    • Data may have multiple modes if frequencies are prominent at several values.

    • Real-life example: The most popular shoe size sold in a store, or the most frequently chosen color of car.

Measures of Variability

  • Range: difference between the maximum and minimum values. \text{Range}(x1,\dots,xn) = x^{(n)} - x^{(1)}.

    • Real-life example: The difference between the highest and lowest temperature recorded in a day.

  • Interquartile Range (IQR): range of the middle 50% of the data. \text{IQR}(x1,\dots,xn) = Q3 - Q1, where Q1 \approx x^{(0.25n)}, \quad Q3 \approx x^{(0.75n)}.

    • Real-life example: The spread of the middle 50% of salaries in a company, which helps understand typical salary variations without being affected by extreme high or low earners.

  • Variance (Sample Variance):
    s^2 = \text{Var}(x1, x2, \dots, xn) = \frac{\sum{i=1}^n (x_i - \bar{x})^2}{n-1}.

  • Standard Deviation (Sample SD): s = \text{SD}(x1, x2, \dots, xn) = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^n (x_i - \bar{x})^2}{n-1}}.

    • Real-life example for SD: How spread out the scores are from the average in a standardized test; a high SD indicates a wide range of scores, while a low SD means scores are clustered around the mean.

  • Notes on why certain denominators appear (e.g., n-1):

    • Corrects bias in estimating population variance from a sample (unbiased estimator).

Graphical Displays

  • General guidelines for graphical displays:

    • Include titles, labels, and legends.

    • Pay attention to the scale.

    • Captions should provide context.

    • Only include graphics that contribute to the goals.

    • For quantitative data, describe shape, center, variability, and outstanding features.

  • Pie Charts:

    • Circular displays where area/proportion corresponds to relative frequencies.

    • Relative frequencies sum to 1.00.

    • Can be misleading due to spatial interpretation.

    • Example: Class distribution and Survived distribution in Titanic dataset.

    • Real-life example: Showing the proportion of different categories of expenses in a household budget.

  • Bar Charts:

    • Display frequencies or relative frequencies for each category.

    • x-axis lists categorical values; y-axis shows counts or proportions.

    • Bars are separated by gaps.

    • Real-life example: Comparing the number of students enrolled in different majors at a university.

  • Histograms:

    • Display where height of each bar shows the number or proportion of observations within an interval.

    • Construction involves choosing bins of equal width; aim for 5–8 bins.

    • Real-life example: Illustrating the distribution of ages of customers visiting a store, or the distribution of heights of individuals in a population.

  • Cumulative (Relative) Frequency Graphs:

    • Plot cumulative frequencies or relative frequencies against the value axis.

    • Should be monotonically increasing; maximum at the maximum data value or 1.00 for relative frequencies.

    • Real-life example: Showing the percentage of students who scored at or below a certain grade on an exam.

  • Dotplots:

    • Each observation is a dot on the value axis; identical values stack vertically.

    • Real-life example: Displaying the number of hours students spent studying per week, where each dot represents a student.

  • Stemplots (Stem-and-Leaf):

    • Split each value into a stem (leading digits) and leaf (last digit).

    • A key translates stems/leaves back to original values.

    • Back-to-back stemplots can compare two distributions.

    • Real-life example: Quickly organizing and visualizing individual test scores in a small class, showing the exact values and their distribution simultaneously.

  • Boxplots:

    • Display 5-number summary: minimum, Q1, median, Q3, maximum.

    • Outliers identified via 1.5 IQR rule; whiskers extend to the most extreme non-outlier values.

    • Real-life example: Comparing the distribution of salaries between different departments in a company, highlighting median salary, spread, and potential outlier earners.

Datasets and Examples

  • Titanic dataset:

    • 4 variables resulting from cross-tabulation of 2201 observations:

    • Class: {1st, 2nd, 3rd, Crew}

    • Sex: {Male, Female}

    • Age: {Child, Adult}

    • Survived: {Yes, No}

    • Used to illustrate frequency tables and relative frequency tables.

  • Hurricanes data (Examples I and II):

    • Example I: Hurricanes per year from 1944 to 1969; sample includes values like 0, 1, 2, 3, 4, 5, 6, 7.

    • Example II: 50 grocery shoppers’ total amounts spent (in dollars).

    • Descriptions include skewness, median, IQR, and potential outliers (e.g., possible outliers near 70–80 in some datasets).

Describing Distributions

  • Distribution definition: describes values the variable can take and the likelihood of observing each value.

  • A distribution can be visualized and organized to reveal patterns and patterns in data behavior.

  • To describe a distribution, mention:

    1) Shape

    2) Center

    3) Variability

    4) Unusual features (outliers, gaps, clusters)

    5) Context of the problem

  • Shape categories:

    • Symmetrical (unimodal, thin tails)

      • Real-life example: Heights of adult humans, IQ scores.

    • Uniform (roughly equal height across values)

      • Real-life example: Rolling a fair six-sided die, where each outcome has an equal probability.

    • Skew Right (long right tail; peak on the left)

      • Real-life example: Household income distribution, where a few individuals earn significantly more.

    • Skew Left (long left tail; peak on the right)

      • Real-life example: Scores on an easy exam, where most students score high, and a few score low.

    • Bimodal/Multimodal (two or more prominent peaks)

      • Real-life example: Distribution of heights in a population containing both adult males and adult females (two distinct peaks).

  • Other descriptive features:

    • Outliers: unusually small or large observations relative to the rest.

    • Gaps: regions with no observed data.

    • Clusters: concentrations of data often separated by gaps.

  • Robust Statistics:

    • A statistic is robust if it is not heavily influenced by outliers.

    • In practice, robust statistics are preferred when data contain outliers.

    • Note: Merely labeling a statistic as "robust" is insufficient without justification.

  • Robustness for central tendency:

    • Mean is not robust (sensitive to outliers) but may be appropriate for symmetric distributions.

    • Median is robust (outliers have limited impact on the 50th percentile).

    • Mode is robust (outliers are not typically modes).

  • Robustness for variation:

    • Range is not robust (outliers affect it).

    • IQR is robust (middle 50% is less affected by outliers).

    • Variance and standard deviation are not robust.

  • Describing center and spread in context:

    • When distributions are skewed, prefer median and IQR for robust summaries.

    • For symmetric distributions, mean and standard deviation can be informative.

Descriptive Statistics in Practice: Examples

  • Example I (Rivers in North America):

    • Data: lengths of 108 rivers in miles.

    • Task: Construct a histogram and describe the distribution (shape, center, variability).

  • Example II (Grocery shoppers):

    • Data: 50 totals spent (in dollars).

    • Task: Construct a histogram and describe the distribution; identify skewness and approximate center (median) and variability (IQR).

    • Observations: Likely skewed right; median between 20 and 30; middle 50% variability around 20.75; potential outliers beyond 70–80.

  • Example III (Presidents' ages at inauguration):

    • Data: Ages at inauguration (44 values listed).

    • Task: Construct a cumulative relative frequency graph; compute the median and IQR; interpret distribution.

    • Observations: Distribution roughly symmetrical with tails not strictly decreasing; mean between 30 and 40; data range from 10 to 60 inclusive; no unusual features.

Comparing Distributions

  • When comparing distributions, describe:

    • Shape, center, and variability for each distribution.

    • Use comparative language (larger/smaller, more/less).

    • Numerical summaries may be used to compare independent samples.

  • Example I: Baby weights for non-smoker vs smoker parents.

    • Non-smoker weights are left-skewed; smoker weights are more symmetrical.

    • Median: non-smoker center is larger than smoker center.

    • IQR: middle 50% variability is smaller for non-smoker group.

    • No unusual features noted in either distribution.

Comparative Visual Representations

  • Back-to-back Stemplots:

    • Compare distributions side-by-side for two groups.

    • Real-life example: Comparing exam scores of two different classes to see how their performances differ in detail.

  • Boxplots:

    • Compare distributions by shape, center, and variability using the five-number summaries.

    • Real-life example: Comparing the effectiveness of two different fertilizers on crop yield by showing the distribution of yields for each fertilizer.

  • Dotplots and Histograms:

    • Used to contrast frequency patterns, with attention to skewness and outliers.

    • Real-life example: Comparing the distribution of commute times for employees using public transport versus those driving personal cars.

Outliers and Boxplots

  • Boxplot construction steps:

    1) Compute the five-number summary: min, Q1, median, Q3, max.

    2) Identify outliers using the 1.5(IQR) rule: any value outside [Q1 - 1.5\cdot\text{IQR}, Q3 + 1.5\cdot\text{IQR}] is an outlier.

    3) Draw the box from Q1 to Q3 with a line at the median; whiskers extend to the most extreme non-outlier values; plot outliers individually.

  • Outlier detection methods:

    • Mean and standard deviation rule: outlier if not within (\bar{x} - 2s, \bar{x} + 2s).

    • Median and IQR rule: outlier if not within (Q1 - 1.5\cdot\text{IQR}, Q3 + 1.5\cdot\text{IQR}).

    • These rules create artificial boundaries for identifying potential outliers.

  • Example: Outlier assessment in a small sample: values 2, 2, 4, 5, 5, 7, 10, 21.

    • Use both rules to determine whether 21 is an outlier.

Boxplots and Interpretations

  • Boxplots summarize the distribution with a 5-number summary, showing central tendency and spread, and highlighting potential outliers.

  • The whiskers do not always extend to the absolute min and max if outliers are present; they extend to the most extreme non-outlier values.

Reading and Interpreting Cumulative Frequency Graphs

  • Cumulative Relative Frequency Graphs show the proportion of observations at or below a given value.

  • Percentiles can be interpreted from these graphs.

Percentiles and Measures of Location

  • Percentiles: The pth percentile is the value below which p% of the data fall.

    • The sample median is the 50th percentile.

    • In formulas: the pth percentile is the value such that p% of the data are \le that value.

Practice: Describing Distributions (In Context)

  • In practice, you should describe the data in terms of shape, center, and variability, and include any notable features (outliers, gaps, clusters) and the context of the problem.

References and External Resources

  • Tukey, John W. (1977). Exploratory Data Analysis.

  • Tufte, Edward R. (2001). The Visual Display of Quantitative Information.

  • Tufte, Edward R. (1997). Visual Explanations.

  • Knaflic, Cole Nussbaumer (2015). Storytelling with Data.

  • Yau, Nathan (2011). Visualize This.

  • The above notes summarize the content from the provided transcript on Descriptive Statistics for One-Variable Categorical Data, including definitions, formulas, graphical methods, distribution descriptions, and robustness considerations.