Biostatistics 3A Lecture Notes: Describing Numerical Data

Chapter 2: Describing Numerical Data

Setup and context from the lecture
- Technical hiccups in live class (microphone off, online audience audio issues) were acknowledged; the instructor emphasizes that such “little things” are part of learning and why attending class is valuable.
- Deadlines and logistics mentioned:
- Quiz 1 and Homework 1 for Lecture 1 are due tonight, with Homework 1 due at 11:00 (not midnight).
- Quiz 2 and Homework 2 have been assigned; the instructor aims to finish the 70-slide lecture today.
- Quiz due date: one week from today.
- Wednesday: the instructor will review the Homework; an introduction to using R will occur on Wednesday; two R labs exist on openintro.org under the book’s Labs section; two videos (older) are available to help with R.
- Extra two-day grace for two problems that include R in Homework 2.
- Emphasis on getting students to log into and use R before Wednesday; the OpenIntro R labs are referenced for additional practice.
- Reminder about the rule: to pass Homework, you must show both the R command and the R output.
Core topic of Lecture 2: examining and summarizing numerical data
- Goal: describe a large dataset by summarizing with statistics (two main ideas).
- Important caveat: relationships in scatter plots may imply association but not causation; a third variable may drive observed associations (e.g., poverty impacting life expectancy and fertility).
- Visual tools discussed: scatter plots, dot plots, histograms, bar plots, and box plots to describe distributions and relationships.
- A note about real-world data use: examples include surveys (e.g., votes for Trump vs. Kamala) and international data (fertility vs. life expectancy).
Scatter plots and association
- A scatter plot has an x-axis and a y-axis representing two numerical variables.
- Example discussed: number of children per woman (fertility) vs. life expectancy; dots represent countries; dot size may reflect population (larger dot = larger population).
- Observed pattern: a downward trend suggesting that higher fertility is associated with lower life expectancy in the plotted countries, but this is not claimed as causal.
- A warning about causality and a reminder that poverty acts as a third variable that influences both fertility and life expectancy.
- The relationship can appear linear in some regions but may not be strictly causal; the concept of association versus causation is emphasized.
Dot plots and data description
- Dot plots show individual observations; color/density can indicate concentration (e.g., darker colors indicate more observations in a region).
- Used to describe distributions like GPAs in a class.
- When describing a distribution, describe three aspects: center, shape, and spread.
- The instructor notes occasional classroom tech issues but returns to describe the plotting concepts.
Center of a distribution: mean vs. median
- Mean (average) is denoted by
 $\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi$
- The population mean is denoted by $\mu$ (a population parameter); the sample mean is a statistic denoted by $\bar{x}$ and is an estimate of the population mean $\mu$ .
- Highlights a key distinction:
- Population parameter: value of the whole population (often unknown, symbolized by Greek letters like $\mu$ ).
- Sample statistic: a value calculated from a sample (like $\bar{x}$ ) used as an estimate of the population parameter.
- The phrase "population parameters are god-only-knows" is used humorously to emphasize that parameters cannot be measured directly; only statistics from a sample can be observed.
- If a distribution is skewed, the median can be a better measure of center than the mean (more robust to outliers).
Measures of spread and variability
- Variance measures average squared deviation from the mean:
 $s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2$
- Standard deviation is the square root of the variance:
 $s = \sqrt{s^2}$
- Example given: sleep hours data with a mean of $\bar{x} = 6.71$ hours and a variance of $s^2 = 4.11$ hours^2, so
 $s = \sqrt{4.11} \approx 2.03\text{ hours}$
- Practical point: standard deviation has nice properties (described as “magical”), and almost all data lie within three standard deviations of the mean in a normal-like sense: approximately
- 95% of data within $\bar{x} \pm 2s$
- 99.7% of data within $\bar{x} \pm 3s$
- These are the empirical rules (68-95-99.7) that will be discussed further later.
- Emphasis on robustness: mean and standard deviation are not robust to outliers or heavy skew; median and IQR are more robust in skewed distributions.
Median, percentiles, and quartiles
- Median is the middle value when data are ordered; if there is an even number of observations, the median is the average of the two middle values.
- Percentiles capture the value below which a given percent of observations fall; the 50th percentile is the median.
- Quartiles:
- Q1 is the 25th percentile (the first quartile).
- Q3 is the 75th percentile (the third quartile).
- The interquartile range (IQR) is the middle 50% of the data: $IQR = Q3 - Q1$
- Relationship between mean and median depends on skewness:
- Right-skewed (positive skew): mean > median (big values pull the mean up).
- Left-skewed (negative skew): mean < median (small values pull the mean down).
Box plots: anatomy and outliers
- A box plot shows the middle 50% of data in a box bounded by Q1 and Q3.
- The median is shown as a line inside the box.
- Whiskers extend to the most extreme data points that are not considered outliers, using the 1.5 × IQR rule:
- Upper whisker = $Q_3 + 1.5 \cdot IQR$
- Lower whisker = $Q_1 - 1.5 \cdot IQR$
- Observations outside the whiskers are outliers.
- Example given: if Q1 = 10, Q3 = 20, IQR = 10, then
- Upper whisker = 20 + 15 = 35
- Lower whisker = 10 - 15 = -5 (often floored at 0 in data like study hours).
- Outliers can indicate data entry errors (fat-finger errors) or genuine extreme observations; they can reveal skewness or important data features and may warrant further investigation.
Robustness and choosing summary statistics
- Robust statistics are less affected by outliers or extreme values:
- Median and IQR are relatively robust.
- Mean and standard deviation can be heavily influenced by extreme values.
- If data are skewed or contain major outliers, report the median and IQR for center and spread instead of the mean and standard deviation.
- If data are symmetric with no major outliers, mean and standard deviation can be informative.
- Illustrative exercise: replacing the largest or smallest value with extremely large numbers (e.g., 10,000,000) tends to affect the mean and SD a lot, but leaves the median and IQR relatively stable.
- Practical takeaway: use median and IQR for skewed distributions; use mean and SD for symmetric, mound-shaped distributions.
Shape of distributions: modality, skewness, and outliers
- Modality describes number of peaks in the distribution's shape:
- Unimodal: one clear peak.
- Bimodal: two distinct peaks.
- Multimodal: more than two peaks.
- Uniform: relatively flat with no pronounced peak.
- Visual description trick: imagine a smooth curve overlaying the histogram to judge modality (the “limp spaghetti” analogy).
- Skewness describes tail direction:
- Right-skewed (positive skew): long tail to the right.
- Left-skewed (negative skew): long tail to the left.
- Symmetric: roughly balanced tails.
- Outliers are extreme observations outside the main data tail; the box plot whiskers help identify them.
- Example thoughts: extracurricular activities hours may form a right-skewed distribution with possible outliers at high hours; exam scores are often expected to be symmetric but can skew depending on exam difficulty.
- Exercise prompts in lecture include predicting shape characteristics for: piercings, note-taking time vs. Facebook, etc., to illustrate how skewness and modality can vary by variable.
Practical viewing and interpretation of shapes
- When describing a distribution, report:
- Modality (unimodal, bimodal, multimodal, or uniform).
- Skewness (right, left, symmetric).
- Presence of outliers.
- In data analysis or homework, you may be asked to describe the histogram shape and to identify outliers or the effect of bin width on histogram shape.
- Example: a histogram of hours spent on extracurricular activities with bin widths of 10 hours may provide a good balance between too-smoothed and too-noisy; alt bin widths (5 or 2) may better reveal or obscure shape features.
Histograms versus bar plots and other charts
- Histogram: used for a numerical (continuous) variable; shows data density via bar heights representing counts or densities within bins.
- Bar plot: used for categorical data; shows frequencies or proportions for categories (e.g., died vs survived by adult vs child in Titanic example).
- Relative frequency bar plot: displays proportions rather than raw counts.
- Stacked bar plot vs side-by-side bar plot vs standardized stacked bar plot: different ways to compare categories within groups; useful for visualizing proportions across categories.
- Pie charts: generally discouraged in statistics unless there are only a few categories; less precise for comparing small shares.
- Titanic example illustrates using a contingency table (two categorical variables: age category and survival outcome) and deriving row proportions (e.g., proportion of adults who died vs. survived; proportion of children who died vs. survived).
Contingency tables and categorical data analysis
- A contingency table summarizes two categorical variables (e.g., age group: adult vs child; outcome: died vs survived).
- Bar plots can display absolute counts or percentages for these categories.
- Relative frequencies (percentages) help interpret proportions across groups.
- Row proportions help compare the likelihood of an outcome within each row category (e.g., proportion of adults who died vs. children who died).
- In Titanic example, observed patterns suggested differences in survival rates by age category, which motivates comparing proportions across rows.
- Additional plots include stacked and standardized stacked bar plots to emphasize group-wise proportions.
Practical use cases and homework relevance
- You will be asked to create histograms in R and describe the resulting shape (modality, skewness, outliers).
- You will compute and interpret the three measures of center/spread: mean, median, and IQR; and interest in the relationship between mean, median, and skewness.
- You will calculate variability measures: variance, standard deviation, and IQR.
- You will identify outliers via the 1.5 × IQR rule and discuss the importance of outliers for data interpretation and potential data entry errors.
- You will learn to interpret box plots and explain how whiskers are determined, and how to identify outliers.
- You will learn to decide whether to use robust statistics (median, IQR) versus non-robust statistics (mean, SD) depending on data skewness or presence of outliers.
- You will encounter questions about how introducing extreme values affects different statistics, reinforcing the concept of robustness.
Summary tips for choosing statistics in practice
- If the distribution is skewed or has outliers, use the median to describe center and the IQR to describe spread.
- If the distribution is symmetric and roughly bell-shaped with no major outliers, the mean and standard deviation are reasonable descriptive statistics.
- Always consider the shape and potential outliers before selecting summary measures.
Quick reference formulas and concepts to remember for exams/homework
- Sample mean: $\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi$
- Population mean: $\mu$ (parameter)
- Relationship between statistics and parameters:
- Population parameter: a fixed, unknown value (e.g., $\mu$ )
- Sample statistic: an estimate based on data (e.g., $\bar{x}$ as an estimate of $\mu$ )
- Variance and standard deviation:
- $s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2$
- $s = \sqrt{s^2}$
- Interquartile range:
- $IQR = Q3 - Q1$
- Box plot whiskers (outlier rule):
- Upper whisker: $Q_3 + 1.5 \cdot IQR$
- Lower whisker: $Q_1 - 1.5 \cdot IQR$
- Percentiles and quartiles: Q1 = 25th percentile, Median = 50th percentile, Q3 = 75th percentile; IQR covers the middle 50%
- Modality and skewness: unimodal vs bimodal vs multimodal; right-skewed vs left-skewed vs symmetric
- Empirical rule (normal-like data): approximately
- 68% within $\bar{x} \pm s$
- 95% within $\bar{x} \pm 2s$
- 99.7% within $\bar{x} \pm 3s$
R, labs, and additional resources (as discussed in lecture)
- Two R labs exist on openintro.org under the OpenIntro book’s Labs section; additional videos (older) are available to help with R usage.
- Homework policy emphasizes that you must show both the R command and the R output for full credit.
- Wednesday session will cover how to run R commands and interpret outputs; you will learn practical steps for generating histograms, computing summary statistics, and interpreting results with R.
- Practical tip: have R ready before Wednesday to minimize the learning curve during the live session.
Real-world relevance and ethical notes
- The lecture connects statistical methods to real-world data analysis (survey data, public datasets, large-scale population data).
- It emphasizes critical interpretation: correlations do not imply causation; beware of lurking variables (e.g., poverty) that can drive observed associations.
- The discussion also touches on misinterpretation risks in data (e.g., oversimplifying what a center statistic tells you about a population).
Final practical reminders from the lecture
- Check your equipment and ensure you can hear both in-class and online participants; the instructor recaps the importance of ensuring access to audio/video during online learning.
- If you’re watching a recording, ensure you review the key points: scatter plots, dot plots, distributions, center/spread, box plots, IQR, outliers, and robustness concepts.
- Be prepared to discuss and apply these concepts to homework problems, including creating histograms in R and computing the associated statistics and interpretation.