Biostatistics 3A Lecture Notes: Describing Numerical Data

Chapter 2: Describing Numerical Data

  • Setup and context from the lecture

    • Technical hiccups in live class (microphone off, online audience audio issues) were acknowledged; the instructor emphasizes that such “little things” are part of learning and why attending class is valuable.
    • Deadlines and logistics mentioned:
    • Quiz 1 and Homework 1 for Lecture 1 are due tonight, with Homework 1 due at 11:00 (not midnight).
    • Quiz 2 and Homework 2 have been assigned; the instructor aims to finish the 70-slide lecture today.
    • Quiz due date: one week from today.
    • Wednesday: the instructor will review the Homework; an introduction to using R will occur on Wednesday; two R labs exist on openintro.org under the book’s Labs section; two videos (older) are available to help with R.
    • Extra two-day grace for two problems that include R in Homework 2.
    • Emphasis on getting students to log into and use R before Wednesday; the OpenIntro R labs are referenced for additional practice.
    • Reminder about the rule: to pass Homework, you must show both the R command and the R output.
  • Core topic of Lecture 2: examining and summarizing numerical data

    • Goal: describe a large dataset by summarizing with statistics (two main ideas).
    • Important caveat: relationships in scatter plots may imply association but not causation; a third variable may drive observed associations (e.g., poverty impacting life expectancy and fertility).
    • Visual tools discussed: scatter plots, dot plots, histograms, bar plots, and box plots to describe distributions and relationships.
    • A note about real-world data use: examples include surveys (e.g., votes for Trump vs. Kamala) and international data (fertility vs. life expectancy).
  • Scatter plots and association

    • A scatter plot has an x-axis and a y-axis representing two numerical variables.
    • Example discussed: number of children per woman (fertility) vs. life expectancy; dots represent countries; dot size may reflect population (larger dot = larger population).
    • Observed pattern: a downward trend suggesting that higher fertility is associated with lower life expectancy in the plotted countries, but this is not claimed as causal.
    • A warning about causality and a reminder that poverty acts as a third variable that influences both fertility and life expectancy.
    • The relationship can appear linear in some regions but may not be strictly causal; the concept of association versus causation is emphasized.
  • Dot plots and data description

    • Dot plots show individual observations; color/density can indicate concentration (e.g., darker colors indicate more observations in a region).
    • Used to describe distributions like GPAs in a class.
    • When describing a distribution, describe three aspects: center, shape, and spread.
    • The instructor notes occasional classroom tech issues but returns to describe the plotting concepts.
  • Center of a distribution: mean vs. median

    • Mean (average) is denoted by
      \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi
    • The population mean is denoted by \mu (a population parameter); the sample mean is a statistic denoted by \bar{x} and is an estimate of the population mean \mu.
    • Highlights a key distinction:
    • Population parameter: value of the whole population (often unknown, symbolized by Greek letters like \mu).
    • Sample statistic: a value calculated from a sample (like \bar{x}) used as an estimate of the population parameter.
    • The phrase "population parameters are god-only-knows" is used humorously to emphasize that parameters cannot be measured directly; only statistics from a sample can be observed.
    • If a distribution is skewed, the median can be a better measure of center than the mean (more robust to outliers).
  • Measures of spread and variability

    • Variance measures average squared deviation from the mean:
      s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2
    • Standard deviation is the square root of the variance:
      s = \sqrt{s^2}
    • Example given: sleep hours data with a mean of \bar{x} = 6.71 hours and a variance of s^2 = 4.11 hours^2, so
      s = \sqrt{4.11} \approx 2.03\text{ hours}
    • Practical point: standard deviation has nice properties (described as “magical”), and almost all data lie within three standard deviations of the mean in a normal-like sense: approximately
    • 95% of data within \bar{x} \pm 2s
    • 99.7% of data within \bar{x} \pm 3s
    • These are the empirical rules (68-95-99.7) that will be discussed further later.
    • Emphasis on robustness: mean and standard deviation are not robust to outliers or heavy skew; median and IQR are more robust in skewed distributions.
  • Median, percentiles, and quartiles

    • Median is the middle value when data are ordered; if there is an even number of observations, the median is the average of the two middle values.
    • Percentiles capture the value below which a given percent of observations fall; the 50th percentile is the median.
    • Quartiles:
    • Q1 is the 25th percentile (the first quartile).
    • Q3 is the 75th percentile (the third quartile).
    • The interquartile range (IQR) is the middle 50% of the data: IQR = Q3 - Q1
    • Relationship between mean and median depends on skewness:
    • Right-skewed (positive skew): mean > median (big values pull the mean up).
    • Left-skewed (negative skew): mean < median (small values pull the mean down).
  • Box plots: anatomy and outliers

    • A box plot shows the middle 50% of data in a box bounded by Q1 and Q3.
    • The median is shown as a line inside the box.
    • Whiskers extend to the most extreme data points that are not considered outliers, using the 1.5 × IQR rule:
    • Upper whisker = Q_3 + 1.5 \cdot IQR
    • Lower whisker = Q_1 - 1.5 \cdot IQR
    • Observations outside the whiskers are outliers.
    • Example given: if Q1 = 10, Q3 = 20, IQR = 10, then
    • Upper whisker = 20 + 15 = 35
    • Lower whisker = 10 - 15 = -5 (often floored at 0 in data like study hours).
    • Outliers can indicate data entry errors (fat-finger errors) or genuine extreme observations; they can reveal skewness or important data features and may warrant further investigation.
  • Robustness and choosing summary statistics

    • Robust statistics are less affected by outliers or extreme values:
    • Median and IQR are relatively robust.
    • Mean and standard deviation can be heavily influenced by extreme values.
    • If data are skewed or contain major outliers, report the median and IQR for center and spread instead of the mean and standard deviation.
    • If data are symmetric with no major outliers, mean and standard deviation can be informative.
    • Illustrative exercise: replacing the largest or smallest value with extremely large numbers (e.g., 10,000,000) tends to affect the mean and SD a lot, but leaves the median and IQR relatively stable.
    • Practical takeaway: use median and IQR for skewed distributions; use mean and SD for symmetric, mound-shaped distributions.
  • Shape of distributions: modality, skewness, and outliers

    • Modality describes number of peaks in the distribution's shape:
    • Unimodal: one clear peak.
    • Bimodal: two distinct peaks.
    • Multimodal: more than two peaks.
    • Uniform: relatively flat with no pronounced peak.
    • Visual description trick: imagine a smooth curve overlaying the histogram to judge modality (the “limp spaghetti” analogy).
    • Skewness describes tail direction:
    • Right-skewed (positive skew): long tail to the right.
    • Left-skewed (negative skew): long tail to the left.
    • Symmetric: roughly balanced tails.
    • Outliers are extreme observations outside the main data tail; the box plot whiskers help identify them.
    • Example thoughts: extracurricular activities hours may form a right-skewed distribution with possible outliers at high hours; exam scores are often expected to be symmetric but can skew depending on exam difficulty.
    • Exercise prompts in lecture include predicting shape characteristics for: piercings, note-taking time vs. Facebook, etc., to illustrate how skewness and modality can vary by variable.
  • Practical viewing and interpretation of shapes

    • When describing a distribution, report:
    • Modality (unimodal, bimodal, multimodal, or uniform).
    • Skewness (right, left, symmetric).
    • Presence of outliers.
    • In data analysis or homework, you may be asked to describe the histogram shape and to identify outliers or the effect of bin width on histogram shape.
    • Example: a histogram of hours spent on extracurricular activities with bin widths of 10 hours may provide a good balance between too-smoothed and too-noisy; alt bin widths (5 or 2) may better reveal or obscure shape features.
  • Histograms versus bar plots and other charts

    • Histogram: used for a numerical (continuous) variable; shows data density via bar heights representing counts or densities within bins.
    • Bar plot: used for categorical data; shows frequencies or proportions for categories (e.g., died vs survived by adult vs child in Titanic example).
    • Relative frequency bar plot: displays proportions rather than raw counts.
    • Stacked bar plot vs side-by-side bar plot vs standardized stacked bar plot: different ways to compare categories within groups; useful for visualizing proportions across categories.
    • Pie charts: generally discouraged in statistics unless there are only a few categories; less precise for comparing small shares.
    • Titanic example illustrates using a contingency table (two categorical variables: age category and survival outcome) and deriving row proportions (e.g., proportion of adults who died vs. survived; proportion of children who died vs. survived).
  • Contingency tables and categorical data analysis

    • A contingency table summarizes two categorical variables (e.g., age group: adult vs child; outcome: died vs survived).
    • Bar plots can display absolute counts or percentages for these categories.
    • Relative frequencies (percentages) help interpret proportions across groups.
    • Row proportions help compare the likelihood of an outcome within each row category (e.g., proportion of adults who died vs. children who died).
    • In Titanic example, observed patterns suggested differences in survival rates by age category, which motivates comparing proportions across rows.
    • Additional plots include stacked and standardized stacked bar plots to emphasize group-wise proportions.
  • Practical use cases and homework relevance

    • You will be asked to create histograms in R and describe the resulting shape (modality, skewness, outliers).
    • You will compute and interpret the three measures of center/spread: mean, median, and IQR; and interest in the relationship between mean, median, and skewness.
    • You will calculate variability measures: variance, standard deviation, and IQR.
    • You will identify outliers via the 1.5 × IQR rule and discuss the importance of outliers for data interpretation and potential data entry errors.
    • You will learn to interpret box plots and explain how whiskers are determined, and how to identify outliers.
    • You will learn to decide whether to use robust statistics (median, IQR) versus non-robust statistics (mean, SD) depending on data skewness or presence of outliers.
    • You will encounter questions about how introducing extreme values affects different statistics, reinforcing the concept of robustness.
  • Summary tips for choosing statistics in practice

    • If the distribution is skewed or has outliers, use the median to describe center and the IQR to describe spread.
    • If the distribution is symmetric and roughly bell-shaped with no major outliers, the mean and standard deviation are reasonable descriptive statistics.
    • Always consider the shape and potential outliers before selecting summary measures.
  • Quick reference formulas and concepts to remember for exams/homework

    • Sample mean: \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi
    • Population mean: \mu (parameter)
    • Relationship between statistics and parameters:
    • Population parameter: a fixed, unknown value (e.g., \mu)
    • Sample statistic: an estimate based on data (e.g., \bar{x} as an estimate of \mu)
    • Variance and standard deviation:
    • s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2
    • s = \sqrt{s^2}
    • Interquartile range:
    • IQR = Q3 - Q1
    • Box plot whiskers (outlier rule):
    • Upper whisker: Q_3 + 1.5 \cdot IQR
    • Lower whisker: Q_1 - 1.5 \cdot IQR
    • Percentiles and quartiles: Q1 = 25th percentile, Median = 50th percentile, Q3 = 75th percentile; IQR covers the middle 50%
    • Modality and skewness: unimodal vs bimodal vs multimodal; right-skewed vs left-skewed vs symmetric
    • Empirical rule (normal-like data): approximately
    • 68% within \bar{x} \pm s
    • 95% within \bar{x} \pm 2s
    • 99.7% within \bar{x} \pm 3s
  • R, labs, and additional resources (as discussed in lecture)

    • Two R labs exist on openintro.org under the OpenIntro book’s Labs section; additional videos (older) are available to help with R usage.
    • Homework policy emphasizes that you must show both the R command and the R output for full credit.
    • Wednesday session will cover how to run R commands and interpret outputs; you will learn practical steps for generating histograms, computing summary statistics, and interpreting results with R.
    • Practical tip: have R ready before Wednesday to minimize the learning curve during the live session.
  • Real-world relevance and ethical notes

    • The lecture connects statistical methods to real-world data analysis (survey data, public datasets, large-scale population data).
    • It emphasizes critical interpretation: correlations do not imply causation; beware of lurking variables (e.g., poverty) that can drive observed associations.
    • The discussion also touches on misinterpretation risks in data (e.g., oversimplifying what a center statistic tells you about a population).
  • Final practical reminders from the lecture

    • Check your equipment and ensure you can hear both in-class and online participants; the instructor recaps the importance of ensuring access to audio/video during online learning.
    • If you’re watching a recording, ensure you review the key points: scatter plots, dot plots, distributions, center/spread, box plots, IQR, outliers, and robustness concepts.
    • Be prepared to discuss and apply these concepts to homework problems, including creating histograms in R and computing the associated statistics and interpretation.