C-N M201 Lecture Notes (Part 2) — Comprehensive Summary

Random Variables: Classification and Basics

  • Random variable (rv) concepts and classifications are extended to dimension n:
    • If dim(A) = 1, the rv is quantitative; if dim(A) > 1, the rv is qualitative (e.g., color).
    • Quantitative rvs can have ranges that are intervals (continuous) or finite/countable sets (discrete).
  • In these notes, focus is on quantitative rvs.
  • Examples of variable types:
    • Diastolic Blood Pressure: quantitative, continuous.
    • College Class: quantitative, discrete.
  • Two approaches to “sin” as an rv example:
    • Discrete case: X ∈ {0, 1} with X = 0 (no sin) or X = 1 (sin).
    • Continuous case: X > 0 representing magnitude of sin.
  • Affiliated concept: distribution of a rv describes its numerical occurrence pattern.
    • Questions about distribution include where values occur most often, how far apart occurrences are, and expected/average occurrences.
  • Famous continuous distributions are introduced later (Normal, t, Chi-square, F). See Notes 3 and 4.
  • Simple discrete examples illustrate rv behavior:
    • Vehicle direction example: Ω = {Right, Straight, Left}, X: Ω → ℕ with values in A = {−1, 0, 1} (symmetric shape).
    • Coin-toss example: W = {HH, HT, TH, TT}, X: W → ℕ with X = 1,2,3,4 (symmetric, discrete).
  • Real-world data and distributions link to inference: distributions describe data patterns and underpin statistical inference.

Shape, Spread, and Location of Distributions

  • Key descriptive facets of a distribution:
    • Shape: the overall form (e.g., symmetric, skewed).
    • Spread: how dispersed the data are (range, variability).
    • Location: where the data center or balance point lies (mean, median, mode).
  • Terminology in notes: “random variable, data set and distribution” are used interchangeably in context; however, they aren’t exactly the same.
  • Measures of shape, spread, and location are called the distribution’s parameters (shape, spread, location).
  • Skewness terminology:
    • Right-skewed (positively skewed): tail to the right; longer right tail.
    • Left-skewed (negatively skewed): tail to the left; longer left tail.
  • Example data sets and interpretations:
    • Set C (inches): 30,32,33,33,33,34,35 → symmetric shape, range 5 in, location ~33 in.
    • Set A (inches): 47,52,43,56,45,49,50 → right-skewed, range 13 in, location near ~48 in.
    • Set B (inches): 61,67,68,70,69,72,69 → left-skewed, range 11 in, location ~69 in.
  • A graph can illustrate shape, spread, and location together (e.g., histograms, stemplots, boxplots).

Random Variable Types and Examples

  • Continuous vs discrete rv examples:
    • X: movement through an intersection; X ∈ {−1,0,1} (discrete, symmetric) with probabilities provided.
    • Two-coin toss: W = {HH, HT, TH, TT}, X ∈ {1,2,3,4} with equal probabilities (discrete, symmetric).
    • Waiting time for Apple gadget: X is quantitative, continuous; distribution described as right-skewed with infinite range in some cases (Exponential distribution).
  • Waiting-time example details:
    • X = time (in hours) a person waits in a queue; X > 0; average ~5 hours; could be as large as 25 hours (extreme).
    • This is a qualitative note about a continuous rv with right-skewed shape (Exponential distribution is cited).
  • Qualitative rvs and plotting:
    • Pareto or Pie charts are used for qualitative rvs; not to infer distribution shape from these graphs.

Famous Continuous Distributions (in this course)

  • Four famous continuous distributions highlighted for study: Normal, t, Chi-square, and F.
  • Exponential distribution is presented as another well-known continuous rv (example in the notes).
  • Relationship to inference: understanding these distributions aids in modeling and statistical inference.

Descriptive Statistics: Location, Spread, and Shape in Practice

  • Notation and key concepts:

    • Mean (location) m or E(X): the balance point of the distribution.
    • Median (Q2): 50th percentile; midpoint of the ordered data.
    • Mode: most frequent value; may be nonunique or nonexistent.
    • Quartiles: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
    • Five-number summary: min, Q1, Q2, Q3, max.
    • Range: max − min.
    • Interquartile range (IQR): Q3 − Q1.
    • Variance: s^2 = E[(X − m)^2]; standard deviation: s = √s^2. For a sample:
      s^{2} = rac{
      abla{i=1}^{n} (xi - ar{x})^{2}}{n-1},

    s = \,\sqrt{s^{2}}.

  • The mean is the first raw moment; the variance is the second central moment.

  • Robustness considerations:

    • Mean and mode may be sensitive to outliers; median is more robust as a measure of center.
  • Shape vs location vs spread interplay:

    • For right-skewed distributions, typically Median < Mean; for left-skewed, Mean < Median.
    • For symmetric distributions, mean ≈ median.
  • In practice, use R or other software to obtain: 5-number summary, mean, median, mode, quartiles, range, IQR, variance, and standard deviation.

Practical Data Visualization and Software Tools (R)

  • Tools introduced to assess distribution features:
    • Stem-and-leaf plots (stem()), histograms, boxplots (boxplot()), Pareto/Pie charts for qualitative rv.
    • Five-number summary: min, Q1, Q2, Q3, max.
    • Spread measures include: range, IQR, variance, standard deviation.
  • Example data: commuting times of a class (25 observations, minutes):
    • Data: 23, 2, 10, 15, 2, 5, 5, 6, 15, 6, 10, 6, 2, 45, 35, 2, 5, 5, 25, 30, 5, 5, 5, 2, 6.
    • Computations (via R):
    • Mean ≈ ar{X} \,\approx \,11.1\text{ min}
    • Median Q2 = 6 minutes
    • Mode ≈ 5 minutes
    • Range = 45 − 2 = 43 minutes
    • IQR = Q3 − Q1 ≈ 10 minutes
    • Shape: right-skewed (long tail toward higher values).
  • R practice with ctime (7 observations) and timegordon example:
    • Small set: ctime = c(10, 13, 5, 9, 5, 7, 7)
    • Summary and spread: s^2 ≈ 134.3\;\text{min}^2, s ≈ 11.6\;\text{min}
    • Expanded timegordon: timegordon = ctime; timegordon ← append with 6 and 6.5; then summary(timegordon) shows a right-skewed distribution with mean ≈ 7.6, median ≈ 7.
  • Example with larger data: 2019 US Cities dataset (citygordon) used to illustrate qualitative vs quantitative variables and the use of Pareto charts and CST20 (cost of riots) as a quantitative variable with severe right skew.
  • Important R commands introduced:
    • ctime = c(…)
    • length(ctime)
    • summary(ctime)
    • sd(ctime); var(ctime)
    • stem(ctime); boxplot(ctime)
    • boxplot and 5-number summary via boxplot output
    • read.table("URL", header = TRUE) to load datasets
    • citygordon$CST20 to access a variable within a data frame
    • summary(citygordon$CST20), sd(citygordon$CST20) for location/spread
    • q() to exit R

Practical Examples and Interpretations

  • Exponential distribution example (Apple queue):
    • X = waiting time; shape is right-skewed; range is infinite; location around 5 hours; this is described as another famous distribution.
  • ACT score dataset (Carson-Newman):
    • Data show a slightly right-skewed shape; spread via range ≈ 22; location around the most common score ≈ 21; mean and median differ due to skewness.
  • Conceptual interpretation of the 5-number summary for commuting times:
    • Min ≈ 2, Q1 ≈ 5, Median ≈ 6, Q3 ≈ 15, Max ≈ 45
    • Range ≈ 43; IQR ≈ 10; indicates right-skew with a concentration of values near the lower end but with a long tail toward higher times.
  • Summary interpretation of X (theoretical example):
    • Symmetric discrete rv with X ∈ {−2, 0, 2} and P(−2)=0.05, P(0)=0.9, P(2)=0.05
    • Range = 4; mean and median both 0; distribution remains symmetric, but spread changes with sample size (SSE and s^2 differences shown for n=3 vs n=20).
    • Demonstrates that range can mislead about spread; variance captures spread more accurately as shown by computations: for data sets {−2,0,2} s^2 = 4 with s = 2; for a larger set with many zeros and two nonzero outliers, s ≈ 0.65 (s^2 ≈ 0.42).

Inferential Statistics: From Sample to Population

  • Inferential model overview:
    • Population parameters (shape, center, spread) are generally unknown.
    • Use random samples to estimate parameters (statistics) and infer about true population characteristics.
    • The progression: Population --Inference--> Parameters; Sample --Inference--> Statistics (estimators) --Inference--> Truth.
  • Population vs Sample terminology:
    • Population parameters denoted by Greek letters (e.g., m, s, p, r).
    • Sample statistics denoted by Roman letters (e.g., X̄, s, p̂, r̂).
  • Random sampling methods to obtain representative data:
    • Simple random sample (SRS): every possible data set has equal chance.
    • Stratified random sample: sample from each stratum to ensure representation.
    • Cluster sample: sample from clusters; often used in large populations.
    • Systematic sampling: select every k-th item after a random start.
    • Non-sampling bias: voluntary response samples (VRS) can bias results.
  • Observational studies vs designed experiments:
    • Observational study: data reflect current situation without manipulating factors.
    • Designed experiment: researchers actively manipulate at least one factor to observe effects.
  • Practical inference with data:
    • Use random samples and statistical software (R in these notes) to estimate the shape, spread, and location of distributions.
    • Inferential conclusions depend on the sampling design and data quality.

Relationship Between Two Quantitative Variables: Correlation

  • Pearson correlation measures the strength and direction of a linear relationship between two quantitative variables.
    • Denoted by r; the estimate is also called r when computed from sample data.
    • r ∈ [−1, 1], where:
    • r ≈ 1 indicates a strong positive linear relationship,
    • r ≈ −1 indicates a strong negative linear relationship,
    • r ≈ 0 indicates little to no linear relationship.
  • Examples discussed:
    • Municipal violence vs percentage of Hillary Clinton voters: a positive linear relationship with r ≈ 0.64 reported in one example.
    • Transformations (e.g., log CST20) can linearize highly skewed variables to improve linear correlation estimates.
  • Practical notes:
    • Correlation does not imply causation; a nonlinear relationship may have low r even if a strong relationship exists nonlinearly.

Worked Data Set Illustrations with R and Practical Notes

  • City data set exercises (CST20, LDR, CST20 vs LDR) show how skewness affects correlation and how transformations can improve linear interpretation.
  • The five-number summary and boxplots (via R) help visualize shape, spread, and location for both qualitative and quantitative variables.
  • Pareto charts and Pie charts are recommended for qualitative rv to summarize category frequencies; they are not used to infer distribution shape.

Problem Sets: What to Do and How to Think About Them

  • B problems (B1–B12): classification and description tasks for various rvs (Birthplace, Letter grade, ACT Score, Weight, Hair Color, Class Rank, Number of Siblings, Age); describe distributions and shape/spread/location for current class data; discuss robustness and appropriate measures.
  • B5: For a continuous rv with mean 5, explain why P(X = 5) = 0.
  • B6–B7: A practical binomial-type problem (Miriam’s baton spins) uses n = 15 trials; estimate the probability of making at least 10 catches; involves binomial modeling and estimating p from a sample of observed catches.
  • B8–B12: Numerical exercises with discrete rv X; compute shape, spread, and location; consider binomial examples, weather predictions, stadium capacity, etc.; apply 5-number summaries, range, IQR, and X̄/M (mean/median).
  • C problems (C1–C18): extensive hands-on data-analysis tasks using R with real data sets (winter temps, attendance, baseball hits, Coke weights, Old Faithful heights, cereal sugar rates, ACT distributions, city CST20/LDR, etc.).
    • Tasks include: loading data with read.table, producing stemplots, boxplots, IQR, s, s^2, and X33/M (shape, spread, and location metrics) for each rv; interpret what the results say about the distribution and type of rv (e.g., symmetric, right-skewed, left-skewed, binomial).
  • D problems (D1–D6): deeper discussion of sampling and relationships, including identifying study types (observational vs designed experiments), Simpson’s Paradox, and correlation findings from Shelton data.
  • Summary of key problem-solving ideas:
    • Classify rvs by type (qualitative vs quantitative; discrete vs continuous).
    • Use shape, spread, and location to describe distributions; prefer robust measures (median, IQR) when outliers are present.
    • Use appropriate graphs (stem plots, boxplots, histograms) to visually assess distributions.
    • Use R for data analysis: loading data, computing summary statistics, and visualizing distributions.
    • Recognize when to apply transformations (e.g., logarithms) to linearize relationships for correlation analyses.
    • Distinguish between sampling methods and their implications for inference; consider potential biases.

Quick Formulas and Key Relationships (LaTeX)

  • Mean (location): m = ar{X} ext{ or } E(X)
  • Expected value for discrete rv: E(X) = \sum_{x} x \, P(X=x)
  • Variance and standard deviation:
    • s^2 = \frac{\sum{i=1}^{n} (xi - \bar{X})^2}{n-1}
    • s = \sqrt{s^2}
  • Range and IQR:
    • \text{Range} = \max(xi) - \min(xi)
    • \text{IQR} = Q3 - Q1
  • Five-number summary: \text{min}, Q1, Q2, Q_3, \text{max}
  • Quartiles: Q1, Q2 (\text{median}), Q_3\; (75\%\text{ile})
  • Pearson correlation (between two quantitative variables X and Y):
    r = \frac{\sum (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum (xi - \bar{x})^2} \sqrt{\sum (yi - \bar{y})^2}}
  • Exponential distribution (brief description): for rate \lambda > 0\,
    • Density: f(x) = \lambda e^{-\lambda x}, \quad x \ge 0
    • Mean: E(X) = \frac{1}{\lambda}
    • Variance: \mathrm{Var}(X) = \frac{1}{\lambda^2}
  • Note on skewness and estimation: in right-skewed data, typically \text{Median} < \text{Mean}; in symmetric data, Mean ≈ Median.

Practical Takeaways for Exam Preparation

  • Be able to classify an rv as quantitative vs qualitative and discrete vs continuous.
  • Describe a distribution using shape, spread (range, IQR, variance, standard deviation) and location (mean, median, mode, quartiles).
  • Use the 5-number summary to quickly assess distribution features and skewness.
  • Recognize when to use robust measures (median, IQR) vs mean/variance depending on outliers.
  • Interpret and compute basic statistics from data sets; understand how to read and summarize R outputs (stem plot, boxplot, summary, and standard deviation).
  • Understand the difference between sampling methods and their impact on inference; distinguish observational studies from designed experiments.
  • Apply correlation appropriately, including considering transformations to linearize relationships when needed.
  • Practice with both theoretical rv distributions (e.g., binomial, exponential) and real data sets to estimate shape, spread, and location; be able to interpret what these tell you about the underlying process and its practical implications.