C-N M201 Lecture Notes (Part 2) — Comprehensive Summary

Random Variables: Classification and Basics

Random variable (rv) concepts and classifications are extended to dimension n:
- If dim(A) = 1, the rv is quantitative; if dim(A) > 1, the rv is qualitative (e.g., color).
- Quantitative rvs can have ranges that are intervals (continuous) or finite/countable sets (discrete).
In these notes, focus is on quantitative rvs.
Examples of variable types:
- Diastolic Blood Pressure: quantitative, continuous.
- College Class: quantitative, discrete.
Two approaches to “sin” as an rv example:
- Discrete case: X ∈ {0, 1} with X = 0 (no sin) or X = 1 (sin).
- Continuous case: X > 0 representing magnitude of sin.
Affiliated concept: distribution of a rv describes its numerical occurrence pattern.
- Questions about distribution include where values occur most often, how far apart occurrences are, and expected/average occurrences.
Famous continuous distributions are introduced later (Normal, t, Chi-square, F). See Notes 3 and 4.
Simple discrete examples illustrate rv behavior:
- Vehicle direction example: Ω = {Right, Straight, Left}, X: Ω → ℕ with values in A = {−1, 0, 1} (symmetric shape).
- Coin-toss example: W = {HH, HT, TH, TT}, X: W → ℕ with X = 1,2,3,4 (symmetric, discrete).
Real-world data and distributions link to inference: distributions describe data patterns and underpin statistical inference.

Shape, Spread, and Location of Distributions

Key descriptive facets of a distribution:
- Shape: the overall form (e.g., symmetric, skewed).
- Spread: how dispersed the data are (range, variability).
- Location: where the data center or balance point lies (mean, median, mode).
Terminology in notes: “random variable, data set and distribution” are used interchangeably in context; however, they aren’t exactly the same.
Measures of shape, spread, and location are called the distribution’s parameters (shape, spread, location).
Skewness terminology:
- Right-skewed (positively skewed): tail to the right; longer right tail.
- Left-skewed (negatively skewed): tail to the left; longer left tail.
Example data sets and interpretations:
- Set C (inches): 30,32,33,33,33,34,35 → symmetric shape, range 5 in, location ~33 in.
- Set A (inches): 47,52,43,56,45,49,50 → right-skewed, range 13 in, location near ~48 in.
- Set B (inches): 61,67,68,70,69,72,69 → left-skewed, range 11 in, location ~69 in.
A graph can illustrate shape, spread, and location together (e.g., histograms, stemplots, boxplots).

Random Variable Types and Examples

Continuous vs discrete rv examples:
- X: movement through an intersection; X ∈ {−1,0,1} (discrete, symmetric) with probabilities provided.
- Two-coin toss: W = {HH, HT, TH, TT}, X ∈ {1,2,3,4} with equal probabilities (discrete, symmetric).
- Waiting time for Apple gadget: X is quantitative, continuous; distribution described as right-skewed with infinite range in some cases (Exponential distribution).
Waiting-time example details:
- X = time (in hours) a person waits in a queue; X > 0; average ~5 hours; could be as large as 25 hours (extreme).
- This is a qualitative note about a continuous rv with right-skewed shape (Exponential distribution is cited).
Qualitative rvs and plotting:
- Pareto or Pie charts are used for qualitative rvs; not to infer distribution shape from these graphs.

Famous Continuous Distributions (in this course)

Four famous continuous distributions highlighted for study: Normal, t, Chi-square, and F.
Exponential distribution is presented as another well-known continuous rv (example in the notes).
Relationship to inference: understanding these distributions aids in modeling and statistical inference.

Descriptive Statistics: Location, Spread, and Shape in Practice

Notation and key concepts:
- Mean (location) m or E(X): the balance point of the distribution.
- Median (Q2): 50th percentile; midpoint of the ordered data.
- Mode: most frequent value; may be nonunique or nonexistent.
- Quartiles: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
- Five-number summary: min, Q1, Q2, Q3, max.
- Range: max − min.
- Interquartile range (IQR): Q3 − Q1.
- Variance: s^2 = E[(X − m)^2]; standard deviation: s = √s^2. For a sample:
  s^{2} = rac{
  abla{i=1}^{n} (xi - ar{x})^{2}}{n-1},
s = \,\sqrt{s^{2}}.
The mean is the first raw moment; the variance is the second central moment.
Robustness considerations:
- Mean and mode may be sensitive to outliers; median is more robust as a measure of center.
Shape vs location vs spread interplay:
- For right-skewed distributions, typically Median < Mean; for left-skewed, Mean < Median.
- For symmetric distributions, mean ≈ median.
In practice, use R or other software to obtain: 5-number summary, mean, median, mode, quartiles, range, IQR, variance, and standard deviation.

Practical Data Visualization and Software Tools (R)

Tools introduced to assess distribution features:
- Stem-and-leaf plots (stem()), histograms, boxplots (boxplot()), Pareto/Pie charts for qualitative rv.
- Five-number summary: min, Q1, Q2, Q3, max.
- Spread measures include: range, IQR, variance, standard deviation.
Example data: commuting times of a class (25 observations, minutes):
- Data: 23, 2, 10, 15, 2, 5, 5, 6, 15, 6, 10, 6, 2, 45, 35, 2, 5, 5, 25, 30, 5, 5, 5, 2, 6.
- Computations (via R):
- Mean ≈ ar{X} \,\approx \,11.1\text{ min}
- Median Q2 = 6 minutes
- Mode ≈ 5 minutes
- Range = 45 − 2 = 43 minutes
- IQR = Q3 − Q1 ≈ 10 minutes
- Shape: right-skewed (long tail toward higher values).
R practice with ctime (7 observations) and timegordon example:
- Small set: ctime = c(10, 13, 5, 9, 5, 7, 7)
- Summary and spread: s^2 ≈ 134.3\;\text{min}^2, s ≈ 11.6\;\text{min}
- Expanded timegordon: timegordon = ctime; timegordon ← append with 6 and 6.5; then summary(timegordon) shows a right-skewed distribution with mean ≈ 7.6, median ≈ 7.
Example with larger data: 2019 US Cities dataset (citygordon) used to illustrate qualitative vs quantitative variables and the use of Pareto charts and CST20 (cost of riots) as a quantitative variable with severe right skew.
Important R commands introduced:
- ctime = c(…)
- length(ctime)
- summary(ctime)
- sd(ctime); var(ctime)
- stem(ctime); boxplot(ctime)
- boxplot and 5-number summary via boxplot output
- read.table("URL", header = TRUE) to load datasets
- citygordon$CST20 to access a variable within a data frame
- summary(citygordon$CST20), sd(citygordon$CST20) for location/spread
- q() to exit R

Practical Examples and Interpretations

Exponential distribution example (Apple queue):
- X = waiting time; shape is right-skewed; range is infinite; location around 5 hours; this is described as another famous distribution.
ACT score dataset (Carson-Newman):
- Data show a slightly right-skewed shape; spread via range ≈ 22; location around the most common score ≈ 21; mean and median differ due to skewness.
Conceptual interpretation of the 5-number summary for commuting times:
- Min ≈ 2, Q1 ≈ 5, Median ≈ 6, Q3 ≈ 15, Max ≈ 45
- Range ≈ 43; IQR ≈ 10; indicates right-skew with a concentration of values near the lower end but with a long tail toward higher times.
Summary interpretation of X (theoretical example):
- Symmetric discrete rv with X ∈ {−2, 0, 2} and P(−2)=0.05, P(0)=0.9, P(2)=0.05
- Range = 4; mean and median both 0; distribution remains symmetric, but spread changes with sample size (SSE and s^2 differences shown for n=3 vs n=20).
- Demonstrates that range can mislead about spread; variance captures spread more accurately as shown by computations: for data sets {−2,0,2} s^2 = 4 with s = 2; for a larger set with many zeros and two nonzero outliers, s ≈ 0.65 (s^2 ≈ 0.42).

Inferential Statistics: From Sample to Population

Inferential model overview:
- Population parameters (shape, center, spread) are generally unknown.
- Use random samples to estimate parameters (statistics) and infer about true population characteristics.
- The progression: Population --Inference--> Parameters; Sample --Inference--> Statistics (estimators) --Inference--> Truth.
Population vs Sample terminology:
- Population parameters denoted by Greek letters (e.g., m, s, p, r).
- Sample statistics denoted by Roman letters (e.g., X̄, s, p̂, r̂).
Random sampling methods to obtain representative data:
- Simple random sample (SRS): every possible data set has equal chance.
- Stratified random sample: sample from each stratum to ensure representation.
- Cluster sample: sample from clusters; often used in large populations.
- Systematic sampling: select every k-th item after a random start.
- Non-sampling bias: voluntary response samples (VRS) can bias results.
Observational studies vs designed experiments:
- Observational study: data reflect current situation without manipulating factors.
- Designed experiment: researchers actively manipulate at least one factor to observe effects.
Practical inference with data:
- Use random samples and statistical software (R in these notes) to estimate the shape, spread, and location of distributions.
- Inferential conclusions depend on the sampling design and data quality.

Relationship Between Two Quantitative Variables: Correlation

Pearson correlation measures the strength and direction of a linear relationship between two quantitative variables.
- Denoted by r; the estimate is also called r when computed from sample data.
- r ∈ [−1, 1], where:
- r ≈ 1 indicates a strong positive linear relationship,
- r ≈ −1 indicates a strong negative linear relationship,
- r ≈ 0 indicates little to no linear relationship.
Examples discussed:
- Municipal violence vs percentage of Hillary Clinton voters: a positive linear relationship with r ≈ 0.64 reported in one example.
- Transformations (e.g., log CST20) can linearize highly skewed variables to improve linear correlation estimates.
Practical notes:
- Correlation does not imply causation; a nonlinear relationship may have low r even if a strong relationship exists nonlinearly.

Worked Data Set Illustrations with R and Practical Notes

City data set exercises (CST20, LDR, CST20 vs LDR) show how skewness affects correlation and how transformations can improve linear interpretation.
The five-number summary and boxplots (via R) help visualize shape, spread, and location for both qualitative and quantitative variables.
Pareto charts and Pie charts are recommended for qualitative rv to summarize category frequencies; they are not used to infer distribution shape.

Problem Sets: What to Do and How to Think About Them

B problems (B1–B12): classification and description tasks for various rvs (Birthplace, Letter grade, ACT Score, Weight, Hair Color, Class Rank, Number of Siblings, Age); describe distributions and shape/spread/location for current class data; discuss robustness and appropriate measures.
B5: For a continuous rv with mean 5, explain why P(X = 5) = 0.
B6–B7: A practical binomial-type problem (Miriam’s baton spins) uses n = 15 trials; estimate the probability of making at least 10 catches; involves binomial modeling and estimating p from a sample of observed catches.
B8–B12: Numerical exercises with discrete rv X; compute shape, spread, and location; consider binomial examples, weather predictions, stadium capacity, etc.; apply 5-number summaries, range, IQR, and X̄/M (mean/median).
C problems (C1–C18): extensive hands-on data-analysis tasks using R with real data sets (winter temps, attendance, baseball hits, Coke weights, Old Faithful heights, cereal sugar rates, ACT distributions, city CST20/LDR, etc.).
- Tasks include: loading data with read.table, producing stemplots, boxplots, IQR, s, s^2, and X33/M (shape, spread, and location metrics) for each rv; interpret what the results say about the distribution and type of rv (e.g., symmetric, right-skewed, left-skewed, binomial).
D problems (D1–D6): deeper discussion of sampling and relationships, including identifying study types (observational vs designed experiments), Simpson’s Paradox, and correlation findings from Shelton data.
Summary of key problem-solving ideas:
- Classify rvs by type (qualitative vs quantitative; discrete vs continuous).
- Use shape, spread, and location to describe distributions; prefer robust measures (median, IQR) when outliers are present.
- Use appropriate graphs (stem plots, boxplots, histograms) to visually assess distributions.
- Use R for data analysis: loading data, computing summary statistics, and visualizing distributions.
- Recognize when to apply transformations (e.g., logarithms) to linearize relationships for correlation analyses.
- Distinguish between sampling methods and their implications for inference; consider potential biases.

Quick Formulas and Key Relationships (LaTeX)

Mean (location): m = ar{X} ext{ or } E(X)
Expected value for discrete rv: E(X) = \sum_{x} x \, P(X=x)
Variance and standard deviation:
- s^2 = \frac{\sum{i=1}^{n} (xi - \bar{X})^2}{n-1}
- s = \sqrt{s^2}
Range and IQR:
- \text{Range} = \max(xi) - \min(xi)
- \text{IQR} = Q3 - Q1
Five-number summary: \text{min}, Q1, Q2, Q_3, \text{max}
Quartiles: Q1, Q2 (\text{median}), Q_3\; (75\%\text{ile})
Pearson correlation (between two quantitative variables X and Y):
r = \frac{\sum (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum (xi - \bar{x})^2} \sqrt{\sum (yi - \bar{y})^2}}
Exponential distribution (brief description): for rate \lambda > 0\,
- Density: f(x) = \lambda e^{-\lambda x}, \quad x \ge 0
- Mean: E(X) = \frac{1}{\lambda}
- Variance: \mathrm{Var}(X) = \frac{1}{\lambda^2}
Note on skewness and estimation: in right-skewed data, typically \text{Median} < \text{Mean}; in symmetric data, Mean ≈ Median.

Practical Takeaways for Exam Preparation

Be able to classify an rv as quantitative vs qualitative and discrete vs continuous.
Describe a distribution using shape, spread (range, IQR, variance, standard deviation) and location (mean, median, mode, quartiles).
Use the 5-number summary to quickly assess distribution features and skewness.
Recognize when to use robust measures (median, IQR) vs mean/variance depending on outliers.
Interpret and compute basic statistics from data sets; understand how to read and summarize R outputs (stem plot, boxplot, summary, and standard deviation).
Understand the difference between sampling methods and their impact on inference; distinguish observational studies from designed experiments.
Apply correlation appropriately, including considering transformations to linearize relationships when needed.
Practice with both theoretical rv distributions (e.g., binomial, exponential) and real data sets to estimate shape, spread, and location; be able to interpret what these tell you about the underlying process and its practical implications.