C-N M201 Lecture Notes (Part 2) — Comprehensive Summary
Random Variables: Classification and Basics
- Random variable (rv) concepts and classifications are extended to dimension n:
- If dim(A) = 1, the rv is quantitative; if dim(A) > 1, the rv is qualitative (e.g., color).
- Quantitative rvs can have ranges that are intervals (continuous) or finite/countable sets (discrete).
- In these notes, focus is on quantitative rvs.
- Examples of variable types:
- Diastolic Blood Pressure: quantitative, continuous.
- College Class: quantitative, discrete.
- Two approaches to “sin” as an rv example:
- Discrete case: X ∈ {0, 1} with X = 0 (no sin) or X = 1 (sin).
- Continuous case: X > 0 representing magnitude of sin.
- Affiliated concept: distribution of a rv describes its numerical occurrence pattern.
- Questions about distribution include where values occur most often, how far apart occurrences are, and expected/average occurrences.
- Famous continuous distributions are introduced later (Normal, t, Chi-square, F). See Notes 3 and 4.
- Simple discrete examples illustrate rv behavior:
- Vehicle direction example: Ω = {Right, Straight, Left}, X: Ω → ℕ with values in A = {−1, 0, 1} (symmetric shape).
- Coin-toss example: W = {HH, HT, TH, TT}, X: W → ℕ with X = 1,2,3,4 (symmetric, discrete).
- Real-world data and distributions link to inference: distributions describe data patterns and underpin statistical inference.
Shape, Spread, and Location of Distributions
- Key descriptive facets of a distribution:
- Shape: the overall form (e.g., symmetric, skewed).
- Spread: how dispersed the data are (range, variability).
- Location: where the data center or balance point lies (mean, median, mode).
- Terminology in notes: “random variable, data set and distribution” are used interchangeably in context; however, they aren’t exactly the same.
- Measures of shape, spread, and location are called the distribution’s parameters (shape, spread, location).
- Skewness terminology:
- Right-skewed (positively skewed): tail to the right; longer right tail.
- Left-skewed (negatively skewed): tail to the left; longer left tail.
- Example data sets and interpretations:
- Set C (inches): 30,32,33,33,33,34,35 → symmetric shape, range 5 in, location ~33 in.
- Set A (inches): 47,52,43,56,45,49,50 → right-skewed, range 13 in, location near ~48 in.
- Set B (inches): 61,67,68,70,69,72,69 → left-skewed, range 11 in, location ~69 in.
- A graph can illustrate shape, spread, and location together (e.g., histograms, stemplots, boxplots).
Random Variable Types and Examples
- Continuous vs discrete rv examples:
- X: movement through an intersection; X ∈ {−1,0,1} (discrete, symmetric) with probabilities provided.
- Two-coin toss: W = {HH, HT, TH, TT}, X ∈ {1,2,3,4} with equal probabilities (discrete, symmetric).
- Waiting time for Apple gadget: X is quantitative, continuous; distribution described as right-skewed with infinite range in some cases (Exponential distribution).
- Waiting-time example details:
- X = time (in hours) a person waits in a queue; X > 0; average ~5 hours; could be as large as 25 hours (extreme).
- This is a qualitative note about a continuous rv with right-skewed shape (Exponential distribution is cited).
- Qualitative rvs and plotting:
- Pareto or Pie charts are used for qualitative rvs; not to infer distribution shape from these graphs.
Famous Continuous Distributions (in this course)
- Four famous continuous distributions highlighted for study: Normal, t, Chi-square, and F.
- Exponential distribution is presented as another well-known continuous rv (example in the notes).
- Relationship to inference: understanding these distributions aids in modeling and statistical inference.
Descriptive Statistics: Location, Spread, and Shape in Practice
Notation and key concepts:
- Mean (location) m or E(X): the balance point of the distribution.
- Median (Q2): 50th percentile; midpoint of the ordered data.
- Mode: most frequent value; may be nonunique or nonexistent.
- Quartiles: Q1 (25th percentile), Q2 (median), Q3 (75th percentile).
- Five-number summary: min, Q1, Q2, Q3, max.
- Range: max − min.
- Interquartile range (IQR): Q3 − Q1.
- Variance: s^2 = E[(X − m)^2]; standard deviation: s = √s^2. For a sample:
s^{2} = rac{
abla{i=1}^{n} (xi - ar{x})^{2}}{n-1},
s = \,\sqrt{s^{2}}.
The mean is the first raw moment; the variance is the second central moment.
Robustness considerations:
- Mean and mode may be sensitive to outliers; median is more robust as a measure of center.
Shape vs location vs spread interplay:
- For right-skewed distributions, typically Median < Mean; for left-skewed, Mean < Median.
- For symmetric distributions, mean ≈ median.
In practice, use R or other software to obtain: 5-number summary, mean, median, mode, quartiles, range, IQR, variance, and standard deviation.
Practical Data Visualization and Software Tools (R)
- Tools introduced to assess distribution features:
- Stem-and-leaf plots (stem()), histograms, boxplots (boxplot()), Pareto/Pie charts for qualitative rv.
- Five-number summary: min, Q1, Q2, Q3, max.
- Spread measures include: range, IQR, variance, standard deviation.
- Example data: commuting times of a class (25 observations, minutes):
- Data: 23, 2, 10, 15, 2, 5, 5, 6, 15, 6, 10, 6, 2, 45, 35, 2, 5, 5, 25, 30, 5, 5, 5, 2, 6.
- Computations (via R):
- Mean ≈ ar{X} \,\approx \,11.1\text{ min}
- Median Q2 = 6 minutes
- Mode ≈ 5 minutes
- Range = 45 − 2 = 43 minutes
- IQR = Q3 − Q1 ≈ 10 minutes
- Shape: right-skewed (long tail toward higher values).
- R practice with ctime (7 observations) and timegordon example:
- Small set: ctime = c(10, 13, 5, 9, 5, 7, 7)
- Summary and spread: s^2 ≈ 134.3\;\text{min}^2, s ≈ 11.6\;\text{min}
- Expanded timegordon: timegordon = ctime; timegordon ← append with 6 and 6.5; then summary(timegordon) shows a right-skewed distribution with mean ≈ 7.6, median ≈ 7.
- Example with larger data: 2019 US Cities dataset (citygordon) used to illustrate qualitative vs quantitative variables and the use of Pareto charts and CST20 (cost of riots) as a quantitative variable with severe right skew.
- Important R commands introduced:
- ctime = c(…)
- length(ctime)
- summary(ctime)
- sd(ctime); var(ctime)
- stem(ctime); boxplot(ctime)
- boxplot and 5-number summary via boxplot output
- read.table("URL", header = TRUE) to load datasets
- citygordon$CST20 to access a variable within a data frame
- summary(citygordon$CST20), sd(citygordon$CST20) for location/spread
- q() to exit R
Practical Examples and Interpretations
- Exponential distribution example (Apple queue):
- X = waiting time; shape is right-skewed; range is infinite; location around 5 hours; this is described as another famous distribution.
- ACT score dataset (Carson-Newman):
- Data show a slightly right-skewed shape; spread via range ≈ 22; location around the most common score ≈ 21; mean and median differ due to skewness.
- Conceptual interpretation of the 5-number summary for commuting times:
- Min ≈ 2, Q1 ≈ 5, Median ≈ 6, Q3 ≈ 15, Max ≈ 45
- Range ≈ 43; IQR ≈ 10; indicates right-skew with a concentration of values near the lower end but with a long tail toward higher times.
- Summary interpretation of X (theoretical example):
- Symmetric discrete rv with X ∈ {−2, 0, 2} and P(−2)=0.05, P(0)=0.9, P(2)=0.05
- Range = 4; mean and median both 0; distribution remains symmetric, but spread changes with sample size (SSE and s^2 differences shown for n=3 vs n=20).
- Demonstrates that range can mislead about spread; variance captures spread more accurately as shown by computations: for data sets {−2,0,2} s^2 = 4 with s = 2; for a larger set with many zeros and two nonzero outliers, s ≈ 0.65 (s^2 ≈ 0.42).
Inferential Statistics: From Sample to Population
- Inferential model overview:
- Population parameters (shape, center, spread) are generally unknown.
- Use random samples to estimate parameters (statistics) and infer about true population characteristics.
- The progression: Population --Inference--> Parameters; Sample --Inference--> Statistics (estimators) --Inference--> Truth.
- Population vs Sample terminology:
- Population parameters denoted by Greek letters (e.g., m, s, p, r).
- Sample statistics denoted by Roman letters (e.g., X̄, s, p̂, r̂).
- Random sampling methods to obtain representative data:
- Simple random sample (SRS): every possible data set has equal chance.
- Stratified random sample: sample from each stratum to ensure representation.
- Cluster sample: sample from clusters; often used in large populations.
- Systematic sampling: select every k-th item after a random start.
- Non-sampling bias: voluntary response samples (VRS) can bias results.
- Observational studies vs designed experiments:
- Observational study: data reflect current situation without manipulating factors.
- Designed experiment: researchers actively manipulate at least one factor to observe effects.
- Practical inference with data:
- Use random samples and statistical software (R in these notes) to estimate the shape, spread, and location of distributions.
- Inferential conclusions depend on the sampling design and data quality.
Relationship Between Two Quantitative Variables: Correlation
- Pearson correlation measures the strength and direction of a linear relationship between two quantitative variables.
- Denoted by r; the estimate is also called r when computed from sample data.
- r ∈ [−1, 1], where:
- r ≈ 1 indicates a strong positive linear relationship,
- r ≈ −1 indicates a strong negative linear relationship,
- r ≈ 0 indicates little to no linear relationship.
- Examples discussed:
- Municipal violence vs percentage of Hillary Clinton voters: a positive linear relationship with r ≈ 0.64 reported in one example.
- Transformations (e.g., log CST20) can linearize highly skewed variables to improve linear correlation estimates.
- Practical notes:
- Correlation does not imply causation; a nonlinear relationship may have low r even if a strong relationship exists nonlinearly.
Worked Data Set Illustrations with R and Practical Notes
- City data set exercises (CST20, LDR, CST20 vs LDR) show how skewness affects correlation and how transformations can improve linear interpretation.
- The five-number summary and boxplots (via R) help visualize shape, spread, and location for both qualitative and quantitative variables.
- Pareto charts and Pie charts are recommended for qualitative rv to summarize category frequencies; they are not used to infer distribution shape.
Problem Sets: What to Do and How to Think About Them
- B problems (B1–B12): classification and description tasks for various rvs (Birthplace, Letter grade, ACT Score, Weight, Hair Color, Class Rank, Number of Siblings, Age); describe distributions and shape/spread/location for current class data; discuss robustness and appropriate measures.
- B5: For a continuous rv with mean 5, explain why P(X = 5) = 0.
- B6–B7: A practical binomial-type problem (Miriam’s baton spins) uses n = 15 trials; estimate the probability of making at least 10 catches; involves binomial modeling and estimating p from a sample of observed catches.
- B8–B12: Numerical exercises with discrete rv X; compute shape, spread, and location; consider binomial examples, weather predictions, stadium capacity, etc.; apply 5-number summaries, range, IQR, and X̄/M (mean/median).
- C problems (C1–C18): extensive hands-on data-analysis tasks using R with real data sets (winter temps, attendance, baseball hits, Coke weights, Old Faithful heights, cereal sugar rates, ACT distributions, city CST20/LDR, etc.).
- Tasks include: loading data with read.table, producing stemplots, boxplots, IQR, s, s^2, and X33/M (shape, spread, and location metrics) for each rv; interpret what the results say about the distribution and type of rv (e.g., symmetric, right-skewed, left-skewed, binomial).
- D problems (D1–D6): deeper discussion of sampling and relationships, including identifying study types (observational vs designed experiments), Simpson’s Paradox, and correlation findings from Shelton data.
- Summary of key problem-solving ideas:
- Classify rvs by type (qualitative vs quantitative; discrete vs continuous).
- Use shape, spread, and location to describe distributions; prefer robust measures (median, IQR) when outliers are present.
- Use appropriate graphs (stem plots, boxplots, histograms) to visually assess distributions.
- Use R for data analysis: loading data, computing summary statistics, and visualizing distributions.
- Recognize when to apply transformations (e.g., logarithms) to linearize relationships for correlation analyses.
- Distinguish between sampling methods and their implications for inference; consider potential biases.
Quick Formulas and Key Relationships (LaTeX)
- Mean (location): m = ar{X} ext{ or } E(X)
- Expected value for discrete rv: E(X) = \sum_{x} x \, P(X=x)
- Variance and standard deviation:
- s^2 = \frac{\sum{i=1}^{n} (xi - \bar{X})^2}{n-1}
- s = \sqrt{s^2}
- Range and IQR:
- \text{Range} = \max(xi) - \min(xi)
- \text{IQR} = Q3 - Q1
- Five-number summary: \text{min}, Q1, Q2, Q_3, \text{max}
- Quartiles: Q1, Q2 (\text{median}), Q_3\; (75\%\text{ile})
- Pearson correlation (between two quantitative variables X and Y):
r = \frac{\sum (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum (xi - \bar{x})^2} \sqrt{\sum (yi - \bar{y})^2}} - Exponential distribution (brief description): for rate \lambda > 0\,
- Density: f(x) = \lambda e^{-\lambda x}, \quad x \ge 0
- Mean: E(X) = \frac{1}{\lambda}
- Variance: \mathrm{Var}(X) = \frac{1}{\lambda^2}
- Note on skewness and estimation: in right-skewed data, typically \text{Median} < \text{Mean}; in symmetric data, Mean ≈ Median.
Practical Takeaways for Exam Preparation
- Be able to classify an rv as quantitative vs qualitative and discrete vs continuous.
- Describe a distribution using shape, spread (range, IQR, variance, standard deviation) and location (mean, median, mode, quartiles).
- Use the 5-number summary to quickly assess distribution features and skewness.
- Recognize when to use robust measures (median, IQR) vs mean/variance depending on outliers.
- Interpret and compute basic statistics from data sets; understand how to read and summarize R outputs (stem plot, boxplot, summary, and standard deviation).
- Understand the difference between sampling methods and their impact on inference; distinguish observational studies from designed experiments.
- Apply correlation appropriately, including considering transformations to linearize relationships when needed.
- Practice with both theoretical rv distributions (e.g., binomial, exponential) and real data sets to estimate shape, spread, and location; be able to interpret what these tell you about the underlying process and its practical implications.