Statistics Lecture Notes – Measures, Graphs, Correlation & Exam Review
Symmetric vs. Skewed Distributions
- When a distribution is symmetric
- Shape is “mirror-image” around the center.
- \text{Mean}=\text{Median} (mode usually sits there too).
- Arithmetic mean ( \bar x ) is the preferred measure of center.
- Graph of choice: Histogram (bell-shaped/symmetric quantitative display).
- When a distribution is skewed
- Skewed left (long left tail): \text{Mean}<\text{Median}<\text{Mode}.
- Skewed right (long right tail): \text{Mode}<\text{Median}<\text{Mean}.
- The tail “drags” the mean in its direction.
- Preferred measure of center: Median.
- Preferred display: Box plot (visualizes 5-number summary & outliers).
Box Plots, Quartiles & Five-Number Summary
- Five numbers: Min, Q1, Median (Q2), Q_3, Max.
- Box spans Q1 to Q3; vertical line in box = median.
- “Whiskers”: extend to min & max (excluding flagged outliers).
- Fences (imaginary cut-offs for outliers)
- \text{Lower Fence}=Q_1-1.5(IQR)
- \text{Upper Fence}=Q_3+1.5(IQR)
- Any data beyond fences = outliers (often plotted as dots or asterisks).
- Interquartile Range (IQR)
- IQR=Q3-Q1
- Describes spread of middle 50 % of data.
- For skewed data, use IQR instead of standard deviation to discuss spread.
Histograms
- Bar-like graph for quantitative data.
- Good for symmetric or bell-shaped sets.
- Relative frequency for each class width = \frac{\text{class frequency}}{\text{total count}}.
Z-Scores
- Standardized distance of an individual value x from the mean:
z=\frac{x-\bar x}{s} - Interpreted as “# of standard deviations above/below the mean”.
- Example (MPG data)
- x=34.2,\; \bar x=38.97,\; s=3.54 \Rightarrow z\approx\frac{34.2-38.97}{3.54}\approx-1.35 (≈1.35 σ below mean).
Univariate vs. Bivariate Data
- Univariate: one variable (height, commute time, siblings, etc.).
- Bivariate: two variables; study their relationship through an x–y pair.
- René Descartes’ xy-plane (17th c.) fused geometry & algebra, giving us coordinate mapping & modern graphs.
Explanatory / Response Terminology
- Algebra: Independent (input x) ↔ Dependent (output y).
- Computer science: Input ↔ Output.
- Statistics: Explanatory (predictor) ↔ Response (outcome).
- Examples
- Hours worked \to Paycheck.
- Distance from light source \to Light intensity.
- Gallons pumped \to Cost of fuel.
Correlation vs. Causation
- Causation = direct cause-and-effect (drop rock → hits floor every time).
- Correlation = variables “run together”; may be strong, weak, or nonexistent, but not necessarily causal (sun rises & you wake up).
- Practical implications
- Low income areas & high crime: correlated but poverty ≠ automatic criminality.
- Smoking & lung disease: high positive correlation (evidence ultimately supported causal link).
Linear Correlation Coefficient r
- Measures strength & direction of a linear relation.
- Range: -1\le r\le1.
- r\approx1 → very strong positive linear fit.
- r\approx-1 → very strong negative linear fit.
- r\approx0 → little/no linear relation.
- Interpretation scale used in lecture
- |r|\ge0.9 : extremely strong.
- 0.7\le|r|<0.9 : strong.
- 0.4\le|r|<0.7 : moderate.
- |r|<0.4 : weak to none.
- Positive vs. Negative examples
- Positive: Cigarettes smoked ↑ ⇒ Disease risk ↑.
- Negative: Hours driven ↑ ⇒ Fuel in tank ↓.
Empirical Rule (68-95-99.7)
For bell-shaped distributions:
- \mu\pm1\sigma ≈ 68 % of data.
- \mu\pm2\sigma ≈ 95 % of data.
- \mu\pm3\sigma ≈ 99.7 % of data.
Example with IQ - \mu=100,\;\sigma=15.
- 95 % of IQs lie between 100-2(15)=70 and 100+2(15)=130.
- Values
Data Types & Scale
- Qualitative (categorical): eye color, brand, nationality (addition makes no sense).
- Quantitative
- Discrete: whole-number counts (siblings, cars owned).
- Continuous: can assume any value in interval (height, MPG, weight).
Exam Preparation Highlights
- Know relationships between mean, median, mode for symmetric vs. skewed.
- Be able to:
- Compute mean, median, standard deviation, 5-number summary on calculator (STAT → EDIT → 1-Var Stats).
- Convert frequency table to relative frequency.
- Apply empirical rule & z-scores.
- Determine explanatory/response variables in context.
- Distinguish correlation from causation.
- Identify discrete vs. continuous quantitative data.
- Allowed: personal formula sheet (no size limit announced), calculator.
- Concepts such as fences, IQR, box plots, skew direction will appear (“I promise it’s on there”).
Numerical & Formula Recap
- \text{Mean}\;(\bar x)=\dfrac{\sum x}{n}.
- \text{Sample SD}\;(s)=\sqrt{\dfrac{\sum(x-\bar x)^2}{n-1}}.
- IQR=Q3-Q1.
- \text{Fences}=Q1-1.5(IQR),\;Q3+1.5(IQR).
- z=\dfrac{x-\bar x}{s}.
- r=\dfrac{\sum\big[(x-\bar x)(y-\bar y)\big]}{(n-1)sx sy} (calculator finds it for you).
End of Unit 1 material; Unit 2 (Correlation/Regression) begins after the exam.