Statistics Lecture Notes – Measures, Graphs, Correlation & Exam Review

Symmetric vs. Skewed Distributions

  • When a distribution is symmetric
    • Shape is “mirror-image” around the center.
    • \text{Mean}=\text{Median} (mode usually sits there too).
    • Arithmetic mean ( \bar x ) is the preferred measure of center.
    • Graph of choice: Histogram (bell-shaped/symmetric quantitative display).
  • When a distribution is skewed
    • Skewed left (long left tail): \text{Mean}<\text{Median}<\text{Mode}.
    • Skewed right (long right tail): \text{Mode}<\text{Median}<\text{Mean}.
    • The tail “drags” the mean in its direction.
    • Preferred measure of center: Median.
    • Preferred display: Box plot (visualizes 5-number summary & outliers).

Box Plots, Quartiles & Five-Number Summary

  • Five numbers: Min, Q1, Median (Q2), Q_3, Max.
  • Box spans Q1 to Q3; vertical line in box = median.
  • “Whiskers”: extend to min & max (excluding flagged outliers).
  • Fences (imaginary cut-offs for outliers)
    • \text{Lower Fence}=Q_1-1.5(IQR)
    • \text{Upper Fence}=Q_3+1.5(IQR)
    • Any data beyond fences = outliers (often plotted as dots or asterisks).
  • Interquartile Range (IQR)
    • IQR=Q3-Q1
    • Describes spread of middle 50 % of data.
    • For skewed data, use IQR instead of standard deviation to discuss spread.

Histograms

  • Bar-like graph for quantitative data.
  • Good for symmetric or bell-shaped sets.
  • Relative frequency for each class width = \frac{\text{class frequency}}{\text{total count}}.

Z-Scores

  • Standardized distance of an individual value x from the mean:
    z=\frac{x-\bar x}{s}
  • Interpreted as “# of standard deviations above/below the mean”.
  • Example (MPG data)
    • x=34.2,\; \bar x=38.97,\; s=3.54 \Rightarrow z\approx\frac{34.2-38.97}{3.54}\approx-1.35 (≈1.35 σ below mean).

Univariate vs. Bivariate Data

  • Univariate: one variable (height, commute time, siblings, etc.).
  • Bivariate: two variables; study their relationship through an x–y pair.
  • René Descartes’ xy-plane (17th c.) fused geometry & algebra, giving us coordinate mapping & modern graphs.

Explanatory / Response Terminology

  • Algebra: Independent (input x) ↔ Dependent (output y).
  • Computer science: Input ↔ Output.
  • Statistics: Explanatory (predictor) ↔ Response (outcome).
  • Examples
    • Hours worked \to Paycheck.
    • Distance from light source \to Light intensity.
    • Gallons pumped \to Cost of fuel.

Correlation vs. Causation

  • Causation = direct cause-and-effect (drop rock → hits floor every time).
  • Correlation = variables “run together”; may be strong, weak, or nonexistent, but not necessarily causal (sun rises & you wake up).
  • Practical implications
    • Low income areas & high crime: correlated but poverty ≠ automatic criminality.
    • Smoking & lung disease: high positive correlation (evidence ultimately supported causal link).

Linear Correlation Coefficient r

  • Measures strength & direction of a linear relation.
  • Range: -1\le r\le1.
    • r\approx1 → very strong positive linear fit.
    • r\approx-1 → very strong negative linear fit.
    • r\approx0 → little/no linear relation.
  • Interpretation scale used in lecture
    • |r|\ge0.9 : extremely strong.
    • 0.7\le|r|<0.9 : strong.
    • 0.4\le|r|<0.7 : moderate.
    • |r|<0.4 : weak to none.
  • Positive vs. Negative examples
    • Positive: Cigarettes smoked ↑ ⇒ Disease risk ↑.
    • Negative: Hours driven ↑ ⇒ Fuel in tank ↓.

Empirical Rule (68-95-99.7)

For bell-shaped distributions:

  • \mu\pm1\sigma ≈ 68 % of data.
  • \mu\pm2\sigma ≈ 95 % of data.
  • \mu\pm3\sigma ≈ 99.7 % of data.
    Example with IQ
  • \mu=100,\;\sigma=15.
  • 95 % of IQs lie between 100-2(15)=70 and 100+2(15)=130.
  • Values

Data Types & Scale

  • Qualitative (categorical): eye color, brand, nationality (addition makes no sense).
  • Quantitative
    • Discrete: whole-number counts (siblings, cars owned).
    • Continuous: can assume any value in interval (height, MPG, weight).

Exam Preparation Highlights

  • Know relationships between mean, median, mode for symmetric vs. skewed.
  • Be able to:
    • Compute mean, median, standard deviation, 5-number summary on calculator (STAT → EDIT → 1-Var Stats).
    • Convert frequency table to relative frequency.
    • Apply empirical rule & z-scores.
    • Determine explanatory/response variables in context.
    • Distinguish correlation from causation.
    • Identify discrete vs. continuous quantitative data.
  • Allowed: personal formula sheet (no size limit announced), calculator.
  • Concepts such as fences, IQR, box plots, skew direction will appear (“I promise it’s on there”).

Numerical & Formula Recap

  • \text{Mean}\;(\bar x)=\dfrac{\sum x}{n}.
  • \text{Sample SD}\;(s)=\sqrt{\dfrac{\sum(x-\bar x)^2}{n-1}}.
  • IQR=Q3-Q1.
  • \text{Fences}=Q1-1.5(IQR),\;Q3+1.5(IQR).
  • z=\dfrac{x-\bar x}{s}.
  • r=\dfrac{\sum\big[(x-\bar x)(y-\bar y)\big]}{(n-1)sx sy} (calculator finds it for you).

End of Unit 1 material; Unit 2 (Correlation/Regression) begins after the exam.