Statistics Lecture Notes – Measures, Graphs, Correlation & Exam Review

Symmetric vs. Skewed Distributions

When a distribution is symmetric
- Shape is “mirror-image” around the center.
- \text{Mean}=\text{Median} (mode usually sits there too).
- Arithmetic mean ( \bar x ) is the preferred measure of center.
- Graph of choice: Histogram (bell-shaped/symmetric quantitative display).
When a distribution is skewed
- Skewed left (long left tail): \text{Mean}<\text{Median}<\text{Mode}.
- Skewed right (long right tail): \text{Mode}<\text{Median}<\text{Mean}.
- The tail “drags” the mean in its direction.
- Preferred measure of center: Median.
- Preferred display: Box plot (visualizes 5-number summary & outliers).

Five numbers: Min, Q1, Median (Q2), Q_3, Max.
Box spans Q1 to Q3; vertical line in box = median.
“Whiskers”: extend to min & max (excluding flagged outliers).
Fences (imaginary cut-offs for outliers)
- \text{Lower Fence}=Q_1-1.5(IQR)
- \text{Upper Fence}=Q_3+1.5(IQR)
- Any data beyond fences = outliers (often plotted as dots or asterisks).
Interquartile Range (IQR)
- IQR=Q3-Q1
- Describes spread of middle 50 % of data.
- For skewed data, use IQR instead of standard deviation to discuss spread.

Bar-like graph for quantitative data.
Good for symmetric or bell-shaped sets.
Relative frequency for each class width = \frac{\text{class frequency}}{\text{total count}}.

Standardized distance of an individual value x from the mean:
z=\frac{x-\bar x}{s}
Interpreted as “# of standard deviations above/below the mean”.
Example (MPG data)
- x=34.2,\; \bar x=38.97,\; s=3.54 \Rightarrow z\approx\frac{34.2-38.97}{3.54}\approx-1.35 (≈1.35 σ below mean).

Univariate: one variable (height, commute time, siblings, etc.).
Bivariate: two variables; study their relationship through an x–y pair.
René Descartes’ xy-plane (17th c.) fused geometry & algebra, giving us coordinate mapping & modern graphs.

Algebra: Independent (input x) ↔ Dependent (output y).
Computer science: Input ↔ Output.
Statistics: Explanatory (predictor) ↔ Response (outcome).
Examples
- Hours worked \to Paycheck.
- Distance from light source \to Light intensity.
- Gallons pumped \to Cost of fuel.

Causation = direct cause-and-effect (drop rock → hits floor every time).
Correlation = variables “run together”; may be strong, weak, or nonexistent, but not necessarily causal (sun rises & you wake up).
Practical implications
- Low income areas & high crime: correlated but poverty ≠ automatic criminality.
- Smoking & lung disease: high positive correlation (evidence ultimately supported causal link).

Measures strength & direction of a linear relation.
Range: -1\le r\le1.
- r\approx1 → very strong positive linear fit.
- r\approx-1 → very strong negative linear fit.
- r\approx0 → little/no linear relation.
Interpretation scale used in lecture
- |r|\ge0.9 : extremely strong.
- 0.7\le|r|<0.9 : strong.
- 0.4\le|r|<0.7 : moderate.
- |r|<0.4 : weak to none.
Positive vs. Negative examples
- Positive: Cigarettes smoked ↑ ⇒ Disease risk ↑.
- Negative: Hours driven ↑ ⇒ Fuel in tank ↓.

For bell-shaped distributions:

Qualitative (categorical): eye color, brand, nationality (addition makes no sense).
Quantitative
- Discrete: whole-number counts (siblings, cars owned).
- Continuous: can assume any value in interval (height, MPG, weight).

Know relationships between mean, median, mode for symmetric vs. skewed.
Be able to:
- Compute mean, median, standard deviation, 5-number summary on calculator (STAT → EDIT → 1-Var Stats).
- Convert frequency table to relative frequency.
- Apply empirical rule & z-scores.
- Determine explanatory/response variables in context.
- Distinguish correlation from causation.
- Identify discrete vs. continuous quantitative data.
Allowed: personal formula sheet (no size limit announced), calculator.
Concepts such as fences, IQR, box plots, skew direction will appear (“I promise it’s on there”).

\text{Mean}\;(\bar x)=\dfrac{\sum x}{n}.
\text{Sample SD}\;(s)=\sqrt{\dfrac{\sum(x-\bar x)^2}{n-1}}.
IQR=Q3-Q1.
\text{Fences}=Q1-1.5(IQR),\;Q3+1.5(IQR).
z=\dfrac{x-\bar x}{s}.
r=\dfrac{\sum\big[(x-\bar x)(y-\bar y)\big]}{(n-1)sx sy} (calculator finds it for you).

End of Unit 1 material; Unit 2 (Correlation/Regression) begins after the exam.