Notes on Measures of Center, Variation, Relative Standing, Boxplots, Z-scores, Percentiles, Boxplots, and Probability

Measures of Center

  • A measure of center is a value at the center or middle of a data set.

  • Main measures discussed: mean and median (also midrange and mode). The objective is to obtain a value that measures the center and to interpret those values.

Mean (or Arithmetic Mean)

  • The mean of a data set is found by adding all data values and dividing by the number of data values.

  • Formula (sample): \bar{X} = \frac{\sum X}{n}

  • Notation:

    • (\bar{X}) = mean of a sample

    • (\mu = \frac{\sum X}{N}) = mean of a population

  • Notation for sum: (\Sigma X)

  • Key properties:

    • Sample means drawn from the same population tend to vary less than other measures of center.

    • The mean uses every data value.

    • Disadvantage: a single extreme value (outlier) can substantially change the mean (not resistant).

  • Caution: Do not use the term average when referring to a measure of center. In statistics, the term average is not preferred.

  • Example: Verizon data speeds (first five): 38.5, 55.6, 22.4, 14.1, 23.1 Mbps

    • Sum = 153.7; n = 5; \bar{X} = \frac{153.7}{5} = 30.74\,\text{Mbps}

  • Summary of notation: sample mean (\bar{X}); population mean (\mu)

Median

  • The median is the middle value when data are ordered from smallest to largest.

  • For an odd number of data values, the median is the exact middle value.

  • For an even number, the median is the mean of the two middle values.

  • Example (odd, 5 values): Verizon speeds sorted: 14.1, 22.4, 23.1, 38.5, 55.6; median = 23.1 Mbps.

  • Example (even, 6 values, including 24.5): 14.1, 22.4, 23.1, 24.5, 38.5, 55.6; median = (\frac{23.1 + 24.5}{2} = 23.80) Mbps.

  • Important properties:

    • The median is a resistant measure of center (not affected much by extreme values).

    • The median does not necessarily use every data value (e.g., a large change in the largest value may not affect the median).

  • Calculation notes:

    • Median often denoted as (\tilde{X}) or Mor Med.

    • Steps: sort data; if odd, take middle value; if even, average the two middle values.

Mode

  • The mode is the value(s) that occur with the greatest frequency.

  • A data set can have:

    • one mode, two modes (bimodal), multiple modes (multimodal), or no mode.

  • Examples:

    • Mode of {0.2, 0.3, 0.3, 0.3, 0.6, 0.6, 1.2} is 0.3 Mbps (occurs most often).

    • Data speeds with two modes: {0.3, 0.3, 0.6, 4.0, 4.0} have modes 0.3 and 4.0 Mbps.

    • Data speeds with no repetition have no mode.

  • Notes: The mode can be found with qualitative data.

Midrange

  • The midrange is the value midway between the maximum and minimum values:
    \text{Midrange} = \frac{\max X + \min X}{2}

  • Example: Verizon speeds: max = 55.6, min = 14.1; midrange = (\frac{55.6 + 14.1}{2} = 34.85) Mbps.

  • Important properties:

    • Very sensitive to extreme values; not a resistant measure.

  • Practical notes:

    • Very easy to compute.

    • Useful to illustrate there are multiple ways to define a center.

    • Can help illustrate potential confusion with the median; always define midrange when used.

Round-off Rules for Measures of Center

  • For the mean, median, and midrange, carry one more decimal place than is present in the original data.

  • For the mode, do not round (keep as is, since it is one of the original data values).

Critical Thinking and Measures of Center

  • Always consider whether computing a measure of center makes sense for the data and sampling method.

  • Some data (e.g., zip codes, ranks) are not measurements; their means/central values may be meaningless.

  • When data are summarized by a frequency distribution, the mean can be approximated using class midpoints: \bar{x} \approx \frac{\sum fi xi}{\sum fi} where (fi) are class frequencies and (x_i) are class midpoints.

  • Weighted means: when observations have different weights, the weighted mean is
    \bar{X}w = \frac{\sum wi xi}{\sum wi}

Examples and Applications (Measures of Center)

  • Grade-Point Average (GPA) example:

    • Grades A(3), A(4), B(3), C(3), F(1) with quality points A=4, B=3, C=2, F=0.

    • Weighted mean: \bar{X} = \frac{3(4) + 4(4) + 3(3) + 3(2) + 1(0)}{3+4+3+3+1} = \frac{43}{14} \approx 3.07

  • Percentage interpretation: rounding rules may yield GPA to two decimals (3.07) depending on policy.

Measures of Variation

  • Variation measures describe how spread out the data are.

  • The three key measures: range, standard deviation, and variance.

  • The discussion emphasizes interpretation and understanding of variation, not just computation.

Range

  • Definition: difference between maximum and minimum values.
    \text{Range} = \max X - \min X

  • Important properties:

    • Uses only the extreme values; very sensitive to outliers; not resistant.

    • Does not reflect variation among all data values.

  • Example: Verizon speeds {38.5, 55.6, 22.4, 14.1, 23.1} => range = 55.6 - 14.1 = 41.5 Mbps.

Standard Deviation and Variance

  • Standard deviation measures how much data values deviate from the mean.

  • Notation:

    • s = sample standard deviation

    • \sigma = population standard deviation

  • Properties:

    • s is nonnegative; larger s indicates more spread.

    • Outliers can dramatically increase s.

    • s is a biased estimator of (\sigma) (tends not to center around the population value).

  • Sample standard deviation (unbiased intuition):
    s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}

  • Shortcut (computational) form:
    s = \sqrt{\frac{n\sum xi^2 - (\sum xi)^2}{n(n-1)}}

  • Example (Verizon speeds):

    • Given: (\sum xi = 153.7), (\sum xi^2 = 5807.79), n = 5.

    • Then: s = \sqrt{\frac{5(5807.79) - (153.7)^2}{5(4)}} \approx \sqrt{270.7} \approx 16.46\,\text{Mbps}

  • Range rule of thumb for estimating s: approximately s \approx \frac{\text{range}}{4}

    • For Verizon speeds, range = 41.5; estimate (s \approx 41.5/4 \approx 10.38) Mbps (actual s ≈ 16.46 Mbps; the rule is a crude estimate).

  • Variance:

    • Definition: variance is the square of the standard deviation.

    • Notation:

    • (s^2) = sample variance

    • (\sigma^2) = population variance

    • Notation point: units of variance are square of original units, e.g., Mbps^2.

  • Empirical Rule (for bell-shaped distributions):

    • About 68% of values lie within 1 standard deviation of the mean.

    • About 95% lie within 2 standard deviations.

    • About 99.7% lie within 3 standard deviations.

  • Chebyshev's Theorem (for any distribution):

    • For any k > 1, at least 1 − 1/k^2 of the data lie within k standard deviations of the mean.

    • Examples: for k = 2, at least 75%; for k = 3, at least 89%.

Notation Summary and Additional Concepts

  • Notation: s (sample standard deviation), $s^2$ (sample variance), (\sigma) (population standard deviation), (\sigma^2) (population variance).

  • The empirical rule is a useful approximation for bell-shaped distributions; Chebyshev's theorem applies to any distribution.

  • Coefficient of Variation (CV): a standardized measure of dispersion relative to the mean.

    • Sample CV: \text{CV}_{\text{sample}} = \frac{s}{\bar{X}} \times 100\%

    • Population CV: \text{CV}_{\text{population}} = \frac{\sigma}{\mu} \times 100\%

  • Why divide by (n − 1) in the sample variance? Because with n − 1 degrees of freedom, the sample variance is an unbiased estimator of the population variance. Using n tends to underestimate the true variance.

Z Scores

  • A z score (standard score) shows how many standard deviations a value is from the mean.

  • Two common forms:

    • Sample: Z = \frac{X - \bar{X}}{s}

    • Population: Z = \frac{X - \mu}{\sigma}

  • Properties:

    • z scores are unitless.

    • A value below the mean has a negative z; above the mean is positive.

    • A value with z ≤ −2 or z ≥ 2 is often considered significantly low or high.

  • Rounding: round z scores to two decimal places.

Percentiles, Quartiles, and Boxplots

  • Percentiles: values that divide the data into 100 equal parts.

    • Notation: Pk denotes the k-th percentile.

    • Example: P40 divides data into bottom 40% and top 60%.

  • How to find the percentile of a value x:

    • Percentile of x = (number of values < x) / n × 100.

  • Converting a percentile to a data value:

    • Let L = (k/100) × n.

    • If L is an integer, the percentile value is the average of the L-th and (L+1)-th values in the ordered data.

    • If L is not an integer, the percentile value is the value at position ⌈L⌉ in the ordered data.

  • Quartiles:

    • Q1 = P25, Q2 = P50 (the median), Q3 = P75.

    • Interquartile Range (IQR) = Q3 − Q1.

    • Semi-interquartile range = (Q3 − Q1)/2.

    • Midquartile range = (Q3 + Q1)/2.

    • 10–90 quartile range = P90 − P10.

  • 5-Number Summary: minimum, Q1, Q2 (median), Q3, maximum.

  • Boxplot construction (uses the 5-number summary):
    1) Draw a line from min to max.
    2) Draw a box from Q1 to Q3.
    3) Put a line inside the box at the median Q2.

  • Example: Verizon airport speeds (sorted data) yield min = 0.8, Q1 = 7.9, Q2 = 13.9, Q3 = 21.5, max = 77.8 Mbps, so the 5-number summary is 0.8, 7.9, 13.9, 21.5, 77.8.

  • Boxplot interpretation and skewness:

    • Boxplots can reveal skewness in distributions.

  • Outliers in boxplots (modified boxplots):

    • Outlier rule: a value is an outlier if it lies below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR.

    • Modified boxplots show outliers with a special symbol and whiskers stop at the most extreme non-outlier values.

Probability (Basics, Rules, and Applications)

  • A probability is a number between 0 and 1, inclusive, describing how likely an event is to occur.

  • Key concepts:

    • Event: a set of outcomes.

    • Simple event: an outcome that cannot be broken down further.

    • Sample space: all possible simple events (outcomes).

  • Three common approaches to finding probabilities:

    • Relative frequency (observed frequencies from repeated trials or simulations).

    • Classical approach (equally likely outcomes): If there are n equally likely simple events and A can occur in s ways, then P(A) = \frac{s}{n}.

    • Subjective probabilities: based on knowledge or belief when data are sparse.

  • Law of Large Numbers: as a procedure is repeated many times, the relative frequency probability tends to the true probability.

    • Cautions: LLN applies to long-run behavior, not a single outcome; do not assume equal likelihood without justification.

  • Relative frequency example: Skydiving

    • 3,000,000 jumps, 21 deaths, so P( ext{death}) = \frac{21}{3{,}000{,}000} = 7 \times 10^{-6} = 0.000007.

Addition Rule

  • For P(A or B): the probability that either A or B occurs (or both).

  • Intuitive approach: add the number of ways A can occur to the number of ways B can occur, counting each outcome only once; divide by the total number of outcomes in the sample space.

  • Formal rule (not necessarily disjoint):
    P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)

  • Disjoint events (mutually exclusive): if A and B cannot occur together, then
    P(A \text{ or } B) = P(A) + P(B).

  • Example: Disjoint events (male vs female in a trial) are disjoint; a person cannot be both.

  • Summary:

    • “Or” corresponds to addition, but avoid double counting.

Complementary Events

  • The complement of A, denoted A, consists of all outcomes in which A does not occur.

  • Rule: P(A) + P(A) = 1 \Rightarrow P(A) = 1 - P(A).

  • Example: P(not sleepwalked) = 1 − P(sleepwalked).

Multiplication Rule

  • Intuitive: to get A and B in successive trials, multiply the probabilities, taking into account conditional probability.

  • Formal: P(A \text{ and } B) = P(A) \cdot P(B|A).

  • If A and B are independent, then P(A \text{ and } B) = P(A)P(B).

  • Examples:

    • Drug screening with replacement (independent): P(positive then negative) = P(positive) × P(negative).

    • Without replacement (dependent): adjust the second probability conditional on the first outcome.

  • 5% guideline for cumbersome calculations: when sampling without replacement and the sample size is no more than 5% of the population, you can treat selections as independent to simplify calculations.

  • Redundancy example (Airbus 310): with three independent hydraulic systems, probability that all fail is extremely small: (0.002^3 = 8 \times 10^{-9}) and probability that at least one works is 1 minus that value.

Conditional Probability

  • Definition: the probability of an event given that another event has already occurred.

  • Notation: P(B|A) denotes the probability of B given A.

  • Intuitive approach: condition on A occurring, then compute B within that scenario.

  • Formal definition: P(B|A) = \frac{P(A \text{ and } B)}{P(A)}.

  • Example: Pre-Employment Drug Screening

    • Using a 2×2 table of test results by actual drug use, find:

    • P(positive test result | subject uses drugs) = \frac{45}{50} = 0.900.

    • P(subject uses drugs | positive test result) = \frac{45}{70} = 0.643.

    • Note: P(B|A) ≠ P(A|B) in general (the inverse relationship is not symmetric).

Bayes’ Theorem

  • Bayes’ theorem links prior and posterior probabilities:
    P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}.

  • Example: Interpreting Medical Test Results (cancer example)

    • Population prevalence P(C) = 0.01 (1%).

    • Test characteristics: false positive rate P(positive|no cancer) = 0.10; true positive rate P(positive|cancer) = 0.80.

    • For 1000 subjects: expected with cancer = 10; among these, 8 test positive (true positives).

    • Among the 990 without cancer, 99 test positive (false positives).

    • Total positives = 8 + 99 = 107; thus P(C | positive) = 8/107 ≈ 0.0748 (7.48%).

    • Interpretation: A positive test increases the probability of cancer from 1% to about 7.48%; not definitive.

  • Prior vs posterior probability concepts:

    • Prior probability P(C) is the initial probability before new information.

    • Posterior probability P(C | positive) is revised using Bayes’ rule after new information (positive test result).

At Least One

  • When finding the probability that something occurs at least once in a series of trials, use the complement:
    P(\text{at least one occurrence}) = 1 - P(\text{no occurrences}).}

Practical Considerations and Examples

  • Example: Accidental iPad damage

    • If 6% of damaged iPads are damaged in bags/backpacks, and 20 damaged iPads are sampled, probability of at least one damaged in a bag/backpack is:

    • Calculation uses 1 − (probability none are bag/backpack) = 1 − (0.94)^{20} ≈ 0.710.

    • Interpretation: The probability is not very high; to be reasonably sure, more than 20 damaged iPads would be needed.

Additional Probability Concepts

  • Odds vs Probability:

    • Odds against an event A: P(A^c)/P(A); odds in favor: P(A)/P(A^c).

    • Payoff odds vs actual odds: examples in roulette illustrate differences between true probabilities and casino payoffs.

  • Significance heuristics (rare event rule): If under a given assumption the probability of an observed event is very small and that event occurs, the assumption may be incorrect.

  • Dependence and independence in sampling:

    • With replacement: independent events.

    • Without replacement: dependent events (the outcome affects subsequent probabilities).

    • 5% guideline: for small samples relative to population, independence is a reasonable approximation.

Summary of Key Formulas (Quick Reference)

  • Mean (sample): \bar{X} = \frac{\sum X}{n}

  • Mean (population): \mu = \frac{\sum X}{N}

  • Median (odd n): middle value after ordering; (even n): \frac{x{(n/2)} + x{(n/2+1)}}{2}

  • Range: \text{Range} = \max X - \min X

  • Midrange: \text{Midrange} = \frac{\max X + \min X}{2}

  • Mode: value(s) with greatest frequency

  • Standard deviation (sample): s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}

  • Standard deviation (population): \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}

  • Variance (sample): s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}

  • Variance (population): \sigma^2 = \frac{\sum (x_i - \mu)^2}{N}

  • Shortcut for sample standard deviation: s = \sqrt{\frac{n\sum xi^2 - (\sum xi)^2}{n(n-1)}}

  • Coefficient of variation (sample): \text{CV}_{\text{sample}} = \frac{s}{\bar{X}} \times 100\%

  • Z score (sample): Z = \frac{X - \bar{X}}{s}

  • Z score (population): Z = \frac{X - \mu}{\sigma}

  • Quartiles: Q1 = P{25}, Q2 = P{50}, Q3 = P_{75}; IQR = Q3 - Q1

  • Boxplot construction: uses the 5-number summary (min, Q1, Q2, Q3, max)

  • Percentile of value x: P(x) = \frac{#{X_i < x}}{n} \times 100

  • Percentile to data value conversion: let (L = \frac{k}{100} n); if L is integer, percentile is the average of the Lth and (L+1)th values; otherwise it is the value at position (\lceil L \rceil).

  • P(A or B) (General): P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)

  • P(A or B) when A and B are disjoint: P(A \text{ or } B) = P(A) + P(B)

  • P(A and B) (General): P(A \text{ and } B) = P(A) \cdot P(B|A)

  • Bayes’ Theorem: P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}

  • Probability of at least one in n trials: P(\text{at least one}) = 1 - P(\text{none})