Chapter 3 Notes: Describing, Exploring, and Comparing Data

Measures of Center (3-1)
  • Center: representative average value indicating the middle of the data.

  • Variation: measure of how values vary.

  • Key measures:

    • Mean (arithmetic mean): sum of all values divided by the number of values.

    • Median: middle value of ordered data; unaffected by extremes.

    • Mode: most frequent value; can be non-unique or absent. its the only measure of center where it can used with nominal data

    • Midrange: midpoint between max and min, defined as Midrange=max+min2\text{Midrange} = \frac{\text{max} + \text{min}}{2} .

  • Notation:

    • \sum: sum; xx: data values; nn: sample size; NN: population size.

    • Population mean: μ=xN\mu = \frac{\sum x}{N}.

    • Sample mean: xˉ=xn\bar{x} = \frac{\sum x}{n}.

  • Mean from a frequency distribution: xˉ=(fx)f\bar{x} = \frac{\sum (f \cdot x)}{\sum f}.

  • Round-off Rule: carry one more decimal place than original data; round only the final answer.


Measures of Variation (3-2)
  • Key concept: Variation quantifies how values differ; standard deviation interpretation is essential.

  • Range: difference between maximum and minimum values, Range=maxxminx\text{Range} = \max{x} - \min{x}.

  • Standard deviation (sample): measure of variation about the mean.

    • Formula: s=(xxˉ)2n1s = \sqrt{ \frac{\sum (x - \bar{x})^2}{n-1} }.

    • Shortcut formula: s=nx2(x)2n(n1)s = \sqrt{ \frac{ n\sum x^2 - (\sum x)^2 }{ n(n-1) } }.

  • Properties of ss:

    • Measures variation of all values from the mean.

    • Typically positive; increases with outliers.

    • Units are the same as the original data.

  • Range Rule of Thumb: estimate standard deviation as sRange4s \approx \frac{\text{Range}}{4} .

  • Variance:

    • Population variance: σ2\sigma^2

    • Sample variance: s2s^2

  • Round-off Rule for Variation: carry one more decimal place than in original data; round only the final answer.

  • Estimation of standard deviation (range-based):

    • Minimum usual xˉ2s\approx \bar{x} - 2s, maximum usual xˉ+2s\approx \bar{x} + 2s.

  • Empirical (68-95-99.7) Rule for bell-shaped distributions:

    • ~68% of values fall within 1s1s of the mean.

    • ~95% within 2s2s.

    • ~99.7% within 3s3s.


Measures of Relative Standing and Boxplots (3-3)
  • Key concept: Compare values across/within data sets; z-score is central; identify outliers.

  • z-score (standardized value): number of standard deviations a value is above or below the mean.

    • Population: z=xμσz = \frac{x - \mu}{\sigma}.

    • Sample: z=xxˉsz = \frac{x - \bar{x}}{s}.

    • Interpretation: z-scores rounded to 2 decimals; values with |z| > 2 are unusual.

  • Five-number summary: min,Q<em>1,median(Q</em>2),Q3,max\min, Q<em>1, \text{median} (Q</em>2), Q_3, \max .

  • Quartiles:

    • Q1Q_1: separates bottom 25%.

    • Q2Q_2 (median): separates bottom 50%.

    • Q3Q_3: separates bottom 75%.

  • Boxplot (box-and-whisker diagram): shows min, Q1, median, Q3, max; box spans Q<em>1Q<em>1 to Q</em>3Q</em>3, line at median; whiskers extend to min and max.

  • Outliers in boxplots: values far from most data can distort interpretation.

  • Modified boxplots: outliers shown as special points; outlier if value > Q3 + 1.5 \times IQR or < Q1 - 1.5 \times IQR , where IQR=Q<em>3Q</em>1IQR = Q<em>3 - Q</em>1.

  • Boxplots examples: distributions can be normal, uniform, or skewed; boxplots illustrate shape and spread.

  • Percentiles (P<em>1P<em>1 to P</em>99P</em>{99}): divide data into 100 groups; PkP_k is the value below which k%k\% of observations fall.

    • Finding Pk(x)P_k(x): 100×number of values less than xn100 \times \frac{\text{number of values less than } x}{n}.

    • Locator method for PkP_k: compute L=nk100L = \frac{n k}{100}.

      • If LL is an integer, PkP_k is the LL-th value.

      • If LL is not an integer, P<em>kP<em>k is the average of the L\lfloor L \rfloorth and the next value: x</em>L+xL+12\frac{x</em>{\lfloor L \rfloor} + x_{\lfloor L \rfloor + 1}}{2}.

  • Summary: boxplots and percentiles provide quick visual and numeric summaries of distribution shape, spread, and relative standing.

The "mu" symbol μ\mu represents the population mean. This is calculated by summing all data values (x\sum x) and dividing by the population size (NN). In contrast, the sample mean is denoted by xˉ\bar{x} (x-bar).

m = median

x bar = mean

if the median is even, then it is calculated by taking the average of the two middle numbers in the sorted data set. In cases where the median is odd, it is simply the middle number from the sorted data set, providing a measure of central tendency that is less affected by outliers. The mode, denoted as "m", is another measure of central tendency that represents the value that appears most frequently in a data set, which can be particularly useful for identifying common trends.

midrange= 2+25/2 = 13.5, where 2 is the minimum value and 25 is the maximum value in the data set.

Range= 25-2=23