Chapter 3 Notes: Describing, Exploring, and Comparing Data

Measures of Center (3-1)

Center: representative average value indicating the middle of the data.
Variation: measure of how values vary.
Key measures:
- Mean (arithmetic mean): sum of all values divided by the number of values.
- Median: middle value of ordered data; unaffected by extremes.
- Mode: most frequent value; can be non-unique or absent. its the only measure of center where it can used with nominal data
- Midrange: midpoint between max and min, defined as $\text{Midrange} = \frac{\text{max} + \text{min}}{2}$ .
Notation:
- $\sum$ : sum; $x$ : data values; $n$ : sample size; $N$ : population size.
- Population mean: $\mu = \frac{\sum x}{N}$ .
- Sample mean: $\bar{x} = \frac{\sum x}{n}$ .
Mean from a frequency distribution: $\bar{x} = \frac{\sum (f \cdot x)}{\sum f}$ .
Round-off Rule: carry one more decimal place than original data; round only the final answer.

Measures of Variation (3-2)

Key concept: Variation quantifies how values differ; standard deviation interpretation is essential.
Range: difference between maximum and minimum values, $\text{Range} = \max{x} - \min{x}$ .
Standard deviation (sample): measure of variation about the mean.
- Formula: $s = \sqrt{ \frac{\sum (x - \bar{x})^2}{n-1} }$ .
- Shortcut formula: $s = \sqrt{ \frac{ n\sum x^2 - (\sum x)^2 }{ n(n-1) } }$ .
Properties of $s$ :
- Measures variation of all values from the mean.
- Typically positive; increases with outliers.
- Units are the same as the original data.
Range Rule of Thumb: estimate standard deviation as $s \approx \frac{\text{Range}}{4}$ .
Variance:
- Population variance: $\sigma^2$
- Sample variance: $s^2$
Round-off Rule for Variation: carry one more decimal place than in original data; round only the final answer.
Estimation of standard deviation (range-based):
- Minimum usual $\approx \bar{x} - 2s$ , maximum usual $\approx \bar{x} + 2s$ .
Empirical (68-95-99.7) Rule for bell-shaped distributions:
- ~68% of values fall within $1s$ of the mean.
- ~95% within $2s$ .
- ~99.7% within $3s$ .

Measures of Relative Standing and Boxplots (3-3)

Key concept: Compare values across/within data sets; z-score is central; identify outliers.
z-score (standardized value): number of standard deviations a value is above or below the mean.
- Population: $z = \frac{x - \mu}{\sigma}$ .
- Sample: $z = \frac{x - \bar{x}}{s}$ .
- Interpretation: z-scores rounded to 2 decimals; values with |z| > 2 are unusual.
Five-number summary: $\min, Q1, \text{median} (Q2), Q_3, \max$ .
Quartiles:
- $Q_1$ : separates bottom 25%.
- $Q_2$ (median): separates bottom 50%.
- $Q_3$ : separates bottom 75%.
Boxplot (box-and-whisker diagram): shows min, Q1, median, Q3, max; box spans $Q1$ to $Q3$ , line at median; whiskers extend to min and max.
Outliers in boxplots: values far from most data can distort interpretation.
Modified boxplots: outliers shown as special points; outlier if value > Q3 + 1.5 \times IQR or < Q1 - 1.5 \times IQR , where $IQR = Q3 - Q1$ .
Boxplots examples: distributions can be normal, uniform, or skewed; boxplots illustrate shape and spread.
Percentiles ( $P1$ to $P{99}$ ): divide data into 100 groups; $P_k$ is the value below which $k\%$ of observations fall.
- Finding $P_k(x)$ : $100 \times \frac{\text{number of values less than } x}{n}$ .
- Locator method for $P_k$ : compute $L = \frac{n k}{100}$ .
 - If $L$ is an integer, $P_k$ is the $L$ -th value.
 - If $L$ is not an integer, $Pk$ is the average of the $\lfloor L \rfloor$ th and the next value: $\frac{x{\lfloor L \rfloor} + x_{\lfloor L \rfloor + 1}}{2}$ .
Summary: boxplots and percentiles provide quick visual and numeric summaries of distribution shape, spread, and relative standing.

The "mu" symbol $\mu$ represents the population mean. This is calculated by summing all data values ( $\sum x$ ) and dividing by the population size ( $N$ ). In contrast, the sample mean is denoted by $\bar{x}$ (x-bar).

m = median

x bar = mean

if the median is even, then it is calculated by taking the average of the two middle numbers in the sorted data set. In cases where the median is odd, it is simply the middle number from the sorted data set, providing a measure of central tendency that is less affected by outliers. The mode, denoted as "m", is another measure of central tendency that represents the value that appears most frequently in a data set, which can be particularly useful for identifying common trends.

midrange= 2+25/2 = 13.5, where 2 is the minimum value and 25 is the maximum value in the data set.

Range= 25-2=23