Descriptive Statistics: Percentiles, Quartiles, Range, IQR, and Variance (Excel)

Percentiles and Their Interpretation

  • Percentiles divide a dataset into 100 equal parts; each percentile represents a specific data point.

  • The kth percentile is the value below which k percent of the data falls. Equivalently, the kth percentile is a value such that at least k percent of the items are
    <= that value and at least 100−k percent of the items are >= that value.

  • Interpretation example: the 25th percentile (P25) means that 25% of the data values are ≤ P25 and 75% are ≥ P25.

  • When data are sorted from smallest to largest, the percentile indicates the position of a data value in that ordered list.

  • Common mistake to avoid: percentile is not the percent of questions correct on a test. It is a location of your score within the distribution of all scores.

  • SAT example from the transcript:

    • If a student scores 1000 and has an 80th percentile, it means that 80% of test-takers scored 1000 or less, and 20% scored 1000 or higher.

  • Percentile values can be any number from 0 to 100 (inclusive).

  • The percentile value is the data value at that position after sorting (not a percentage).

  • In practice, you are often asked for a percentile like the 25th, 50th, or 75th, which correspond to quartiles as well (see Quartiles section).

Using Excel to Compute Percentiles

  • Built-in Excel function: percentile.EXC (uses the same logic as the textbook definition but is easier to use).

  • Arguments in the dialog:

    • Array: the data range (the dataset).

    • k: the percentile you are looking for, entered as a decimal between 0 and 1 (not as a percent).

  • Example inputs:

    • For the 40th percentile, enter 0.40 as k.

    • For the 35th percentile, enter 0.35 as k.

  • Important input rule: if you enter 40 (not 0.40), Excel will return an error (e.g., #NUM) because it expects a decimal between 0 and 1.

  • Practical setup: place the data range in one place, and reference a cell for k so you can reuse the same formula for multiple percentiles by simply changing that k cell.

  • Tip: put each problem on a separate worksheet/tab to keep work organized and easy to verify later.

  • Data and k example setup:

    • Data range: e.g., A2:A101.

    • Percentile cell: e.g., B1 contains 0.40 for the 40th percentile.

    • Formula: =PERCENTILE.EXC(A2:A101, B1) (depending on Excel version, you may see =PERCENTILE.EXC or =PERCENTILE.EXC(A2:A101, B1)).

  • Relation to problem parts: you can reuse the same setup for multiple percentiles by changing the k value (e.g., 0.40, 0.65) without re-typing the data range.

  • Practical note: Excel sorts the data internally for percentile calculation; you do not need to manually sort the data.

Quartiles and Their Relationship to Percentiles

  • Quartiles partition data into four equal parts; notation:

    • Q1: first quartile (lower Q1 = 25th percentile)

    • Q2: second quartile (median, 50th percentile)

    • Q3: third quartile (75th percentile)

  • Relationship to percentiles:

    • Q1 = P_{0.25}

    • Q2 = P_{0.50}

    • Q3 = P_{0.75}

  • In practice, textbooks use q1, q2, q3 notation, but they are computed using the same percentile function with k = 0.25, 0.50, and 0.75 respectively.

  • Example: to compute q1, use k = 0.25; to compute q2, use k = 0.50; to compute q3, use k = 0.75, all via the same percentile function.

Measures of Variability: Range and Interquartile Range (IQR)

  • Range: the simplest variability measure; defined as

    • Range=max(X)min(X)\text{Range} = \max(X) - \min(X)

  • Limitations of the range:

    • It only uses the extreme values and tells nothing about the values in between, so it’s not a very sophisticated measure of spread.

  • Excel approach to range:

    • No single built-in "range" function; instead use two functions:

    • MAX(data) returns the largest value in the dataset.

    • MIN(data) returns the smallest value in the dataset.

    • Range can be computed in one step as: Range=max(X)min(X)\text{Range} = \max(X) - \min(X) using a single formula.

  • Interquartile Range (IQR): a more robust measure of spread that focuses on the middle 50% of the data.

    • Definition:

    • IQR=Q<em>3Q</em>1\text{IQR} = Q<em>3 - Q</em>1

    • What it tells you: the range of the middle 50% of the data; less sensitive to extreme values (outliers).

    • Real-world usage: universities reporting SAT/ACT scores often present Q1 and Q3 to describe the middle 50% of the class.

    • Example from transcript: if Q3 = 625 and Q1 = 545, then IQR=625545=80\text{IQR} = 625 - 545 = 80 (in dollars for rental rates in the example).

  • Practical application:

    • The IQR helps assess whether results are influenced by outliers by focusing on the central portion of the data.

    • Researchers sometimes examine how results change when extreme values are trimmed (e.g., dropping bottom 5% and top 5%) to see if conclusions hold for the middle 90%.

Variance: Population vs Sample

  • Variance measures the average squared deviation from the mean and uses all data values.

  • Population variance (denoted by a Greek letter):

    • σ2=1N<em>i=1N(x</em>iμ)2\sigma^2 = \frac{1}{N} \sum<em>{i=1}^{N} (x</em>i - \mu)^2

    • Here, $N$ is the population size and $\mu$ is the population mean.

  • Sample variance (denoted by an $s^2$):

    • s2=1n1<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n-1} \sum<em>{i=1}^{n} (x</em>i - \bar{x})^2

    • Here, $n$ is the sample size and $\bar{x}$ is the sample mean.

  • Key difference between the two formulas:

    • Numerator is the same: sum of squared deviations $\sum (x_i - \text{mean})^2$.

    • Denominator differs: population uses $N$, while sample uses $n-1$ (the degrees of freedom adjustment).

  • Practical interpretation issues:

    • Variance uses squared units; if data are dollars, variance is in dollars$^2$, which can be hard to interpret.

  • Excel usage (conceptual):

    • There are two built-in variance functions corresponding to population and sample data (one for each denominator choice).

    • In practice, you need to identify whether your data represent a population or a sample to choose the correct function.

  • Note on the course plan:

    • The instructor indicates that the next session will cover the Excel variance functions in detail.

  • Important reminders from the transcript:

    • Always check whether your problem uses population data or sample data before computing variance.

    • If unspecified, the problem typically uses sample data.

Quick references and implementation tips

  • Percentile inputs must be decimal (0 to 1) in Excel; using a whole percent like 40 will produce an error.

  • When solving multiple percentile questions, link k to a cell and reference that cell in the percentile formula to reuse the same setup for different percentiles.

  • For organization, keep each problem on a separate worksheet/tab to easily verify results and track mistakes.

  • Remember the equivalences:

    • Q1 = P_{0.25}

    • Q2 = P_{0.50}

    • Q3 = P_{0.75}

  • Recap of core formulas:

    • Percentile: P<em>k=value such that at least k% of data are P</em>k and at least (100k)% are Pk.P<em>k = \text{value such that at least } k\% \text{ of data are } \le P</em>k \text{ and at least } (100-k)\% \text{ are } \ge P_k.

    • Range: Range=max(X)min(X)\text{Range} = \max(X) - \min(X)

    • IQR: IQR=Q<em>3Q</em>1\text{IQR} = Q<em>3 - Q</em>1

    • Population variance: σ2=1N<em>i=1N(x</em>iμ)2\sigma^2 = \frac{1}{N} \sum<em>{i=1}^{N} (x</em>i - \mu)^2

    • Sample variance: s2=1n1<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n-1} \sum<em>{i=1}^{n} (x</em>i - \bar{x})^2

  • These notes summarize the key ideas and steps described in the transcript and set up practical Excel workflows for percentile, quartile, range, IQR, and variance calculations.