Unit 1: Exploring One-Variable Data - Section 1D: Describing Quantitative Data with Numbers

SECTION 1D: DESCRIBING QUANTITATIVE DATA WITH NUMBERS\n\n## PAGE 1\n\n### Learning Targets\nBy the end of this section, students should be able to:\n- Find the median of a distribution of quantitative data.\n- Calculate the mean of a distribution of quantitative data.\n- Find the range of a distribution of quantitative data.\n- Calculate and interpret the standard deviation of a distribution of quantitative data.\n- Find the interquartile range ( $IQR$ ) of a distribution of quantitative data.\n- Choose appropriate measures of center and variability to summarize a distribution of quantitative data.\n- Identify outliers in a distribution of quantitative data.\n- Make and interpret boxplots of quantitative data.\n- Use boxplots and summary statistics to compare distributions of quantitative data.\n\n### Introduction to Numerical Summaries\nIn Section 1C, the focus was on displaying quantitative data with graphs and using those graphs to describe and compare distributions. Section 1D shifts focus to numerical summaries. \n\nContext: Lead Levels in Flint, Michigan\nCity managers switched the water source from Lake Huron to the Flint River to save money. The following data represents lead levels (in parts per billion, $ppb$ ) in $71$ water samples collected from randomly selected dwellings:\n\n $0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 9, 9, 7, 8, 8, 9, 10, 7, 2, 2, 10, 11, 13, 18, 20, 21, 22, 29, 42, 42, 104$ \n\nObservations from the Distribution:\n- The distribution is right-skewed and single-peaked.\n- The dwelling with a lead level of $104\,ppb$ appears to be an outlier.\n- The Mode is the most frequently occurring data value. For this data, the mode is $0\,ppb$ . However, the mode is often a poor measure of center because it can fall anywhere in a distribution, a distribution can have multiple modes, or it can have no mode at all. In this context, $0\,ppb$ is not representative of a typical dwelling's lead level.\n\n## PAGE 2\n\n### Measuring Center: The Median\nThe median is the conceptual \"middle value\" of an ordered data set.\n\nDEFINITION: Median\nThe median is the midpoint of a distribution\u2014the number such that about half the observations are smaller and about half are larger. To find the median, arrange the data values from smallest to largest.\n- If the number $n$ of data values is odd, the median is the middle value in the ordered list.\n- If the number $n$ of data values is even, the median is the average of the two middle values in the ordered list.\n\nExample: Population Density in Central America\nData on population density (people per $km^2$ ) for seven countries:\n- Belize: $17$ \n- Costa Rica: $100$ \n- El Salvador: $308$ \n- Guatemala: $158$ \n- Honduras: $82$ \n- Nicaragua: $48$ \n- Panama: $52$ \n\nCalculation:\n1. Sort the data: $17, 48, 52, 82, 100, 158, 308$ \n2. Since $n=7$ (odd), the middle value is the $4^{\text{th}}$ value.\n3. Median = $82$ .\n\nExample: More Chips Please (Measuring Center: The Median)\nProblem: A group collected data on the percentage of air in a sample of $14$ brands of chips.\n- Brands/Values: Cape Cod ( $46$ ), Cheetos ( $59$ ), Doritos ( $48$ ), Fritos ( $19$ ), Kettle Brand ( $47$ ), Lays ( $41$ ), Lays Baked ( $39$ ), Popchips ( $45$ ), Pringles ( $28$ ), Ruffles ( $50$ ), Stacy's Pita Chips ( $50$ ), Sun Chips ( $41$ ), Terra ( $49$ ), Tostitos Scoops ( $34$ ).\n\n## PAGE 3\n\nSolution (Median):\n1. Sorted values: $19, 28, 34, 39, 41, 41, 45, 46, 47, 48, 49, 50, 50, 59$ \n2. Since $n=14$ (even), the median is the average of the $7^{\text{th}}$ and $8^{\text{th}}$ values ( $45$ and $46$ ).\n3. Median = $\frac{45+46}{2} = 45.5\%$ .\n\nInterpretation: About half of the chip brands have less than $45.5\%$ air, and about half have more.\n\n### Measuring Center: The Mean\nThe mean is the most common measure of center.\n\nDEFINITION: The Mean\nThe mean of a distribution of quantitative data is the average of all individual data values. Add all values and divide by the total number of values.\n\nFormula for Sample Mean ( $\bar{x}$ ):\n $\bar{x} = \frac{\sum x_i}{n} = \frac{x_1 + x_2 + \dots + x_n}{n}$ \n- The symbol $\sum$ (capital Greek letter sigma) means \"add them all up.\"\n- The subscripts $x_i$ are used to distinguish data values and do not indicate order.\n\nAP\u00ae Exam Tip: The formula for the sample mean $\bar{x}$ is provided on the AP\u00ae Statistics exam formula sheet.\n\n## PAGE 4\n\nExample: More Chips Please (Measuring Center: The Mean)\n(a) Calculate the mean percent air for the 14 brands.\n $\bar{x} = \frac{46 + 59 + 48 + 19 + 47 + 41 + 39 + 45 + 28 + 50 + 50 + 41 + 49 + 34}{14}$ \n $\bar{x} = \frac{596}{14} = 42.57\%$ air.\n\n(b) Calculate the mean if the possible outlier (Fritos, 19%) is removed.\n $\bar{x} = \frac{577}{13} = 44.38\%$ air.\n\nObservation: The inclusion of the Fritos bag decreased the mean by $1.81$ percentage points.\n\nPopulation Mean ( $\mu$ ):\nWhen data represents an entire population, the Greek letter $\mu$ (mu) is used.\nExample: Seven South American countries population density.\n $\mu = \frac{17 + 48 + 52 + 82 + 100 + 158 + 308}{7} = 109.286\, \text{people per } km^2$ \n\nDEFINITION: Statistic vs. Parameter\n- Statistic: A number describing a characteristic of a sample ( $\bar{x}$ ).\n- Parameter: A number describing a characteristic of a population ( $\mu$ ).\n\n## PAGE 5\n\n### Properties of the Mean\nThe mean is not resistant to extreme values or outliers.\n\nDEFINITION: Resistant\nA statistical measure is resistant if it is not affected much by extreme data values. The median is resistant; the mean is not.\n\nActivity: Interpreting the Mean (The Seesaw Interpretation)\n1. Five pennies on the $6$ -inch mark of a $12$ -inch ruler balance at the $6$ -inch mark. The mean is $6$ .\n2. Moving one penny to $8$ inches and another to $4$ inches maintains the balance at $6$ . The mean remains $6$ .\n3. Mean as the Balance Point: The mean is the point where the dotplot or distribution would physically balance.\n\n### Comparing the Mean and Median\nThe choice between mean and median depends on the distribution's shape and outliers.\n\n## PAGE 6\n\n- Skewed to the Left: The mean is pulled toward the long tail. $\text{Mean} < \text{Median}$ .\n- Roughly Symmetric: The mean and median are close. $\text{Mean} \approx \text{Median}$ .\n- Skewed to the Right: The mean is pulled toward the long tail. $\text{Mean} > \text{Median}$ .\n\nEffect Summary:\n- In symmetric distributions with no outliers, mean and median are similar.\n- In strongly skewed distributions, the mean is pulled in the direction of skewness.\n- Median is resistant to outliers; mean is not.\n\nReal-World Example: MLB Player Salaries (2022)\n- Distribution: Strongly right-skewed.\n- Median Salary: $\approx \$1.2\, \text{million}$ (describes the \"typical\" player).\n- Mean Salary: $\approx \$4.4\, \text{million}$ (pulled up by superstars like Max Scherzer and Mike Trout).\n- The mean is useful for calculating totals: $(\$4.4\, \text{million}) \times (975\, \text{players}) = \$4.3\, \text{billion}$ .\n\n## PAGE 7\n\n### Questions & Discussion: Check Your Understanding\nContext: Pumpkin weights (lb): $3.6, 4.0, 9.6, 14.0, 11.0, 12.4, 13.0, 2.0, 6.0, 6.6, 15.0, 3.4, 12.7, 9.6, 4.0, 6.1, 6.0, 2.8, 5.4, 11.9, 5.4, 31.0, 33.0$ .\n\n1. Find the median weight.\nSorted: $2.0, 2.8, 3.4, 3.6, 4.0, 4.0, 5.4, 5.4, 6.0, 6.0, 6.1, 6.6, 9.6, 9.6, 11.0, 11.9, 12.4, 12.7, 13.0, 14.0, 15.0, 31.0, 33.0$ .\nWith $n=23$ , median is the $12^{\text{th}}$ value = $6.6\,lb$ .\n\n2. Calculate the mean weight.\n $\bar{x} = \frac{228.5}{23} \approx 9.93\,lb$ .\n\n3. Why is the mean larger than the median?\nThe distribution is right-skewed with extreme values ( $31.0$ and $33.0$ ) that pull the mean upward.\n\n### Measuring Variability: The Range\nDistributions can have the same center and shape but different variability.\n\nDEFINITION: Range\nThe range is the distance between the minimum and maximum values.\n $\text{Range} = \text{Maximum} - \text{Minimum}$ \n\nExample: PVC Pipe Lengths\n- Supplier A: $601.5 - 598.5 = 3.0\,mm$ .\n- Supplier B: $604.0 - 596.0 = 8.0\,mm$ .\n\n## PAGE 8\n\nExample: More Chips Please (Measuring Variability: The Range)\n- Max: $59\%$ .\n- Min: $19\%$ .\n- Range = $59 - 19 = 40\%$ air.\n\nLimitations of the Range:\n1. Not Resistant: Strongly affected by outliers. Without the $19\%$ outlier, the chip air range drops to $31\%$ .\n2. Uses only two values: Does not describe how other data values are distributed. (Example: In film strip machine widths, Machine A and B have the same range ( $0.4\,mm$ ) but Machine B has values more spread out from the center).\n\n## PAGE 9\n\n### Measuring Variability: The Standard Deviation\nThe standard deviation describes the variation of data values around the mean.\n\nDEFINITION: Standard Deviation\nThe standard deviation measures the typical distance of data values in a distribution from the mean. It is the square root of the average squared deviation.\n\nFormula for Sample Standard Deviation ( $s_x$ ):\n $s_x = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}$ \n\nDEFINITION: Sample Variance ( $s_x^2$ )\nThe value obtained before taking the square root ( $\frac{\sum (x_i - \bar{x})^2}{n-1}$ ). It is measured in squared units.\n\nSteps to Calculate $s_x$ :\n1. Find the mean ( $\bar{x}$ ).\n2. Calculate deviations: $\text{deviation} = \text{value} - \text{mean}$ .\n3. Square each deviation.\n4. Add squared deviations and divide by $n-1$ (Variance).\n5. Take the square root.\n\nExample: How Many Friends?\nData: $1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 6$ ( $n=11$ ).\n\n## PAGE 10\n\n1. Mean: $\bar{x} = 3$ .\n2. Deviations and Squares:\n- $1-3 = -2 \rightarrow (-2)^2 = 4$ \n- $2-3 = -1 \rightarrow (-1)^2 = 1$ (three times)\n- $3-3 = 0 \rightarrow (0)^2 = 0$ (four times)\n- $4-3 = 1 \rightarrow (1)^2 = 1$ (two times)\n- $6-3 = 3 \rightarrow (3)^2 = 9$ \n3. Sum of Squares: $4 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 9 = 18$ .\n4. Variance: $s_x^2 = \frac{18}{11-1} = 1.80$ .\n5. Standard Deviation: $s_x = \sqrt{1.80} \approx 1.34\, \text{close friends}$ .\n\nInterpretation: The number of close friends these students have typically varies from the mean by about $1.34$ friends.\n\nPopulation Standard Deviation ( $\sigma$ ): Calculated using the population mean $\mu$ and dividing by $N$ (not $n-1$ ).\n\nRationale for calculation method:\n- Deviations sum to $0$ because the mean is the balance point.\n- We square deviations to make them positive so they don't cancel out.\n- We take the square root to return to the original units.\n\n## PAGE 11\n\n### Properties of the Standard Deviation\n- $s_x \ge 0$ . $s_x = 0$ only if all data values are identical (no variability).\n- Greater variation from the mean results in larger $s_x$ .\n- Not resistant: Squared deviations make $s_x$ highly sensitive to outliers. (Removing the student with $6$ friends drops $s_x$ from $1.34$ to $0.949$ ).\n- $s_x$ measures variation specifically about the mean.\n\nScenario: Adding a 12th observation.\nIf a value equal to the mean (e.g., $3$ friends) is added, $s_x$ decreases because the new typical distance from the mean decreases ( $s_x \rightarrow 1.28$ ).\n\n### Measuring Variability: The Interquartile Range ( $IQR$ )\nThe $IQR$ focuses on the middle half of the distribution to avoid the impact of extremes.\n\n## PAGE 12\n\nDEFINITION: Quartiles\nQuartiles divide an ordered data set into four groups of roughly equal size.\n- First Quartile ( $Q_1$ ): The median of the data values to the left of the actual median.\n- Third Quartile ( $Q_3$ ): The median of the data values to the right of the actual median.\n\nDEFINITION: Interquartile Range ( $IQR$ )\nThe distance between the first and third quartiles.\n $IQR = Q_3 - Q_1$ \n\nExample: Charity Collections Per Hour\nData: $\$19, \$22, \$22, \$25, \$26, \$28, \$29, \$31, \$31, \$34, \$37, \$39$ . ( $n=12$ ).\n- Median ( $Q_2$ ): $\frac{28+29}{2} = \$28.50$ .\n- $Q_1$ : Median of left 6 ( $19, 22, 22, 25, 26, 28$ ) = $\frac{22+25}{2} = \$23.50$ .\n- $Q_3$ : Median of right 6 ( $29, 31, 31, 34, 37, 39$ ) = $\frac{31+34}{2} = \$32.50$ .\n- $IQR$ = $\$32.50 - \$23.50 = \$9.00$ .\n\n## PAGE 13\n\nExample: More Chips Please (Measuring Variability: $IQR$ )\nSorted Air Data ( $n=14$ ): $19, 28, 34, 39, 41, 41, 45 | 46, 47, 48, 49, 50, 50, 59$ \n- Median: $45.5$ \n- $Q_1$ : Median of first 7 values = $39$ .\n- $Q_3$ : Median of last 7 values = $49$ .\n- $IQR$ = $49 - 39 = 10\%$ air.\n\nInterpretation: The middle half of the distribution of percent air has a range of $10\%$ .\n\nProperties of $IQR$ :\n- Resistant: Not affected by extreme values. If the max was $69\%$ instead of $59\%$ , $IQR$ remains $10\%$ .\n- Important Note: Leave out the median when locating quartiles. If the median is part of the data set, ignore it in both halves.\n\n### Choosing Summary Statistics\n\n## PAGE 14\n\nGuidelines for summarizing center and variability:\n- Roughly Symmetric with No Outliers: Use the Mean ( $x$ ) and Standard Deviation ( $s_x$ ).\n- Skewed or with Outliers: Use the Median and $IQR$ (they are resistant).\n- Range: Use only as a last resort; it provides the least information about internal distribution.\n\nExample: Lead in the water (Summary Statistics)\nData: $n=71$ . Mean = $7.31$ , SD = $14.347$ , Min = $0$ , $Q_1=2$ , Med = $3$ , $Q_3=7$ , Max = $104$ .\n- Analysis: Distribution is right-skewed with a prominent outlier at $104\,ppb$ . \n- Choice: Choose resistant measures: Median = $3\,ppb$ and $IQR$ = $7 - 2 = 5\,ppb$ .\n\n## PAGE 15\n\n### Tech Corner: Calculating Summary Statistics (TI-83/84)\n1. Enter values into list L1.\n2. Choose STAT \u2192 CALC \u2192 1-Var Stats.\n3. Outputs include $\bar{x}, \sum x, \sum x^2, s_x, n, \text{minX}, Q_1, \text{Med}, Q_3, \text{maxX}$ .\n4. Range and $IQR$ must be calculated by hand from the five-number summary: $\text{maxX} - \text{minX}$ and $Q_3 - Q_1$ .\n\nCaution: Different software (like Minitab) may use slightly different rules for calculating quartiles (e.g., $Q_1=37.75$ instead of $39$ ), but results are usually similar for large data sets.\n\n## PAGE 16\n\n### Questions & Discussion: Check Your Understanding (Pumpkins Continued)\n1. Can you calculate range exactly from a histogram? No, histograms only show bins (intervals), not individual values. Using the raw data: $33.0 - 2.0 = 31.0\,lb$ .\n2. Interpret the standard deviation ( $8.01\,lb$ ). The typical distance of a pumpkin's weight from the mean ( $9.93\,lb$ ) is about $8.01\,lb$ .\n3. Calculate $IQR$ . $Q_1 = 4.0, Q_3 = 12.7$ . $IQR = 12.7 - 4.0 = 8.7\,lb$ .\n4. Preferred measures? Use Median and $IQR$ because the distribution is strongly right-skewed with outliers.\n\n### Identifying Outliers\nDetermining if LeBron James's rookie season average ( $20.9\,pts$ ) is an outlier compared to his first $16$ seasons.\n\n## PAGE 17\n\nHOW TO IDENTIFY OUTLIERS: The 1.5 x $IQR$ Rule\nAn observation is an outlier if:\n- It is lower than $Q_1 - 1.5 \times IQR$ \n- It is higher than $Q_3 + 1.5 \times IQR$ \n\nExample: LeBron James\nData: $20.9, 25.3, 25.3, 26.4, 26.7, 26.8, 27.1, 27.1, 27.2, 27.3, 27.4, 27.5, 28.4, 29.7, 30.0, 31.4$ .\n1. $Q_1 = 26.55, Q_3 = 27.95$ .\n2. $IQR = 27.95 - 26.55 = 1.40$ .\n3. Multiplier: $1.5 \times 1.40 = 2.1$ .\n4. Low cutoff: $26.55 - 2.1 = 24.45$ .\n5. High cutoff: $27.95 + 2.1 = 30.05$ .\n6. Outliers: $20.9$ (low) and $31.4$ (high).\n\nThe 2 x SD Rule:\nSome use \"more than 2 standard deviations from the mean.\" For LeBron: \n- Cutoffs: $27.156 \pm 2(2.328) \rightarrow 22.50$ to $31.812$ .\n- Only $20.9$ is an outlier by this rule. This rule is less resistant than the $IQR$ rule.\n\nAP\u00ae Exam Tip: Be prepared to use both the $1.5 \times IQR$ rule and the $2 \times SD$ rule.\n\n## PAGE 18\n\nImportance of Identifying Outliers:\n1. They might be inaccuracies (recording errors or machine failure).\n2. They can indicate a remarkable occurrence (e.g., Serena Williams's earnings).\n3. They heavily influence mean, range, and standard deviation.\n\n### Displaying Summary Statistics: Boxplots\n\nDEFINITION: Five-number summary\nMinimum, $Q_1$ , Median, $Q_3$ , and Maximum.\n\nDEFINITION: Boxplot\nA visual representation of the five-number summary.\n\n## PAGE 19\n\nHOW TO MAKE A BOXPLOT:\n1. Find the five-number summary.\n2. Identify outliers using the $1.5 \times IQR$ rule.\n3. Draw and label a horizontal axis with variable name and units.\n4. Scale the axis appropriately.\n5. Draw a box from $Q_1$ to $Q_3$ .\n6. Mark the median with a vertical line in the box.\n7. Mark outliers with special symbols (e.g., $*$ ).\n8. Draw whiskers to the smallest and largest non-outlier data values.\n\nExample: How big are the large fries?\n15 orders: $165, 172, 173, 176, 178, 179, 179, 180, 181, 181, 183, 183, 184, 186, 187$ .\n\n## PAGE 20\n\n- Min: $165$ , $Q_1=176$ , Med= $180$ , $Q_3=183$ , Max= $187$ .\n- $IQR = 7$ . $1.5 \times 7 = 10.5$ .\n- Low cutoff: $165.5$ ; High: $193.5$ .\n- Outlier: $165\,g$ .\n- Boxplot shows left skewness (left half varies from $165$ to $180$ , right half from $180$ to $187$ ).\n\nLimitations of Boxplots:\n- They hide individual data points.\n- They hide shape details like peaks, gaps, or clusters. (Example: Old Faithful Geyser eruptions are bimodal, but a boxplot makes them look unimodal).\n\n## PAGE 21\n\n### Questions & Discussion: Check Your Understanding (Pumpkins Final)\n1. Identify outliers. $Q_1=4.0, Q_3=12.7, IQR=8.7$ . High cutoff: $12.7 + 13.05 = 25.75$ . Outliers: $31.0, 33.0$ . (Low cutoff $4 - 13.05$ has no values).\n2. Make boxplot. Whiskers go to $2.0$ and $15.0$ . Asterisks for $31.0, 33.0$ .\n3. Why is shape incomplete? The boxplot does not show the peaks/unimodality visible in the histogram.\n\n### Comparing Distributions with Boxplots and Summary Statistics\nAlways discuss Shape, Outliers, Center, and Variability (SOCV) with context.\n\nExample: Apple vs. Samsung Tablets\nRatings: High score is better.\n\n## PAGE 22\n\n- Apple Ratings ( $n=20$ ): $87, 87, 87, 87, 86, 86, 86, 86, 84, 84, 83, 83, 83, 83, 81, 79, 76, 73, …$ \n- Samsung Ratings ( $n=20$ ): $88, 87, 87, 86, 86, 86, 86, 84, 84, 83, 83, 77, 76, 76, 75, 75, 75, 75, 74, 71, 62$ \n\n## PAGE 23\n\nComparison (Apple vs. Samsung):\n- Shape: Both are left-skewed.\n- Outliers: Apple has two low outliers ( $73, 76$ ). Samsung has none.\n- Center: Apple has a slightly higher median ( $84$ ) than Samsung ( $83$ ). $75\%$ of Apple tablets are at or above the Samsung median.\n- Variability: Samsung ratings vary much more. Samsung $IQR (11)$ is nearly four times larger than Apple's $IQR (3)$ .\n\nAP\u00ae Exam Tip: Use precise terminology. Do not say \"mean\" if you mean \"median.\" Skewed refers to shape, not center. $IQR$ and Range are single numbers, not regions.\n\n### Activity: Team Challenge - Did Mr. Starnes Stack His Class?\nStudents compare GPA data from two teachers (Starnes vs. McGrail) to determine if class placement was random. This requires creating graphs and calculating summary statistics to assess differences in academic profiles between the two classes.\n\n## PAGE 24\n\n### Tech Corner: Making Boxplots (TI-83/84)\n- Use Plot 1 and Plot 2 to show parallel boxplots.\n- Select the boxplot type that identifies outliers (the first icon with dots).\n- Use ZOOM \u2192 ZoomStat to view.\n- Use TRACE to see values (Min, $Q_1$ , Med, $Q_3$ , Max).", "title": "Describing Quantitative Data with Numbers - Level Exhaustive"}