Ch 3 - Numerical Summaries of Data

Chapter 3 Numerical Summaries of Data

Sanduni Palliyage, PhDMATH 220 – Fall 2024James Madison University


The Mean

Overview

  • The mean, or average, summarizes a quantitative variable with a single value, representing a measure of data center.

Calculation

  • Formula: Mean = (Sum of all data points) / (Number of data points)

  • Example:

    • For the data set [10, 15, 9, 21, 17, 10]:

      • Mean = (10 + 15 + 9 + 21 + 17 + 10) / 6 = 82 / 6 = 13.667


Notation for the Mean

Sample Mean and Population Mean

  • Denote data points as:𝑥1, 𝑥2, … , 𝑥𝑛

  • Sample Mean:

    • x̄ = (𝑥1 + 𝑥2 + … + 𝑥𝑛) / n

  • Population Mean:

    • μ = (𝑥1 + 𝑥2 + … + 𝑥𝑁) / N


The Median

Definition and Importance

  • The median is the middle value in a dataset when organized in ascending order, marking the 50th percentile.

  • 50% of data values lie below, and 50% lie above the median.

  • Calculation differs based on whether n (number of observations) is odd or even.

Calculation Steps

  1. Organize: Arrange your data in increasing order.

  2. Find n: Determine the total number of data values, n.

  3. Determine Median:

    • If n is odd: Median = 𝑛+1 / 2

    • If n is even: Median = (Average of the two middle numbers)


Examples of Finding the Median

Odd Number of Observations

  • Data: 10, 8, 14, 20, 19, 17, 8

  • Sorted: 8, 8, 10, 14, 17, 19, 20

  • Median Calculation:

    • n = 7 (Odd) -> Median position: (7+1)/2 = 4th observation = 14

Even Number of Observations

  • Data: 10, 8, 14, 20, 19, 17, 8, 21

  • Sorted: 8, 8, 10, 14, 17, 19, 20, 21

  • Median Calculation:

    • n = 8 (Even) -> Median = (4th + 5th) / 2 = (14 + 17) / 2 = 15.5


Comparing Mean and Median

Key Differences

  • Mean includes all data points; sensitive to extremes.

  • Median only considers the middle value(s); resistant to extreme values.

Example Analysis

  • Data Set 1: 1, 2, 3, 4, 5, 6

    • Mean = 3.5, Median = 3.5

  • Data Set 2: 1, 2, 3, 4, 5, 21

    • Mean = 6, Median = 3.5

  • Conclusion: Median stays constant; mean is skewed by the outlier (21).


The Mode

Definition

  • The mode is the value that appears most frequently in a dataset.

    • If multiple values tie for frequency, they all are modes (e.g. bimodal).

    • Example Data: [1, 2, 2, 3, 4, 5, 5, 5, 6]

    • Modes = 5


Measures of Spread

Overview

  • Measures of center do not provide insight into the data's spread.

  • Example: Number choices like temperature could be misleading without spread context.

Range

  • Definition: Difference between the maximum and minimum values.

  • Example:

    • SF Range = 63 - 51 = 12

    • STL Range = 79 - 30 = 49


The Variance

Definition

  • Variance measures data points’ dispersion from the mean.

Importance

  • Low Variance: Data values close to the mean.

  • High Variance: Values spread out from the mean, indicating varied data points.


Population Variance

Calculation Steps

  1. Sum squared deviations from the mean.

  2. Divide by the population size N.

Formula

  • Variance (σ²) = Σ(𝑥 - μ)² / N


Sample Variance

Key Differences

  • Sample variance (s²) uses sample mean (x̄) and divides by n - 1 to compensate for bias:

  • Formula: s² = Σ(𝑥 - x̄)² / (n - 1)


The Standard Deviation

Explanation

  • Definition: Square root of variance provides a measure of data spread with the same units as the data.

    • Population Standard Deviation (σ) = √σ²

    • Sample Standard Deviation (s) = √s²


z-Scores

Definition

  • A z-score indicates how many standard deviations a data point is from the mean:z = (𝑥 - μ) / σ

  • Positive z-scores indicate above-average values; negative indicates below average.


Quartiles and Percentiles

Quartiles

  • Quartiles divide data into four parts:

    • Q1 (25th percentile) separates lower 25%

    • Median (Q2) separates 50%

    • Q3 (75th percentile) separates upper 25%


Five-Number Summary

Components

  1. Minimum

  2. Q1

  3. Median

  4. Q3

  5. Maximum


Outliers

Detection

  • An outlier significantly differs from other data points. Classify using interquartile range (IQR).

IQR Method Steps

  1. Determine Q1 and Q3.

  2. Calculate IQR = Q3 - Q1.

  3. Compute boundaries using 1.5*IQR rules.


Boxplots

Explanation

  • Visual representation of the five-number summary. Useful in identifying outliers and comparing distributions.


Comparative Boxplots

Utility

  • Allow easy visual comparisons between datasets (e.g., rainfall over different time periods).


Conclusion

In this case, n = 7, since there are 7 values in the dataset.

To find n, which represents the total number of data values in your dataset, you simply count how many values are present. For example, if your dataset is [10, 8, 14, 20, 19, 17, 8], you would count these values:

  • 10

  • 8

  • 14

  • 20

  • 19

  • 17

  • 8

In this case, n = 7, since there are 7 values in the dataset.

To calculate the population variance using a TI-84 calculator, follow these steps:

  1. Enter your data:

    • Press the STAT button.

    • Select 1: Edit... to access the lists.

    • Enter your data into one of the columns (e.g., L1).

  2. Calculate statistics:

    • After entering all your data points, press the STAT button again.

    • Arrow over to the CALC menu.

    • Select 1: 1-Var Stats.

    • Input the list you used for your data (e.g., L1) and press ENTER.

  3. Find the variance:

    • Look for the value labeled σ² in the output. This is the population variance of the dataset you entered.

To find the z-score using a TI-84 calculator, you cannot directly calculate it as the calculator does not have a built-in function for z-scores. However, you can calculate it using the formula: z = (x - μ) / σ Here, x is your data point, μ is the mean of the dataset, and σ is the standard deviation. Follow these steps:

  1. First, find the mean (μ) and standard deviation (σ):

    • Press the STAT button.

    • Choose 1: Edit to enter your data in one of the lists (e.g., L1).

    • After entering data, press STAT, then arrow over to CALC, and select 1: 1-Var Stats. This will give you the mean and standard deviation.

  2. Once you have μ and σ, use the z-score formula to manually calculate z for your specific data point x.