Ch 3 - Numerical Summaries of Data

Chapter 3 Numerical Summaries of Data

Sanduni Palliyage, PhDMATH 220 – Fall 2024James Madison University

The Mean

Overview

The mean, or average, summarizes a quantitative variable with a single value, representing a measure of data center.

Calculation

Formula: Mean = (Sum of all data points) / (Number of data points)
Example:
- For the data set [10, 15, 9, 21, 17, 10]:
  - Mean = (10 + 15 + 9 + 21 + 17 + 10) / 6 = 82 / 6 = 13.667

Notation for the Mean

Sample Mean and Population Mean

Denote data points as:𝑥1, 𝑥2, … , 𝑥𝑛
Sample Mean:
- x̄ = (𝑥1 + 𝑥2 + … + 𝑥𝑛) / n
Population Mean:
- μ = (𝑥1 + 𝑥2 + … + 𝑥𝑁) / N

The Median

Definition and Importance

The median is the middle value in a dataset when organized in ascending order, marking the 50th percentile.
50% of data values lie below, and 50% lie above the median.
Calculation differs based on whether n (number of observations) is odd or even.

Calculation Steps

Organize: Arrange your data in increasing order.
Find n: Determine the total number of data values, n.
Determine Median:
- If n is odd: Median = 𝑛+1 / 2
- If n is even: Median = (Average of the two middle numbers)

Examples of Finding the Median

Odd Number of Observations

Data: 10, 8, 14, 20, 19, 17, 8
Sorted: 8, 8, 10, 14, 17, 19, 20
Median Calculation:
- n = 7 (Odd) -> Median position: (7+1)/2 = 4th observation = 14

Even Number of Observations

Data: 10, 8, 14, 20, 19, 17, 8, 21
Sorted: 8, 8, 10, 14, 17, 19, 20, 21
Median Calculation:
- n = 8 (Even) -> Median = (4th + 5th) / 2 = (14 + 17) / 2 = 15.5

Comparing Mean and Median

Key Differences

Mean includes all data points; sensitive to extremes.
Median only considers the middle value(s); resistant to extreme values.

Example Analysis

Data Set 1: 1, 2, 3, 4, 5, 6
- Mean = 3.5, Median = 3.5
Data Set 2: 1, 2, 3, 4, 5, 21
- Mean = 6, Median = 3.5
Conclusion: Median stays constant; mean is skewed by the outlier (21).

The Mode

Definition

The mode is the value that appears most frequently in a dataset.
- If multiple values tie for frequency, they all are modes (e.g. bimodal).
- Example Data: [1, 2, 2, 3, 4, 5, 5, 5, 6]
- Modes = 5

Measures of Spread

Overview

Measures of center do not provide insight into the data's spread.
Example: Number choices like temperature could be misleading without spread context.

Range

Definition: Difference between the maximum and minimum values.
Example:
- SF Range = 63 - 51 = 12
- STL Range = 79 - 30 = 49

The Variance

Definition

Variance measures data points’ dispersion from the mean.

Importance

Low Variance: Data values close to the mean.
High Variance: Values spread out from the mean, indicating varied data points.

Population Variance

Calculation Steps

Sum squared deviations from the mean.
Divide by the population size N.

Formula

Variance (σ²) = Σ(𝑥 - μ)² / N

Sample Variance

Key Differences

Sample variance (s²) uses sample mean (x̄) and divides by n - 1 to compensate for bias:
Formula: s² = Σ(𝑥 - x̄)² / (n - 1)

The Standard Deviation

Explanation

Definition: Square root of variance provides a measure of data spread with the same units as the data.
- Population Standard Deviation (σ) = √σ²
- Sample Standard Deviation (s) = √s²

z-Scores

Definition

A z-score indicates how many standard deviations a data point is from the mean:z = (𝑥 - μ) / σ
Positive z-scores indicate above-average values; negative indicates below average.

Quartiles and Percentiles

Quartiles

Quartiles divide data into four parts:
- Q1 (25th percentile) separates lower 25%
- Median (Q2) separates 50%
- Q3 (75th percentile) separates upper 25%

Five-Number Summary

Components

Minimum
Q1
Median
Q3
Maximum

Outliers

Detection

An outlier significantly differs from other data points. Classify using interquartile range (IQR).

IQR Method Steps

Determine Q1 and Q3.
Calculate IQR = Q3 - Q1.
Compute boundaries using 1.5*IQR rules.

Boxplots

Explanation

Visual representation of the five-number summary. Useful in identifying outliers and comparing distributions.

Comparative Boxplots

Utility

Allow easy visual comparisons between datasets (e.g., rainfall over different time periods).

Conclusion

In this case, n = 7, since there are 7 values in the dataset.

To find n, which represents the total number of data values in your dataset, you simply count how many values are present. For example, if your dataset is [10, 8, 14, 20, 19, 17, 8], you would count these values:

In this case, n = 7, since there are 7 values in the dataset.

To calculate the population variance using a TI-84 calculator, follow these steps:

Enter your data:
- Press the STAT button.
- Select 1: Edit... to access the lists.
- Enter your data into one of the columns (e.g., L1).
Calculate statistics:
- After entering all your data points, press the STAT button again.
- Arrow over to the CALC menu.
- Select 1: 1-Var Stats.
- Input the list you used for your data (e.g., L1) and press ENTER.
Find the variance:
- Look for the value labeled σ² in the output. This is the population variance of the dataset you entered.

To find the z-score using a TI-84 calculator, you cannot directly calculate it as the calculator does not have a built-in function for z-scores. However, you can calculate it using the formula: z = (x - μ) / σ Here, x is your data point, μ is the mean of the dataset, and σ is the standard deviation. Follow these steps:

First, find the mean (μ) and standard deviation (σ):
- Press the STAT button.
- Choose 1: Edit to enter your data in one of the lists (e.g., L1).
- After entering data, press STAT, then arrow over to CALC, and select 1: 1-Var Stats. This will give you the mean and standard deviation.
Once you have μ and σ, use the z-score formula to manually calculate z for your specific data point x.