Ch. 3 Numerical Descriptive Measures

Objectives

Measures of central tendency
Measures of variation and identifying outliers
Measures of shape and relative location in numerical variables.
Compute descriptive summary measures for a population.
Interpret Empirical Rule
Calculate the covariance and the coefficient of correlation.

Summary Definitions

Central Tendency: The extent to which values of a numerical variable group around a typical or central value.
Variation: The amount of dispersion or scattering away from a central value that the values of a numerical variable show.
Shape: The pattern of the distribution of values from the lowest value to the highest value.

Measures of Central Tendency

Arithmetic Mean
Median
Mode
Geometric Mean (not exam material)

The Mean

The arithmetic mean is the most common measure of central tendency.
Each value plays an equal role, serving as a balance point in a data set. $\bar{X} = \frac{\sum{i=1}^{n} Xi}{n}$
- $\bar{X}$ : Pronounced x-bar (mean of a sample)
- $n$ : Sample size
- $X_i$ : The ith value

Calculation and Impact

Mean = sum of values divided by the number of values.
Affected by extreme values (outliers).

Example
* Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Mean = 15
* Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Mean = 31.5

The mean is a poor measure of central tendency in the presence of outliers.

The Median

In an ordered array, the median is the “middle” number (50% above, 50% below).
Less sensitive than the mean to extreme values. Example
- Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Median = 15.5
- Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Median = 15.5

Locating the Median

Sort the values in numerical order (smallest to largest).
Use the formula to find the position of the middle number: $\frac{n+1}{2}$
- If $n$ is odd, the median is the middle number.
- If $n$ is even, the median is the average of the two middle numbers.
  Example
- Ranked values: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52 (n = 10)
- Median position: $\frac{10+1}{2} = 5.5$
- Median = $\frac{39 + 40}{2} = 39.5$

The Mode

Mode is the value that occurs most often.
Not affected by extreme values.
Used for either numerical or categorical data.
There may be several modes in a data set or no mode at all. Example
- Data Set 1: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52; Modes = 39 and 44
  Data Set 2: No Mode

Review Example: House Prices

House Prices: $2,000,000, $500,000, $300,000, $100,000, $100,000
Mean: \frac{3,000,000}{5} = $600,000
Median: $300,000
Mode: $100,000

Which Measure to Choose?

The mean is generally used unless extreme values (outliers) exist.
The median is often used since it is not sensitive to extreme values.
In many situations, it makes sense to report both the mean and the median.

Measures of Variation

Measures of variation give information on the spread or variability or dispersion of the data values.
Types:
- Range
- Variance
- Standard Deviation
- Coefficient of Variation

The Range

Simplest measure of variation.
Difference between the largest and the smallest values: $Range = X{largest} – X{smallest}$ Example
- Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
- Range = 13 - 1 = 12

Why The Range Can Be Misleading

Does not account for how the data are distributed.
Sensitive to outliers. Examples
- 7, 8, 9, 10, 11, 12; Range = 12 - 7 = 5
- 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 5; Range = 5 - 1 = 4
- 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 120; Range = 120 - 1 = 119

The Sample Variance

Measures the average scatter around the mean.
Calculated as the average of squared deviations of values from the mean.
Formula: $S^2 = \frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}$
- $\bar{X}$ = arithmetic mean
- $n$ = sample size
- $X_i$ = ith value of the variable X

The Sample Standard Deviation

The square root of the variance.
Has the same units as that of the original sample data.
Neither the variance nor the standard deviation can ever be negative.
Sample standard deviation:
$S = \sqrt{S^2}$

Steps for Computing Standard Deviation:

Compute the difference between each value and the mean.
Square each difference.
Add the squared differences.
Divide this total by n-1 to get the sample variance.
Take the square root of the sample variance to get the sample standard deviation.

Calculation Example

Sample Data ( $X_i$ ): 10, 12, 14, 15, 17, 18, 18, 24
n = 8
Mean = $\bar{X}$ = 16

Comparing Standard Deviations

All data sets can have the same mean but different standard deviations.
The more spread out the observations are, the larger the standard deviation.

Summary Characteristics

The more the data are spread out, the greater the range, variance, and standard deviation.
The more the data are concentrated, the smaller the range, variance, and standard deviation.
If the values are all the same (no variation), all these measures will be zero.
None of these measures are ever negative.

The Coefficient of Variation

Measures relative variation.
Always in percentage (%).
Shows variation relative to mean.
Can be used to compare the variability of two or more sets of data measured in different units.
Formula:
$CV = (\frac{S}{\bar{X}}) * 100$

Comparing Coefficients of Variation

Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock B: Mean price = $100, Standard deviation = $5, CV = 5%

Stock B is less variable relative to its mean price.

Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock C: Mean price = $8, Standard deviation = $2, CV = 25%

Stock C has a much smaller standard deviation but a much higher coefficient of variation.

Locating Extreme Outliers: Z-Score

An extreme value or outlier is a value located far away from the mean.
Z scores are useful in identifying outliers.
The Z score of a value is the difference between that data value and the mean, divided by the standard deviation.
Formula:
$Zi = \frac{Xi - \bar{X}}{S}$
- $X_i$ represents the ith data value
- $\bar{X}$ is the sample mean
- $S$ is the sample standard deviation
 Example
- Suppose a data value $Xi = 10$ from the data set with a sample mean $\bar{X}= 2$ and a sample standard deviation $S = 4$ , then the Z score for $Xi$ is $Z = \frac{10 – 2}{4} = 2$
A Z score of 0 indicates that the data value is the same as the mean.
- If a Z score is a positive or negative number, it indicates whether the data value is above or below the mean and by how many standard deviations above or below the mean.
The Z-score is the number of standard deviations a data value is away from the mean.
A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater than +3.0.
The larger the absolute value of the Z-score, the farther the data value is from the mean.

Example
* Suppose the mean math SAT score is 490, with a standard deviation of 100.
* Compute the Z-score for a test score of 620 and determine if it is an outlier.
* $Z = \frac{620-490}{100} = 1.3$

A score of 620 is 1.3 standard deviations above the mean and would not be considered an outlier.

Shape of a Distribution

Describes how data are distributed.
In a symmetrical distribution, the values below the mean are distributed exactly as the values above the mean.
Two useful shape related statistics are:
- Skewness: Measures the extent to which data values are not symmetrical.
- Kurtosis (not test material): Measures the peakedness of the curve of the distribution.

Shape of a Distribution (Skewness)

Measures the extent to which data is not symmetrical.
Left-Skewed: Mean < Median, Skewness Statistic < 0
Symmetric: Mean = Median, Skewness Statistic = 0
Right-Skewed: Median < Mean, Skewness Statistic > 0

General Descriptive Stats Using Microsoft Excel Functions

=AVERAGE(range)
=MEDIAN(range)
=MODE.SNGL(range)
=STDEV.S(range)
=VAR.S(range)
=Z.SCORE(x, mean, stdev)

Numerical Descriptive Measures for a Population

Descriptive statistics discussed previously described a sample, not the population.
Summary measures describing a population, called parameters, are denoted with Greek letters.
Important descriptive population parameters are the population mean, population variance, and population standard deviation.

Sample statistics versus population parameters

Measure	Population Parameter	Sample Statistic
Mean	$µ$	$\bar{X}$
Variance	$σ^2$	$S^2$
Standard Deviation	$σ$	$S$
Proportion (Ch.7)	$π$	$p$

The mean µ

The population mean is the sum of the values in the population divided by the population size, N. $μ = \frac{\sum{i=1}^{N} Xi}{N}$
- $μ$ = population mean
- $N$ = population size
- $X_i$ = ith value of the variable X

The Variance σ^2

Average of squared deviations of values from the mean.
Population variance: $σ^2 = \frac{\sum{i=1}^{N} (Xi - μ)^2}{N}$
- $μ$ = population mean
- $N$ = population size
- $X_i$ = ith value of the variable X

The Standard Deviation σ

Most commonly used measure of variation.
Shows variation about the mean.
Is the square root of the population variance.
Has the same units as the original data.
Population standard deviation:
$σ = \sqrt{σ^2}$

The Empirical Rule

The empirical rule approximates the variation of data that are in a symmetric bell-shaped distribution.
Approximately 68% of the data in a symmetric bell shaped distribution is within 1 standard deviation of the mean or $µ ± 1σ$ .
Approximately 95% of the data in a symmetric bell-shaped distribution lies within two standard deviations of the mean, or $µ ± 2σ$ .
Approximately 99.7% of the data in a symmetric bell-shaped distribution lies within three standard deviations of the mean, or $µ ± 3σ$ .

Using the Empirical Rule

Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard deviation of 90. Then:
Approximately 68% of all test takers scored between 410 and 590, ( $500 ± 90$ ).
Approximately 95% of all test takers scored between 320 and 680, ( $500 ± 180$ ).
Approximately 99.7% of all test takers scored between 230 and 770, ( $500 ± 270$ ).
The empirical rule helps measure how the values distribute above and below the mean and can help identify outliers.
You can consider values not found in the interval $µ ± 3σ$ as outliers.
Note: this rule also applies to the bell-shaped sample data sets (i.e., $±1s$ contains 68% of data, $2s$ for 95%, $3s$ for 99.7%)

Measures Of The Relationship Between Two Numerical Variables

The Covariance
The Coefficient of Correlation

The Covariance

The covariance measures the direction of the linear relationship between two numerical variables (X & Y).
The sample covariance: $cov(X,Y) = \frac{\sum{i=1}^{n} (Xi - \bar{X})(Y_i - \bar{Y})}{n-1}$
- $n$ = number of the pairs
Only concerned with the directional relationship.
No causal effect is implied.

Interpreting Covariance

cov(X,Y) > 0: X and Y tend to move in the same direction.
cov(X,Y) < 0: X and Y tend to move in opposite directions.
$cov(X,Y) = 0$ : X and Y are independent.
The covariance has a major flaw: It is not possible to determine the relative strength of the relationship from the size of the covariance.

Coefficient of Correlation

Measures the relative strength of the linear relationship between two numerical variables.
Sample coefficient of correlation: $r = \frac{cov(X,Y)}{SX SY}$
- Where,
No causal effect is implied.

Features of the Coefficient of Correlation

The population coefficient of correlation is referred to as $ρ$ .
The sample coefficient of correlation is referred to as $r$ .
Either $ρ$ or $r$ have the following features:
- Unit free
- Range between –1 and 1
- The closer to –1, the stronger the negative linear relationship.
- The closer to 1, the stronger the positive linear relationship.
- The closer to 0, the weaker the linear relationship.

Interpreting the Coefficient of Correlation

Example
* r = 0.733

There is a relatively strong positive linear relationship between test score #1 and test score #2.
Students who scored high on the first test tended to score high on second test.