Ch. 3 Numerical Descriptive Measures
Objectives
- Measures of central tendency
- Measures of variation and identifying outliers
- Measures of shape and relative location in numerical variables.
- Compute descriptive summary measures for a population.
- Interpret Empirical Rule
- Calculate the covariance and the coefficient of correlation.
Summary Definitions
- Central Tendency: The extent to which values of a numerical variable group around a typical or central value.
- Variation: The amount of dispersion or scattering away from a central value that the values of a numerical variable show.
- Shape: The pattern of the distribution of values from the lowest value to the highest value.
Measures of Central Tendency
- Arithmetic Mean
- Median
- Mode
- Geometric Mean (not exam material)
The Mean
- The arithmetic mean is the most common measure of central tendency.
- Each value plays an equal role, serving as a balance point in a data set.
- : Pronounced x-bar (mean of a sample)
- : Sample size
- : The ith value
Calculation and Impact
- Mean = sum of values divided by the number of values.
- Affected by extreme values (outliers).
Example
* Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Mean = 15
* Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Mean = 31.5
- The mean is a poor measure of central tendency in the presence of outliers.
The Median
- In an ordered array, the median is the “middle” number (50% above, 50% below).
- Less sensitive than the mean to extreme values.
Example
- Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Median = 15.5
- Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Median = 15.5
Locating the Median
- Sort the values in numerical order (smallest to largest).
- Use the formula to find the position of the middle number:
- If is odd, the median is the middle number.
- If is even, the median is the average of the two middle numbers.
Example - Ranked values: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52 (n = 10)
- Median position:
- Median =
The Mode
- Mode is the value that occurs most often.
- Not affected by extreme values.
- Used for either numerical or categorical data.
- There may be several modes in a data set or no mode at all.
Example
- Data Set 1: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52; Modes = 39 and 44
Data Set 2: No Mode
- Data Set 1: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52; Modes = 39 and 44
Review Example: House Prices
- House Prices: $2,000,000, $500,000, $300,000, $100,000, $100,000
- Mean: \frac{3,000,000}{5} = $600,000
- Median: $300,000
- Mode: $100,000
Which Measure to Choose?
- The mean is generally used unless extreme values (outliers) exist.
- The median is often used since it is not sensitive to extreme values.
- In many situations, it makes sense to report both the mean and the median.
Measures of Variation
- Measures of variation give information on the spread or variability or dispersion of the data values.
- Types:
- Range
- Variance
- Standard Deviation
- Coefficient of Variation
The Range
- Simplest measure of variation.
- Difference between the largest and the smallest values:
Example
- Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
- Range = 13 - 1 = 12
Why The Range Can Be Misleading
- Does not account for how the data are distributed.
- Sensitive to outliers.
Examples
- 7, 8, 9, 10, 11, 12; Range = 12 - 7 = 5
- 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 5; Range = 5 - 1 = 4
- 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 120; Range = 120 - 1 = 119
The Sample Variance
- Measures the average scatter around the mean.
- Calculated as the average of squared deviations of values from the mean.
- Formula:
- = arithmetic mean
- = sample size
- = ith value of the variable X
The Sample Standard Deviation
- The square root of the variance.
- Has the same units as that of the original sample data.
- Neither the variance nor the standard deviation can ever be negative.
- Sample standard deviation:
Steps for Computing Standard Deviation:
- Compute the difference between each value and the mean.
- Square each difference.
- Add the squared differences.
- Divide this total by n-1 to get the sample variance.
- Take the square root of the sample variance to get the sample standard deviation.
Calculation Example
- Sample Data (): 10, 12, 14, 15, 17, 18, 18, 24
- n = 8
- Mean = = 16
Comparing Standard Deviations
- All data sets can have the same mean but different standard deviations.
- The more spread out the observations are, the larger the standard deviation.
Summary Characteristics
- The more the data are spread out, the greater the range, variance, and standard deviation.
- The more the data are concentrated, the smaller the range, variance, and standard deviation.
- If the values are all the same (no variation), all these measures will be zero.
- None of these measures are ever negative.
The Coefficient of Variation
- Measures relative variation.
- Always in percentage (%).
- Shows variation relative to mean.
- Can be used to compare the variability of two or more sets of data measured in different units.
- Formula:
Comparing Coefficients of Variation
Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock B: Mean price = $100, Standard deviation = $5, CV = 5%
- Stock B is less variable relative to its mean price.
Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock C: Mean price = $8, Standard deviation = $2, CV = 25%
- Stock C has a much smaller standard deviation but a much higher coefficient of variation.
Locating Extreme Outliers: Z-Score
An extreme value or outlier is a value located far away from the mean.
Z scores are useful in identifying outliers.
The Z score of a value is the difference between that data value and the mean, divided by the standard deviation.
Formula:
- represents the ith data value
- is the sample mean
- is the sample standard deviation
Example - Suppose a data value from the data set with a sample mean and a sample standard deviation , then the Z score for is
A Z score of 0 indicates that the data value is the same as the mean.
- If a Z score is a positive or negative number, it indicates whether the data value is above or below the mean and by how many standard deviations above or below the mean.
The Z-score is the number of standard deviations a data value is away from the mean.
A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater than +3.0.
The larger the absolute value of the Z-score, the farther the data value is from the mean.
Example
* Suppose the mean math SAT score is 490, with a standard deviation of 100.
* Compute the Z-score for a test score of 620 and determine if it is an outlier.
*
- A score of 620 is 1.3 standard deviations above the mean and would not be considered an outlier.
Shape of a Distribution
- Describes how data are distributed.
- In a symmetrical distribution, the values below the mean are distributed exactly as the values above the mean.
- Two useful shape related statistics are:
- Skewness: Measures the extent to which data values are not symmetrical.
- Kurtosis (not test material): Measures the peakedness of the curve of the distribution.
Shape of a Distribution (Skewness)
- Measures the extent to which data is not symmetrical.
- Left-Skewed: Mean < Median, Skewness Statistic < 0
- Symmetric: Mean = Median, Skewness Statistic = 0
- Right-Skewed: Median < Mean, Skewness Statistic > 0
General Descriptive Stats Using Microsoft Excel Functions
- =AVERAGE(range)
- =MEDIAN(range)
- =MODE.SNGL(range)
- =STDEV.S(range)
- =VAR.S(range)
- =Z.SCORE(x, mean, stdev)
Numerical Descriptive Measures for a Population
- Descriptive statistics discussed previously described a sample, not the population.
- Summary measures describing a population, called parameters, are denoted with Greek letters.
- Important descriptive population parameters are the population mean, population variance, and population standard deviation.
Sample statistics versus population parameters
| Measure | Population Parameter | Sample Statistic |
|---|---|---|
| Mean | ||
| Variance | ||
| Standard Deviation | ||
| Proportion (Ch.7) |
The mean µ
- The population mean is the sum of the values in the population divided by the population size, N.
- = population mean
- = population size
- = ith value of the variable X
The Variance σ^2
- Average of squared deviations of values from the mean.
- Population variance:
- = population mean
- = population size
- = ith value of the variable X
The Standard Deviation σ
- Most commonly used measure of variation.
- Shows variation about the mean.
- Is the square root of the population variance.
- Has the same units as the original data.
- Population standard deviation:
The Empirical Rule
- The empirical rule approximates the variation of data that are in a symmetric bell-shaped distribution.
- Approximately 68% of the data in a symmetric bell shaped distribution is within 1 standard deviation of the mean or .
- Approximately 95% of the data in a symmetric bell-shaped distribution lies within two standard deviations of the mean, or .
- Approximately 99.7% of the data in a symmetric bell-shaped distribution lies within three standard deviations of the mean, or .
Using the Empirical Rule
- Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard deviation of 90. Then:
- Approximately 68% of all test takers scored between 410 and 590, ().
- Approximately 95% of all test takers scored between 320 and 680, ().
- Approximately 99.7% of all test takers scored between 230 and 770, ().
- The empirical rule helps measure how the values distribute above and below the mean and can help identify outliers.
- You can consider values not found in the interval as outliers.
- Note: this rule also applies to the bell-shaped sample data sets (i.e., contains 68% of data, for 95%, for 99.7%)
Measures Of The Relationship Between Two Numerical Variables
- The Covariance
- The Coefficient of Correlation
The Covariance
- The covariance measures the direction of the linear relationship between two numerical variables (X & Y).
- The sample covariance:
- = number of the pairs
- Only concerned with the directional relationship.
- No causal effect is implied.
Interpreting Covariance
- cov(X,Y) > 0: X and Y tend to move in the same direction.
- cov(X,Y) < 0: X and Y tend to move in opposite directions.
- : X and Y are independent.
- The covariance has a major flaw: It is not possible to determine the relative strength of the relationship from the size of the covariance.
Coefficient of Correlation
- Measures the relative strength of the linear relationship between two numerical variables.
- Sample coefficient of correlation:
- Where,
- No causal effect is implied.
Features of the Coefficient of Correlation
- The population coefficient of correlation is referred to as .
- The sample coefficient of correlation is referred to as .
- Either or have the following features:
- Unit free
- Range between –1 and 1
- The closer to –1, the stronger the negative linear relationship.
- The closer to 1, the stronger the positive linear relationship.
- The closer to 0, the weaker the linear relationship.
Interpreting the Coefficient of Correlation
Example
* r = 0.733
- There is a relatively strong positive linear relationship between test score #1 and test score #2.
- Students who scored high on the first test tended to score high on second test.