Ch. 3 Numerical Descriptive Measures

Objectives

  • Measures of central tendency
  • Measures of variation and identifying outliers
  • Measures of shape and relative location in numerical variables.
  • Compute descriptive summary measures for a population.
  • Interpret Empirical Rule
  • Calculate the covariance and the coefficient of correlation.

Summary Definitions

  • Central Tendency: The extent to which values of a numerical variable group around a typical or central value.
  • Variation: The amount of dispersion or scattering away from a central value that the values of a numerical variable show.
  • Shape: The pattern of the distribution of values from the lowest value to the highest value.

Measures of Central Tendency

  • Arithmetic Mean
  • Median
  • Mode
  • Geometric Mean (not exam material)

The Mean

  • The arithmetic mean is the most common measure of central tendency.
  • Each value plays an equal role, serving as a balance point in a data set. Xˉ=<em>i=1nX</em>in\bar{X} = \frac{\sum<em>{i=1}^{n} X</em>i}{n}
    • Xˉ\bar{X}: Pronounced x-bar (mean of a sample)
    • nn: Sample size
    • XiX_i: The ith value
Calculation and Impact
  • Mean = sum of values divided by the number of values.
  • Affected by extreme values (outliers).

Example
* Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Mean = 15
* Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Mean = 31.5

  • The mean is a poor measure of central tendency in the presence of outliers.

The Median

  • In an ordered array, the median is the “middle” number (50% above, 50% below).
  • Less sensitive than the mean to extreme values. Example
    • Data Set 1: 11, 12, 13, 14, 15, 16, 17, 18, 19, 20; Median = 15.5
    • Data Set 2: 11, 12, 13, 14, 15, 16, 17, 18, 19, 100; Median = 15.5
Locating the Median
  • Sort the values in numerical order (smallest to largest).
  • Use the formula to find the position of the middle number: n+12\frac{n+1}{2}
    • If nn is odd, the median is the middle number.
    • If nn is even, the median is the average of the two middle numbers.
      Example
    • Ranked values: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52 (n = 10)
    • Median position: 10+12=5.5\frac{10+1}{2} = 5.5
    • Median = 39+402=39.5\frac{39 + 40}{2} = 39.5

The Mode

  • Mode is the value that occurs most often.
  • Not affected by extreme values.
  • Used for either numerical or categorical data.
  • There may be several modes in a data set or no mode at all. Example
    • Data Set 1: 29, 31, 35, 39, 39, 40, 43, 44, 44, 52; Modes = 39 and 44
      Data Set 2: No Mode

Review Example: House Prices

  • House Prices: $2,000,000, $500,000, $300,000, $100,000, $100,000
  • Mean: \frac{3,000,000}{5} = $600,000
  • Median: $300,000
  • Mode: $100,000

Which Measure to Choose?

  • The mean is generally used unless extreme values (outliers) exist.
  • The median is often used since it is not sensitive to extreme values.
  • In many situations, it makes sense to report both the mean and the median.

Measures of Variation

  • Measures of variation give information on the spread or variability or dispersion of the data values.
  • Types:
    • Range
    • Variance
    • Standard Deviation
    • Coefficient of Variation

The Range

  • Simplest measure of variation.
  • Difference between the largest and the smallest values: Range=X<em>largestX</em>smallestRange = X<em>{largest} – X</em>{smallest}Example
    • Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
    • Range = 13 - 1 = 12

Why The Range Can Be Misleading

  • Does not account for how the data are distributed.
  • Sensitive to outliers. Examples
    • 7, 8, 9, 10, 11, 12; Range = 12 - 7 = 5
    • 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 5; Range = 5 - 1 = 4
    • 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 120; Range = 120 - 1 = 119

The Sample Variance

  • Measures the average scatter around the mean.
  • Calculated as the average of squared deviations of values from the mean.
  • Formula: S2=<em>i=1n(X</em>iXˉ)2n1S^2 = \frac{\sum<em>{i=1}^{n} (X</em>i - \bar{X})^2}{n-1}
    • Xˉ\bar{X} = arithmetic mean
    • nn = sample size
    • XiX_i = ith value of the variable X

The Sample Standard Deviation

  • The square root of the variance.
  • Has the same units as that of the original sample data.
  • Neither the variance nor the standard deviation can ever be negative.
  • Sample standard deviation:
    S=S2S = \sqrt{S^2}
Steps for Computing Standard Deviation:
  1. Compute the difference between each value and the mean.
  2. Square each difference.
  3. Add the squared differences.
  4. Divide this total by n-1 to get the sample variance.
  5. Take the square root of the sample variance to get the sample standard deviation.
Calculation Example
  • Sample Data (XiX_i): 10, 12, 14, 15, 17, 18, 18, 24
  • n = 8
  • Mean = Xˉ\bar{X} = 16

Comparing Standard Deviations

  • All data sets can have the same mean but different standard deviations.
  • The more spread out the observations are, the larger the standard deviation.

Summary Characteristics

  • The more the data are spread out, the greater the range, variance, and standard deviation.
  • The more the data are concentrated, the smaller the range, variance, and standard deviation.
  • If the values are all the same (no variation), all these measures will be zero.
  • None of these measures are ever negative.

The Coefficient of Variation

  • Measures relative variation.
  • Always in percentage (%).
  • Shows variation relative to mean.
  • Can be used to compare the variability of two or more sets of data measured in different units.
  • Formula:
    CV=(SXˉ)100CV = (\frac{S}{\bar{X}}) * 100
Comparing Coefficients of Variation

Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock B: Mean price = $100, Standard deviation = $5, CV = 5%

  • Stock B is less variable relative to its mean price.

Example
* Stock A: Mean price = $50, Standard deviation = $5, CV = 10%
* Stock C: Mean price = $8, Standard deviation = $2, CV = 25%

  • Stock C has a much smaller standard deviation but a much higher coefficient of variation.

Locating Extreme Outliers: Z-Score

  • An extreme value or outlier is a value located far away from the mean.

  • Z scores are useful in identifying outliers.

  • The Z score of a value is the difference between that data value and the mean, divided by the standard deviation.

  • Formula:
    Z<em>i=X</em>iXˉSZ<em>i = \frac{X</em>i - \bar{X}}{S}

    • XiX_i represents the ith data value
    • Xˉ\bar{X} is the sample mean
    • SS is the sample standard deviation
      Example
    • Suppose a data value X<em>i=10X<em>i = 10 from the data set with a sample mean Xˉ=2\bar{X}= 2 and a sample standard deviation S=4S = 4, then the Z score for X</em>iX</em>i is Z=1024=2Z = \frac{10 – 2}{4} = 2
  • A Z score of 0 indicates that the data value is the same as the mean.

    • If a Z score is a positive or negative number, it indicates whether the data value is above or below the mean and by how many standard deviations above or below the mean.
  • The Z-score is the number of standard deviations a data value is away from the mean.

  • A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater than +3.0.

  • The larger the absolute value of the Z-score, the farther the data value is from the mean.

Example
* Suppose the mean math SAT score is 490, with a standard deviation of 100.
* Compute the Z-score for a test score of 620 and determine if it is an outlier.
* Z=620490100=1.3Z = \frac{620-490}{100} = 1.3

  • A score of 620 is 1.3 standard deviations above the mean and would not be considered an outlier.

Shape of a Distribution

  • Describes how data are distributed.
  • In a symmetrical distribution, the values below the mean are distributed exactly as the values above the mean.
  • Two useful shape related statistics are:
    • Skewness: Measures the extent to which data values are not symmetrical.
    • Kurtosis (not test material): Measures the peakedness of the curve of the distribution.

Shape of a Distribution (Skewness)

  • Measures the extent to which data is not symmetrical.
  • Left-Skewed: Mean < Median, Skewness Statistic < 0
  • Symmetric: Mean = Median, Skewness Statistic = 0
  • Right-Skewed: Median < Mean, Skewness Statistic > 0

General Descriptive Stats Using Microsoft Excel Functions

  • =AVERAGE(range)
  • =MEDIAN(range)
  • =MODE.SNGL(range)
  • =STDEV.S(range)
  • =VAR.S(range)
  • =Z.SCORE(x, mean, stdev)

Numerical Descriptive Measures for a Population

  • Descriptive statistics discussed previously described a sample, not the population.
  • Summary measures describing a population, called parameters, are denoted with Greek letters.
  • Important descriptive population parameters are the population mean, population variance, and population standard deviation.

Sample statistics versus population parameters

MeasurePopulation ParameterSample Statistic
MeanµµXˉ\bar{X}
Varianceσ2σ^2S2S^2
Standard DeviationσσSS
Proportion (Ch.7)ππpp

The mean µ

  • The population mean is the sum of the values in the population divided by the population size, N. μ=<em>i=1NX</em>iNμ = \frac{\sum<em>{i=1}^{N} X</em>i}{N}
    • μμ = population mean
    • NN = population size
    • XiX_i = ith value of the variable X

The Variance σ^2

  • Average of squared deviations of values from the mean.
  • Population variance: σ2=<em>i=1N(X</em>iμ)2Nσ^2 = \frac{\sum<em>{i=1}^{N} (X</em>i - μ)^2}{N}
    • μμ = population mean
    • NN = population size
    • XiX_i = ith value of the variable X

The Standard Deviation σ

  • Most commonly used measure of variation.
  • Shows variation about the mean.
  • Is the square root of the population variance.
  • Has the same units as the original data.
  • Population standard deviation:
    σ=σ2σ = \sqrt{σ^2}

The Empirical Rule

  • The empirical rule approximates the variation of data that are in a symmetric bell-shaped distribution.
  • Approximately 68% of the data in a symmetric bell shaped distribution is within 1 standard deviation of the mean or µ±1σµ ± 1σ.
  • Approximately 95% of the data in a symmetric bell-shaped distribution lies within two standard deviations of the mean, or µ±2σµ ± 2σ.
  • Approximately 99.7% of the data in a symmetric bell-shaped distribution lies within three standard deviations of the mean, or µ±3σµ ± 3σ.

Using the Empirical Rule

  • Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard deviation of 90. Then:
  • Approximately 68% of all test takers scored between 410 and 590, (500±90500 ± 90).
  • Approximately 95% of all test takers scored between 320 and 680, (500±180500 ± 180).
  • Approximately 99.7% of all test takers scored between 230 and 770, (500±270500 ± 270).
  • The empirical rule helps measure how the values distribute above and below the mean and can help identify outliers.
  • You can consider values not found in the interval µ±3σµ ± 3σ as outliers.
  • Note: this rule also applies to the bell-shaped sample data sets (i.e., ±1s±1s contains 68% of data, 2s2s for 95%, 3s3s for 99.7%)

Measures Of The Relationship Between Two Numerical Variables

  • The Covariance
  • The Coefficient of Correlation

The Covariance

  • The covariance measures the direction of the linear relationship between two numerical variables (X & Y).
  • The sample covariance: cov(X,Y)=<em>i=1n(X</em>iXˉ)(YiYˉ)n1cov(X,Y) = \frac{\sum<em>{i=1}^{n} (X</em>i - \bar{X})(Y_i - \bar{Y})}{n-1}
    • nn = number of the pairs
  • Only concerned with the directional relationship.
  • No causal effect is implied.
Interpreting Covariance
  • cov(X,Y) > 0: X and Y tend to move in the same direction.
  • cov(X,Y) < 0: X and Y tend to move in opposite directions.
  • cov(X,Y)=0cov(X,Y) = 0: X and Y are independent.
  • The covariance has a major flaw: It is not possible to determine the relative strength of the relationship from the size of the covariance.

Coefficient of Correlation

  • Measures the relative strength of the linear relationship between two numerical variables.
  • Sample coefficient of correlation: r=cov(X,Y)S<em>XS</em>Yr = \frac{cov(X,Y)}{S<em>X S</em>Y}
    • Where,
  • No causal effect is implied.
Features of the Coefficient of Correlation
  • The population coefficient of correlation is referred to as ρρ.
  • The sample coefficient of correlation is referred to as rr.
  • Either ρρ or rr have the following features:
    • Unit free
    • Range between –1 and 1
    • The closer to –1, the stronger the negative linear relationship.
    • The closer to 1, the stronger the positive linear relationship.
    • The closer to 0, the weaker the linear relationship.
Interpreting the Coefficient of Correlation

Example
* r = 0.733

  • There is a relatively strong positive linear relationship between test score #1 and test score #2.
  • Students who scored high on the first test tended to score high on second test.