Numerical Descriptive Statistics Notes

Numerical Measures
  • Sample Statistics vs. Population Parameters:

    • Measures computed from sample data are called sample statistics. These are used to estimate population parameters when it's not feasible to analyze the entire population.

    • Measures computed from population data are called population parameters. Population parameters provide a complete and accurate description of the entire population.

    • A sample statistic serves as a point estimator for the corresponding population parameter. The accuracy of this estimation depends on the sample size and variability.

Measures of Location - Data Centrality
  • Mean:

    • Population Mean: μ=ΣXN\mu = \frac{\Sigma X}{N}, where μ\mu is the population mean, SigmaX\\Sigma X is the sum of all values in the population, and NN is the number of values in the population.

    • Sample Mean: xˉ=ΣXn\bar{x} = \frac{\Sigma X}{n}, where xˉ\bar{x} is the sample mean, SigmaX\\Sigma X is the sum of all values in the sample, and nn is the number of values in the sample.

    • Excel Function: =AVERAGE(cell range)

    • The sample mean xˉ\bar{x} is the point estimator of the population mean μ\mu. It's an unbiased estimator, meaning that on average, it will equal the population mean if multiple samples are taken.

    • Example: Monthly Starting Salary (Graduates 1-12)

      • Salaries: $3850, $3950, $4050, $3880, $3755, $3710, $3890, $4130, $3940, $4325, $3920, $3880

      • Σx=47280\Sigma x = 47280

      • xˉ=4728012=3940\bar{x} = \frac{47280}{12} = 3940

  • Median:

    • The median is the middle value when data is arranged in ascending order. It divides the data set into two equal parts.

    • Preferred when there are extreme values in the data set. The median is less sensitive to outliers than the mean.

    • Commonly used for annual income and property value data because these data sets often contain extreme values.

    • Excel Function: =MEDIAN(cell range)

    • Odd Number of Observations:

      • Example: 7 observations: 26, 18, 27, 12, 14, 27, 19

      • Ascending order: 12, 14, 18, 19, 26, 27, 27

      • Median = 19

    • Even Number of Observations:

      • Example: 8 observations

      • Median is the average of the middle two values.

      • Example: Median = 19+262=22.5\frac{19 + 26}{2} = 22.5

Measures of Variability - Data Dispersion
  • Range:

    • The difference between the largest and smallest data values. It provides a basic understanding of data spread.

    • Range = Largest value – Smallest value

    • Simplest measure of variability but highly affected by outliers.

    • Example: Monthly Starting Salary

      • Smallest value: $3710

      • Largest value: $4325

      • Range = $4325 - $3710 = $615

  • Variance:

    • Measures the variability utilizing all the data. It quantifies the average squared distance of data points from the mean.

    • Average of the squared differences between each data value and the mean. Squaring the differences ensures that all values are positive.

    • Formula:

      • Sample Variance: Will be added once provided

      • Population Variance: Will be added once provided

  • Standard Deviation:

    • Positive square root of the variance. It measures the spread of data around the mean.

    • Measured in the same units as the data, making it easier to interpret than the variance.

    • Easier to interpret than the variance because it is in the original units of measurement.

    • Formula:

      • Sample Standard Deviation: Will be added once provided

      • Population Standard Deviation: Will be added once provided

    • Pizza Delivery Example:

      • Pizza A: Average delivery time = 20 minutes, standard deviation = 10 minutes.

      • Pizza B: Average delivery time = 20 minutes, standard deviation = 5 minutes.

      • Lower standard deviation (Pizza B) indicates more consistent delivery times. This means Pizza B's delivery times are more predictable.

  • Coefficient of Variation:

    • Indicates how large the standard deviation is in relation to the mean. It is a dimensionless number, often expressed as a percentage.

    • Allows comparison of variables with different units of measurement. For example, comparing the variability of heights (in cm) and weights (in kg).

    • Formula: Will be added once provided

    • Examples (Excel): Pizza Restaurants, Household Expenditures, Fuel Prices

Measures of Distribution Shape, Relative Location, and Detecting Outliers
  • z-Scores:

    • Also called the standardized value. It indicates how many standard deviations an element is from the mean.

    • Denotes the number of standard deviations a data value is from the mean. A z-score of 1 means the value is one standard deviation above the mean.

    • z=x<em>ixˉsz = \frac{x<em>i - \bar{x}}{s}, where x</em>ix</em>i is the individual data value, barx\\bar{x} is the sample mean, and ss is the sample standard deviation.

    • A data value less than the sample mean will have a z-score less than zero. This indicates the value is below average.

    • A data value greater than the sample mean will have a z-score greater than zero. This indicates the value is above average.

    • A data value equal to the sample mean will have a z-score of zero. This indicates the value is exactly at the average.

    • Examples:

      • SAT, ACT, and GRE scores: Standardizing scores allows for comparison across different tests.

      • Baby’s weight in comparison to her cohort: Assessing if a baby's weight is within the normal range.

    • Compare Different Types of Variables

      • Compare weights of an apple and orange

      • 110-gram Apple and a 100-gram Orange

      • Apples:

        • Mean weight grams = 100

        • Standard Deviation = 15

        • Zapple=11010015=0.667Z_{apple} = \frac{110-100}{15} = 0.667

      • Oranges

        • Mean weight grams = 140

        • Standard Deviation = 25

        • Zorange=10014025=1.6Z_{orange} = \frac{100-140}{25} = -1.6

  • Detecting Outliers:

    • Raw data values that are far from the average (usually greater than +/-3 standard deviations) are unusual and potential outliers. Outliers can significantly skew the results of statistical analyses.

  • Empirical Rule:

    • For data that is normally distributed:

      • 68.3% of the data falls within +/- 1 standard deviation of the mean. This is a common range for most data points.

      • 95.5% of the data falls within +/- 2 standard deviations of the mean. This range captures the vast majority of data.

      • 99.7% of the data falls within +/- 3 standard deviations of the mean. Almost all data points fall within this range.

Mini Workshops
  • Mini Workshop 1:

    • DCManager: Purchase Orders; lead times, fill rates, and items damaged

    • Determine the mean and median metrics for each of the attributes. This helps understand central tendencies.

    • Determine the variance and standard deviation for each of the attributes. What are the units? This quantifies the spread of the data.

    • Which variable of the purchase orders has the most variability? Understanding this helps in identifying areas of inconsistency.

  • Mini Workshop 2:

    • Weather: Which city experiences more fluctuation in temperature?

    • Which metric should be used? The standard deviation is appropriate here.

  • Mini Workshop 3:

    • DCManager: Use the lead times attribute.

    • Is it appropriate to use the empirical rule to interpret the spread in this data set? Why or why not? Check if the data is normally distributed.

    • If so, use the empirical rule to help understand the spread in the data set of sample lead times. This provides insights into expected lead time ranges.

  • Mini Workshop 4:

    • Autos - Random sample of autos weights and mpg’s.

    • Develop an appropriate visualization. Scatter plots can be useful here.

    • Determine the standard deviation for each variable. This provides insights into data spread.

    • Determine the correlation coefficient and interpret. This helps understand the relationship between auto weights and mpg’s.