Numerical Descriptive Statistics Notes
Numerical Measures
Sample Statistics vs. Population Parameters:
Measures computed from sample data are called sample statistics. These are used to estimate population parameters when it's not feasible to analyze the entire population.
Measures computed from population data are called population parameters. Population parameters provide a complete and accurate description of the entire population.
A sample statistic serves as a point estimator for the corresponding population parameter. The accuracy of this estimation depends on the sample size and variability.
Measures of Location - Data Centrality
Mean:
Population Mean: , where is the population mean, is the sum of all values in the population, and is the number of values in the population.
Sample Mean: , where is the sample mean, is the sum of all values in the sample, and is the number of values in the sample.
Excel Function:
=AVERAGE(cell range)The sample mean is the point estimator of the population mean . It's an unbiased estimator, meaning that on average, it will equal the population mean if multiple samples are taken.
Example: Monthly Starting Salary (Graduates 1-12)
Salaries: $3850, $3950, $4050, $3880, $3755, $3710, $3890, $4130, $3940, $4325, $3920, $3880
Median:
The median is the middle value when data is arranged in ascending order. It divides the data set into two equal parts.
Preferred when there are extreme values in the data set. The median is less sensitive to outliers than the mean.
Commonly used for annual income and property value data because these data sets often contain extreme values.
Excel Function:
=MEDIAN(cell range)Odd Number of Observations:
Example: 7 observations: 26, 18, 27, 12, 14, 27, 19
Ascending order: 12, 14, 18, 19, 26, 27, 27
Median = 19
Even Number of Observations:
Example: 8 observations
Median is the average of the middle two values.
Example: Median =
Measures of Variability - Data Dispersion
Range:
The difference between the largest and smallest data values. It provides a basic understanding of data spread.
Range = Largest value – Smallest value
Simplest measure of variability but highly affected by outliers.
Example: Monthly Starting Salary
Smallest value: $3710
Largest value: $4325
Range = $4325 - $3710 = $615
Variance:
Measures the variability utilizing all the data. It quantifies the average squared distance of data points from the mean.
Average of the squared differences between each data value and the mean. Squaring the differences ensures that all values are positive.
Formula:
Sample Variance: Will be added once provided
Population Variance: Will be added once provided
Standard Deviation:
Positive square root of the variance. It measures the spread of data around the mean.
Measured in the same units as the data, making it easier to interpret than the variance.
Easier to interpret than the variance because it is in the original units of measurement.
Formula:
Sample Standard Deviation: Will be added once provided
Population Standard Deviation: Will be added once provided
Pizza Delivery Example:
Pizza A: Average delivery time = 20 minutes, standard deviation = 10 minutes.
Pizza B: Average delivery time = 20 minutes, standard deviation = 5 minutes.
Lower standard deviation (Pizza B) indicates more consistent delivery times. This means Pizza B's delivery times are more predictable.
Coefficient of Variation:
Indicates how large the standard deviation is in relation to the mean. It is a dimensionless number, often expressed as a percentage.
Allows comparison of variables with different units of measurement. For example, comparing the variability of heights (in cm) and weights (in kg).
Formula: Will be added once provided
Examples (Excel): Pizza Restaurants, Household Expenditures, Fuel Prices
Measures of Distribution Shape, Relative Location, and Detecting Outliers
z-Scores:
Also called the standardized value. It indicates how many standard deviations an element is from the mean.
Denotes the number of standard deviations a data value is from the mean. A z-score of 1 means the value is one standard deviation above the mean.
, where is the individual data value, is the sample mean, and is the sample standard deviation.
A data value less than the sample mean will have a z-score less than zero. This indicates the value is below average.
A data value greater than the sample mean will have a z-score greater than zero. This indicates the value is above average.
A data value equal to the sample mean will have a z-score of zero. This indicates the value is exactly at the average.
Examples:
SAT, ACT, and GRE scores: Standardizing scores allows for comparison across different tests.
Baby’s weight in comparison to her cohort: Assessing if a baby's weight is within the normal range.
Compare Different Types of Variables
Compare weights of an apple and orange
110-gram Apple and a 100-gram Orange
Apples:
Mean weight grams = 100
Standard Deviation = 15
Oranges
Mean weight grams = 140
Standard Deviation = 25
Detecting Outliers:
Raw data values that are far from the average (usually greater than +/-3 standard deviations) are unusual and potential outliers. Outliers can significantly skew the results of statistical analyses.
Empirical Rule:
For data that is normally distributed:
68.3% of the data falls within +/- 1 standard deviation of the mean. This is a common range for most data points.
95.5% of the data falls within +/- 2 standard deviations of the mean. This range captures the vast majority of data.
99.7% of the data falls within +/- 3 standard deviations of the mean. Almost all data points fall within this range.
Mini Workshops
Mini Workshop 1:
DCManager: Purchase Orders; lead times, fill rates, and items damaged
Determine the mean and median metrics for each of the attributes. This helps understand central tendencies.
Determine the variance and standard deviation for each of the attributes. What are the units? This quantifies the spread of the data.
Which variable of the purchase orders has the most variability? Understanding this helps in identifying areas of inconsistency.
Mini Workshop 2:
Weather: Which city experiences more fluctuation in temperature?
Which metric should be used? The standard deviation is appropriate here.
Mini Workshop 3:
DCManager: Use the lead times attribute.
Is it appropriate to use the empirical rule to interpret the spread in this data set? Why or why not? Check if the data is normally distributed.
If so, use the empirical rule to help understand the spread in the data set of sample lead times. This provides insights into expected lead time ranges.
Mini Workshop 4:
Autos - Random sample of autos weights and mpg’s.
Develop an appropriate visualization. Scatter plots can be useful here.
Determine the standard deviation for each variable. This provides insights into data spread.
Determine the correlation coefficient and interpret. This helps understand the relationship between auto weights and mpg’s.