Numerical Descriptive Statistics Notes

Numerical Measures

Sample Statistics vs. Population Parameters:
- Measures computed from sample data are called sample statistics. These are used to estimate population parameters when it's not feasible to analyze the entire population.
- Measures computed from population data are called population parameters. Population parameters provide a complete and accurate description of the entire population.
- A sample statistic serves as a point estimator for the corresponding population parameter. The accuracy of this estimation depends on the sample size and variability.

Measures of Location - Data Centrality

Mean:
- Population Mean: $\mu = \frac{\Sigma X}{N}$ , where $\mu$ is the population mean, $\\Sigma X$ is the sum of all values in the population, and $N$ is the number of values in the population.
- Sample Mean: $\bar{x} = \frac{\Sigma X}{n}$ , where $\bar{x}$ is the sample mean, $\\Sigma X$ is the sum of all values in the sample, and $n$ is the number of values in the sample.
- Excel Function: =AVERAGE(cell range)
- The sample mean $\bar{x}$ is the point estimator of the population mean $\mu$ . It's an unbiased estimator, meaning that on average, it will equal the population mean if multiple samples are taken.
- Example: Monthly Starting Salary (Graduates 1-12)
  - Salaries: $3850, $3950, $4050, $3880, $3755, $3710, $3890, $4130, $3940, $4325, $3920, $3880
  - $\Sigma x = 47280$
  - $\bar{x} = \frac{47280}{12} = 3940$
Median:
- The median is the middle value when data is arranged in ascending order. It divides the data set into two equal parts.
- Preferred when there are extreme values in the data set. The median is less sensitive to outliers than the mean.
- Commonly used for annual income and property value data because these data sets often contain extreme values.
- Excel Function: =MEDIAN(cell range)
- Odd Number of Observations:
  - Example: 7 observations: 26, 18, 27, 12, 14, 27, 19
  - Ascending order: 12, 14, 18, 19, 26, 27, 27
  - Median = 19
- Even Number of Observations:
  - Example: 8 observations
  - Median is the average of the middle two values.
  - Example: Median = $\frac{19 + 26}{2} = 22.5$

Measures of Variability - Data Dispersion

Range:
- The difference between the largest and smallest data values. It provides a basic understanding of data spread.
- Range = Largest value – Smallest value
- Simplest measure of variability but highly affected by outliers.
- Example: Monthly Starting Salary
  - Smallest value: $3710
  - Largest value: $4325
  - Range = $4325 - $3710 = $615
Variance:
- Measures the variability utilizing all the data. It quantifies the average squared distance of data points from the mean.
- Average of the squared differences between each data value and the mean. Squaring the differences ensures that all values are positive.
- Formula:
  - Sample Variance: Will be added once provided
  - Population Variance: Will be added once provided
Standard Deviation:
- Positive square root of the variance. It measures the spread of data around the mean.
- Measured in the same units as the data, making it easier to interpret than the variance.
- Easier to interpret than the variance because it is in the original units of measurement.
- Formula:
  - Sample Standard Deviation: Will be added once provided
  - Population Standard Deviation: Will be added once provided
- Pizza Delivery Example:
  - Pizza A: Average delivery time = 20 minutes, standard deviation = 10 minutes.
  - Pizza B: Average delivery time = 20 minutes, standard deviation = 5 minutes.
  - Lower standard deviation (Pizza B) indicates more consistent delivery times. This means Pizza B's delivery times are more predictable.
Coefficient of Variation:
- Indicates how large the standard deviation is in relation to the mean. It is a dimensionless number, often expressed as a percentage.
- Allows comparison of variables with different units of measurement. For example, comparing the variability of heights (in cm) and weights (in kg).
- Formula: Will be added once provided
- Examples (Excel): Pizza Restaurants, Household Expenditures, Fuel Prices

Measures of Distribution Shape, Relative Location, and Detecting Outliers

z-Scores:
- Also called the standardized value. It indicates how many standard deviations an element is from the mean.
- Denotes the number of standard deviations a data value is from the mean. A z-score of 1 means the value is one standard deviation above the mean.
- $z = \frac{x<em>i - \bar{x}}{s}$ , where $x</em>i$ is the individual data value, $\\bar{x}$ is the sample mean, and $s$ is the sample standard deviation.
- A data value less than the sample mean will have a z-score less than zero. This indicates the value is below average.
- A data value greater than the sample mean will have a z-score greater than zero. This indicates the value is above average.
- A data value equal to the sample mean will have a z-score of zero. This indicates the value is exactly at the average.
- Examples:
  - SAT, ACT, and GRE scores: Standardizing scores allows for comparison across different tests.
  - Baby’s weight in comparison to her cohort: Assessing if a baby's weight is within the normal range.
- Compare Different Types of Variables
  - Compare weights of an apple and orange
  - 110-gram Apple and a 100-gram Orange
  - Apples:
    - Mean weight grams = 100
    - Standard Deviation = 15
    - $Z_{apple} = \frac{110-100}{15} = 0.667$
  - Oranges
    - Mean weight grams = 140
    - Standard Deviation = 25
    - $Z_{orange} = \frac{100-140}{25} = -1.6$
Detecting Outliers:
- Raw data values that are far from the average (usually greater than +/-3 standard deviations) are unusual and potential outliers. Outliers can significantly skew the results of statistical analyses.
Empirical Rule:
- For data that is normally distributed:
  - 68.3% of the data falls within +/- 1 standard deviation of the mean. This is a common range for most data points.
  - 95.5% of the data falls within +/- 2 standard deviations of the mean. This range captures the vast majority of data.
  - 99.7% of the data falls within +/- 3 standard deviations of the mean. Almost all data points fall within this range.

Mini Workshops

Mini Workshop 1:
- DCManager: Purchase Orders; lead times, fill rates, and items damaged
- Determine the mean and median metrics for each of the attributes. This helps understand central tendencies.
- Determine the variance and standard deviation for each of the attributes. What are the units? This quantifies the spread of the data.
- Which variable of the purchase orders has the most variability? Understanding this helps in identifying areas of inconsistency.
Mini Workshop 2:
- Weather: Which city experiences more fluctuation in temperature?
- Which metric should be used? The standard deviation is appropriate here.
Mini Workshop 3:
- DCManager: Use the lead times attribute.
- Is it appropriate to use the empirical rule to interpret the spread in this data set? Why or why not? Check if the data is normally distributed.
- If so, use the empirical rule to help understand the spread in the data set of sample lead times. This provides insights into expected lead time ranges.
Mini Workshop 4:
- Autos - Random sample of autos weights and mpg’s.
- Develop an appropriate visualization. Scatter plots can be useful here.
- Determine the standard deviation for each variable. This provides insights into data spread.
- Determine the correlation coefficient and interpret. This helps understand the relationship between auto weights and mpg’s.