Lesson 3 Descriptive Statistics - Numerical Measures

Lesson 3: Descriptive Statistics: Numerical Measures

Slide Topics

  • Measures of Location

  • Measures of Variability

  • Measures of Distribution Shape, Relative Location, and Detecting Outliers

  • Measures of Association Between Two Variables

Measures of Location

Measures help us understand data based on where it is located, and we can categorize them as either sample statistics (data from a small part) or population parameters (data from the entire group).

Common Measures Include:

  • Mean

  • Median

  • Mode

  • Percentiles

  • Quartiles

Mean

The mean, or average, is calculated by adding all the data values together and dividing by the number of values. It gives a single number that summarizes the entire dataset.

  • Sample Mean (noted as ( \bar{x} )): This is used to estimate the average of a larger group from a smaller group.

  • Sample Mean Formula:[ \bar{x} = \frac{\text{Sum of the values of n observations}}{\text{Number of observations in the sample}} ]

  • Population Mean Formula:[ m = \frac{\text{Sum of the values of N observations}}{\text{Number of observations in the population}} ]

Example: Apartment Rents

Imagine we have the rent prices of seventy small apartments, like 445, 615, 430, and so forth. To find the mean:

  • Sample Mean Calculation:[ \bar{x} = \frac{34,356}{70} = 490.80 ]This means the average rent is about $490.80.

Median

The median is the middle value in a list of numbers and is very useful when we have extreme values (outliers). It gives a better idea of the middle point of most data.

  • How to Find the Median:

    1. Arrange data from smallest to largest.

    2. If there’s an odd number of values, it’s the middle one.

    3. If there’s an even number, it’s the average of the two middle ones.

Mode

The mode is the number that appears most often in the dataset. It can show us the most common item.

  • Types of Mode:

    • Unimodal: One mode

    • Bimodal: Two modes

    • Multimodal: More than two modes

  • Example of Mode Calculation: In a list, if the number 450 appears 7 times, we say the mode is 450 since it occurs the most.

Percentiles

A percentile tells us where a certain value stands compared to the rest of the data.

  • Definition: The p-th percentile is the value below which p% of all values fall.

  • Steps to Calculate Percentiles:

    1. Arrange the data in order from smallest to largest.

    2. Find the index using: ( i = \frac{p}{100}n ) where ( n ) is the total number of values.

    3. Find the value at position ( i ).

Quartiles

Quartiles divide the data into four equal parts to show how the data is spread out.

  • First Quartile (Q1): 25% of data falls below this point.

  • Second Quartile (Q2): This is the median, meaning 50% of data is below this point.

  • Third Quartile (Q3): 75% of data falls below this point.

Measures of Variability

Variability measures help us understand how much the data is spread out around the mean. This helps us see if the data points are close together or widely separated. Key measures include:

  • Range

  • Interquartile Range

  • Variance

  • Standard Deviation

  • Coefficient of Variation

Range

The range is simply the difference between the largest and smallest values in the data.

  • Formula: ( \text{Range} = \text{largest value} - \text{smallest value} )

  • Example: For rent prices, if the highest price is 615 and the lowest is 425, the range is 615 - 425 = 190.

Interquartile Range (IQR)

The IQR tells us the range of the middle 50% of the data, which helps minimize the effect of outliers.

  • Formula: ( \text{IQR} = Q3 - Q1 )

  • Example: If Q3 is 525 and Q1 is 445, then ( \text{IQR} = 525 - 445 = 80 ).

Variance

Variance shows how far the data points are spread out from the average. Higher variance means more spread.

  • Formula: [ s^2 = \frac{\sum{(x_i - \bar{x})^2}}{n - 1} ] (this is for sample variance)

Standard Deviation

The standard deviation is simply the square root of the variance. It allows us to understand how spread out the values are in the same units as the original data.

  • Formula: [ s = \sqrt{\frac{\sum{(x_i - \bar{x})^2}}{n - 1}} ]

Coefficient of Variation (CV)

The CV shows how much variability there is in comparison to the average value and is presented as a percentage.

  • Formula: [ CV = \frac{s}{\bar{x}} \times 100 % ]

Distribution Shape and Detecting Outliers

The shape of the distribution (how data points are spread out) can give an important view of the results.

  • Z-Scores: They tell us how far away a data point is from the mean in terms of standard deviations.

  • Chebyshev’s Theorem: This states that at least ( 1 - \frac{1}{z^2} ) of the data falls within z standard deviations of the mean (for any ( z > 1 )).

  • Outliers: These are data points that lay too far from the mean, usually defined by z-scores less than -3 or greater than +3. No outliers were found in this dataset.

Exploratory Data Analysis

  • Five-Number Summary: This helps visualize the data by looking at the smallest value, Q1, median, Q3, and largest value.

  • Box Plot: A simple visual way to show the five-number summary, which can indicate potential outliers using the interquartile range. To find outliers:

    • Lower Limit: ( Q1 - 1.5(IQR) )

    • Upper Limit: ( Q3 + 1.5(IQR) )

Measures of Association Between Two Variables

  • Covariance: This shows how two variables change together; whether they increase or decrease together.

  • Correlation Coefficient: This shows the strength of the relationship between two variables and can range from -1 (a perfect negative relationship) to +1 (a perfect positive relationship), with 0 meaning no relationship. Remember, correlation does not mean causation.

Weighted Mean and Grouped Data

  • Weighted Mean: This is used when some values are more important than others, like calculating grades where some classes count more than others.

  • Grouped Data Calculations: This method helps estimate mean, variance, and standard deviation using midpoints and frequencies for large datasets when we don’t have every single data point.