Lesson 3 Descriptive Statistics - Numerical Measures
Lesson 3: Descriptive Statistics: Numerical Measures
Slide Topics
Measures of Location
Measures of Variability
Measures of Distribution Shape, Relative Location, and Detecting Outliers
Measures of Association Between Two Variables
Measures of Location
Measures help us understand data based on where it is located, and we can categorize them as either sample statistics (data from a small part) or population parameters (data from the entire group).
Common Measures Include:
Mean
Median
Mode
Percentiles
Quartiles
Mean
The mean, or average, is calculated by adding all the data values together and dividing by the number of values. It gives a single number that summarizes the entire dataset.
Sample Mean (noted as ( \bar{x} )): This is used to estimate the average of a larger group from a smaller group.
Sample Mean Formula:[ \bar{x} = \frac{\text{Sum of the values of n observations}}{\text{Number of observations in the sample}} ]
Population Mean Formula:[ m = \frac{\text{Sum of the values of N observations}}{\text{Number of observations in the population}} ]
Example: Apartment Rents
Imagine we have the rent prices of seventy small apartments, like 445, 615, 430, and so forth. To find the mean:
Sample Mean Calculation:[ \bar{x} = \frac{34,356}{70} = 490.80 ]This means the average rent is about $490.80.
Median
The median is the middle value in a list of numbers and is very useful when we have extreme values (outliers). It gives a better idea of the middle point of most data.
How to Find the Median:
Arrange data from smallest to largest.
If there’s an odd number of values, it’s the middle one.
If there’s an even number, it’s the average of the two middle ones.
Mode
The mode is the number that appears most often in the dataset. It can show us the most common item.
Types of Mode:
Unimodal: One mode
Bimodal: Two modes
Multimodal: More than two modes
Example of Mode Calculation: In a list, if the number 450 appears 7 times, we say the mode is 450 since it occurs the most.
Percentiles
A percentile tells us where a certain value stands compared to the rest of the data.
Definition: The p-th percentile is the value below which p% of all values fall.
Steps to Calculate Percentiles:
Arrange the data in order from smallest to largest.
Find the index using: ( i = \frac{p}{100}n ) where ( n ) is the total number of values.
Find the value at position ( i ).
Quartiles
Quartiles divide the data into four equal parts to show how the data is spread out.
First Quartile (Q1): 25% of data falls below this point.
Second Quartile (Q2): This is the median, meaning 50% of data is below this point.
Third Quartile (Q3): 75% of data falls below this point.
Measures of Variability
Variability measures help us understand how much the data is spread out around the mean. This helps us see if the data points are close together or widely separated. Key measures include:
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range
The range is simply the difference between the largest and smallest values in the data.
Formula: ( \text{Range} = \text{largest value} - \text{smallest value} )
Example: For rent prices, if the highest price is 615 and the lowest is 425, the range is 615 - 425 = 190.
Interquartile Range (IQR)
The IQR tells us the range of the middle 50% of the data, which helps minimize the effect of outliers.
Formula: ( \text{IQR} = Q3 - Q1 )
Example: If Q3 is 525 and Q1 is 445, then ( \text{IQR} = 525 - 445 = 80 ).
Variance
Variance shows how far the data points are spread out from the average. Higher variance means more spread.
Formula: [ s^2 = \frac{\sum{(x_i - \bar{x})^2}}{n - 1} ] (this is for sample variance)
Standard Deviation
The standard deviation is simply the square root of the variance. It allows us to understand how spread out the values are in the same units as the original data.
Formula: [ s = \sqrt{\frac{\sum{(x_i - \bar{x})^2}}{n - 1}} ]
Coefficient of Variation (CV)
The CV shows how much variability there is in comparison to the average value and is presented as a percentage.
Formula: [ CV = \frac{s}{\bar{x}} \times 100 % ]
Distribution Shape and Detecting Outliers
The shape of the distribution (how data points are spread out) can give an important view of the results.
Z-Scores: They tell us how far away a data point is from the mean in terms of standard deviations.
Chebyshev’s Theorem: This states that at least ( 1 - \frac{1}{z^2} ) of the data falls within z standard deviations of the mean (for any ( z > 1 )).
Outliers: These are data points that lay too far from the mean, usually defined by z-scores less than -3 or greater than +3. No outliers were found in this dataset.
Exploratory Data Analysis
Five-Number Summary: This helps visualize the data by looking at the smallest value, Q1, median, Q3, and largest value.
Box Plot: A simple visual way to show the five-number summary, which can indicate potential outliers using the interquartile range. To find outliers:
Lower Limit: ( Q1 - 1.5(IQR) )
Upper Limit: ( Q3 + 1.5(IQR) )
Measures of Association Between Two Variables
Covariance: This shows how two variables change together; whether they increase or decrease together.
Correlation Coefficient: This shows the strength of the relationship between two variables and can range from -1 (a perfect negative relationship) to +1 (a perfect positive relationship), with 0 meaning no relationship. Remember, correlation does not mean causation.
Weighted Mean and Grouped Data
Weighted Mean: This is used when some values are more important than others, like calculating grades where some classes count more than others.
Grouped Data Calculations: This method helps estimate mean, variance, and standard deviation using midpoints and frequencies for large datasets when we don’t have every single data point.