Descriptive Statistics and Measures of Central Tendency
Descriptive Statistics Overview
Course Context
Course: Descriptive Statistics
Institution: Western University Canada
Week: 3 of the course
Focus Areas:
Measures of central tendency
Standard deviation
Visualizing central tendency and range
Distributions
Review from Previous Week
Measures of Central Tendency
Mode:
Definition: The mode is the most frequently occurring value in a dataset.
Types:
Bimodal: Dataset with two modes.
Multimodal: Dataset with more than two modes.
Median:
Definition: The median is the middle value in a sorted list of numbers, effectively dividing the dataset into two equal halves.
Mean (Average):
Definition: The mean is calculated by summing all values in the dataset and dividing by the total number of observations.
Correlation Coefficient
Linearity and Correlation Coefficient (r):
Ranges from +1 to -1, indicating the strength and direction of a linear relationship between two variables.
Values Interpretation:
+1: Perfect positive correlation
0: No correlation
-1: Perfect negative correlation
Central Tendency
Definition: Central tendency indicates the center or typical value of a dataset, revealing where data points tend to cluster.
Dependence on Distribution: The measure of central tendency varies based on the nature of the data distribution.
Bell Curve (Normal Distribution):
Characteristics:
Symmetrical distribution where the mean, median, and mode are equal and located at the midpoint.
Approximately 68% of values fall within one standard deviation of the mean, and about 95% fall within two standard deviations.
Measures of Dispersion
Definition: Dispersion describes how data varies from the central tendency.
Key Terms:
Range:
Definition: The range is the difference between the highest and lowest values in the dataset.
Calculation:
Example Calculation: Given values 2, 2, 3, 5, 5, 7, 8:
Highest = 8
Lowest = 2
Calculation:
Thus, the range is 6.
Standard Deviation:
Definition: The standard deviation measures the average distance of each value from the mean, indicating how spread out the values are in the dataset.
Characteristics:
A larger standard deviation indicates greater variability in the data.
Calculation Steps:
Calculate the mean.
Find the average of the squared differences from the mean.
Take the square root of this average.
Note on Sample vs Population:
When calculating standard deviation, using instead of corrects for underestimation of the population standard deviation by producing a slightly larger result.
Application of Standard Deviation
Example: Dog Heights
Heights of dogs at the shoulder (in mm): 600, 470, 170, 430, and 300.
Step 1: Calculate the Mean
Mean Height Calculation:
Total number of dogs (n) = 5.
Calculation of mean from the given heights.
Step 2: Differences from the Mean
Calculating each dog's difference from the mean:
Height data: 600, 470, 170, 430, 300
Differences calculation:
Individual results: 206, 76, -224, 36, -94.
Total number of dogs: 5.
Implications of Standard Deviation
Standard deviation illustrates the typical variation from the average height of dogs.
It also provides insight into the normality of the data set, identifying values within one standard deviation (± 164 mm from the mean).
Understanding Normality
Importance of Normal Distribution:
Many variables exhibit normal distribution patterns.
Normality assumptions are critical for inferential statistics and hypothesis testing.
Characteristics:
For a unimodal variable, approx. 68.2% of data will be within 1 SD of the mean, 94.5% within 2 SD, and 99.6% within 3 SD.
Key Symbols:
= mean (population)
or = mean (sample)
= standard deviation (population)
= standard deviation (sample)
Application of Normal Distribution
Statistical Tests:
Z-scores can compare observed vs. expected values.
Confidence intervals help estimate the range of expected means.
Hypothesis testing relies on normal distribution data.assumptions.
Z-Scores:
Definition: Z-scores measure how many standard deviations a value is from the mean.
Usage: Standardized statistics useful for comparing different data sets.
Interpretation:
Z-score of 0 indicates the mean.
Positive Z-scores indicate a value above the mean.
Negative Z-scores indicate a value below the mean.
Z-scores categorize into standard deviations, e.g., ±1 SD encompasses 68% of data, ±2 SD encompasses 95%, and ±3 SD encompasses 99.7%.
Deviations from Normality
Considerations: Data are not always normally distributed.
Examples:
Skewness:
Definition: An asymmetrical distribution where tails differ in length.
Types:
Positive skew (longer right tail).
Negative skew (longer left tail).
Kurtosis:
Indicates the peakedness or flatness of a distribution.
Probability and Uncertainty
Probability: Reflects the chance of a specific outcome occurring.
Probability Value (P-value): Indicates the likelihood of observed results being due to chance.
Interpretation of P-values:
Small P-values: Results unlikely due to chance, potentially indicating meaningful data.
Large P-values: Results likely due to chance, suggesting data may not be significant.
Probability Distribution: Graphical representation of probabilities of outcomes rather than frequencies.
Sources of Uncertainty in Data
Variability in sampling can yield different results from the same population.
Measurement errors can arise from inaccuracies in tools, methods, or human factors.
Model assumptions may distort analysis if they don't reflect actual events.
Summary of Key Points
Range: Measures dispersion within a dataset.
Standard Deviation: Average distance from the mean.
Normality: Important for statistical analysis but may not always be present (consider skewness/kurtosis).
Uncertainty: An inherent aspect of data quantifiable through probability.