Part 3
Topic context
- Week 1, Lecture Series: Data Preparation, Data Exploration, Cleaning, and Managing Data (Part 3).
- Focus: revisiting measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation) after handling outliers and achieving a reasonable distribution.
- Relationship to prior concepts: checking for normal distribution and addressing outliers to shape data toward a bell curve; then using central tendency and dispersion to summarize data for further analysis (e.g., SPSS workflow in the workshop).
Key concepts: central tendency
Mode
Definition: the most common score in a data set.
Example: with 20 numbers, if 3 occurrences are 34, and 3 occurrences are 72, the distribution is bimodal (two modes: 34 and 72).
Interpretation: mode(s) indicate the most frequent value; distributions can be unimodal (one mode) or bimodal/multimodal (multiple modes).
Median
Definition: the middle score in an ordered data set.
50% below and 50% above the median.
Example with odd n: median is the single middle value (e.g., 62 in a data set where there are 19 values around it).
Example with even n: there are two middle scores; the median is the average of those two values.
- Example given: two middle scores are 62 and 68; median = \frac{62 + 68}{2} = 65.
Note: median is robust to outliers and skewness, but not as easy to compute algebraically as the mean.
Mean (the average)
Definition: the sum of all scores divided by the number of scores.
Notation: the mean can be denoted with a hat or bar; in this transcript, the mean is described as \hat{X} = \frac{1}{N} \sum{i=1}^{N} Xi, where N is the total number of scores and X_i represents each score.
Why we like the mean:
- It can be calculated directly from the data using a simple formula, without sorting the data.
- It is the most widely used measure of central tendency, especially for interval/ratio data, and is often a better estimator of the population mean than the sample mode or sample median when inferring about the population.
- It enables mathematical/statistical analysis and modeling.
Practical note: the mean is most appropriate for interval/ratio-scale data (as opposed to ordinal). This aligns with many questionnaire/survey designs in which responses are on a meaningful numeric scale.
Measures of variability (dispersion)
Concept of variability
Variability shows how far data points spread around the mean.
Low variability: data tightly clustered around the mean (small arrows around the mean in a histogram).
High variability: data more dispersed around the mean (larger spread around the mean).
Importance: even with the same mean, different variabilities imply different information about the population; a tight cluster around the mean provides a better summary than a widely dispersed set.
Range
Definition: difference between the maximum and minimum scores in the data set.
Example in histograms: a visible extreme outlier can inflate the range, suggesting more spread than is representative for most data.
Limitation: highly sensitive to outliers; not a robust measure of dispersion.
Variance (sample variance)
Intuition: variance is the average of the squared deviations from the mean; it quantifies how far data points are from the mean on average.
Formula (sample variance):
- s^2 = \frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}
- Here, \bar{X} is the sample mean and n is the sample size.
Why we square deviations: to avoid cancellations (sum of raw deviations would be zero) and to emphasize larger deviations.
The divisor n-1 (rather than n): provides an unbiased estimator of the population variance based on the sample variance.
Interpretation: s^2 indicates how spread out the data are around the mean; larger values mean more dispersion.
Standard deviation (SD)
Definition: the square root of the variance; provides dispersion in the same units as the data.
Formula:
- s = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}}.
Practical interpretation: a more intuitive measure of typical deviation from the mean; easier to compare across datasets with the same units.
Worked examples (variance and standard deviation)
- Example 1: data = {2, 3, 4}
- Mean: \bar{X} = \frac{2 + 3 + 4}{3} = 3
- Deviations: (2-3 = -1), (3-3 = 0), (4-3 = 1)
- Squared deviations: (1, 0, 1); Sum = 2
- Variance: s^2 = \frac{2}{3-1} = \frac{2}{2} = 1
- Standard deviation: s = \sqrt{1} = 1
- Example 2: data = {0, 3, 6}
- Mean: \bar{X} = \frac{0 + 3 + 6}{3} = 3
- Deviations: (0-3 = -3), (3-3 = 0), (6-3 = 3)
- Squared deviations: (9, 0, 9); Sum = 18
- Variance: s^2 = \frac{18}{3-1} = \frac{18}{2} = 9
- Standard deviation: s = \sqrt{9} = 3
- takeaway: same mean but larger dispersion in the second dataset leads to a larger variance and SD.
- Relationship: larger deviations from the mean inflate the variance and SD; small variance means the mean is a better representative of the data.
Practical application: computing in SPSS (workflow overview in the workshop)
- Data setup: load your dataset and select the variable you want to examine (total column or mean column).
- Navigation: go to Statistics, choose measures of central tendency, and choose measures of dispersion.
- Outputs available:
- Central tendency: mean, median, mode.
- Dispersion: range, standard deviation, variance.
- Also provides minimum and maximum values.
- Interpretation of SPSS output (example described in the transcript):
- Mean: 18.7; Median: 19; Mode: 18.
- These three values being close suggests a roughly symmetric distribution, likely unimodal and near normal.
- Range: 25 (min to max on a five-point scale: 5 to 30).
- Indicates a good spread across the scale rather than data all clustered at a single point.
- Standard deviation (example): 5.7; Variance (derived): approximately s^2 = 5.7^2 \approx 32.49.
- The distribution appears well-dispersed with a wide range of scores.
- Other outputs and implications:
- SPSS may show the number of valid data points and missing data points (e.g., valid data points: 220; missing data points: 120).
- The presence of missing data highlights the potential need for imputation or handling missingness in analysis.
- Practical write-up guidance: report the mean and standard deviation as primary descriptors of central tendency and variability, respectively, e.g., mean = 18.7, SD = 5.7.
Connections to prior lectures and broader implications
- Normal distribution and outliers: after addressing outliers, data can resemble a normal distribution, making the mean/SD more informative.
- Central tendency and population inference: the mean is used to estimate the population mean from a sample; it serves as the basis for many statistical tests and confidence intervals.
- Shape and symmetry indicators: close mean, median, and mode suggest a symmetric, unimodal distribution; larger gaps among them hint at skewness or multimodality.
- Practical data-quality considerations: outliers affect range; missing data affect the reliability of summary statistics; imputation strategies are important for maintaining data integrity.
Formulas (quick reference)
- Mean (sample): \hat{X} = \frac{1}{N} \sum{i=1}^{N} Xi
- Median (general concept): middle value when data are ordered; for even n, \text{Median} = \frac{a{(n/2)} + a{(n/2+1)}}{2}
- Variance (sample): s^2 = \frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}
- Standard deviation: s = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}}
- Range (quick): \text{Range} = \max(Xi) - \min(Xi)
Summary takeaways
- Mean, median, and mode are central tendency measures; the mean is preferred when data are interval/ratio and when population inference is desired, provided the data are not overly skewed or heavily have outliers.
- Variance and standard deviation quantify dispersion around the mean; n-1 in the denominator makes the variance an unbiased estimator of the population variance from a sample.
- Practical data analysis involves using software (e.g., SPSS) to obtain these statistics quickly, check data quality (missing data, outliers), and guide interpretation of the data distribution and subsequent analyses.
Ethical/philosophical/practical implications
- Accurate reporting of means and dispersion is essential for valid inferences about populations.
- Handling missing data (imputation) involves assumptions that can influence results; transparency about methods used is important.
- Understanding variability helps avoid overinterpretation of a single central value and encourages consideration of data spread when making decisions or policy recommendations.