MD

Part 3

  • Topic context

    • Week 1, Lecture Series: Data Preparation, Data Exploration, Cleaning, and Managing Data (Part 3).
    • Focus: revisiting measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation) after handling outliers and achieving a reasonable distribution.
    • Relationship to prior concepts: checking for normal distribution and addressing outliers to shape data toward a bell curve; then using central tendency and dispersion to summarize data for further analysis (e.g., SPSS workflow in the workshop).
  • Key concepts: central tendency

    • Mode

    • Definition: the most common score in a data set.

    • Example: with 20 numbers, if 3 occurrences are 34, and 3 occurrences are 72, the distribution is bimodal (two modes: 34 and 72).

    • Interpretation: mode(s) indicate the most frequent value; distributions can be unimodal (one mode) or bimodal/multimodal (multiple modes).

    • Median

    • Definition: the middle score in an ordered data set.

    • 50% below and 50% above the median.

    • Example with odd n: median is the single middle value (e.g., 62 in a data set where there are 19 values around it).

    • Example with even n: there are two middle scores; the median is the average of those two values.

      • Example given: two middle scores are 62 and 68; median = \frac{62 + 68}{2} = 65.
    • Note: median is robust to outliers and skewness, but not as easy to compute algebraically as the mean.

    • Mean (the average)

    • Definition: the sum of all scores divided by the number of scores.

    • Notation: the mean can be denoted with a hat or bar; in this transcript, the mean is described as \hat{X} = \frac{1}{N} \sum{i=1}^{N} Xi, where N is the total number of scores and X_i represents each score.

    • Why we like the mean:

      • It can be calculated directly from the data using a simple formula, without sorting the data.
      • It is the most widely used measure of central tendency, especially for interval/ratio data, and is often a better estimator of the population mean than the sample mode or sample median when inferring about the population.
      • It enables mathematical/statistical analysis and modeling.
    • Practical note: the mean is most appropriate for interval/ratio-scale data (as opposed to ordinal). This aligns with many questionnaire/survey designs in which responses are on a meaningful numeric scale.

  • Measures of variability (dispersion)

    • Concept of variability

    • Variability shows how far data points spread around the mean.

    • Low variability: data tightly clustered around the mean (small arrows around the mean in a histogram).

    • High variability: data more dispersed around the mean (larger spread around the mean).

    • Importance: even with the same mean, different variabilities imply different information about the population; a tight cluster around the mean provides a better summary than a widely dispersed set.

    • Range

    • Definition: difference between the maximum and minimum scores in the data set.

    • Example in histograms: a visible extreme outlier can inflate the range, suggesting more spread than is representative for most data.

    • Limitation: highly sensitive to outliers; not a robust measure of dispersion.

    • Variance (sample variance)

    • Intuition: variance is the average of the squared deviations from the mean; it quantifies how far data points are from the mean on average.

    • Formula (sample variance):

      • s^2 = \frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}
      • Here, \bar{X} is the sample mean and n is the sample size.
    • Why we square deviations: to avoid cancellations (sum of raw deviations would be zero) and to emphasize larger deviations.

    • The divisor n-1 (rather than n): provides an unbiased estimator of the population variance based on the sample variance.

    • Interpretation: s^2 indicates how spread out the data are around the mean; larger values mean more dispersion.

    • Standard deviation (SD)

    • Definition: the square root of the variance; provides dispersion in the same units as the data.

    • Formula:

      • s = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}}.
    • Practical interpretation: a more intuitive measure of typical deviation from the mean; easier to compare across datasets with the same units.

  • Worked examples (variance and standard deviation)

    • Example 1: data = {2, 3, 4}
    • Mean: \bar{X} = \frac{2 + 3 + 4}{3} = 3
    • Deviations: (2-3 = -1), (3-3 = 0), (4-3 = 1)
    • Squared deviations: (1, 0, 1); Sum = 2
    • Variance: s^2 = \frac{2}{3-1} = \frac{2}{2} = 1
    • Standard deviation: s = \sqrt{1} = 1
    • Example 2: data = {0, 3, 6}
    • Mean: \bar{X} = \frac{0 + 3 + 6}{3} = 3
    • Deviations: (0-3 = -3), (3-3 = 0), (6-3 = 3)
    • Squared deviations: (9, 0, 9); Sum = 18
    • Variance: s^2 = \frac{18}{3-1} = \frac{18}{2} = 9
    • Standard deviation: s = \sqrt{9} = 3
    • takeaway: same mean but larger dispersion in the second dataset leads to a larger variance and SD.
    • Relationship: larger deviations from the mean inflate the variance and SD; small variance means the mean is a better representative of the data.
  • Practical application: computing in SPSS (workflow overview in the workshop)

    • Data setup: load your dataset and select the variable you want to examine (total column or mean column).
    • Navigation: go to Statistics, choose measures of central tendency, and choose measures of dispersion.
    • Outputs available:
    • Central tendency: mean, median, mode.
    • Dispersion: range, standard deviation, variance.
    • Also provides minimum and maximum values.
    • Interpretation of SPSS output (example described in the transcript):
    • Mean: 18.7; Median: 19; Mode: 18.
    • These three values being close suggests a roughly symmetric distribution, likely unimodal and near normal.
    • Range: 25 (min to max on a five-point scale: 5 to 30).
    • Indicates a good spread across the scale rather than data all clustered at a single point.
    • Standard deviation (example): 5.7; Variance (derived): approximately s^2 = 5.7^2 \approx 32.49.
    • The distribution appears well-dispersed with a wide range of scores.
    • Other outputs and implications:
    • SPSS may show the number of valid data points and missing data points (e.g., valid data points: 220; missing data points: 120).
    • The presence of missing data highlights the potential need for imputation or handling missingness in analysis.
    • Practical write-up guidance: report the mean and standard deviation as primary descriptors of central tendency and variability, respectively, e.g., mean = 18.7, SD = 5.7.
  • Connections to prior lectures and broader implications

    • Normal distribution and outliers: after addressing outliers, data can resemble a normal distribution, making the mean/SD more informative.
    • Central tendency and population inference: the mean is used to estimate the population mean from a sample; it serves as the basis for many statistical tests and confidence intervals.
    • Shape and symmetry indicators: close mean, median, and mode suggest a symmetric, unimodal distribution; larger gaps among them hint at skewness or multimodality.
    • Practical data-quality considerations: outliers affect range; missing data affect the reliability of summary statistics; imputation strategies are important for maintaining data integrity.
  • Formulas (quick reference)

    • Mean (sample): \hat{X} = \frac{1}{N} \sum{i=1}^{N} Xi
    • Median (general concept): middle value when data are ordered; for even n, \text{Median} = \frac{a{(n/2)} + a{(n/2+1)}}{2}
    • Variance (sample): s^2 = \frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}
    • Standard deviation: s = \sqrt{s^2} = \sqrt{\frac{\sum{i=1}^{n} (Xi - \bar{X})^2}{n-1}}
    • Range (quick): \text{Range} = \max(Xi) - \min(Xi)
  • Summary takeaways

    • Mean, median, and mode are central tendency measures; the mean is preferred when data are interval/ratio and when population inference is desired, provided the data are not overly skewed or heavily have outliers.
    • Variance and standard deviation quantify dispersion around the mean; n-1 in the denominator makes the variance an unbiased estimator of the population variance from a sample.
    • Practical data analysis involves using software (e.g., SPSS) to obtain these statistics quickly, check data quality (missing data, outliers), and guide interpretation of the data distribution and subsequent analyses.
  • Ethical/philosophical/practical implications

    • Accurate reporting of means and dispersion is essential for valid inferences about populations.
    • Handling missing data (imputation) involves assumptions that can influence results; transparency about methods used is important.
    • Understanding variability helps avoid overinterpretation of a single central value and encourages consideration of data spread when making decisions or policy recommendations.