6_-_Measures_of_Dispersion-- textbook

Chapter 6: Measures of Dispersion

6.1 Get Ready

  • In this chapter, the focus is on understanding variation or dispersion in outcomes while describing how variables are distributed.

  • The concept of skewness was mentioned in the last chapter but will be addressed more broadly here by examining overall spread.

  • Required R packages to follow along:

    • descr

    • DescTools

    • Hmisc

  • Data sets needed:

    • anes20

    • states20

    • cces20

  • Cooperative Congressional Election Study (cces20):

    • Similar to ANES, it is a large-scale, regularly occurring survey of political attitudes and behaviors, featuring pre- and post-election waves.

    • Key differences from ANES:

    • Fewer topics and questions.

    • Larger sample size: 61,000 respondents in the 2020 survey.

    • Used in this chapter to demonstrate aspects of measures of dispersion due to age variable's truncated data in ANES.

6.2 Introduction

  • Measures of central tendency provide expected values, but variables with identical means, medians, and modes can have differing distributions.

  • Three symmetric distributions with the same central tendencies can vary in dispersion:

    • First distribution: Most tightly clustered around the mean.

    • Second distribution: Average spread.

    • Third distribution: Widely spread out.

  • Concentration around the central tendency signifies importance and is measurable through various measures of dispersion.

6.3 Measures of Spread

  • Focus: Understanding the spread of observations outside the mean.

  • Measures of spread analyze upper and lower limits across different ranges of outcomes.

6.3.1 Range
  • Definition: Difference between the lowest and highest values of a variable.

  • Limited use; for example:

    • All three graphs may show the same range (4 to 16) despite distinct shapes.

  • Helpful in spotting coding errors in datasets:

    • E.g., age should range realistically from 18 to 100; an observed range of 932 indicates data corruption.

  • For the cces20 survey, the age range is from 18 to 95, showing a plausible spread of 77 years.

6.3.2 Interquartile Range (IQR)
  • Definition: The range between the 25th percentile (1st quartile) and 75th percentile (3rd quartile).

  • Importance:

    • Represents the middle 50% of a distribution.

    • More insightful than the range; helps identify where most observations lie - effective for understanding central tendency and narrowness leading to dispersions.

  • Calculating the IQR involves:

    • In R, the command IQR(cces20$age) estimates IQR width, while summary(cces20$age) provides limits.

  • Example from cces20$age variable:

    • 25th percentile (1st Qu.): 33

    • 75th percentile (3rd Qu.): 63

  • IQR width: 30, more informative than the range alone because it aids in the understanding of data spread.

  • Visualization:

    • Age histogram with vertical lines for 25th and 75th percentiles provides insight into data spread and central tendency.

6.3.3 Boxplots
  • Boxplots represent range and IQR visually:

    • Box: Middle 50% of data.

    • Horizontal line (median): central tendency.

    • Ends: extremes of the data with outliers marked outside limits.

  • Useful for examining skewness visually and understanding distributionine for cces20$age vs states.

  • Example for states20 data followed the same statistical process for variables like foreign-born percentage and abortion restrictions.

6.4 Dispersion Around the Mean

  • Measures of dispersion do not only encompass the range but also focus on the average deviation from the mean using:

    • Mean Absolute Deviation (M.A.D.) and Variance.

    • M.A.D. Definition: Expressing deviations from the mean as absolute values; treats negative deviations as positive.

  • Calculating M.A.D. for states20$fb yields an average deviation of approximately 4.27.

  • Although M.A.D. is straightforward, its limitations mean it isn't primarily used for further statistical analysis.

Variance
  • Definition: Square of the deviations from the mean measures dispersion, foundational for further statistics.

  • Calculate by pushing states20$fb variable through R’s variances formula:
    S^2 = rac{1}{n-1} imes extstyle{ ext{Sum of Squared Deviations}}

  • In R, you compute using var(states20$fb), consolidating calculated variance to enhance clarity.

  • Variance can be counterintuitive; for instance, a variance of 29.18 can suggest higher data dispersion than reality (just like variance may exceed ranges).

Standard Deviation (S)
  • Definition: Square root of variance provides a more relatable measure;

  • S = extstyle{ ext{sqrt}} {S^2}

  • In R, directly derive standard deviation via the command sd(states20$fb), leading to a standard deviation of approximately 5.40.

6.5 Coefficient of Variation

  • Standard deviation's usefulness depends on context; comparative measure across variables of different scales is value-based results in Coefficient of Variation (CV):
    CV = rac{S}{ar{x}}

  • E.g., results indicate that percent foreign-born data exhibits moderate CV of 0.767092 suggesting observations fall relatively close to the mean.

6.6 Dichotomous Variables

  • Exploratory assessment of variance in binary indicators conveys the potential for variance and standard deviation indicators in understanding variations within dichotomous datasets, using formulary approaches:

    • General variance formula for dichotomous data uses proportion (p):
      S^2 = p(1 - p)

6.7 Exploring Dispersion in Categorical Variables

  • Evaluating multi-category nominal variables probes how responses cluster across categories, leading to Index of Qualitative Variation (IQV) to ascertain relative diversity, essential for revealing effective meaningful data distribution.

  • The formula:
    IQV = rac{k(1 - extstyle{ ext{sum of squared proportions}})}{K - 1}

  • Implementation of IQV for categorical responses addresses data degrees of variance and applicability.

6.8 The Standard Deviation and Normal Curve

  • The relationship of standard deviation with the normal distribution is critical:

    • Characteristics of the normal curve:

    • Single-peaked.

    • Mean=Median=Mode.

    • Infinite tails, etc.

  • Knowledge of coverage areas within standard deviations:

    • 68.26% within 1 standard deviation.

    • 95.24% within 2 standard deviations, and so forth.

6.9 Area Under a Normal Curve Calculations

  • Exploring z-score utilization through both z-distribution tables and R (pnorm) provides statistical area calculations relevant to the normal curve phenomena.

6.10 Key Functions in R

  • Essential functions like Desc facilitate obtaining and summarizing multiple statistical measures, enhancing analysis efficiency.

  • Example of usage provides significant context for each descriptive statistic.

6.11 Assignments & Practice Problems

  • Engage with meaningful assignments reflective of theoretical application such as skewness, z-scores interpretations, and comparisons using descriptive statistics approach across diverse sets.