6_-_Measures_of_Dispersion-- textbook
Chapter 6: Measures of Dispersion
6.1 Get Ready
In this chapter, the focus is on understanding variation or dispersion in outcomes while describing how variables are distributed.
The concept of skewness was mentioned in the last chapter but will be addressed more broadly here by examining overall spread.
Required R packages to follow along:
descrDescToolsHmisc
Data sets needed:
anes20states20cces20
Cooperative Congressional Election Study (cces20):
Similar to ANES, it is a large-scale, regularly occurring survey of political attitudes and behaviors, featuring pre- and post-election waves.
Key differences from ANES:
Fewer topics and questions.
Larger sample size: 61,000 respondents in the 2020 survey.
Used in this chapter to demonstrate aspects of measures of dispersion due to age variable's truncated data in ANES.
6.2 Introduction
Measures of central tendency provide expected values, but variables with identical means, medians, and modes can have differing distributions.
Three symmetric distributions with the same central tendencies can vary in dispersion:
First distribution: Most tightly clustered around the mean.
Second distribution: Average spread.
Third distribution: Widely spread out.
Concentration around the central tendency signifies importance and is measurable through various measures of dispersion.
6.3 Measures of Spread
Focus: Understanding the spread of observations outside the mean.
Measures of spread analyze upper and lower limits across different ranges of outcomes.
6.3.1 Range
Definition: Difference between the lowest and highest values of a variable.
Limited use; for example:
All three graphs may show the same range (4 to 16) despite distinct shapes.
Helpful in spotting coding errors in datasets:
E.g., age should range realistically from 18 to 100; an observed range of 932 indicates data corruption.
For the cces20 survey, the age range is from 18 to 95, showing a plausible spread of 77 years.
6.3.2 Interquartile Range (IQR)
Definition: The range between the 25th percentile (1st quartile) and 75th percentile (3rd quartile).
Importance:
Represents the middle 50% of a distribution.
More insightful than the range; helps identify where most observations lie - effective for understanding central tendency and narrowness leading to dispersions.
Calculating the IQR involves:
In R, the command
IQR(cces20$age)estimates IQR width, whilesummary(cces20$age)provides limits.
Example from cces20$age variable:
25th percentile (1st Qu.): 33
75th percentile (3rd Qu.): 63
IQR width: 30, more informative than the range alone because it aids in the understanding of data spread.
Visualization:
Age histogram with vertical lines for 25th and 75th percentiles provides insight into data spread and central tendency.
6.3.3 Boxplots
Boxplots represent range and IQR visually:
Box: Middle 50% of data.
Horizontal line (median): central tendency.
Ends: extremes of the data with outliers marked outside limits.
Useful for examining skewness visually and understanding distributionine for cces20$age vs states.
Example for states20 data followed the same statistical process for variables like foreign-born percentage and abortion restrictions.
6.4 Dispersion Around the Mean
Measures of dispersion do not only encompass the range but also focus on the average deviation from the mean using:
Mean Absolute Deviation (M.A.D.) and Variance.
M.A.D. Definition: Expressing deviations from the mean as absolute values; treats negative deviations as positive.
Calculating M.A.D. for
states20$fbyields an average deviation of approximately 4.27.Although M.A.D. is straightforward, its limitations mean it isn't primarily used for further statistical analysis.
Variance
Definition: Square of the deviations from the mean measures dispersion, foundational for further statistics.
Calculate by pushing
states20$fbvariable through R’s variances formula:
S^2 = rac{1}{n-1} imes extstyle{ ext{Sum of Squared Deviations}}In R, you compute using
var(states20$fb), consolidating calculated variance to enhance clarity.Variance can be counterintuitive; for instance, a variance of 29.18 can suggest higher data dispersion than reality (just like variance may exceed ranges).
Standard Deviation (S)
Definition: Square root of variance provides a more relatable measure;
S = extstyle{ ext{sqrt}} {S^2}
In R, directly derive standard deviation via the command
sd(states20$fb), leading to a standard deviation of approximately 5.40.
6.5 Coefficient of Variation
Standard deviation's usefulness depends on context; comparative measure across variables of different scales is value-based results in Coefficient of Variation (CV):
CV = rac{S}{ar{x}}E.g., results indicate that percent foreign-born data exhibits moderate CV of 0.767092 suggesting observations fall relatively close to the mean.
6.6 Dichotomous Variables
Exploratory assessment of variance in binary indicators conveys the potential for variance and standard deviation indicators in understanding variations within dichotomous datasets, using formulary approaches:
General variance formula for dichotomous data uses proportion (p):
S^2 = p(1 - p)
6.7 Exploring Dispersion in Categorical Variables
Evaluating multi-category nominal variables probes how responses cluster across categories, leading to Index of Qualitative Variation (IQV) to ascertain relative diversity, essential for revealing effective meaningful data distribution.
The formula:
IQV = rac{k(1 - extstyle{ ext{sum of squared proportions}})}{K - 1}Implementation of IQV for categorical responses addresses data degrees of variance and applicability.
6.8 The Standard Deviation and Normal Curve
The relationship of standard deviation with the normal distribution is critical:
Characteristics of the normal curve:
Single-peaked.
Mean=Median=Mode.
Infinite tails, etc.
Knowledge of coverage areas within standard deviations:
68.26% within 1 standard deviation.
95.24% within 2 standard deviations, and so forth.
6.9 Area Under a Normal Curve Calculations
Exploring z-score utilization through both z-distribution tables and R (
pnorm) provides statistical area calculations relevant to the normal curve phenomena.
6.10 Key Functions in R
Essential functions like
Descfacilitate obtaining and summarizing multiple statistical measures, enhancing analysis efficiency.Example of usage provides significant context for each descriptive statistic.
6.11 Assignments & Practice Problems
Engage with meaningful assignments reflective of theoretical application such as skewness, z-scores interpretations, and comparisons using descriptive statistics approach across diverse sets.