Class4
Class Overview
Class Title: Analysis of Univariate Data
Focus: Measures of Dispersion and Other Measures of Shape
Course Context: Introduction to Statistics for Social Sciences
Department: Department of Statistics, UC3M
Chapter Structure
Chapter 4: Analysis of Univariate Data
Key Topics:
Measures of Dispersion
Range
Interquartile Range (IQR)
Box and Whiskers Plot
Dispersion Measures Associated with the Mean
Variance
Standard Deviation
Other Shape Features of a Sample
Skewness
Kurtosis
Recommended Reading: Wikipedia page on dispersion, includes links to relevant quantities.
Measures of Dispersion
Range
Definition: Difference between the maximum and minimum values in a dataset.
Example Calculation: For data ranging from 0 to 4, the range = 4 - 0 = 4.
Consideration: The range may not be the best measure of dispersion, especially with extreme values.
Interquartile Range (IQR)
Definition: Distance between the 1st (Q1) and 3rd (Q3) quartiles.
Semi-Interquartile Range: Half of the IQR, used for comparison with the standard deviation.
Example Calculation:
Given Data: {5, 3, 11, 21, 7, 5, 2, 1, 3, 1, 2, 3, 3, 5, 5, 7, 11, 21}
Q1 = 2.5, Q3 = 9
IQR = Q3 - Q1 = 9 - 2.5 = 6.5
Identifying Outliers Using IQR
Mild Outlier: Below Q1 - 1.5IQR or above Q3 + 1.5IQR.
Extreme Outlier: Below Q1 - 3IQR or above Q3 + 3IQR.
Example: For Q3 = 9 and IQR = 6.5:
Upper inner fence: 9 + 1.5*6.5 = 18.75
Upper outer fence: 9 + 3*6.5 = 24.5
Values above 21 would be considered mild outliers.
Box and Whiskers Plot
Purpose: Visualize the shape of the dataset and identify potential outliers.
Components of Plot:
Minimum, Maximum, Median, Q1, Q3, Mean, Upper and Lower Fences.
Different shapes indicate various data distributions.
Measures of Spread Associated with Mean
Why Average Distance?
Attempt to find a measure of typical distance from the mean by considering average differences.
Sum of differences to the mean results in zero, making it ineffective for dispersion measurement.
Variance
Defined as an average squared distance from each value to the mean.
Calculation Formula:
Variance = Sigma((x_i - mean)^2) / n
Example Calculation: Variance of a set yielding 310.22/9 = 34.47.
Interpretation: Variance informs about the data distribution's spread but has units squared.
Standard Deviation
Definition: Square root of the variance, provides a measure in the same units as the variable.
Importance: Indication of typical distance from the mean.
Comparing Sensitivity to Outliers: Assess which is more affected - range, IQR, or standard deviation.
Quasi Variance and Standard Deviation
Quasi Variance (s^2): Adjustment made for sample variances to estimate population variance; typically higher than the actual variance.
Quasi Standard Deviation (s): Square root of quasi variance, gives average distance from the mean relevant to sample data.
Chebyshev's Inequality
Theorem: For any sample, a proportion of data is bounded by k standard deviations from the mean:
More than 75% within 2 standard deviations
More than 88.89% within 3 standard deviations
More than 93.75% within 4 standard deviations
Emphasizes that the inequality is conservative.
Relative Variability: Coefficient of Variation
Definition: CV = (Standard Deviation) / (Mean)
Comparison Example:
Milk Price: €1, SD: €0.15, CV: 0.15
Car Price: €13000, SD: €1300, CV: 0.1
Conclusion: More competition in milk market based on relative variability.
Other Shape Features of Data
Skewness
Measure of asymmetry in the data distribution.
Positive and negative skewness indicates data tendency toward lower or upper values.
Formula related to mean, median, and mode to determine skewness characteristics.
Kurtosis
Measure of 'tailedness' of the distribution, delineated as:
Leptokurtic: High peak and heavy tails.
Mesokurtic: Normal distribution.
Platykurtic: Flat-top and light tails.
Kurtosis calculated using the fourth moment around the mean.
Exercises
Government Age Distribution: Create a box plot; calculate mean, median, standard deviation, identify outliers and distribution shape, and calculate skewness.
Wage Bill Comparison: Analyze sections with different employee counts and wage averages to determine total wage bill and variability.
Political Leanings Assessment: Analyze three datasets using histograms for means and variances; identify and mark correct answers regarding variability and distribution characteristics.