Class4

Class Overview

  • Class Title: Analysis of Univariate Data

  • Focus: Measures of Dispersion and Other Measures of Shape

  • Course Context: Introduction to Statistics for Social Sciences

  • Department: Department of Statistics, UC3M

Chapter Structure

  • Chapter 4: Analysis of Univariate Data

  • Key Topics:

    1. Measures of Dispersion

      • Range

      • Interquartile Range (IQR)

    2. Box and Whiskers Plot

    3. Dispersion Measures Associated with the Mean

      • Variance

      • Standard Deviation

    4. Other Shape Features of a Sample

      • Skewness

      • Kurtosis

  • Recommended Reading: Wikipedia page on dispersion, includes links to relevant quantities.

Measures of Dispersion

Range

  • Definition: Difference between the maximum and minimum values in a dataset.

  • Example Calculation: For data ranging from 0 to 4, the range = 4 - 0 = 4.

  • Consideration: The range may not be the best measure of dispersion, especially with extreme values.

Interquartile Range (IQR)

  • Definition: Distance between the 1st (Q1) and 3rd (Q3) quartiles.

  • Semi-Interquartile Range: Half of the IQR, used for comparison with the standard deviation.

  • Example Calculation:

    • Given Data: {5, 3, 11, 21, 7, 5, 2, 1, 3, 1, 2, 3, 3, 5, 5, 7, 11, 21}

    • Q1 = 2.5, Q3 = 9

    • IQR = Q3 - Q1 = 9 - 2.5 = 6.5

Identifying Outliers Using IQR

  • Mild Outlier: Below Q1 - 1.5IQR or above Q3 + 1.5IQR.

  • Extreme Outlier: Below Q1 - 3IQR or above Q3 + 3IQR.

  • Example: For Q3 = 9 and IQR = 6.5:

    • Upper inner fence: 9 + 1.5*6.5 = 18.75

    • Upper outer fence: 9 + 3*6.5 = 24.5

    • Values above 21 would be considered mild outliers.

Box and Whiskers Plot

  • Purpose: Visualize the shape of the dataset and identify potential outliers.

  • Components of Plot:

    • Minimum, Maximum, Median, Q1, Q3, Mean, Upper and Lower Fences.

  • Different shapes indicate various data distributions.

Measures of Spread Associated with Mean

Why Average Distance?

  • Attempt to find a measure of typical distance from the mean by considering average differences.

  • Sum of differences to the mean results in zero, making it ineffective for dispersion measurement.

Variance

  • Defined as an average squared distance from each value to the mean.

  • Calculation Formula:

    • Variance = Sigma((x_i - mean)^2) / n

  • Example Calculation: Variance of a set yielding 310.22/9 = 34.47.

  • Interpretation: Variance informs about the data distribution's spread but has units squared.

Standard Deviation

  • Definition: Square root of the variance, provides a measure in the same units as the variable.

  • Importance: Indication of typical distance from the mean.

  • Comparing Sensitivity to Outliers: Assess which is more affected - range, IQR, or standard deviation.

Quasi Variance and Standard Deviation

  • Quasi Variance (s^2): Adjustment made for sample variances to estimate population variance; typically higher than the actual variance.

  • Quasi Standard Deviation (s): Square root of quasi variance, gives average distance from the mean relevant to sample data.

Chebyshev's Inequality

  • Theorem: For any sample, a proportion of data is bounded by k standard deviations from the mean:

    • More than 75% within 2 standard deviations

    • More than 88.89% within 3 standard deviations

    • More than 93.75% within 4 standard deviations

  • Emphasizes that the inequality is conservative.

Relative Variability: Coefficient of Variation

  • Definition: CV = (Standard Deviation) / (Mean)

  • Comparison Example:

    • Milk Price: €1, SD: €0.15, CV: 0.15

    • Car Price: €13000, SD: €1300, CV: 0.1

    • Conclusion: More competition in milk market based on relative variability.

Other Shape Features of Data

Skewness

  • Measure of asymmetry in the data distribution.

  • Positive and negative skewness indicates data tendency toward lower or upper values.

  • Formula related to mean, median, and mode to determine skewness characteristics.

Kurtosis

  • Measure of 'tailedness' of the distribution, delineated as:

    • Leptokurtic: High peak and heavy tails.

    • Mesokurtic: Normal distribution.

    • Platykurtic: Flat-top and light tails.

  • Kurtosis calculated using the fourth moment around the mean.

Exercises

  1. Government Age Distribution: Create a box plot; calculate mean, median, standard deviation, identify outliers and distribution shape, and calculate skewness.

  2. Wage Bill Comparison: Analyze sections with different employee counts and wage averages to determine total wage bill and variability.

  3. Political Leanings Assessment: Analyze three datasets using histograms for means and variances; identify and mark correct answers regarding variability and distribution characteristics.