Standard deviation and variance

Week 2: Data Distributions

  • Topics Covered:

    • What to do with Quantitative Data

    • Variance and Standard Deviation

    • Z-Scores

What Do We Do With Quantitative Data?

  • Collect and Pre-Process Data: Gather relevant numerical data and prepare it for analysis to ensure accuracy and relevance.

  • Descriptive Statistics: Summarize and describe features of the data, providing insights into its characteristics.

  • Test for Differences: Analyze data to determine if there are significant differences between various groups or conditions within the dataset.

  • Look for Relationships: Explore correlations and dependencies between different variables to understand their interactions.

Understanding Distributions

Basic Histogram Features

  • X Axis (Horizontal): Represents the range of values or bins.

  • Y Axis (Vertical): Represents the frequency or count of observations in each bin.

  • Bins: Ranges of values that group the data for analysis.

  • Purpose: Shows how much of the data is concentrated across the range of values, helping visualize distribution.

Example Histogram

  • 20 Class Count Distribution:

    • Example bin heights representing the number of states with specific percentages of foreign-born residents:

      • 0.1 to 5.0%: Height 13 states

      • 5.1 to 10.0%: Height 20 states

      • 10.1 to 15.0%: Height 10 states

      • ...

Analyzing Peaks in Histograms

  • Two Peaks in Distribution:

    • Indicates two distinct groups; in education data, such peaks might show the presence of 'ACT states' vs 'SAT states'.

Identifying Outliers

  • Outliers: Observations that fall significantly outside the overall distribution pattern; may require further investigation.

Distribution Characteristics

Symmetric vs. Skewed Distributions

  • Symmetric Distribution: Both sides of the histogram are mirror images.

  • Right-Skewed Distribution: Tail on the right extends further; common in income data, with mean and median pulled upwards.

  • Left-Skewed Distribution: Tail on the left extends further; mean and median pulled downwards.

Visualizations of Skewness

  • Histogram Examples:

    • Age at Death of Australian Males: Left-skewed distribution features a left tail.

    • Income Data: Right-skewed distribution shows typical income reporting methods focused on median values.

Uniform Distribution

  • Uniform Distribution: Each value falls into bins with equal likelihood.

Normal Distribution

  • Characteristics:

    • Symmetrical and bell-shaped (Gaussian curve).

    • Mean = Median = Mode.

    • Many statistical methods assume data are normally distributed.

Summary Statistics

Mean

  • Mean Equation: Used to calculate the average of a data set.

Variance

  • Definition: Measures how broadly the data is distributed around the mean. It is calculated by measuring and squaring the distance of each observation from the mean, summing these squared distances, and dividing by the number of observations.

Standard Deviation

  • Definition: The square root of variance, indicating how much individual data points deviate from the mean.

  • Empirical Rule for Normal Distribution:

    • 68% of observations are within +/- 1 Standard Deviation (SD).

    • 95% are within +/- 2 SD.

    • 99.7% are within +/- 3 SD.

Z-Scores

  • Definition: Each data point has an associated z-score which indicates how many standard deviations away it is from the mean.

  • Characteristics:

    • Mean of z-scores is zero; standard deviation is one.

    • Important to calculate z-scores only if data follows a normal distribution.

    • A z-score can clarify whether an observation is typical or atypical within the dataset.

    • Negative z-scores indicate values below the mean; positive z-scores indicate values above the mean.