L8-L9 In-depth Notes on Summary Statistics and Data Visualization

Summary Statistics

  • Summary statistics are essential numbers that provide insights into data characteristics.
  • They summarize key properties like frequency, location, and spread.
  • Common metrics include:
    • Location: Mean, Median
    • Spread: Standard Deviation, Variance

Functions for Summary Statistics

  • Two-pass Variance Function:
    python def two_pass_variance(data): n = len(data) mean = sum(data) / n variance = sum([(x - mean) ** 2 for x in data]) / (n - 1) return variance
  • Involves calculating the mean and then using it to determine variance.
  • Applies to a sample of data points: x1, x2, …, xn.

Welford’s One-pass Algorithm

  • Highlights that many summary statistics can be computed in a single pass through data, which is time-efficient.

Frequency and Mode

  • Frequency: percentage of a specific attribute value within a dataset.
    • E.g., in a gender dataset, females may account for 50%.
  • Mode: the most frequently occurring value in a dataset.
  • Typically used with categorical data.

Examples of Mode Calculation:

  1. Data Set: 3, 3, 6, 9, 15, 15, 15, 27, 27, 37, 48 => Mode = 15 (occurs most frequently)
  2. Data Set: 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 => Modes: 4 and 15 (both occur equally often)
  3. Data Set: 3, 6, 9, 16, 27, 37, 48 => No mode (all values occur once).

Percentiles

  • A percentile indicates the value below which a certain percentage of data falls.
  • pth percentile = value xp such that p% of the values are below it.
  • Example: 50th percentile (median) divides the dataset into two equal halves.

Example of Percentile:

  • If you rank fourth tallest in a group of 20, you are at the 80th percentile if 80% are shorter.

Measures of Location: Mean and Median

  • Mean: Sum of all values divided by the number of values.
  • Sensitive to outliers.
  • Median: The middle value when data is ordered.

Example Calculation:

  • For data: 2, 2, 3, 4, 5:
  • Mean = (2+2+3+4+5)/5 = 3.2
  • Median = 3
  • Mode = 2

Measures of Spread: Range and Variance

  • Range: Difference between maximum and minimum values.
  • Variance/Standard Deviation: Commonly used to understand data spread.
  • Other measures can include Absolute Average Deviation, Median Absolute Deviation, Interquartile Range (IQR).

Interquartile Range (IQR):

  • IQR = Q3 - Q1 (the range containing the middle 50% of the data).

Visualization

  • Visualization: Converting data into visual formats to analyze characteristics and relationships.
  • Helps in detecting patterns, trends, and outliers effectively.
  • Example: Visualizing Sea Surface Temperature (SST) data.

Visualization Techniques:

  • Histograms: Show distribution of single variable values using bins.
  • Scatter Plots: Use two variables to illustrate their relationship.
  • Box Plots: Based on quartiles, show data distribution and outliers.
  • Parallel Coordinates: Represent high-dimensional data where each axis is parallel.
  • Matrix Plots: Help visualize relationships in data matrices.

Other Visualization Techniques:

  • Star Plots: Represent data in a radial format.
  • Chernoff Faces: Attribute values appear as facial characteristics, taking advantage of human ability to recognize faces.