L8-L9 In-depth Notes on Summary Statistics and Data Visualization
Summary Statistics
- Summary statistics are essential numbers that provide insights into data characteristics.
- They summarize key properties like frequency, location, and spread.
- Common metrics include:
- Location: Mean, Median
- Spread: Standard Deviation, Variance
Functions for Summary Statistics
- Two-pass Variance Function:
python
def two_pass_variance(data):
n = len(data)
mean = sum(data) / n
variance = sum([(x - mean) ** 2 for x in data]) / (n - 1)
return variance
- Involves calculating the mean and then using it to determine variance.
- Applies to a sample of data points: x1, x2, …, xn.
Welford’s One-pass Algorithm
- Highlights that many summary statistics can be computed in a single pass through data, which is time-efficient.
Frequency and Mode
- Frequency: percentage of a specific attribute value within a dataset.
- E.g., in a gender dataset, females may account for 50%.
- Mode: the most frequently occurring value in a dataset.
- Typically used with categorical data.
Examples of Mode Calculation:
- Data Set: 3, 3, 6, 9, 15, 15, 15, 27, 27, 37, 48 => Mode = 15 (occurs most frequently)
- Data Set: 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 => Modes: 4 and 15 (both occur equally often)
- Data Set: 3, 6, 9, 16, 27, 37, 48 => No mode (all values occur once).
Percentiles
- A percentile indicates the value below which a certain percentage of data falls.
- pth percentile = value xp such that p% of the values are below it.
- Example: 50th percentile (median) divides the dataset into two equal halves.
Example of Percentile:
- If you rank fourth tallest in a group of 20, you are at the 80th percentile if 80% are shorter.
- Mean: Sum of all values divided by the number of values.
- Sensitive to outliers.
- Median: The middle value when data is ordered.
Example Calculation:
- For data: 2, 2, 3, 4, 5:
- Mean = (2+2+3+4+5)/5 = 3.2
- Median = 3
- Mode = 2
Measures of Spread: Range and Variance
- Range: Difference between maximum and minimum values.
- Variance/Standard Deviation: Commonly used to understand data spread.
- Other measures can include Absolute Average Deviation, Median Absolute Deviation, Interquartile Range (IQR).
Interquartile Range (IQR):
- IQR = Q3 - Q1 (the range containing the middle 50% of the data).
Visualization
- Visualization: Converting data into visual formats to analyze characteristics and relationships.
- Helps in detecting patterns, trends, and outliers effectively.
- Example: Visualizing Sea Surface Temperature (SST) data.
Visualization Techniques:
- Histograms: Show distribution of single variable values using bins.
- Scatter Plots: Use two variables to illustrate their relationship.
- Box Plots: Based on quartiles, show data distribution and outliers.
- Parallel Coordinates: Represent high-dimensional data where each axis is parallel.
- Matrix Plots: Help visualize relationships in data matrices.
Other Visualization Techniques:
- Star Plots: Represent data in a radial format.
- Chernoff Faces: Attribute values appear as facial characteristics, taking advantage of human ability to recognize faces.