Notes on Numerical Data Visualization and Summary Statistics

Examining Numerical Data

  • Focus: Visualizing and describing one numerical variable using plots and numerical summaries.

  • Key ideas:

    • Center (typical value), shape (distribution form), and spread (variability) are the main characteristics to describe a distribution.

    • Visual tools discussed: dot plots, stacked dot plots, histograms, bar charts (categorical representations of the same data), and box plots.

  • Question prompt often used: "How would you describe the distribution of GPA(s) in this data set?" in terms of center, shape, and spread.


Dot Plots and Means

  • The mean (also called the average) is a measure of center.

  • In a dot plot, the mean is sometimes indicated by a triangle as a visual cue in addition to the data points.

  • Example given: The mean GPA is 3.59.

  • Important concept: The mean is a point estimate of the population mean, denoted by ar{x} (sample mean) and BC (population mean, often unknown).

  • Formula (sample mean):
    \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi

  • Population mean: \mu = \frac{1}{N} \sum{i=1}^{N} xi

  • Practical note: Population data are rarely available, so we rely on the sample mean as an estimate. A good representative sample provides a reasonably accurate estimate.


Stacked Dot Plot and Center Estimation

  • Stacked dot plots emphasize density by stacking dots in columns; higher bars indicate more observations.

  • This representation helps in judging the center and the shape more easily than a simple dot plot.

  • Example visuals show how stacking aids interpretation of distribution features.


Numerical Example: Mean with and without Outliers

  • Data snippet (countries visited): values include 0, 1, 2, 30, 31, etc.

  • Mean with all observations (illustrative):
    \text{Mean} = \frac{1+2+30+31}{22}
    (Note: This follows the transcript’s formatting where the sum is divided by the total count 22; use the actual data to compute the exact value.)

  • Mean without outliers (illustrative):
    \text{Mean without outliers} = \frac{1+2}{20} = 0.15
    (Interpretation: removing extreme observations changes the center; real data will determine the exact value.)

  • Conceptual takeaway: Outliers can strongly influence the mean; reporting a mean with and without outliers provides insight into the data’s center and sensitivity to extreme values.


Histograms

  • Histograms provide a view of data density: higher bars indicate more observations in that range.

  • Histograms are particularly convenient for describing the shape of the distribution.

  • A key caveat: The chosen bin width can alter the apparent story told by the histogram.


Histogram Example: Non-US Countries Visited by Students

  • Data presented as a histogram with axes like:

    • x-axis: Number of Non-US Countries Visited

    • y-axis: Number of Students

  • Observations span across values from 0 up to around 32.

  • Interpretation should consider bin width and overall pattern (peaks, spread, gaps).


Bin Width and Data Storytelling

  • Question: Which histograms are useful? Which reveal too much about the data? Which hide too much?

  • Takeaway: Bin width affects the balance between detail and clarity. Too-narrow bins may show noise; too-wide bins may hide important features.


Time Data: Original Time to Decimal Hours

  • Task: Convert time expressed in hours and minutes to decimal hours.

  • Examples (as given in the transcript):

    • 6 hours 0 minutes → 6.00

    • 13 hr 21 min → 13.35

    • 5 hours → 5.00

    • 4 hr 30 mins → 4.50

    • 4 hr 27 mins → 4.45

    • 2 hours 21 minutes → 2.35

    • 1 hour 31 minutes → 1.52

    • 5 hr 14 min → 5.23

    • 1 hr 30 min → 1.50

    • 8 hours → 8.00

    • 4 hours → 4.00

  • Rule: Decimal hours = hours + (minutes/60). Examples above illustrate common conversions.

  • Rationale: Converting to decimal hours standardizes time data for numerical analysis and visualization.


Daily Screen Time: Distribution Visualization

  • Two representations shown: a bar chart (frequency) and a histogram (numerical distribution).

  • Variables: Daily screen time (in hours).

  • In the bar chart, frequencies are plotted for discrete categories; in the histogram, the continuous distribution is depicted with bins.


How Much Sleep Do Students Get? (Survey)

  • Question: How many hours of sleep on average do you get?

  • Sample size: 22 responses

  • Reported distribution (percentages shown in the transcript):

    • 72.7% in one sleep category

    • 9.1% in another category

    • 18.2% in another category

  • Categories shown: 0-4 hours, 4-6 hours, 6-8 hours, 8-10 hours, over 10 hours

  • Interpretation: The majority sleep in a typical range (e.g., 6-8 hours) with some students getting less and a few more; the exact category totals depend on the coded data.


Representations of Sleep Data: Bar Chart vs Histogram

  • Bar Chart (Categorical) vs Histogram (Numerical) give different visuals but describe the same underlying data.

  • Bar chart summarizes by category counts; histogram shows the distribution across continuous hours.

  • Both representations can be used to assess central tendency, spread, and overall shape.


Modality and Shape of a Distribution

  • Modality: The number of peaks in a distribution.

    • Unimodal: one clear peak

    • Bimodal: two distinct peaks

    • Multimodal: more than two peaks

    • Uniform: flat, no distinct peak

  • Visual aid: Conceptual metaphor – imagine dropping a limp spaghetti over a set of wooden histogram bars; the resulting curve approximates the distribution’s smooth shape.


Skewness: Right, Left, or Symmetric

  • Right-skewed (positively skewed): tail extends to the right; mean typically exceeds the median.

  • Left-skewed (negatively skewed): tail extends to the left; mean typically less than the median.

  • Symmetric: left and right sides mirror each other; mean and median are typically close.


Shape and Outliers: Visual Clues

  • Unusual observations or potential outliers may appear as isolated bars or distant points from the main cluster.

  • Outliers can signal data collection/entry errors or genuine rare observations.


Modality and Shape Practice Visuals

  • The transcript references several slides illustrating unimodal, bimodal, multimodal, and uniform shapes.

  • Pay attention to how the number of peaks (modality) and the symmetry around a central value (skewness) interact to describe a distribution.


Center and Spread: Summary Statistics Essentials

  • Median vs Mean:

    • Median: the middle value when data are ordered; 50th percentile.

    • If even number of observations, median is the average of the two middle values.

    • For skewed distributions or those with outliers, the median is often a better measure of center.

  • Percentiles:

    • Q1 = 25th percentile; Q3 = 75th percentile; IQR = Q3 - Q1.

    • The IQR measures the spread of the middle 50% of the data.

  • Box Plot representation:

    • The box spans Q1 to Q3 with a line at the median inside the box.

    • Whiskers extend to the most extreme data points within 1.5 × IQR from the quartiles.

    • Outliers lie beyond the whiskers.


Variance and Standard Deviation

  • Variance measures the average squared deviation from the mean.

    • Population variance: \sigma^2 = \frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2

    • Sample variance: s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2

  • Standard deviation is the square root of the variance and has the same units as the data:

    • Population standard deviation: \sigma = \sqrt{\frac{1}{N} \sum{i=1}^{N} (xi - \mu)^2}

    • Sample standard deviation: s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2}

  • Why squared deviations? To remove negative signs and to weight larger deviations more heavily.


Population vs Sample (Important Concept)

  • Population: all data you care about.

  • Sample: a portion of the population that you care about.

  • This distinction matters for calculating variance and standard deviation because formulas differ between populations and samples (denominators N vs n-1).


The Median, Quartiles, and Interquartile Range (IQR)

  • Median: the middle value of ordered data; also the 50th percentile.

  • Quartiles:

    • Q1: 25th percentile

    • Q3: 75th percentile

  • IQR: the spread of the middle 50% of the data: \text{IQR} = Q3 - Q1


Box Plot Anatomy and whiskers

  • The box represents the middle 50% of the data (Q1 to Q3).

  • The thick line inside the box marks the median.

  • Whiskers extend to the furthest observations within 1.5 × IQR from the quartiles.

  • Formula for whiskers:

    • max upper whisker reach = Q_3 + 1.5 \times \text{IQR}

    • max lower whisker reach = Q_1 - 1.5 \times \text{IQR}

  • Observations beyond these limits are considered potential outliers.


Outliers: Why They Matter

  • Outliers can indicate:

    • Extreme skew or unusual observations

    • Data collection or entry errors

    • Interesting features worth investigation

  • Practical reasons to examine: identify data quality issues and understand data behavior beyond the main cluster.


Robust Statistics

  • Robust statistics emphasize statistics that are not unduly affected by outliers or skewness:

    • Median and IQR are more robust than mean and SD.

    • For skewed distributions, prefer median and IQR to describe center and spread.

    • For symmetric distributions, mean and SD can be appropriate.

  • Practical intuition: If you want to estimate a typical household income, median is often more informative than the mean due to right-skewness in income data.


Mean vs. Median: When to Use Which

  • In symmetric distributions: center is often defined by the mean; mean ≈ median.

  • In skewed distributions or those with outliers: center is often defined by the median.

  • Relationship: Right-skewed distributions typically have mean > median; Left-skewed distributions typically have mean < median.


Practice Question Insights

  • A comparative prompt asks which is more likely true for a given distribution (notes vs. Facebook usage):

    • Possible conclusions include:

    • Mean > Median

    • Mean ≈ Median

    • Mean < Median

    • It may be impossible to tell without the data

  • A worked example in the transcript shows a scenario where median is 80% and mean is 76% (illustrative) and asks the student to infer the relationship (in that example, mean < median would be expected if the data are left-skewed towards note-taking on Facebook, though the exact interpretation depends on the actual data).


Quick Reference: Key Formulas

  • Sample mean: \bar{x} = \dfrac{1}{n} \sum{i=1}^{n} xi

  • Population mean: \mu = \dfrac{1}{N} \sum{i=1}^{N} xi

  • Variance (sample): s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2

  • Standard deviation (sample): s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2}

  • Median: middle value of ordered data; or average of two middle values if n is even.

  • Quartiles and IQR: Q1, Q3, IQR = Q3 - Q1

  • Box plot whiskers: max upper whisker reach = Q3 + 1.5\times \text{IQR}; max lower whisker reach = Q1 - 1.5\times \text{IQR}

  • Modality: unimodal, bimodal, multimodal, uniform

  • Skewness concepts: right-skewed, left-skewed, symmetric


Quick Recap: How to Describe a Distribution

  • Center: use mean for symmetric data; use median for skewed/outliers.

  • Spread: use range, IQR, and standard deviation as appropriate; note robustness considerations.

  • Shape: assess modality and skewness to decide which statistics are most informative.

  • Outliers: identify and consider their impact on summary statistics; use robust statistics when appropriate.

  • Visualize: dot plots, stacked dot plots, histograms, and box plots each provide different insights and should be interpreted in light of bin width, scale, and sample size.


Practice Prompts for Exam Preparation

  • Describe a given GPA distribution in terms of center, spread, and shape using a dot plot.

  • Compute the sample mean and explain how outliers might affect it.

  • Explain the difference between population and sample variance and why the denominator differs.

  • Interpret a histogram’s bin width choice and discuss how it could alter your conclusions about modality and skewness.

  • Identify potential outliers using the 1.5 × IQR rule and explain why detecting them matters.

  • Decide when to report mean with SD versus median with IQR based on distribution symmetry and the presence of outliers.