Statistical Thinking – Data Types & Descriptive Statistics (Quick Review)

Introduction

  • Focus: Gathering & understanding data; descriptive statistics for quick comparison.

Data Types

  • Two broad classes:
    • Numerical / Quantitative
    • Categorical / Qualitative
Numerical Data
  • Quantity that can be counted or measured.
    • Discrete
    • Countable, separate values (e.g., number of weekly posts 0,1,2,{0,1,2,\dots})
    • Continuous
    • Any value in a range (e.g., time on social media 1.5hrs1.5\,\text{hrs})
Categorical Data
  • Quality or category labels.
    • Nominal
    • No inherent order (Instagram / Facebook / Snapchat)
    • Ordinal
    • Ordered categories without exact spacing (UI rating: Poor < Fair < Good < Excellent)
Quick Recap Table
  • Nominal | Ordinal | Discrete | Continuous

Descriptive Statistics — The 3 M’s

  • Mean xˉ=<em>i=1nx</em>in\bar{x}=\frac{\sum<em>{i=1}^{n}x</em>i}{n}
  • Median Middle value after sorting.
  • Mode Most frequent value/category.

Properties of the Mean

  • Add constant cc to every data point ⇒ xˉnew=xˉ+c\bar{x}_{\text{new}} = \bar{x}+c
  • Multiply every data point by kkxˉnew=kxˉ\bar{x}_{\text{new}} = k\,\bar{x}

Mean vs. Median vs. Mode

  • Mean sensitive to outliers (e.g., 2,3,4,5,6,20{2,3,4,5,6,20}xˉ=6.67\bar{x}=6.67 ≠ typical).
  • Median preferred when extreme values exist.
  • Mode preferred when:
    • Data are categorical / non-numeric.
    • Interest is in most common choice.
    • Distribution highly skewed (income example).

Moving (Window) Average

  • Compute mean over a sliding window (e.g., 3-point moving average) to smooth time-series fluctuations.

Choosing the Right Measure

  • Symmetric, outlier-free numeric data → Mean.
  • Skewed or outlier-heavy numeric data → Median.
  • Categorical or desire for most common value → Mode.

Key Takeaways

  • Distinguish data type first; it guides valid statistics.
  • Understand how transformations affect mean.
  • Be aware of outliers before selecting a measure of center.
  • Moving averages help reveal trends in sequential data.