Lecture 1: averages and charts

Course Administration

  • Instructor: Julia Yan, Assistant Professor (Operations and Logistics Division).
  • TA: Haho Presto.
  • Materials: Lecture content is primary; optional free textbook "OpenIntro Statistics." Canvas for readings/exercises.
  • Technology: iClicker, Excel with Data Analysis Toolpak (Google Sheets discouraged). No generative AI allowed.
  • Grading Breakdown: Homework (3535%), Prep Questions (1010%), Clicker Questions (55%), Attendance (55%), Final Exam (4545%).
  • Support: Office hours, tutorials, email (reply within 2424 hours).
  • Course Pace: Moves quickly over 55 weeks.

Describing Data: Visualization

  • Line Charts: Best for visualizing trends/changes over a meaningful order on the x-axis (e.g., time).
  • Scatterplots: Show relationships between two variables; each point is an individual observation.
  • Bar Charts: Used for simple comparisons of categorical data.
  • Histograms: Display the shape or distribution of data (symmetry, peaks, outliers). Not ideal for comparing many distributions.
  • Boxplots: Effective for comparing multiple distributions. Shows outliers, min/max (excluding outliers), 25th25^{th} percentile, median (50th50^{th} percentile), and 75th75^{th} percentile.
  • Plotting Checklist: Ensure best chart type, informative title, labeled axes with units, and a legend if necessary.

Describing Data: Summary Statistics

  • Types of Data:
    • Quantitative: Numerical data (e.g., counts, measurements).
    • Categorical: Data taking limited set of values (e.g., binary, ordinal).
  • Measures of Location:
    • Mean: The average (ext=AVERAGE()ext{=AVERAGE()}); sensitive to extreme observations.
    • Median: The middle value (ext=MEDIAN()ext{=MEDIAN()}); robust to outliers.
    • Mode: The most common value (ext=MODE()ext{=MODE()}); less useful for continuous quantitative data.
  • Measures of Variation:
    • Standard Deviation: Typical distance to the mean (ext=STDEV()ext{=STDEV()}).
    • Interquartile Range (IQR): Difference between the 75th75^{th} and 25th25^{th} percentiles (ext=QUARTILE.INC(range,3)QUARTILE.INC(range,1)ext{=QUARTILE.INC(range, 3) - QUARTILE.INC(range, 1)}).

Modeling Data: Probability Distributions

  • Random Variables: Variables with uncertain outcomes and a set of plausible results; used to model data collection.
  • Probability: A framework to reason about random variables, assigning likelihoods (00 to 11) to outcomes such that all probabilities sum to 11.
  • Probability Distributions: Summarize all possible outcomes of a random variable and their associated probabilities; visualize shapes with histograms.
  • The Normal Distribution (Bell Curve):
    • A continuous, bell-shaped distribution, commonly used due to its convenience and ability to approximate many real-world datasets (e.g., heights, stock returns, demand).
    • Fully described by its Mean (μ\mu) (location) and Standard Deviation (σ\sigma) (variation/width).
  • The Empirical Rule (for Normal distributions):
    • Approximately 68%68\% of observations fall within 11 standard deviation of the mean (μ±1σ\mu \pm 1\sigma).
    • Approximately 95%95\% of observations fall within 22 standard deviations of the mean (μ±2σ\mu \pm 2\sigma).
    • Approximately 99.7%99.7\% of observations fall within 33 standard deviations of the mean (μ±3σ\mu \pm 3\sigma).
  • Z-score: Measures how many standard deviations an observation (xx) is away from the mean:
    • Z=(xμ)σZ = \frac{(x - \mu)}{\sigma}.
    • Z-scores far from 00 are considered "unusual" (low-probability).
  • Percentiles: Indicate the percentage of observations that fall below a given value. Can be calculated using z-scores and normal distribution functions (e.g., Excel's =NORM.S.DIST(z, TRUE)\text{=NORM.S.DIST(z, TRUE)}).

From Data to Decisions

  • Motivation: Data-driven decision-making is pervasive in management.
  • Experimentation Challenges: "Before-and-After" experiments are prone to confounders (other explanations for observed changes).
  • Solution: Randomized Controlled Trials (RCTs) are the "gold standard" to control for confounders, isolating the effect of the variable being tested. Randomness in RCTs introduces noise that requires probability for understanding.