Lecture 1: averages and charts
Course Administration
- Instructor: Julia Yan, Assistant Professor (Operations and Logistics Division).
- TA: Haho Presto.
- Materials: Lecture content is primary; optional free textbook "OpenIntro Statistics." Canvas for readings/exercises.
- Technology: iClicker, Excel with Data Analysis Toolpak (Google Sheets discouraged). No generative AI allowed.
- Grading Breakdown: Homework (35%), Prep Questions (10%), Clicker Questions (5%), Attendance (5%), Final Exam (45%).
- Support: Office hours, tutorials, email (reply within 24 hours).
- Course Pace: Moves quickly over 5 weeks.
Describing Data: Visualization
- Line Charts: Best for visualizing trends/changes over a meaningful order on the x-axis (e.g., time).
- Scatterplots: Show relationships between two variables; each point is an individual observation.
- Bar Charts: Used for simple comparisons of categorical data.
- Histograms: Display the shape or distribution of data (symmetry, peaks, outliers). Not ideal for comparing many distributions.
- Boxplots: Effective for comparing multiple distributions. Shows outliers, min/max (excluding outliers), 25th percentile, median (50th percentile), and 75th percentile.
- Plotting Checklist: Ensure best chart type, informative title, labeled axes with units, and a legend if necessary.
Describing Data: Summary Statistics
- Types of Data:
- Quantitative: Numerical data (e.g., counts, measurements).
- Categorical: Data taking limited set of values (e.g., binary, ordinal).
- Measures of Location:
- Mean: The average (ext=AVERAGE()); sensitive to extreme observations.
- Median: The middle value (ext=MEDIAN()); robust to outliers.
- Mode: The most common value (ext=MODE()); less useful for continuous quantitative data.
- Measures of Variation:
- Standard Deviation: Typical distance to the mean (ext=STDEV()).
- Interquartile Range (IQR): Difference between the 75th and 25th percentiles (ext=QUARTILE.INC(range,3)−QUARTILE.INC(range,1)).
Modeling Data: Probability Distributions
- Random Variables: Variables with uncertain outcomes and a set of plausible results; used to model data collection.
- Probability: A framework to reason about random variables, assigning likelihoods (0 to 1) to outcomes such that all probabilities sum to 1.
- Probability Distributions: Summarize all possible outcomes of a random variable and their associated probabilities; visualize shapes with histograms.
- The Normal Distribution (Bell Curve):
- A continuous, bell-shaped distribution, commonly used due to its convenience and ability to approximate many real-world datasets (e.g., heights, stock returns, demand).
- Fully described by its Mean (μ) (location) and Standard Deviation (σ) (variation/width).
- The Empirical Rule (for Normal distributions):
- Approximately 68% of observations fall within 1 standard deviation of the mean (μ±1σ).
- Approximately 95% of observations fall within 2 standard deviations of the mean (μ±2σ).
- Approximately 99.7% of observations fall within 3 standard deviations of the mean (μ±3σ).
- Z-score: Measures how many standard deviations an observation (x) is away from the mean:
- Z=σ(x−μ).
- Z-scores far from 0 are considered "unusual" (low-probability).
- Percentiles: Indicate the percentage of observations that fall below a given value. Can be calculated using z-scores and normal distribution functions (e.g., Excel's =NORM.S.DIST(z, TRUE)).
From Data to Decisions
- Motivation: Data-driven decision-making is pervasive in management.
- Experimentation Challenges: "Before-and-After" experiments are prone to confounders (other explanations for observed changes).
- Solution: Randomized Controlled Trials (RCTs) are the "gold standard" to control for confounders, isolating the effect of the variable being tested. Randomness in RCTs introduces noise that requires probability for understanding.