Lecture 1: averages and charts

Course Administration

Instructor: Julia Yan, Assistant Professor (Operations and Logistics Division).
TA: Haho Presto.
Materials: Lecture content is primary; optional free textbook "OpenIntro Statistics." Canvas for readings/exercises.
Technology: iClicker, Excel with Data Analysis Toolpak (Google Sheets discouraged). No generative AI allowed.
Grading Breakdown: Homework ( $35$ %), Prep Questions ( $10$ %), Clicker Questions ( $5$ %), Attendance ( $5$ %), Final Exam ( $45$ %).
Support: Office hours, tutorials, email (reply within $24$ hours).
Course Pace: Moves quickly over $5$ weeks.

Describing Data: Visualization

Line Charts: Best for visualizing trends/changes over a meaningful order on the x-axis (e.g., time).
Scatterplots: Show relationships between two variables; each point is an individual observation.
Bar Charts: Used for simple comparisons of categorical data.
Histograms: Display the shape or distribution of data (symmetry, peaks, outliers). Not ideal for comparing many distributions.
Boxplots: Effective for comparing multiple distributions. Shows outliers, min/max (excluding outliers), $25^{th}$ percentile, median ( $50^{th}$ percentile), and $75^{th}$ percentile.
Plotting Checklist: Ensure best chart type, informative title, labeled axes with units, and a legend if necessary.

Describing Data: Summary Statistics

Types of Data:
- Quantitative: Numerical data (e.g., counts, measurements).
- Categorical: Data taking limited set of values (e.g., binary, ordinal).
Measures of Location:
- Mean: The average ( $ext{=AVERAGE()}$ ); sensitive to extreme observations.
- Median: The middle value ( $ext{=MEDIAN()}$ ); robust to outliers.
- Mode: The most common value ( $ext{=MODE()}$ ); less useful for continuous quantitative data.
Measures of Variation:
- Standard Deviation: Typical distance to the mean ( $ext{=STDEV()}$ ).
- Interquartile Range (IQR): Difference between the $75^{th}$ and $25^{th}$ percentiles ( $ext{=QUARTILE.INC(range, 3) - QUARTILE.INC(range, 1)}$ ).

Modeling Data: Probability Distributions

Random Variables: Variables with uncertain outcomes and a set of plausible results; used to model data collection.
Probability: A framework to reason about random variables, assigning likelihoods ( $0$ to $1$ ) to outcomes such that all probabilities sum to $1$ .
Probability Distributions: Summarize all possible outcomes of a random variable and their associated probabilities; visualize shapes with histograms.
The Normal Distribution (Bell Curve):
- A continuous, bell-shaped distribution, commonly used due to its convenience and ability to approximate many real-world datasets (e.g., heights, stock returns, demand).
- Fully described by its Mean ( $\mu$ ) (location) and Standard Deviation ( $\sigma$ ) (variation/width).
The Empirical Rule (for Normal distributions):
- Approximately $68\%$ of observations fall within $1$ standard deviation of the mean ( $\mu \pm 1\sigma$ ).
- Approximately $95\%$ of observations fall within $2$ standard deviations of the mean ( $\mu \pm 2\sigma$ ).
- Approximately $99.7\%$ of observations fall within $3$ standard deviations of the mean ( $\mu \pm 3\sigma$ ).
Z-score: Measures how many standard deviations an observation ( $x$ ) is away from the mean:
- $Z = \frac{(x - \mu)}{\sigma}$ .
- Z-scores far from $0$ are considered "unusual" (low-probability).
Percentiles: Indicate the percentage of observations that fall below a given value. Can be calculated using z-scores and normal distribution functions (e.g., Excel's $\text{=NORM.S.DIST(z, TRUE)}$ ).

From Data to Decisions

Motivation: Data-driven decision-making is pervasive in management.
Experimentation Challenges: "Before-and-After" experiments are prone to confounders (other explanations for observed changes).
Solution: Randomized Controlled Trials (RCTs) are the "gold standard" to control for confounders, isolating the effect of the variable being tested. Randomness in RCTs introduces noise that requires probability for understanding.