Models and Model Evaluation
Introduction to Models and Model Evaluation
Motivating Problem: Imagine 100 people in a lecture hall. Everyone puts their cell phone into a bag. Phones are shuffled and randomly redistributed, one per person. How likely is it that at least 3 people get their own phone back?
Challenge: We often face situations with many possible outcomes and uncertainty about their likelihood.
Goal: Learn how to accurately estimate the likelihood of different outcomes.
Three Approaches to Calculating Probability
There are three main ways to approach such problems:
Calculate the exact probability: When possible, this provides the most accurate answer.
Collect data from real-world observations: Base probabilities on how often events occurred in reality.
Model and simulate events using a computer: Run a computer model thousands of times to estimate probabilities.
Challenge with Exact Calculation
Pros: When feasible, it yields the most accurate, often exact, answer.
Examples:
Probability of a royal flush from a shuffled deck (five cards): approximately 1/650,000.
Probability of rolling a 7 with a pair of fair dice: 1/6.
Cons: Exact calculations are not always easy or possible. Math can quickly become unwieldy, or exact probabilities may be unknown.
Detailed Cell Phone Problem Example (Why exact calculation is hard):
Person 1: Probability of getting their own phone back is 1/100.
Person 2: Probability depends on what Person 1 received.
If Person 1 did not get Person 2's phone, then Person 2's phone is still in the bag with 99 others. Probability is 1/99.
If Person 1 did get Person 2's phone, then Person 2's phone is no longer in the bag. Probability is 0.
Person 3 and beyond: The dependencies become increasingly complex, making exact calculation impractical very quickly.
Challenge with Real-World Observation
Pros: Allows data itself to inform probabilities, factoring in all real-world complexities and noise.
Example: Calculating baseball batting averages. Players have approx. 600 at-bats per season. Collecting data over 2-3 seasons provides thousands of events to calculate the proportion of times a player gets on base.
Cons: Requires collecting data from a vast number of events, which can be immensely time-consuming or expensive.
Time Example: Researching Brood X cicadas, which hatch once every 17 years, would require thousands of years to collect data from many generations.
Cost Example: Studying neural pathways using brain scanners (thousands of dollars per scan) makes collecting thousands of data points financially prohibitive for most scientists.
Solution: Modeling and Simulation
Method: Create a computer model or simulation of an event, run it thousands of times, and analyze the results to determine the likelihood of particular outcomes.
Pros: Fast, cheap, and relatively easy to understand.
Applications: Widely used in various fields.
Examples: Sports simulations to predict game winners, Washington Post simulations (during the early COVID-19 pandemic) to investigate strategies for slowing pandemic spread.
This approach is a primary focus of this statistics class.
Defining a Model
Definition: For our purposes, a model is a formal description of a process that generates some data.
Key elements: A model is a process that generates a piece of data or an outcome.
Example 1: Chicks on a Farm
Real-world situation: A farm daily hatches 45 chicks. How many are likely to be hens?
Model: Randomly pick