Models and Model Evaluation

Introduction to Models and Model Evaluation
  • Motivating Problem: Imagine 100 people in a lecture hall. Everyone puts their cell phone into a bag. Phones are shuffled and randomly redistributed, one per person. How likely is it that at least 3 people get their own phone back?

  • Challenge: We often face situations with many possible outcomes and uncertainty about their likelihood.

  • Goal: Learn how to accurately estimate the likelihood of different outcomes.

Three Approaches to Calculating Probability
  • There are three main ways to approach such problems:

    1. Calculate the exact probability: When possible, this provides the most accurate answer.

    2. Collect data from real-world observations: Base probabilities on how often events occurred in reality.

    3. Model and simulate events using a computer: Run a computer model thousands of times to estimate probabilities.

Challenge with Exact Calculation
  • Pros: When feasible, it yields the most accurate, often exact, answer.

    • Examples:

      • Probability of a royal flush from a shuffled deck (five cards): approximately 1/650,000.

      • Probability of rolling a 7 with a pair of fair dice: 1/6.

  • Cons: Exact calculations are not always easy or possible. Math can quickly become unwieldy, or exact probabilities may be unknown.

  • Detailed Cell Phone Problem Example (Why exact calculation is hard):

    • Person 1: Probability of getting their own phone back is 1/100.

    • Person 2: Probability depends on what Person 1 received.

      • If Person 1 did not get Person 2's phone, then Person 2's phone is still in the bag with 99 others. Probability is 1/99.

      • If Person 1 did get Person 2's phone, then Person 2's phone is no longer in the bag. Probability is 0.

    • Person 3 and beyond: The dependencies become increasingly complex, making exact calculation impractical very quickly.

Challenge with Real-World Observation
  • Pros: Allows data itself to inform probabilities, factoring in all real-world complexities and noise.

    • Example: Calculating baseball batting averages. Players have approx. 600 at-bats per season. Collecting data over 2-3 seasons provides thousands of events to calculate the proportion of times a player gets on base.

  • Cons: Requires collecting data from a vast number of events, which can be immensely time-consuming or expensive.

    • Time Example: Researching Brood X cicadas, which hatch once every 17 years, would require thousands of years to collect data from many generations.

    • Cost Example: Studying neural pathways using brain scanners (thousands of dollars per scan) makes collecting thousands of data points financially prohibitive for most scientists.

Solution: Modeling and Simulation
  • Method: Create a computer model or simulation of an event, run it thousands of times, and analyze the results to determine the likelihood of particular outcomes.

  • Pros: Fast, cheap, and relatively easy to understand.

  • Applications: Widely used in various fields.

    • Examples: Sports simulations to predict game winners, Washington Post simulations (during the early COVID-19 pandemic) to investigate strategies for slowing pandemic spread.

  • This approach is a primary focus of this statistics class.

Defining a Model
  • Definition: For our purposes, a model is a formal description of a process that generates some data.

    • Key elements: A model is a process that generates a piece of data or an outcome.

  • Example 1: Chicks on a Farm

    • Real-world situation: A farm daily hatches 45 chicks. How many are likely to be hens?

    • Model: Randomly pick