Probability Theory and Event Prediction Notes

Foundations of Probability Theory

  • Definition of Probability: A probability is a quantitative value, expressed as either a number or a percentage, that ranges between 00 and 11. It serves as a measure indicating the likelihood of a specific event occurring.

  • Backbone of Statistics: Probability theory is considered the fundamental prerequisite and structural basis for the field of statistics.

  • Methods for Assigning Probabilities:     * Classical Method: Probabilities are assigned based on the theoretical ratio of outcomes. The formula is:         * Number of outcomes in which an event occursNumber of possible outcomes\frac{\text{Number of outcomes in which an event occurs}}{\text{Number of possible outcomes}}     * Empirical Probability (Relative Frequency Method): This approach relies on historical data or observed results. The formula is:         * Number of outcomes in which an event has occurred in the pastNumber of opportunities for an event to occur\frac{\text{Number of outcomes in which an event has occurred in the past}}{\text{Number of opportunities for an event to occur}}     * Subjective Probability Method: This method involves the use of individual judgment, experience, and other non-mathematical criteria to determine the likelihood of an event.

  • Core Definitions:     * Experiment: A defined random process that generates specific results or data points (e.g., a data-collection procedure).     * Sample Space: The comprehensive set representing all possible outcomes of an experiment.     * Event: A specific set of outcomes derived from an experiment. An event can be categorized as containing no outcome (empty set), a single outcome, or multiple outcomes. Probability is specifically assigned to these events.

Data Structure and Contingency Tables

  • Multi-dimensional Data: Data collections often involve multiple variables or dimensions. This is frequently organized using a contingency table (also known as a cross-tab).

  • Phone Plan Example Data (Counts):     * The data tracks two random variables: Phone plan choice (2929, 4949, or 7979 dollars) and the day of purchase (Monday-Friday vs. Saturday-Sunday).     * Monday-Friday (Weekdays):         * 2929 Plan: 1010 instances         * 4949 Plan: 120120 instances         * 7979 Plan: 250250 instances         * Total Weekday Purchases: 380380     * Saturday-Sunday (Weekends):         * 2929 Plan: 4040 instances         * 4949 Plan: 3030 instances         * 7979 Plan: 350350 instances         * Total Weekend Purchases: 420420     * Combined Totals:         * Total 2929 Plans: 5050         * Total 4949 Plans: 150150         * Total 7979 Plans: 600600         * Grand Total Outcomes: 800800

  • Potential Outcomes: In this example, there are 66 specific potential outcomes representing every possible combination of plan price and purchase day.

  • Sample Space for Example: {$29 \, \text{on weekdays}, $29 \, \text{on weekends}, $49 \, \text{on weekdays}, $49 \, \text{on weekends}, $79 \, \text{on weekdays}, $79 \, \text{on weekends}}

Probability Concepts and Visualizations

  • Relative Frequency Assignment: Outcomes are assigned probabilities based on their observed frequency relative to the total samples.     * P(outcomei)=number of occurrences of outcomeitotal number of occurrences of all outcomesP(\text{outcome}_i) = \frac{\text{number of occurrences of outcome}_i}{\text{total number of occurrences of all outcomes}}

  • Joint vs. Marginal Probabilities:     * Joint Probability: This denotes the relative frequency of an event involving all dimensions simultaneously (e.g., the probability a customer bought the 4949 plan on a weekday). It describes outcomes associated with more than one random variable.     * Marginal Probability: This represents the relative frequency of an event when considering only a single dimension, regardless of other variables (e.g., the total probability of a customer buying a 4949 plan, ignoring which day it was purchased).

  • Venn Diagrams: Introduced by John Venn in 18801880, these diagrams visualize logical relations between sets.     * External Rectangle: Represents the entire sample space.     * Internal Circle: Represents a specific event, such as event AA.

  • Mathematical Notation:     * Complement (AA'): Pronounced "A prime," this refers to the event "not AA." For example, if AA is the event of buying a 2929 dollar plan, AA' is the event of buying any plan except the 2929 dollar one.     * Intersection (\cap): This symbol represents "and," indicating the co-occurrence of events. ABA \cap B refers to both AA and BB happening. This is visualized as the overlapping section in a Venn diagram.     * Union (\cup): Pronounced "union," this symbol represents "or." ABA \cup B indicates that event AA or event BB (or both) occurs.

Essential Rules of Probability

  • Complement Rule: The sum of the probability of an event and its complement is always equal to 11.     * P(A)+P(A)=1P(A) + P(A') = 1

  • Law of Total Probability (Version 1): The sum of the joint probabilities of an event AA intersection with BB and AA intersection with not BB equals the marginal probability of AA.     * P(AB)+P(AB)=P(A)P(A \cap B) + P(A \cap B') = P(A)

  • General Rule of Addition: This calculates the probability of the union of two events.     * P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)     * Example calculation from data: Probability of selling a plan on a weekday OR selling it for 2929 dollars.         * Calculation 1: 10800+40800+370800=0.0125+0.05+0.4625=0.525\frac{10}{800} + \frac{40}{800} + \frac{370}{800} = 0.0125 + 0.05 + 0.4625 = 0.525         * Alternative calculation: P(29)+P(weekday)P(29weekday)P(29) + P(\text{weekday}) - P(29 \cap \text{weekday})         * 0.0625+0.4750.0125=0.5250.0625 + 0.475 - 0.0125 = 0.525

  • Mutually Exclusive Events: Events that cannot occur at the same time. If AA and BB are mutually exclusive, then P(AB)=0P(A \cap B) = 0. They do not intersect in a Venn diagram. Any event and its complement are inherently mutually exclusive.

  • Collectively Exhaustive Events: A set of events where the occurrence of at least one covers the entire sample space. If AA and BB are collectively exhaustive, then P(AB)=1P(A \cup B) = 1. Any event and its complement are collectively exhaustive.

Conditional Probabilities and Bayes Rule

  • Conditional Probability (P(AB)P(A|B)): Denotes the probability that event AA occurs given that event BB has already occurred.     * Formula: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}     * Formula: P(BA)=P(AB)P(A)P(B|A) = \frac{P(A \cap B)}{P(A)}     * Example: The probability of a client choosing a 2929 dollar plan conditional on visiting the store during a weekday.         * Approach 1: 10380=0.026\frac{10}{380} = 0.026         * Approach 2: P(29Weekday)P(Weekday)=0.01250.475=0.026\frac{P(29 \cap \text{Weekday})}{P(\text{Weekday})} = \frac{0.0125}{0.475} = 0.026

  • General Law of Multiplication: Derived from the conditional probability formula.     * P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)

  • Bayes Rule: An algebraic rearrangement of the multiplication law.     * P(BA)=P(AB)P(B)P(A)P(B|A) = \frac{P(A|B)P(B)}{P(A)}

  • Law of Total Probability (Version 2): Using conditional probabilities to find a marginal probability.     * P(AB)P(B)+P(AB)P(B)=P(A)P(A|B)P(B) + P(A|B')P(B') = P(A)

Event Independence

  • Defining Independence: Events are independent if the occurrence or non-occurrence of one event has no effect on the likelihood of the other event.     * Example: Flipping a coin twice. The outcome of the second toss is independent of the first; P(Heads on 2nd toss)=50%P(\text{Heads on 2nd toss}) = 50\% regardless of the first result.

  • Mathematical Tests for Independence:     * Version 1: P(AB)=P(A)P(A|B) = P(A) and P(BA)=P(B)P(B|A) = P(B)     * Version 2: P(AB)=P(A)×P(B)P(A \cap B) = P(A) \times P(B)

  • Testing for Dependence: If P(AB)P(A)P(A|B) \neq P(A) or if P(AB)P(A)×P(B)P(A \cap B) \neq P(A) \times P(B), the events are considered dependent (not independent).

  • Independence Test Example: Plan choice vs. Day of purchase.     * P(29)=0.0625P(29) = 0.0625     * P(weekdays)=0.475P(\text{weekdays}) = 0.475     * P(29)×P(weekdays)=0.0625×0.475=0.0297P(29) \times P(\text{weekdays}) = 0.0625 \times 0.475 = 0.0297     * Observed P(29weekdays)=0.0125P(29 \cap \text{weekdays}) = 0.0125     * Since 0.02970.01250.0297 \neq 0.0125, the events are not independent.

Case Study: Titanic Sinking Data

  • Data Set Summary:     * Survivors: 233233 Females, 109109 Males. (Total survivors: 342342\/891891)     * Deceased: 8181 Females, 468468 Males. (Total deaths: 549549\/891891)     * Gender Totals: 314314 Females, 577577 Males. (Grand Total: 891891)

  • Probability Calculations:     * Marginal Probability of Surviving: P(Survived)=3428910.38P(\text{Survived}) = \frac{342}{891} \approx 0.38 (38%38\%)     * Joint Probability (Male and Surviving): P(SurvivedMale)=1098910.12P(\text{Survived} \cap \text{Male}) = \frac{109}{891} \approx 0.12 (12%12\%)     * Union Probability (Survived OR Male): P(SurvivedMale)=342891+577891109891=8108910.91P(\text{Survived} \cup \text{Male}) = \frac{342}{891} + \frac{577}{891} - \frac{109}{891} = \frac{810}{891} \approx 0.91     * Conditional Probability (Survived given Male): P(SurvivedMale)=0.120.650.19P(\text{Survived}|\text{Male}) = \frac{0.12}{0.65} \approx 0.19 (19%19\%)     * Probability of Not Surviving: P(Survived)=10.38=0.62P(\text{Survived}') = 1 - 0.38 = 0.62. Note that Surviving and Dying are complements, mutually exclusive, and collectively exhaustive.

  • Testing Independence (Gender and Survival):     * Independence would require P(SurvivedMale)=P(Survived)P(\text{Survived} | \text{Male}) = P(\text{Survived}).     * P(SurvivedMale)=0.19P(\text{Survived} | \text{Male}) = 0.19.     * P(Survived)=0.38P(\text{Survived}) = 0.38.     * Since 0.190.380.19 \neq 0.38, survival was dependent on gender.

Questions & Discussion

  • Q: Why use the word "may" when concluding lack of independence based on the calculation?     * A: We cannot be 100%100\% sure from raw data alone without performing formal statistical hypothesis tests. These tests determine if the difference observed in the sample is significant enough to represent the population. These tests will be studied later in the semester.

  • Q: If we collect data to answer if plan choices and day of purchase are independent, is that a sample or a population?     * A: It is typically a sample. If you chose a different set of data, the numbers and resulting probabilities in the contingency table would likely vary.

Weekly Summary of Formulae

  • a) P(A)+P(A)=1P(A) + P(A') = 1

  • b) P(AB)+P(AB)=P(A)P(A \cap B) + P(A \cap B') = P(A)

  • c) P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

  • d) P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)} ; P(BA)=P(AB)P(A)P(B|A) = \frac{P(A \cap B)}{P(A)}

  • e) P(AB)P(B)+P(AB)P(B)=P(A)P(A|B)P(B) + P(A|B')P(B') = P(A) ; P(BA)P(A)+P(BA)P(A)=P(B)P(B|A)P(A) + P(B|A')P(A') = P(B)

  • f) Independence    P(AB)=P(A);P(BA)=P(B)\text{Independence} \iff P(A|B) = P(A); P(B|A) = P(B)

  • g) Independence    P(AB)=P(A)×P(B)\text{Independence} \iff P(A \cap B) = P(A) \times P(B)