Unit 5: Probability Rules, Simulations, and Independence

Basic Probability Rules

  • Probability Model:     * A probability model must show all possible outcomes in the sample space.     * A probability model must show all probabilities for those outcomes.     * Rule 1: The probability of each individual outcome must be a value between 0 and 1 (inclusive, representing 0%0\% to 100%100\%).     * Rule 2: The sum of all probabilities for the entire sample space must equal exactly 1 (expressing 100%100\% certainty that one of the outcomes in the sample space will occur).

  • Complement:     * Definition: If AA is designated as event AA, then ACA^C is the complement of event AA, also known as "not AA."     * Conceptual Meaning: The complement is the event that event AA did not happen.     * Complement Rule Formula: P(AC)=1P(A)P(A^C) = 1 - P(A).     * Plain Language Explanation: The probability of the complement of AA is equivalent to "everything" (100%100\% or 1) minus the probability of event AA actually occurring.

  • Mutually Exclusive Events:     * Definition: Events that have no outcomes in common.     * Addition Rule for Mutually Exclusive Events: When events AA and BB are mutually exclusive, the probability of either event occurring is the sum of their individual probabilities: P(A or B)=P(A)+P(B)P(A \text{ or } B) = P(A) + P(B).

Randomness, Probability, and Simulation

  • Probability:     * Definition: The likelihood that something happens in the "long run."     * Nature of Randomness: Random phenomena are characterized as being unpredictable in the short run but becoming predictable in the long run.

  • Law of Large Numbers:     * Definition: If a chance process is repeated many, many times, the proportion of desired outcomes obtained will approach the actual probability of that outcome.

  • Simulation:     * Definition: The act of imitating a chance process, often used by statisticians to estimate probabilities when direct calculation is difficult or to verify theoretical models.

Two-Way Tables and Venn Diagrams

  • Two-Way Table:     * A grid format (often 2×22 \times 2) used to organize data for two categorical variables, arranged in rows and columns.

  • Venn Diagram Components:     * Circle: Represents a specific event.     * Rectangle: Represents the entire sample space.     * Intersection: The overlapping space in the Venn diagram representing the event where both designated outcomes occur simultaneously (ABA \cap B or "Both").     * Union: The total space covered by the events (ABA \cup B), representing the occurrence of either event AA, event BB, or both.

  • General Addition Rule:     * This rule is used when events are not necessarily mutually exclusive (there is an overlap/intersection).     * Formula: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B).     * Symbolism: The symbol \cup denotes the union ("or"), and the symbol \cap denotes the intersection ("and").     * Logic: One must subtract the intersection (ABA \cap B) because that probability is counted twice if you simply add P(A)P(A) and P(B)P(B).

Conditional Probability and Independence

  • Conditional Probability:     * Definition: The probability that event AA will occur given that event BB has already occurred.     * Notation: Notated as P(AB)P(A|B), read as "probability of AA given BB."     * Formula: P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)} (the probability of both events occurring divided by the probability of the given condition).

  • Independence:     * Test for Independence: Events AA and BB are independent if and only if P(A)=P(AB)=P(ABC)P(A) = P(A|B) = P(A|B^C).     * Interpretation: Independence means that knowing whether or not event BB occurred does not change the probability/chance of event AA occurring.

Questions & Discussion

  • Application 5.2: How Prevalent is High Cholesterol?     * Context: American adults chosen at random. Event AA: high cholesterol (240mg/dl\ge 240\,mg/dl); Event BB: borderline high cholesterol (200200 to <240mg/dl< 240\,mg/dl). Data: P(A)=0.16P(A) = 0.16 and P(B)=0.29P(B) = 0.29.     * Question 1: Explain why events AA and BB are mutually exclusive.     * Answer: Events AA and BB are mutually exclusive because a randomly chosen American adult cannot have high cholesterol (240mg/dl240\,mg/dl or above) and borderline high cholesterol (200200 to <240mg/dl< 240\,mg/dl) at the same time.     * Question 2: Say in plain language what the event "A or BA \text{ or } B " is, then find P(A or B)P(A \text{ or } B).     * Plain Language: A randomly chosen American adult has either high or borderline high cholesterol.     * Calculation: P(A or B)=P(A)+P(B)=0.16+0.29=0.45P(A \text{ or } B) = P(A) + P(B) = 0.16 + 0.29 = 0.45.     * Question 3: Let CC be the event that the person chosen has normal cholesterol (less than 200mg/dl200\,mg/dl). Find P(C)P(C).     * logic: Normal is the complement of "borderline or high."     * Calculation: P(C)=1P(A or B)=10.45=0.55P(C) = 1 - P(A \text{ or } B) = 1 - 0.45 = 0.55. A randomly chosen American adult has a 55%55\% chance of having normal cholesterol.

  • Application 5.1: Will the Train Arrive on Time?     * Context: NJ Transit claims its 8:00 a.m. train has probability 0.90.9 of arriving on time.     * Question 1: Explain what probability 0.90.9 means in this setting.     * Answer: The train has a 90%90\% chance of arriving on time in a large sample of many trips.     * Question 2: The train arrived on time 5 days in a row. What is the probability it arrives on time tomorrow?     * Answer: The probability remains 0.90.9. A short streak of on-time arrivals does not change the long-term probability.     * Question 3: Describe a simulation using a 10-sided die for late arrivals (33 of 2020 days late).     * Answer: Assign digits 11 through 99 as "on time" (90%90\%) and digit 1010 as "late" (10%10\%). Roll the die 2020 times and record whether the train is on time or late for each roll.     * Question 4: Explain what the dot at 77 on the dotplot represents.     * Answer: It represents one repetition of the simulation (out of 100100) where the train arrived late exactly 77 times out of 2020 rolls of the die.     * Question 5: Estimate the probability that the train will arrive late on 33 or more of 2020 days based on simulation results.     * Answer: According to the dotplot, there are 3030 dots at 33 or higher. P(Late 3 out of 20)=30100=0.3P(\text{Late } \ge 3 \text{ out of } 20) = \frac{30}{100} = 0.3 or 30%30\%.     * Question 6: Is there convincing evidence that New Jersey Transit's claim is false given 33 late arrivals in 2020 days?     * Answer: No. Because it is fairly likely (30%30\% chance) that the train will arrive late 33 or more days out of 2020 just by chance, there is not convincing evidence the claim is false.

  • Application 5.3: Who Owns a Home?     * Context: Random sample of 500500 U.S. adults. Event GG: High school graduate; Event HH: Homeowner.     * Table Data:         * HS Grad (GG) and Homeowner (HH): 221221         * HS Grad (GG) and Not Homeowner (HCH^C): 8989         * Not HS Grad (GCG^C) and Homeowner (HH): 119119         * Not HS Grad (GCG^C) and Not Homeowner (HCH^C): 7171         * Total HS Grads: 310310; Total Not HS Grads: 190190; Total Homeowners: 340340; Total Not Homeowners: 160160.     * Question 1: Find P(GC)P(G^C).     * Answer: P(GC)=190500=0.38P(G^C) = \frac{190}{500} = 0.38, which is a 38%38\% chance.     * Question 2: Explain why P(G or H)P(G)+P(H)P(G \text{ or } H) \neq P(G) + P(H), and find P(G or H)P(G \text{ or } H).     * Answer: They are not equal because it is possible to be both a high school graduate and a homeowner (the events overlap). Simple addition would double-count the 221221 individuals who are both.     * Calculation: P(GH)=310+340221500=429500=0.858P(G \cup H) = \frac{310 + 340 - 221}{500} = \frac{429}{500} = 0.858.     * Question 3: Structure of the Venn Diagram for this data.     * G region only: 8989     * H region only: 119119     * Overlap (G and H): 221221     * Outside regions (Neither): 7171     * Question 4: Find P(is not a high school graduate and is a homeowner)P(\text{is not a high school graduate and is a homeowner}).     * Answer: P(GCH)=119500=0.238P(G^C \cap H) = \frac{119}{500} = 0.238.

  • Application 5.4: Who Earns A's in College?     * Context: 10,000 grades from UNH categorized by School (Liberal Arts, EPS, Health) and Grade (A, B, Lower than B). Event EE: grade from EPS; Event LL: grade lower than a B.     * Data Table:         * EPS: 368368 (A), 432432 (B), 800800 (Lower than B); Total EPS = 16001600.         * Total Lower than B (LL): 36663666.     * Question 1: Find P(LE)P(L|E) and describe it in words.     * Answer: P(LE)=P(LE)P(E)=8001600=0.5P(L|E) = \frac{P(L \cap E)}{P(E)} = \frac{800}{1600} = 0.5 or 50%50\%. There is a 50%50\% chance that a random course grade is lower than a B, given that it is from an E.P.S. course.     * Question 2: Are events LL and EE independent? Justify.     * Answer: For independence, P(L)P(L) must equal P(LE)P(L|E).     * Check: P(L)=366610000=0.3666P(L) = \frac{3666}{10000} = 0.3666 (36.66%36.66\%). P(LE)=0.5P(L|E) = 0.5 (50%50\%). Since 0.36660.50.3666 \neq 0.5, the events are not independent.     * Question 3: Given the grade is not lower than a B, find the probability it came from an EPS course.     * Answer: Find P(ELC)P(E | L^C). P(LC)=100003666=6334P(L^C) = 10000 - 3666 = 6334. P(ELC)=368+432=800P(E \cap L^C) = 368 + 432 = 800.     * Calculation: P(ELC)=80063340.126P(E | L^C) = \frac{800}{6334} \approx 0.126 or 12.6%12.6\%.