Statistics and Probability Final Test Comprehensive Study Guide

Core Concepts in Diagnostic Testing and Medical Screening

  • Home Pregnancy Test (HPT) Metrics and Definitions     - Sensitivity: This is the probability that the test result is positive, given that the woman is actually pregnant (Chance of positive HPT | Pregnant\text{Chance of positive HPT | Pregnant}).     - Specificity: This is the probability that the test result is negative, given that the woman is not pregnant (Chance of negative HPT | Not Pregnant\text{Chance of negative HPT | Not Pregnant}).     - False Positive Rate (FPR): This is the probability that the test result is positive, given that the woman is not pregnant (Chance of positive HPT | Not Pregnant\text{Chance of positive HPT | Not Pregnant}).     - False Negative Rate (FNR): This is the probability that the test result is negative, given that the woman is pregnant (Chance of negative HPT | Pregnant\text{Chance of negative HPT | Pregnant}).     - Positive Predictive Value (PPV): This is the probability that a person is actually pregnant, given that they have tested positive (Chance of Pregnant | Positive Test\text{Chance of Pregnant | Positive Test}).     - Negative Predictive Value (NPV): This is the probability that a person is not pregnant, given that they have tested negative (Chance of Not Pregnant | Negative Test\text{Chance of Not Pregnant | Negative Test}).

  • Medical Importance of Sensitivity vs. Specificity     - In the context of life-threatening conditions like cancer, High Sensitivity is considered more critical than high specificity.     - Reasoning: It is imperative to identify and treat everyone who has the disease. Failing to identify a person with cancer (a false negative) means they may go untreated, their condition may worsen, or they may die.     - The Trade-off: Telling a healthy person they have cancer (a false positive) is psychologically distressing and bad, but subsequent diagnostic tests will eventually reveal they do not have the disease, preventing unnecessary treatment.

Hypothesis Testing Fundamentals

  • The Nature of Hypothesis Tests     - A hypothesis test is designed to test a claim about a population parameter.     - The test seeks evidence against the null hypothesis (H0H_0).     - The evidence is expressed quantitatively in the form of a p-value.

  • Interpreting p-values     - Small p-values:         - Indicate a small chance of a False Positive Rate (FPR).         - Represent a statistically significant result.         - Lead to the rejection of the null hypothesis (H0H_0) and the acceptance of the alternative hypothesis (HAH_A).     - Large p-values:         - Indicate a large chance of an FPR.         - Represent a non-significant result.         - Lead to a failure to reject the null hypothesis (H0H_0).

  • Practical vs. Statistical Significance     - Practical Significance: Refers to whether the results are useful or meaningful in real-world applications.     - Statistical Significance: Refers purely to the p-value and whether the observed effect is likely due to chance.     - Example of Practical Significance: A drug that reduces cancer risk by 90%90\% is practically significant.     - Example of Lack of Practical Significance: A drug that reduces cancer risk by only 0.0001%0.0001\% is not practically significant, even if a large study finds the result to be statistically significant.

Statistical Data and Probability Examples

  • Z-scores and Associated p-values Table     - Note: Stars (<em><em>) indicate significance at the α=0.05\alpha = 0.05 level (where p < 0.05).     - Z=0Z = 0: Pextvalue=0.5P ext{-value} = 0.5     - Z=0.34Z = 0.34: Pextvalue=0.36693P ext{-value} = 0.36693     - Z=0.57Z = 0.57: Pextvalue=0.28434P ext{-value} = 0.28434     - Z=1.66Z = 1.66: Pextvalue=0.04846</em>P ext{-value} = 0.04846^</em>     - Z=1.97Z = 1.97: Pextvalue=0.02442<em>P ext{-value} = 0.02442^<em>     - Z=2.04Z = 2.04: Pextvalue=0.02068</em>P ext{-value} = 0.02068^</em>     - Z=2.15Z = 2.15: Pextvalue=0.01578<em>P ext{-value} = 0.01578^<em>     - Z=3.68Z = 3.68: Pextvalue=0.00012</em>P ext{-value} = 0.00012^</em>     - Z=3.95Z = 3.95: Pextvalue=0P ext{-value} = 0^*

  • Categorical Investment Data Analysis     - These calculations are based on a contingency table of age groups and investment types with a total sample size of 236236.     - Chance of investing in cash given age 25-34: 141\frac{1}{41}     - Chance of investing in bonds given age 55-70: 3056\frac{30}{56}     - Chance of being between 45 and 70: 133236\frac{133}{236}     - Chance of being 55-70 AND investing in stock: 22236\frac{22}{236}     - Chance of being 70 or younger given stock investment: 125125=1\frac{125}{125} = 1     - Chance of being younger than 45 given bond investment: 35100\frac{35}{100}

Depression Screening Case Study: Threshold Analysis

  • Study Scenario 1 (Threshold Score 3\ge 3)     - Ten patients (Adam, Mark, Bob, Mike, Joe, Fred, Josh, Toney, Jack, Jim) were evaluated.     - Positive results (Score 3\ge 3): Mark (5, Y), Bob (7, Y), Mike (3, N), Joe (4, N), Josh (3, N), Toney (4, N), Jack (3, Y), Jim (9, Y).     - Negative results (Score < 3): Adam (1, N), Fred (2, Y).     - Contingency Table:         - Depressed / Test Positive: 44         - Not Depressed / Test Positive: 44         - Depressed / Test Negative: 11         - Not Depressed / Test Negative: 11     - Calculated Metrics:         - sensitivity: 45=0.8\frac{4}{5} = 0.8         - Specificity: 15=0.2\frac{1}{5} = 0.2         - FPR: 45\frac{4}{5}         - FNR: 15\frac{1}{5}         - PPV: 48\frac{4}{8}         - NPV: 12\frac{1}{2}         - Prevalence: 510=0.5\frac{5}{10} = 0.5         - Overall Accuracy: (extPrevalence)×(extSensitivity)+(1Prevalence)×(extSpecificity)=(0.5)(0.8)+(0.5)(0.2)=0.4+0.1=0.5( ext{Prevalence}) \times ( ext{Sensitivity}) + (1 - \text{Prevalence}) \times ( ext{Specificity}) = (0.5)(0.8) + (0.5)(0.2) = 0.4 + 0.1 = 0.5

  • Study Scenario 2 (New Rule: Threshold Score 5\ge 5)     - New Contingency Table:         - Depressed / Test Positive: 33         - Not Depressed / Test Positive: 00         - Depressed / Test Negative: 22         - Not Depressed / Test Negative: 55     - Analysis of Result Change:         - Pros: Specificity is higher and the False Positive Rate (FPR) is lower.         - Cons: Sensitivity is lower (misses more depressed patients) and the False Negative Rate (FNR) is higher.

Correlation, Causation, and Experimental Design

  • Structural Definitions     - Correlation: A straight-line (linear) association between two variables. It can be assessed visually via scatterplots.     - Causation: A relationship where one variable or event directly causes changes in another.     - Lurking Variable: A variable that is not included in the analysis but affects the relationship between the variables being studied, inducing confounding.     - Confounding: Occurs when the effects of two or more variables are mixed together, making it impossible to determine which variable is causing the observed response.

  • Observational Studies vs. Experiments     - Observational Study: Researchers observe subjects and measure variables without manipulating or assigning treatments (e.g., surveying people's exercise habits and measuring brain chemicals).     - Controlled Experiment: Researchers actively impose an explanatory variable (treatment) on subjects (e.g., randomly assigning people to an exercise program vs. no change).     - Causal Links: Causal links cannot be established from observational studies alone because subjects are not randomly assigned, meaning confounding variables (like gender, age, or general health) may be responsible for the results.

  • Properties of Correlation Coefficients     - Outliers: An outlier can either decrease or increase a correlation coefficient, depending on its position relative to the rest of the data points.     - Range: Correlation values must fall between 1-1 and 11. Values like 20,2,2,20-20, -2, 2, 20 are impossible.     - Weak Correlation: A very weak correlation coefficient (rr near 00) implies there is no linear relationship, but there could be separate linear relationships masked by a third variable, or a non-linear relationship may exist.

  • Variable Association Examples     - Education and Salary: Positive association (more education generally leads to higher salary).     - Age and Running Speed (Adults): Negative association (older adults typically run slower than younger adults).     - Age and Running Speed (Children): Positive association (older children typically run faster than younger children).     - Husband's Age and Wife's Age: Positive association (people tend to marry partners of similar ages).

  • Case Study: NBA Rebounds vs. Free Throw Percentage     - Trend: Negative association (players good at free throws tend to get fewer rebounds; high rebounders tend to have lower free throw percentages).     - Outlier Example: A player with a high free throw percentage (0.870.87) and a high number of rebounds (11001100) would be an outlier in this negative trend.

Problem-Specific Hypothesis Test Walkthroughs

  • Problem 1: Electric Utility Satisfaction     - Claim: CEO claims more than 75%75\% of customers are satisfied.     - Sample: n=100n = 100, p^=0.77\hat{p} = 0.77.     - Hypotheses:         - H0:p0.75H_0: p \le 0.75 (Proportion is 75%75\% or less).         - H_A: p > 0.75 (Proportion is more than 0.750.75).     - P-value Result: 0.322760.32276. Since this is large, we fail to reject the null hypothesis.     - Effect of Sample Size: If the same 77%77\% proportion were found with n=10n = 10, the conclusion would remain the same (fail to reject), but the p-value would be even larger.

  • Problem 2: Drug for Stuttering (Pagoclone)     - Study: 132132 patients (largest stuttering trial); 8888 received Pagoclone, the rest placebo.     - Hypotheses:         - H0H_0: Drug does not help stuttering.         - HAH_A: Drug helps stuttering.     - Results: Doctors gave a "numerically superior rating" to Pagoclone, but it did not reach statistical significance.     - Conclusion: The False Positive Rate (FPR) is too high because there was no statistical significance.

  • UK Student Body Male Proportion Walkthrough     - Context: Jane guesses male proportion is > 30\%.     - Step 1: Hypotheses:         - H0:p0.30H_0: p \le 0.30         - H_A: p > 0.30     - Step 2: Sample Data: n=30n = 30, 1616 males (p^=16300.533\hat{p} = \frac{16}{30} \approx 0.533).     - Step 3: Conclusion: At α=0.05\alpha = 0.05, reject the null. We conclude more than 30%30\% are male.

  • Election Campaign Hypothesis Test     - Requirement to Win: > 50\%.     - Hypotheses: H0:p0.5H_0: p \le 0.5, H_A: p > 0.5.     - Identifying Data: p0=0.5p_0 = 0.5, p^=152250=0.608\hat{p} = \frac{152}{250} = 0.608, n=250n = 250.     - Calculations:         - S.E.=p0(1p0)n=0.5(0.5)250=0.03162\text{S.E.} = \sqrt{\frac{p_0(1-p_0)}{n}} = \sqrt{\frac{0.5(0.5)}{250}} = 0.03162         - D=0.6080.5=0.108D = 0.608 - 0.5 = 0.108         - Z=0.1080.03162=3.42Z = \frac{0.108}{0.03162} = 3.42     - Outcome: Pextvalue=0.00031P ext{-value} = 0.00031. Reject the null. FPR is low (0.000310.00031), meaning the result is reassuring.

  • Crayon Factory Quality Control     - Criteria: If > 95\% products are correct, do not replace machines.     - Hypotheses: H0:p0.95H_0: p \le 0.95, H_A: p > 0.95.     - Sample Data: n=50n = 50, 4848 correct (p^=0.96\hat{p} = 0.96).     - Calculations:         - S.E.=0.03082\text{S.E.} = 0.03082         - Z=0.960.950.03082=0.32446Z = \frac{0.96 - 0.95}{0.03082} = 0.32446         - Pextvalue=0.37P ext{-value} = 0.37     - Conclusion: Fail to reject null. Not enough evidence to say machines are working properly. Increasing sample size (e.g., to 50005000) would provide more evidence and could lead to rejecting the null as statistics better estimate parameters with larger samples.

  • Beetle Infestation in Local Trees     - Concern: > 40\% of trees infected.     - Hypotheses: H0:p0.40H_0: p \le 0.40, H_A: p > 0.40.     - Sample Data: n=320n = 320, 150150 infected (p^=0.46875\hat{p} = 0.46875).     - Calculations:         - S.E.\text{S.E.} calculation leads to Z=2.51Z = 2.51.         - Pextvalue=0.00604P ext{-value} = 0.00604.     - Conclusion: At α=0.07\alpha = 0.07, reject the null. Conclude more than 40%40\% are infected. FPR is 0.006040.00604.