Lectures 4-10
Lecture Notes - CHS 780 Biostatistics in Public Health
Page 1: Introduction
Lecture 4: Distribution
Minggen Lu, PhD
Date: September 19, 2024
Page 2: Random Variable
Definition: A numeric variable that assumes a value based on the outcome of a random experiment.
Types of Random Variables:
Discrete Random Variable: Takes specific numeric values (often integers).
Continuous Random Variable: Can assume any value over an interval or continuum.
Examples:
Discrete: X = number of heads from 3 coin tosses (Sample space S = {0, 1, 2, 3}).
Continuous: X = high temperature in Reno on a summer day (Sample space S = {50 ≤ X ≤ 110}).
Notation:
Capital letters (X, Y, Z) denote random variables.
Lower-case (x, y, z) denote observed values.
Page 3: Probability Distribution
Discrete Random Variable X: A function p(x) assigns probabilities for each possible value of X, expressed as p(x) = Pr(X = x).
Example: Tossing an unbiased coin three times, with a sample space S = {0, 1, 2, 3}.
Probability Distribution:
x: 0, p(x): 1/8 (0.125)
x: 1, p(x): 3/8 (0.375)
x: 2, p(x): 3/8 (0.375)
x: 3, p(x): 1/8 (0.125)
Page 4: Application of p(x)
Events: Let A = event of obtaining 2 heads, B = event of obtaining at least 2 heads.
Compute:
Pr(A) = p(2) = 0.375
Pr(B) = p(2) + p(3) = 0.5
Conditional Probability:
Pr(A|B) = Pr(A ∩ B) / Pr(B) = Pr(A) / Pr(B) = 0.375 / 0.5 = 0.75
Page 5: Continuous Probability Distribution
Continuous probability distribution of random variable X is represented by an unbroken curve (Density Curve).
Characteristics:
The area under the curve over an interval represents the probability of X assuming a value in that interval.
The total area under the curve equals 1.
The probability of X assuming any specific value is zero.
Page 6: Example of Continuous Random Variable
Example: Let X be the time spent studying each week by a college student.
Sample space for X: S = {0 ≤ X ≤ 50}.
The density curve is drawn such that the total area equals 1.
Page 7: Mean and Variance
When a random variable is repeatedly measured, it leads to observed values.
Mean (µ): The mean of large sets of measurements on X.
Variance (σ²): The variance of large sets of measurements on X.
Standard Deviation (σ): The square root of variance.
Page 8: Bernoulli Distribution
Factorials: n! = n(n-1)(n-2)...1, with 0! = 1.
Examples:
3! = 6, 6! = 720, 10! = 3,628,800.
Selection:
Ordering n objects: n!
Choosing x from n (0 ≤ x ≤ n): Binomial Coefficient
n choose x: C(n, x) = n!/(x!(n-x)!)
Page 9: Bernoulli Trials
Criteria:
Each trial has 2 possible outcomes (success/failure).
Trials are independent.
Probability of success (p) remains constant.
Examples:
Tossing a coin 25 times.
Rolling a die 12 times (even/odd outcomes).
Testing 1000 blood samples for HIV status.
Page 10: Binomial Random Variable
Let X count the number of successes in n Bernoulli trials.
X is a Binomial Random Variable.
Probability Distribution:
Binomial Distribution denoted as Bin(n, p).
Formula: p(x) = Pr(X = x) = C(n, x) p^x (1-p)^(n-x), where x = 0, 1,.., n.
Page 11: Properties of Binomial Random Variable
Mean: µ = np.
Variance: σ² = np(1-p).
Page 12: Blood Type Example
Probability assignments: Blood Types O, A, B, AB:
O: 0.45, A: 0.40, B: 0.10, AB: 0.04.
Consider a sample of 10 Americans - let X = number with blood type A:
Calculate the mean, standard deviation, and probabilities for specific counts (e.g., exactly 4 with blood type A).
Page 13: Psychiatric Disorder Example
Error Rate: 20% chance of an adult suffering from a psychiatric disorder.
Analyze sample of 12 adults for presence of disorder, calculating probabilities for those within range.
Example Questions: a. Probability of 3-6 having a disorder. b. Probability of more than 3 but fewer than 6. c. Probability that at least one has a disorder.
Page 14: Eye Color Example
A couple is expecting 6 children with probabilities of eye color: blue (0.25), brown (0.75).
Calculate mean, standard deviation for blue-eyed children, and probabilities for various outcomes.
Analyze for one child with blue eyes, at least one with blue or brown eyes.
Page 15: Poisson Distribution
Used for counting occurrences of events over an interval of time or space.
Let X count occurrences—X is a Poisson Random Variable.
Mean number of occurrences: λ (lambda).
Distribution Formula: p(x) = Pr(X = x) = (e^(-λ) λ^x) / x!, where x = 0, 1, 2, ...
Page 16: Characteristics of Poisson Distribution
The probability mass function decreases as the number of occurrences increase.
Used to approximate probabilities for binomial distributions when n is large and p is small.
Page 17: Properties of a Poisson Random Variable
Mean & Variance: Both are equal to λ.
Can approximate binomial probabilities as described in previous pages.
Page 18: Example - Emergency Treatment Center
Define number of patients arriving at ETC: mean = 4.5 per day.
Probability queries for no, at least one, or 4 or 5 patients arriving.
Page 19: Normal Distribution
Normal distribution: most common for continuous variables.
Characteristics:
Examples include physiological measures (blood pressure, cholesterol).
Density Curve Formula: f(x) = (1/(σ√2π)) e^(-(x-µ)²/(2σ²)), where µ = mean, σ = standard deviation.
Page 20: Properties of Normal Distribution
Denoted as N(µ, σ²).
Standard Normal Distribution: µ = 0 and σ = 1.
Visual representation: bell-shaped curve centered on µ.
Page 21: The Empirical Rule
About 68% of area lies within one standard deviation (µ ± σ).
About 95% of area lies within two standard deviations (µ ± 2σ).
About 99.7% lies within three standard deviations (µ ± 3σ).
Page 22: Computing Standard Normal Probabilities
Definition of a standard normal variable: Z = (X - µ) / σ.
Area under the standard normal curve between points gives probability.
Page 23: Properties of Standard Normal Distribution
Basic probability relationships based on standard normal distribution.
General formulas for comparisons of Z scores and their implications in hypothesis testing.
Page 24-31: Example and Applications
Various practical examples demonstrating how to compute probabilities, perform hypothesis testing, and analyze confidence intervals using normal and other distributions.
Page 32: Statistical Inference
Distinction between estimation and hypothesis testing.
Estimation methods including point and interval estimates.
Page 33-34: Key Definitions
Key statistical terms including populations, samples, statistics, and parameters outlined.
Importance of random sampling emphasized.
Page 35-40: Sampling Distribution and Mean
Walking through the concepts of sampling distributions, expected values, and standard errors.
Page 41-58: Hypothesis Testing and Power
Steps and methods involved in performing hypothesis tests, including Type I and Type II errors.
Power analysis and its significance discussed.
Page 59-82: Examples and Applications
Illustrative examples reinforcing concepts of hypothesis testing, including confidence interval calculations.
Page 83-137: Two-Sample and Chi-Square Tests
Comparing means through paired and independent samples.
Chi-square test fundamentals and applications in testing.
Page 138-156: McNemar’s Test and Practical Examples
The application of nonparametric tests in various scenarios, including details on ear infection tests and vaccine evaluation.
Chi-Square Tests
Purpose
Assess relationships between categorical variables by comparing observed frequencies to expected frequencies.
Types
Chi-Square Test of Independence
Purpose: Determine if there is a significant association between two categorical variables in a contingency table.
Hypothesis:
Null (H0): There is no association between the variables.
Alternative (H1): There is an association.
Assumptions:
Random sampling
Expected frequency should not be less than 5 in more than 20% of cells.
Chi-Square Goodness of Fit Test
Purpose: Determine if the sample data matches a population with a specified distribution (e.g., expected proportions of categories).
Hypothesis:
Null (H0): The observed frequencies match the expected frequencies.
Alternative (H1): The observed frequencies do not match the expected frequencies.
Applications
Useful in survey data analysis, experimental research, and quality control.
McNemar's Test
Purpose
Determine if there are differences on a dichotomous outcome between two related groups (e.g., before-and-after measurements on the same subjects).
Hypothesis
Null (H0): The proportions are the same (no change).
Alternative (H1): The proportions are different.
Assumptions
Data must be paired (repeated measures) and dichotomous (two possible outcomes).
Applications
Commonly used in case-control studies and pre-post intervention studies to examine changes in a binary outcome.
Wilcoxon Tests
Purpose
Non-parametric alternatives to the T-test for comparing two related or independent samples when normal distribution cannot be assumed.
Types
Wilcoxon Signed-Rank Test
Purpose: Compare two related samples or matched observations.
Hypothesis:
Null (H0): The median difference between pairs is zero.
Alternative (H1): The median difference is not zero.
Applications: Used for before-and-after scenarios.
Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Purpose: Compare two independent samples.
Hypothesis:
Null (H0): The distributions of the two populations are equal.
Alternative (H1): The distributions of the populations are not equal.
Applications: Used when comparing two different groups or populations, especially when sample sizes are small.
Key Considerations
Both tests rely on ranking data rather than using raw scores and do not require normality of data distribution.
Lecture Notes - CHS 780 Biostatistics in Public Health
Page 83-137: Two-Sample and Chi-Square Tests
Two-Sample Tests
Purpose: Compare means from two independent samples to determine if they come from populations with the same mean.
Types:
Independent Samples T-Test: Used when the sample sizes are small (n < 30) or when the population standard deviations are unknown. Assumes normal distribution.
Hypothesis: Null (H0): μ1 = μ2 vs. Alternative (H1): μ1 ≠ μ2.
Mann-Whitney U Test: Non-parametric alternative when data do not meet the assumptions of normality. Compares the medians of two groups.
Chi-Square Tests
Purpose: Assess relationships between categorical variables by comparing observed frequencies to expected frequencies.
Types:
Chi-Square Test of Independence
Purpose: Determine if there is a significant association between two categorical variables in a contingency table.
Hypothesis:
Null (H0): There is no association between the variables.
Alternative (H1): There is an association.
Assumptions:
Random sampling
Expected frequency should not be less than 5 in more than 20% of cells.
Chi-Square Goodness of Fit Test
Purpose: Determine if the sample data matches a population with a specified distribution (e.g., expected proportions of categories).
Hypothesis:
Null (H0): The observed frequencies match the expected frequencies.
Alternative (H1): The observed frequencies do not match the expected frequencies.
Key Considerations
For two-sample tests, ensure random sampling and check for equal variances when using T-tests.
For Chi-Square tests, ensure that expected frequencies should not be less than 5 in more than 20% of the cells in a contingency table to maintain validity.
McNemar's Test
Purpose: Determine if there are differences on a dichotomous outcome between two related groups (e.g., before-and-after measurements on the same subjects).
Hypothesis:
Null (H0): The proportions are the same (no change).
Alternative (H1): The proportions are different.
Assumptions:
Data must be paired (repeated measures) and dichotomous (two possible outcomes).
Applications:
Commonly used in case-control studies and pre-post intervention studies to examine changes in a binary outcome.
Wilcoxon Tests
Purpose: Non-parametric alternatives to the T-test for comparing two related or independent samples when normal distribution cannot be assumed.
Types:
Wilcoxon Signed-Rank Test
Purpose: Compare two related samples or matched observations.
Hypothesis:
Null (H0): The median difference between pairs is zero.
Alternative (H1): The median difference is not zero.
Applications: Used for before-and-after scenarios.
Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Purpose: Compare two independent samples.
Hypothesis:
Null (H0): The distributions of the two populations are equal.
Alternative (H1): The distributions of the populations are not equal.
Applications: Used when comparing two different groups or populations, especially when sample sizes are small.
Key Considerations
Both tests rely on ranking data rather than using raw scores and do not require normality of data distribution.