Lecture 4: Distribution
Minggen Lu, PhD
Date: September 19, 2024
Definition: A numeric variable that assumes a value based on the outcome of a random experiment.
Types of Random Variables:
Discrete Random Variable: Takes specific numeric values (often integers).
Continuous Random Variable: Can assume any value over an interval or continuum.
Examples:
Discrete: X = number of heads from 3 coin tosses (Sample space S = {0, 1, 2, 3}).
Continuous: X = high temperature in Reno on a summer day (Sample space S = {50 ≤ X ≤ 110}).
Notation:
Capital letters (X, Y, Z) denote random variables.
Lower-case (x, y, z) denote observed values.
Discrete Random Variable X: A function p(x) assigns probabilities for each possible value of X, expressed as p(x) = Pr(X = x).
Example: Tossing an unbiased coin three times, with a sample space S = {0, 1, 2, 3}.
Probability Distribution:
x: 0, p(x): 1/8 (0.125)
x: 1, p(x): 3/8 (0.375)
x: 2, p(x): 3/8 (0.375)
x: 3, p(x): 1/8 (0.125)
Events: Let A = event of obtaining 2 heads, B = event of obtaining at least 2 heads.
Compute:
Pr(A) = p(2) = 0.375
Pr(B) = p(2) + p(3) = 0.5
Conditional Probability:
Pr(A|B) = Pr(A ∩ B) / Pr(B) = Pr(A) / Pr(B) = 0.375 / 0.5 = 0.75
Continuous probability distribution of random variable X is represented by an unbroken curve (Density Curve).
Characteristics:
The area under the curve over an interval represents the probability of X assuming a value in that interval.
The total area under the curve equals 1.
The probability of X assuming any specific value is zero.
Example: Let X be the time spent studying each week by a college student.
Sample space for X: S = {0 ≤ X ≤ 50}.
The density curve is drawn such that the total area equals 1.
When a random variable is repeatedly measured, it leads to observed values.
Mean (µ): The mean of large sets of measurements on X.
Variance (σ²): The variance of large sets of measurements on X.
Standard Deviation (σ): The square root of variance.
Factorials: n! = n(n-1)(n-2)...1, with 0! = 1.
Examples:
3! = 6, 6! = 720, 10! = 3,628,800.
Selection:
Ordering n objects: n!
Choosing x from n (0 ≤ x ≤ n): Binomial Coefficient
n choose x: C(n, x) = n!/(x!(n-x)!)
Criteria:
Each trial has 2 possible outcomes (success/failure).
Trials are independent.
Probability of success (p) remains constant.
Examples:
Tossing a coin 25 times.
Rolling a die 12 times (even/odd outcomes).
Testing 1000 blood samples for HIV status.
Let X count the number of successes in n Bernoulli trials.
X is a Binomial Random Variable.
Probability Distribution:
Binomial Distribution denoted as Bin(n, p).
Formula: p(x) = Pr(X = x) = C(n, x) p^x (1-p)^(n-x), where x = 0, 1,.., n.
Mean: µ = np.
Variance: σ² = np(1-p).
Probability assignments: Blood Types O, A, B, AB:
O: 0.45, A: 0.40, B: 0.10, AB: 0.04.
Consider a sample of 10 Americans - let X = number with blood type A:
Calculate the mean, standard deviation, and probabilities for specific counts (e.g., exactly 4 with blood type A).
Error Rate: 20% chance of an adult suffering from a psychiatric disorder.
Analyze sample of 12 adults for presence of disorder, calculating probabilities for those within range.
Example Questions: a. Probability of 3-6 having a disorder. b. Probability of more than 3 but fewer than 6. c. Probability that at least one has a disorder.
A couple is expecting 6 children with probabilities of eye color: blue (0.25), brown (0.75).
Calculate mean, standard deviation for blue-eyed children, and probabilities for various outcomes.
Analyze for one child with blue eyes, at least one with blue or brown eyes.
Used for counting occurrences of events over an interval of time or space.
Let X count occurrences—X is a Poisson Random Variable.
Mean number of occurrences: λ (lambda).
Distribution Formula: p(x) = Pr(X = x) = (e^(-λ) λ^x) / x!, where x = 0, 1, 2, ...
The probability mass function decreases as the number of occurrences increase.
Used to approximate probabilities for binomial distributions when n is large and p is small.
Mean & Variance: Both are equal to λ.
Can approximate binomial probabilities as described in previous pages.
Define number of patients arriving at ETC: mean = 4.5 per day.
Probability queries for no, at least one, or 4 or 5 patients arriving.
Normal distribution: most common for continuous variables.
Characteristics:
Examples include physiological measures (blood pressure, cholesterol).
Density Curve Formula: f(x) = (1/(σ√2π)) e^(-(x-µ)²/(2σ²)), where µ = mean, σ = standard deviation.
Denoted as N(µ, σ²).
Standard Normal Distribution: µ = 0 and σ = 1.
Visual representation: bell-shaped curve centered on µ.
About 68% of area lies within one standard deviation (µ ± σ).
About 95% of area lies within two standard deviations (µ ± 2σ).
About 99.7% lies within three standard deviations (µ ± 3σ).
Definition of a standard normal variable: Z = (X - µ) / σ.
Area under the standard normal curve between points gives probability.
Basic probability relationships based on standard normal distribution.
General formulas for comparisons of Z scores and their implications in hypothesis testing.
Various practical examples demonstrating how to compute probabilities, perform hypothesis testing, and analyze confidence intervals using normal and other distributions.
Distinction between estimation and hypothesis testing.
Estimation methods including point and interval estimates.
Key statistical terms including populations, samples, statistics, and parameters outlined.
Importance of random sampling emphasized.
Walking through the concepts of sampling distributions, expected values, and standard errors.
Steps and methods involved in performing hypothesis tests, including Type I and Type II errors.
Power analysis and its significance discussed.
Illustrative examples reinforcing concepts of hypothesis testing, including confidence interval calculations.
Comparing means through paired and independent samples.
Chi-square test fundamentals and applications in testing.
The application of nonparametric tests in various scenarios, including details on ear infection tests and vaccine evaluation.
Assess relationships between categorical variables by comparing observed frequencies to expected frequencies.
Chi-Square Test of Independence
Purpose: Determine if there is a significant association between two categorical variables in a contingency table.
Hypothesis:
Null (H0): There is no association between the variables.
Alternative (H1): There is an association.
Assumptions:
Random sampling
Expected frequency should not be less than 5 in more than 20% of cells.
Chi-Square Goodness of Fit Test
Purpose: Determine if the sample data matches a population with a specified distribution (e.g., expected proportions of categories).
Hypothesis:
Null (H0): The observed frequencies match the expected frequencies.
Alternative (H1): The observed frequencies do not match the expected frequencies.
Useful in survey data analysis, experimental research, and quality control.
Determine if there are differences on a dichotomous outcome between two related groups (e.g., before-and-after measurements on the same subjects).
Null (H0): The proportions are the same (no change).
Alternative (H1): The proportions are different.
Data must be paired (repeated measures) and dichotomous (two possible outcomes).
Commonly used in case-control studies and pre-post intervention studies to examine changes in a binary outcome.
Non-parametric alternatives to the T-test for comparing two related or independent samples when normal distribution cannot be assumed.
Wilcoxon Signed-Rank Test
Purpose: Compare two related samples or matched observations.
Hypothesis:
Null (H0): The median difference between pairs is zero.
Alternative (H1): The median difference is not zero.
Applications: Used for before-and-after scenarios.
Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Purpose: Compare two independent samples.
Hypothesis:
Null (H0): The distributions of the two populations are equal.
Alternative (H1): The distributions of the populations are not equal.
Applications: Used when comparing two different groups or populations, especially when sample sizes are small.
Both tests rely on ranking data rather than using raw scores and do not require normality of data distribution.
Purpose: Compare means from two independent samples to determine if they come from populations with the same mean.
Types:
Independent Samples T-Test: Used when the sample sizes are small (n < 30) or when the population standard deviations are unknown. Assumes normal distribution.
Hypothesis: Null (H0): μ1 = μ2 vs. Alternative (H1): μ1 ≠ μ2.
Mann-Whitney U Test: Non-parametric alternative when data do not meet the assumptions of normality. Compares the medians of two groups.
Purpose: Assess relationships between categorical variables by comparing observed frequencies to expected frequencies.
Types:
Chi-Square Test of Independence
Purpose: Determine if there is a significant association between two categorical variables in a contingency table.
Hypothesis:
Null (H0): There is no association between the variables.
Alternative (H1): There is an association.
Assumptions:
Random sampling
Expected frequency should not be less than 5 in more than 20% of cells.
Chi-Square Goodness of Fit Test
Purpose: Determine if the sample data matches a population with a specified distribution (e.g., expected proportions of categories).
Hypothesis:
Null (H0): The observed frequencies match the expected frequencies.
Alternative (H1): The observed frequencies do not match the expected frequencies.
For two-sample tests, ensure random sampling and check for equal variances when using T-tests.
For Chi-Square tests, ensure that expected frequencies should not be less than 5 in more than 20% of the cells in a contingency table to maintain validity.
Purpose: Determine if there are differences on a dichotomous outcome between two related groups (e.g., before-and-after measurements on the same subjects).
Hypothesis:
Null (H0): The proportions are the same (no change).
Alternative (H1): The proportions are different.
Assumptions:
Data must be paired (repeated measures) and dichotomous (two possible outcomes).
Applications:
Commonly used in case-control studies and pre-post intervention studies to examine changes in a binary outcome.
Purpose: Non-parametric alternatives to the T-test for comparing two related or independent samples when normal distribution cannot be assumed.
Types:
Wilcoxon Signed-Rank Test
Purpose: Compare two related samples or matched observations.
Hypothesis:
Null (H0): The median difference between pairs is zero.
Alternative (H1): The median difference is not zero.
Applications: Used for before-and-after scenarios.
Wilcoxon Rank-Sum Test (Mann-Whitney U Test)
Purpose: Compare two independent samples.
Hypothesis:
Null (H0): The distributions of the two populations are equal.
Alternative (H1): The distributions of the populations are not equal.
Applications: Used when comparing two different groups or populations, especially when sample sizes are small.
Both tests rely on ranking data rather than using raw scores and do not require normality of data distribution.