Discrete Distributions Study Guide
Variance (σ²): Measures how spread out the values of a numerical random variable are from the mean.
Standard Deviation (σ): The square root of variance, keeping the unit consistent with the data.
Var(X) = E[(X - μ)²] = ∑ (x - μ)² P(X = x)
where μ = E(X) (expected value of X).
Mean: E(D) = 3.5
Variance: Var(D) = 2.92
Standard Deviation: σ(D) = 1.71
If extreme values are more likely, variance increases.
If extreme values are less likely, variance decreases.
For a random variable X and constants c, Y:
E(X + c) = E(X) + c
E(cX) = cE(X)
Var(X + c) = Var(X)
Var(cX) = c² Var(X)
Var(X + Y) = Var(X) + Var(Y) (only if X and Y are independent)
A Bernoulli trial is a single experiment with two outcomes (success/failure).
Probability mass function (PMF):
P(X = 1) = p, P(X = 0) = 1 - p
Mean: E(X) = p
Variance: Var(X) = p(1 - p)
Standard Deviation: σ(X) = √(p(1 - p))
Models the number of successes in n independent Bernoulli trials.
PMF:
P(X = k) = (n choose k) pˆk (1 - p)ˆ(n - k)
where (n choose k) = n! / (k!(n - k)!)
Mean: E(X) = np
Variance: Var(X) = np(1 - p)
Standard Deviation: σ(X) = √(np(1 - p))
E(X) = 20 * (2/3) = 13.33
Var(X) = 20 (2/3) (1/3) = 4.44
σ(X) = √(4.44) = 2.11
Describes data where a few values occur very frequently, and many values occur very rarely.
PMF:
P(X = k) ∝ (k + d)⁻ᵇ
where d is an offset and α is the exponent.
Common in natural language processing, web analysis, wealth distributions.
Rank 1 ("the"): 6.2 million occurrences
Rank 2 ("of"): 2.9 million occurrences
Rank 3 ("and"): 2.67 million occurrences
Follows y = c (r + 1)⁻α, where α ≈ 1.08
Fat Head: A few values dominate (e.g., the top 175 words account for 50% of tokens).
Long Tail: A large portion of occurrences come from rare values (e.g., words occurring only once make up 0.5% of all tokens).
Issues for AI:
Easy to capture common cases, but difficult to cover rare cases.
Requires large datasets for accurate modeling.
Variance and standard deviation measure the spread of a distribution.
Bernoulli distribution models single-trial success/failure.
Binomial distribution models multiple independent trials.
Zipf distribution explains power-law relationships in data.
AI applications face challenges due to rare event distributions.
True or False: The variance of a binomial distribution is always less than its mean.
If a fair die is rolled 30 times, what is the expected number of times it lands on 6?
In a language corpus, the most common word appears 5 million times. The second most common appears 2.5 million times. Estimate how many times the 10th most common word appears using Zipf's law.
🚀 Use this guide to master Discrete Distributions for problem sets and exams!