Module 3 Review of Probabilities and Estimating Model Parameters
Review of Probabilities
What is a Probability Distribution?
- A mathematical object used to model an event.
- Assigns a number to possible outcomes.
- Example: Coin flip.
- Outcome space: {heads, tails}
- A random variable is a variable that takes on a value according to some probability distribution.
- Probabilities must be:
- Non-negative
- Sum to 1
Notation for Coin Flips
- Random variable c denotes a coin-flip.
- Event space: c ∈ {heads, tails, other}
- Event space: c ∈ {heads, tails}
- The above represents an unfair coin.
Example Probability Distributions
- Dice roll
- Random variable d.
- Outcome space: {1 .. 6}
- Parameters θ
- A fair dice would have θ =
Simple Example: Picking a Letter
- Roll a fair dice to pick a word from the sentence: "Sam laughs last and laughs loudest"
- Define random variables:
- f = first letter of the chosen word
- Outcome space: {S, l, a}
- s = second letter of the chosen word
- Outcome space {a, n, o}
- f = first letter of the chosen word
- Examples:
Relationships Between Distributions
- Joint probability distribution over both random variables at the same time.
- Outcome space: {(S, a), (S, n), (S, o), … (l, a), (l, n), (l, o)}
- Conditional probability distribution of s given a particular value of f.
- Outcome space is the same as p(s).
Conditional Probability
- Is the amount of probability in the event f=l that is also shared with the event s=a.
Bayes Rule
Example: Using the sentence "Sam laughs last and laughs loudest"
- Joint
- Conditional
Estimating Model Parameters
Probabilistic Models in Practice
- Usual scenario:
- Gather some data.
- Define a probabilistic model.
- Use the data to estimate the parameters of the model.
- Example:
- We flip a coin n times and collect the observations: [heads, tails, tails, …]
- Model: each flip is independent, with probability p(heads) = h.
- The goal is to determine how to select p(h).
Terminology
- The data we collect are called a sample.
- The procedure we use to choose model parameters is called an estimator.
- The data likelihood is the probability of the data under the model's distribution.
- E.g.
- Data: [heads, heads, tails]
- Model: p(heads) = h = 0.7
- Likelihood = 0.7 * 0.7 * 0.3
- E.g.
- Usually, we look at log-likelihood (the natural log of the likelihood).
Data Likelihood
- Question: what P(h) gives the highest likelihood?
- Choosing P(h) this way is called the maximum likelihood estimator.
- Examples:
- Data: D = [heads, heads, tails]
- P(heads) = 0.5 ⇨ likelihood(D) = 0.5 * 0.5 * 0.5 = 0.125
- P(heads) = 0.8 ⇨ likelihood(D) = 0.8 * 0.8 * 0.2 = 0.128
- P(heads) = 0.6 ⇨ likelihood(D) = 0.6 * 0.6 * 0.4 = 0.144
- Data: D = [heads, heads, tails]
How to Find the Max of a Function
- Looking for the very top of the curve.
- The top is always flat.
- More formally, the slope is zero: derivative of the function is zero.
Review of Calculus
- Derivative notation:
- Partial derivative of f(x) with respect to x:
- Derivative of a logarithm
- Partial derivative of ln(x) with respect to x equals 1/x
- Partial derivative of ln(x) with respect to x equals 1/x
- Chain rule of calculus will also be used.
Maximum Likelihood for Our Example
- Data: [heads, heads, tails]
- Taking the derivative:
- Setting the derivative to zero:
Multinomial Distribution
- Distribution over some discrete outcomes.
- E.g. coin flip; dice; letters of the alphabet, words in the dictionary, etc.
- Parameters:
- Probability of outcome i:
- Where i ranges from 1 to k.
- Remember they are probabilities!
- Probability of outcome i:
Maximum Likelihood for Multinomials
Data: [o1, o2, o3, …, on]
- Where counts how many times we see i in the data.
Optimization problem:
The larger π, the larger . So we can get arbitrarily large by setting π arbitrarily high.
Optimization problem:
Make the constraint into a game:
Setting derivative to zero:
What about λ? How can we compute it?
Note: It Is Not Always Easy to Compute Max Likelihood
- Sometimes, we do not have a closed-form solution for maximum likelihood.
- We do not always observe everything we would like to.
Complicated Example
- Roll a dice to pick a word from: "Sam laughs last and laughs loudest"
- Data:
- f = first letter of the chosen word
- s = second letter of the chosen word
- F = [l, l, a, …]; S = [a, a, n, …]
- Model:
- Probability distribution over [1, 2, 3, 4, 5, 6].
Problem with Small Samples
- Suppose we flip a coin just once.
- Data: [heads]
- Max likelihood estimate: p(heads) = 1; p(tails) = 0.
- Is this a good model of the world?
Add-1 Smoothing
- One way to overcome this, is to add 1, or ½ or something else to our counts.
- Data = [heads]; counts = {heads: 2, tails: 1}
- This turns out to be the same as having a prior belief about what the coin probability is.
- This is a probability distribution over the model parameters.
- Priors and other forms of regularization are very important for most models.