Module 3 Review of Probabilities and Estimating Model Parameters

A mathematical object used to model an event.
Assigns a number to possible outcomes.
Example: Coin flip.
- Outcome space: {heads, tails}
- $P(heads) = \frac{1}{2}$
- $P(tails) = \frac{1}{2}$
A random variable is a variable that takes on a value according to some probability distribution.
Probabilities must be:
- Non-negative
- Sum to 1

Random variable c denotes a coin-flip.
Event space: c ∈ {heads, tails, other}
- $P(c = heads) = 0.6$
- $P(c = tails) = 0.4$
Event space: c ∈ {heads, tails}
- $P(c = heads) = h$
- $P(c = tails) = 1 - h$
The above represents an unfair coin.

Roll a fair dice to pick a word from the sentence: "Sam laughs last and laughs loudest"
Define random variables:
- f = first letter of the chosen word
  - Outcome space: {S, l, a}
- s = second letter of the chosen word
  - Outcome space {a, n, o}
Examples:
- $P(f = l) = \frac{4}{6}$
- $P(s = a) = \frac{4}{6}$

Joint probability $p(f, s)$ distribution over both random variables at the same time.
- Outcome space: {(S, a), (S, n), (S, o), … (l, a), (l, n), (l, o)}
Conditional probability $p(s | f)$ distribution of s given a particular value of f.
- Outcome space is the same as p(s).

$P(s = a | f = l)$
- Is the amount of probability in the event f=l that is also shared with the event s=a.
- $P(s = a | f = l) = \frac{P(s = a, f = l)}{P(f = l)}$

$P(c | d) = \frac{P(d | c) * P(c)}{P(d)}$
$P(s | f) = \frac{P(f | s) * P(s)}{P(f)}$
Example: Using the sentence "Sam laughs last and laughs loudest"
- Joint $P(f = l, s = a) = \frac{3}{6}$
- Conditional $P(s = a | f = l) = \frac{3}{4}$

Usual scenario:
1. Gather some data.
2. Define a probabilistic model.
3. Use the data to estimate the parameters of the model.
Example:
- We flip a coin n times and collect the observations: [heads, tails, tails, …]
- Model: each flip is independent, with probability p(heads) = h.
- The goal is to determine how to select p(h).

The data we collect are called a sample.
The procedure we use to choose model parameters is called an estimator.
The data likelihood is the probability of the data under the model's distribution.
- E.g.
  - Data: [heads, heads, tails]
  - Model: p(heads) = h = 0.7
  - Likelihood = 0.7 * 0.7 * 0.3
Usually, we look at log-likelihood (the natural log of the likelihood).

Question: what P(h) gives the highest likelihood?
Choosing P(h) this way is called the maximum likelihood estimator.
Examples:
- Data: D = [heads, heads, tails]
  - P(heads) = 0.5 ⇨ likelihood(D) = 0.5 * 0.5 * 0.5 = 0.125
  - P(heads) = 0.8 ⇨ likelihood(D) = 0.8 * 0.8 * 0.2 = 0.128
  - P(heads) = 0.6 ⇨ likelihood(D) = 0.6 * 0.6 * 0.4 = 0.144

Derivative notation:
- Partial derivative of f(x) with respect to x: $\frac{∂ f(x)}{∂x}$
Derivative of a logarithm
- Partial derivative of ln(x) with respect to x equals 1/x
  - $\frac{∂ ln(x)}{∂x} = \frac{1}{x}$
Chain rule of calculus will also be used.

Data: [heads, heads, tails]
$\mathcal{L} = \text{Log-likelihood(Data)}$
$\mathcal{L} = ln(h * h * (1 - h))$
$\mathcal{L} = 2 ln(h) + ln(1 - h)$
Taking the derivative:
- $\frac{∂\mathcal{L}}{∂h} = 2 * \frac{1}{h} + \frac{(-1)}{(1-h)}$
Setting the derivative to zero:
- $\frac{2}{h} - \frac{1}{(1-h)} = 0$
- $\frac{2}{h} = \frac{1}{(1-h)}$
- $2 – 2h = h$
- $2 = 3h ⇨ h = \frac{2}{3}$

Distribution over some discrete outcomes.
- E.g. coin flip; dice; letters of the alphabet, words in the dictionary, etc.
Parameters:
- Probability of outcome i: $p(X=i) = π_i$
  - Where i ranges from 1 to k.
- Remember they are probabilities!
  - $π_i ≥ 0$
  - $Σ π_i = 1$

Data: [o1, o2, o3, …, on]
$\text{Log-Likelihood} = ln(π{o1}) + ln(π{o2}) + ln(π{o3}) + … + ln(π{on})$
$\text{Log-Likelihood} = c1 ln(π1) + c2 ln(π2) + … + ck ln(πk)$
- Where $c_i$ counts how many times we see i in the data.
$\mathcal{L} = Σi ci ln(π_i)$
Optimization problem:
- $max Σi ci ln(π_i)$
The larger π, the larger $\mathcal{L}$ . So we can get $\mathcal{L}$ arbitrarily large by setting π arbitrarily high.
Optimization problem:
- $max Σi ci ln(πi) \text{ such that } Σ πi = 1$
Make the constraint into a game:
- $maxπ minλ Σi ci ln(πi) + λ (Σ πi - 1)$
$\mathcal{L}(π, λ) = Σi ci ln(πi) + λ (Σ πi - 1)$
$\frac{∂\mathcal{L}}{∂ πi} = \frac{ci}{π_i} + λ$
Setting derivative to zero:
- $\frac{ci}{πi} + λ = 0$
- $πi = \frac{ci}{-λ}$
What about λ? How can we compute it?
- $λ = Σ π_i$

Roll a dice to pick a word from: "Sam laughs last and laughs loudest"
Data:
- f = first letter of the chosen word
- s = second letter of the chosen word
- F = [l, l, a, …]; S = [a, a, n, …]
Model:
- Probability distribution over [1, 2, 3, 4, 5, 6].

One way to overcome this, is to add 1, or ½ or something else to our counts.
Data = [heads]; counts = {heads: 2, tails: 1}
This turns out to be the same as having a prior belief about what the coin probability is.
This is a probability distribution over the model parameters.
Priors and other forms of regularization are very important for most models.