Module 3 Review of Probabilities and Estimating Model Parameters

Review of Probabilities

What is a Probability Distribution?

  • A mathematical object used to model an event.
  • Assigns a number to possible outcomes.
  • Example: Coin flip.
    • Outcome space: {heads, tails}
    • P(heads)=12P(heads) = \frac{1}{2}
    • P(tails)=12P(tails) = \frac{1}{2}
  • A random variable is a variable that takes on a value according to some probability distribution.
  • Probabilities must be:
    • Non-negative
    • Sum to 1

Notation for Coin Flips

  • Random variable c denotes a coin-flip.
  • Event space: c ∈ {heads, tails, other}
    • P(c=heads)=0.6P(c = heads) = 0.6
    • P(c=tails)=0.4P(c = tails) = 0.4
  • Event space: c ∈ {heads, tails}
    • P(c=heads)=hP(c = heads) = h
    • P(c=tails)=1hP(c = tails) = 1 - h
  • The above represents an unfair coin.

Example Probability Distributions

  • Dice roll
    • Random variable d.
    • Outcome space: {1 .. 6}
    • Parameters θ
    • P(d=i)=θiP(d = i) = θ_i
    • A fair dice would have θ = [16,16,16,16,16,16][\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}]

Simple Example: Picking a Letter

  • Roll a fair dice to pick a word from the sentence: "Sam laughs last and laughs loudest"
  • Define random variables:
    • f = first letter of the chosen word
      • Outcome space: {S, l, a}
    • s = second letter of the chosen word
      • Outcome space {a, n, o}
  • Examples:
    • P(f=l)=46P(f = l) = \frac{4}{6}
    • P(s=a)=46P(s = a) = \frac{4}{6}

Relationships Between Distributions

  • Joint probability p(f,s)p(f, s) distribution over both random variables at the same time.
    • Outcome space: {(S, a), (S, n), (S, o), … (l, a), (l, n), (l, o)}
  • Conditional probability p(sf)p(s | f) distribution of s given a particular value of f.
    • Outcome space is the same as p(s).

Conditional Probability

  • P(s=af=l)P(s = a | f = l)
    • Is the amount of probability in the event f=l that is also shared with the event s=a.
    • P(s=af=l)=P(s=a,f=l)P(f=l)P(s = a | f = l) = \frac{P(s = a, f = l)}{P(f = l)}

Bayes Rule

  • P(cd)=P(dc)P(c)P(d)P(c | d) = \frac{P(d | c) * P(c)}{P(d)}

  • P(sf)=P(fs)P(s)P(f)P(s | f) = \frac{P(f | s) * P(s)}{P(f)}

  • Example: Using the sentence "Sam laughs last and laughs loudest"

    • Joint P(f=l,s=a)=36P(f = l, s = a) = \frac{3}{6}
    • Conditional P(s=af=l)=34P(s = a | f = l) = \frac{3}{4}

Estimating Model Parameters

Probabilistic Models in Practice

  • Usual scenario:
    1. Gather some data.
    2. Define a probabilistic model.
    3. Use the data to estimate the parameters of the model.
  • Example:
    • We flip a coin n times and collect the observations: [heads, tails, tails, …]
    • Model: each flip is independent, with probability p(heads) = h.
    • The goal is to determine how to select p(h).

Terminology

  • The data we collect are called a sample.
  • The procedure we use to choose model parameters is called an estimator.
  • The data likelihood is the probability of the data under the model's distribution.
    • E.g.
      • Data: [heads, heads, tails]
      • Model: p(heads) = h = 0.7
      • Likelihood = 0.7 * 0.7 * 0.3
  • Usually, we look at log-likelihood (the natural log of the likelihood).

Data Likelihood

  • Question: what P(h) gives the highest likelihood?
  • Choosing P(h) this way is called the maximum likelihood estimator.
  • Examples:
    • Data: D = [heads, heads, tails]
      • P(heads) = 0.5 ⇨ likelihood(D) = 0.5 * 0.5 * 0.5 = 0.125
      • P(heads) = 0.8 ⇨ likelihood(D) = 0.8 * 0.8 * 0.2 = 0.128
      • P(heads) = 0.6 ⇨ likelihood(D) = 0.6 * 0.6 * 0.4 = 0.144

How to Find the Max of a Function

  • Looking for the very top of the curve.
  • The top is always flat.
  • More formally, the slope is zero: derivative of the function is zero.

Review of Calculus

  • Derivative notation:
    • Partial derivative of f(x) with respect to x: f(x)x\frac{∂ f(x)}{∂x}
  • Derivative of a logarithm
    • Partial derivative of ln(x) with respect to x equals 1/x
      • ln(x)x=1x\frac{∂ ln(x)}{∂x} = \frac{1}{x}
  • Chain rule of calculus will also be used.

Maximum Likelihood for Our Example

  • Data: [heads, heads, tails]
  • L=Log-likelihood(Data)\mathcal{L} = \text{Log-likelihood(Data)}
  • L=ln(hh(1h))\mathcal{L} = ln(h * h * (1 - h))
  • L=2ln(h)+ln(1h)\mathcal{L} = 2 ln(h) + ln(1 - h)
  • Taking the derivative:
    • Lh=21h+(1)(1h)\frac{∂\mathcal{L}}{∂h} = 2 * \frac{1}{h} + \frac{(-1)}{(1-h)}
  • Setting the derivative to zero:
    • 2h1(1h)=0\frac{2}{h} - \frac{1}{(1-h)} = 0
    • 2h=1(1h)\frac{2}{h} = \frac{1}{(1-h)}
    • 22h=h2 – 2h = h
    • 2=3hh=232 = 3h ⇨ h = \frac{2}{3}

Multinomial Distribution

  • Distribution over some discrete outcomes.
    • E.g. coin flip; dice; letters of the alphabet, words in the dictionary, etc.
  • Parameters:
    • Probability of outcome i: p(X=i)=πip(X=i) = π_i
      • Where i ranges from 1 to k.
    • Remember they are probabilities!
      • πi0π_i ≥ 0
      • Σπi=1Σ π_i = 1

Maximum Likelihood for Multinomials

  • Data: [o1, o2, o3, …, on]

  • Log-Likelihood=ln(π<em>o1)+ln(π</em>o2)+ln(π<em>o3)++ln(π</em>on)\text{Log-Likelihood} = ln(π<em>{o1}) + ln(π</em>{o2}) + ln(π<em>{o3}) + … + ln(π</em>{on})

  • Log-Likelihood=c<em>1ln(π</em>1)+c<em>2ln(π</em>2)++c<em>kln(π</em>k)\text{Log-Likelihood} = c<em>1 ln(π</em>1) + c<em>2 ln(π</em>2) + … + c<em>k ln(π</em>k)

    • Where cic_i counts how many times we see i in the data.
  • L=Σ<em>ic</em>iln(πi)\mathcal{L} = Σ<em>i c</em>i ln(π_i)

  • Optimization problem:

    • maxΣ<em>ic</em>iln(πi)max Σ<em>i c</em>i ln(π_i)
  • The larger π, the larger L\mathcal{L}. So we can get L\mathcal{L} arbitrarily large by setting π arbitrarily high.

  • Optimization problem:

    • maxΣ<em>ic</em>iln(π<em>i) such that Σπ</em>i=1max Σ<em>i c</em>i ln(π<em>i) \text{ such that } Σ π</em>i = 1
  • Make the constraint into a game:

    • max<em>πmin</em>λΣ<em>ic</em>iln(π<em>i)+λ(Σπ</em>i1)max<em>π min</em>λ Σ<em>i c</em>i ln(π<em>i) + λ (Σ π</em>i - 1)
  • L(π,λ)=Σ<em>ic</em>iln(π<em>i)+λ(Σπ</em>i1)\mathcal{L}(π, λ) = Σ<em>i c</em>i ln(π<em>i) + λ (Σ π</em>i - 1)

  • Lπ<em>i=c</em>iπi+λ\frac{∂\mathcal{L}}{∂ π<em>i} = \frac{c</em>i}{π_i} + λ

  • Setting derivative to zero:

    • c<em>iπ</em>i+λ=0\frac{c<em>i}{π</em>i} + λ = 0
    • π<em>i=c</em>iλπ<em>i = \frac{c</em>i}{-λ}
  • What about λ? How can we compute it?

    • λ=Σπiλ = Σ π_i

Note: It Is Not Always Easy to Compute Max Likelihood

  • Sometimes, we do not have a closed-form solution for maximum likelihood.
  • We do not always observe everything we would like to.

Complicated Example

  • Roll a dice to pick a word from: "Sam laughs last and laughs loudest"
  • Data:
    • f = first letter of the chosen word
    • s = second letter of the chosen word
    • F = [l, l, a, …]; S = [a, a, n, …]
  • Model:
    • Probability distribution over [1, 2, 3, 4, 5, 6].

Problem with Small Samples

  • Suppose we flip a coin just once.
  • Data: [heads]
  • Max likelihood estimate: p(heads) = 1; p(tails) = 0.
  • Is this a good model of the world?

Add-1 Smoothing

  • One way to overcome this, is to add 1, or ½ or something else to our counts.
  • Data = [heads]; counts = {heads: 2, tails: 1}
  • This turns out to be the same as having a prior belief about what the coin probability is.
  • This is a probability distribution over the model parameters.
  • Priors and other forms of regularization are very important for most models.