Gaussian Mixture Model Notes

Machine Learning Methods

  • Gaussian Mixture Model (GMM) is a method used in machine learning.
  • The model involves clusters, such as Cluster 1, Cluster 2, and Cluster 3.

Recap of Conditional Probabilities and Bayes Rule

  • Conditional probabilities:
    p(A, B) = P(A|B)p(B) = p(B|A)p(A)
  • Bayes rule components:
    • Conditional Probability: P(A|B)
    • Likelihood: p(B|A)
    • Prior Probability: p(A)
    • Posterior Probability: p(A|B)
    • Marginal Probability: p(B)
  • Marginal Probability formula:
    p(A = 1) = \Sigma{i} P(A = 1,B{i}) = \Sigma{i} p(A|B{i})p(B_{i})

Example of Probabilities Calculation

  • Given probabilities for tomorrow's weather (Rainy/Cold) based on today's weather:
    • P(Tomorrow=Rainy | Today=Rainy) = 4/9
    • P(Tomorrow=Cold | Today=Rainy) = 2/9
    • P(Tomorrow=Rainy | Today=Cold) = 2/9
    • P(Tomorrow=Cold | Today=Cold) = 1/9
  • Marginal probabilities for Today:
    • P(Today=Rainy) = 4/9 + 2/9 = 2/3
    • P(Today=Cold) = 2/9 + 1/9 = 1/3
  • Marginal probabilities for Tomorrow:
    • P(Tomorrow=Rainy) = 4/9 + 2/9 = 2/3
    • P(Tomorrow=Cold) = 2/9 + 1/9 = 1/3

Hard Clustering Challenges

  • Hard Clustering methods include K-Means, Hierarchical Clustering, and DBSCAN.
  • These methods can be difficult in scenarios where data points do not clearly belong to a single cluster.

Towards Soft Clustering

  • K-means:
    • Performs hard assignment, where each object belongs to only one cluster.
  • Mixture modeling:
    • Performs soft assignment, providing a probability that an object belongs to a cluster.
    • o{i} \in {o{1}, …, o_{K}}
    • Example probabilities: p=0.315, p=0.287

Gaussian Mixture Model (GMM)

  • GMM provides a soft assignment of data points to clusters.

Gaussian Distribution

  • 1-d Gaussian distribution:
    N(\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}
  • μ is the mean.
  • σ is the standard deviation.
  • GMM uses Gaussian distributions to model clusters.

Gaussian Distribution in d Dimensions

  • For d dimensions, the Gaussian distribution of a vector x = (x^{1}, x^{2}, …,x^{d})^{T} is defined by: N(x|u,\Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}exp(-\frac{1}{2}(x-u)^{T}\Sigma^{-1}(x-u))
    • where u is the mean vector.
    • Σ is the covariance matrix.
  • Example:
    • u = (0, 0)
    • \Sigma = \begin{bmatrix} 0.25 & 0.30 \ 0.30 & 1.00 \end{bmatrix}

Mixture Models

  • A Mixture Model is the weighted sum of a number of PDFs.
  • Weights are determined by a distribution \pi{0}, \pi{1}, \pi_{2}

Mixture Model Equation

  • The probability density function p(x) is given by: p(x) = \pi{0}f{0}(x) + \pi{1}f{1}(x) + \pi{2}f{2}(x)
    • \pi{0}, \pi{1}, \pi_{2} are the mixing proportions.
    • f{0}(x), f{1}(x), f_{2}(x) are the component densities.

Why GMM?

  • GMM creates a new PDF for generating random variables.
  • It is a generative model.
  • It clusters different components using Gaussian distributions.
  • Provides inferring opportunity.
  • Soft assignment of data points to clusters.

Intuition Behind GMM

  • Consider a histogram of temperature readings.
  • Two distributions may indicate data from two different cities.
  • Formulate as mathematics with:
    • Observable variable: temperature.
    • Latent variable: cities (z).

Gaussian Mixture Models: Mathematical Notation

  • For the general case, π is a vector of probabilities (non-negative values which sum to 1).
  • π is known as the mixing proportions.

Gaussian Mixture Models: PDF

  • The probability density function (PDF) over x is computed by marginalizing out (summing out) z:
  • This PDF is a convex combination, or weighted average, of the PDFs of the component distributions.

Posterior Inference

  • Assume parameters of the model have been chosen.
  • Goal: Infer, given a data point x, which component it likely belongs to.
  • In mathematical terms, infer the posterior distribution p(z|x).
  • Posterior distribution can be inferred using Bayes’ Rule.

Example I

  • Using a previous example, suppose we observe x=2 in the model.
  • We want to compute the posterior probability Pr(z = 1|x).

Example II

  • Observations are two-dimensional (x1, x2).
  • We observe x1 and want to predict x2 using the posterior predictive distribution.

Example II: Two-Dimensional Mixture of Gaussian Model

  • Have a two-dimensional mixture of Gaussian model where x1 and x2 are conditionally independent given z.
  • Suppose we observe x1=3.
  • We can compute the posterior distribution just like in the previous example.

Example II: Posterior Predictive Distribution

  • Compute the posterior predictive distribution using the posterior:
    p(x{2}|x{1}) = Pr(z = 1|x{1})p(x{2}|z = 1) + Pr(z = 2|x{1})p(x{2}|z = 2)
    = 0.213 \text{Gaussian}(x{2}; 6, 1) + 0.787 \text{Gaussian} (x{2}; 3, 2)

Learning

  • Parameters of GMM:
    • Mean \mu{k} and standard deviation \sigma{k} associated with each component k.
    • Mixing proportions \pi_{k}, defined as Pr(z = k).
  • Idea:
    • The model performs inferencing repeatedly to learn the parameters.

Learning: Maximum Likelihood Estimation (MLE)

  • Use Maximum Likelihood Estimation (MLE) to solve the parameter learning problem.

Learning: Log-Likelihood Derivatives

  • Compute log-likelihood derivatives by setting the derivatives to 0, or use gradient descent.
  • θ can be any parameter to learn, such as mixing proportion, mean, or standard deviation of a component.
  • Expected derivative of the joint log probability.

Learning: Optimization of Means

  • Optimization of means:
    lnp(x|\pi,\mu, \Sigma) = \sum{n=1}^{N} \sum{k=1}^{K} T{nk} ln (\pik N (xn | \muk, \Sigmak)) \frac{\partial}{\partial \muk} lnp(x|\pi,\mu, \Sigma) = \sum{n=1}^{N} T{nk} {\Sigmak}^{-1} (xn - \muk)=0 \muk = \frac{\sum{n=1}^{N} T{nk} xn}{\sum{n=1}^{N} T_{nk}}

Learning: Optimization of Covariance

  • Optimization of covariance:
    \Sigmak = \frac{\sum{n=1}^{N} T{nk} (xn - \muk)(xn - \muk)^T}{\sum{n=1}^{N} T_{nk}}

Learning: Optimization of Mixing Term

  • Optimization of mixing term:
    ln p(\pi|\pi, \mu, \Sigma) + \lambda (\sum{k=1}^{K} \pik -1)
  • \pik = \frac{Nk}{N}
    Where
  • Nk = \sum{n=1}^{N} T_{nk}

MLE of a GMM

  • MLE of a GMM:
    \mu{k} = \frac{1}{N{k}} \sum{n=1}^{N} T(z{nk}) x{n} \Sigma{k} = \frac{1}{N{k}} \sum{n=1}^{N} T(z{nk}) (x{n} - \mu{k}) (x{n} - \mu{k})^{T} \pi{k} = \frac{N{k}}{N} N{k} = \sum{n=1}^{N} T(z{nk})
  • Not a closed form solution!!
  • Use Expectation-Maximization Algorithm

Take-Home Messages

  • The generative process of Gaussian Mixture Model
  • Inferring cluster membership based on a learned GMM