Gaussian Mixture Model Notes

Pattern Classification: Gaussian Mixture Model

Recap: Kernel Density Estimator

  • Kernel Density Estimator is a method for estimating the probability density function of a random variable.
  • It involves placing a kernel function (e.g., Gaussian) at each data point and summing them up to approximate the true density.

Gaussian Mixture Model (GMM)

  • Problem: When dealing with a small dataset, it is not feasible to estimate class conditionals using the Kernel Density Estimator.
  • Solution: Model each class conditional as a sum of multivariate Gaussian densities.

GMM Formulation

  • The class conditional density P(XCi)P(X|C_i) is modeled as a weighted sum of Gaussian densities:

    P(XC<em>i)=</em>j=1Kw<em>jN(X,μ</em>j,Σj)P(X|C<em>i) = \sum</em>{j=1}^{K} w<em>j N(X, \mu</em>j, \Sigma_j)

    Where:

    • wjw_j represents the probability of each mixture component.
    • N(X,μ<em>j,Σ</em>j)N(X, \mu<em>j, \Sigma</em>j) is a multi-variate Gaussian density with mean μ<em>j\mu<em>j and covariance Σ</em>j\Sigma</em>j.
  • The parameters (mean vectors and covariance matrices) are determined so that this sum approximates the given class conditional density as closely as possible.

GMM Parameters and Conditions

  • Parameters: Mixture weights w<em>jw<em>j, means μ</em>j\mu</em>j, and covariances Σj\Sigma_j.

  • Condition: The sum of the weights must equal 1:

    <em>j=1Kw</em>j=1\sum<em>{j=1}^{K} w</em>j = 1

    This is necessary to ensure that P(XCi)P(X|C_i) is a valid probability density function.

    P(X)dX=<em>j=1Kw</em>jN(X,μ<em>j,Σ</em>j)dX=<em>j=1Kw</em>jN(X,μ<em>j,Σ</em>j)dX=<em>j=1Kw</em>j1=1\int P(X) dX = \int \sum<em>{j=1}^{K} w</em>j N(X, \mu<em>j, \Sigma</em>j) dX = \sum<em>{j=1}^{K} w</em>j \int N(X, \mu<em>j, \Sigma</em>j) dX = \sum<em>{j=1}^{K} w</em>j * 1 = 1

1-D Example

  • Consider an example with three Gaussian components:

    • w<em>1N(X,μ</em>1,σ1)w<em>1N(X, \mu</em>1, \sigma_1)
    • w<em>2N(X,μ</em>2,σ2)w<em>2N(X, \mu</em>2, \sigma_2)
    • w<em>3N(X,μ</em>3,σ3)w<em>3N(X, \mu</em>3, \sigma_3)
  • The sum of these components approximates the class conditional probability P(XCi)P(X|C_i).

    P(XC<em>i)</em>j=1Kw<em>jN(X,μ</em>j,σj)P(X|C<em>i) \approx \sum</em>{j=1}^{K} w<em>jN(X, \mu</em>j, \sigma_j)

Expectation-Maximization (EM) Algorithm

  • The EM algorithm is used to estimate the parameters of the GMM components.

  • It is an iterative algorithm.

  • For simplicity, consider a 2-component case:

    P(XC<em>i)=wN(X,μ</em>1,Σ<em>1)+(1w)N(X,μ</em>2,Σ2)P(X|C<em>i) = wN(X, \mu</em>1, \Sigma<em>1) + (1 - w)N(X, \mu</em>2, \Sigma_2)

EM Algorithm Steps

  1. Initialization: Take initial guesses for the parameters w,μ<em>1,Σ</em>1,μ<em>2w, \mu<em>1, \Sigma</em>1, \mu<em>2, and Σ</em>2\Sigma</em>2.

  2. Expectation Step: Compute the responsibilities:

    γ(i)=wN(X(m),μ<em>1,Σ</em>1)wN(X(m),μ<em>1,Σ</em>1)+(1w)N(X(m),μ<em>2,Σ</em>2)\gamma(i) = \frac{w N(X^{(m)}, \mu<em>1, \Sigma</em>1)}{w N(X^{(m)}, \mu<em>1, \Sigma</em>1) + (1 - w) N(X^{(m)}, \mu<em>2, \Sigma</em>2)}

    γ(i)\gamma(i) represents the probability that X(m)X^{(m)} is generated from component 1.

  3. Maximization Step: Compute the weighted means and covariance matrices:

    μ<em>1=</em>m=1Mγ<em>iX(m)</em>m=1Mγi\mu<em>1 = \frac{\sum</em>{m=1}^{M} \gamma<em>i X^{(m)}}{\sum</em>{m=1}^{M} \gamma_i}

    μ<em>2=</em>m=1M(1γ<em>i)X(m)</em>m=1M(1γi)\mu<em>2 = \frac{\sum</em>{m=1}^{M} (1-\gamma<em>i) X^{(m)}}{\sum</em>{m=1}^{M} (1-\gamma_i)}

    Σ<em>1=</em>m=1Mγ<em>i(X(m)μ</em>1)(X(m)μ<em>1)T</em>m=1Mγi\Sigma<em>1 = \frac{\sum</em>{m=1}^{M} \gamma<em>i ( X^{(m)} - \mu</em>1 )( X^{(m)} - \mu<em>1 )^T}{\sum</em>{m=1}^{M} \gamma_i}

    Σ<em>2=</em>m=1M(1γ<em>i)(X(m)μ</em>2)(X(m)μ<em>2)T</em>m=1M(1γi)\Sigma<em>2 = \frac{\sum</em>{m=1}^{M} (1-\gamma<em>i) ( X^{(m)} - \mu</em>2 )( X^{(m)} - \mu<em>2 )^T}{\sum</em>{m=1}^{M} (1-\gamma_i)}

    w=<em>m=1Mγ</em>iMw = \frac{\sum<em>{m=1}^{M} \gamma</em>i}{M}

  4. Iteration: Iterate steps 2 & 3 until convergence.

Interpretation of Responsibilities

  • γi=P(X(m)component 1)\gamma_i = P(X^{(m)} \in \text{component 1})

  • Using Bayes' rule:

    γi=P(comp. 1)P(X(m)comp. 1)P(X(m))=P(comp. 1)P(X(m)comp. 1)P(comp. 1)P(X(m)comp. 1)+P(comp. 2)P(X(m)comp. 2)\gamma_i = \frac{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1})}{P(X^{(m)})} = \frac{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1})}{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1}) + P(\text{comp. 2}) P(X^{(m)} | \text{comp. 2})}

Issues with GMM

  • Initialization:
    • EM is very sensitive to initial conditions.
    • Starting from a bad initial guess can lead to poor results.
    • Usually, K-Means is used to get a good initialization.
  • Number of Gaussian Components:
    • Choosing the appropriate number of Gaussian components is crucial.
    • Try different numbers of components and choose the best based on a validation set.