Gaussian Mixture Model Notes

Kernel Density Estimator is a method for estimating the probability density function of a random variable.
It involves placing a kernel function (e.g., Gaussian) at each data point and summing them up to approximate the true density.

Problem: When dealing with a small dataset, it is not feasible to estimate class conditionals using the Kernel Density Estimator.
Solution: Model each class conditional as a sum of multivariate Gaussian densities.

The class conditional density $P(X|C_i)$ is modeled as a weighted sum of Gaussian densities:
$P(X|Ci) = \sum{j=1}^{K} wj N(X, \muj, \Sigma_j)$
Where:
- $w_j$ represents the probability of each mixture component.
- $N(X, \muj, \Sigmaj)$ is a multi-variate Gaussian density with mean $\muj$ and covariance $\Sigmaj$ .
The parameters (mean vectors and covariance matrices) are determined so that this sum approximates the given class conditional density as closely as possible.

Parameters: Mixture weights $wj$ , means $\muj$ , and covariances $\Sigma_j$ .
Condition: The sum of the weights must equal 1:
$\sum{j=1}^{K} wj = 1$
This is necessary to ensure that $P(X|C_i)$ is a valid probability density function.
$\int P(X) dX = \int \sum{j=1}^{K} wj N(X, \muj, \Sigmaj) dX = \sum{j=1}^{K} wj \int N(X, \muj, \Sigmaj) dX = \sum{j=1}^{K} wj * 1 = 1$

Consider an example with three Gaussian components:
- $w1N(X, \mu1, \sigma_1)$
- $w2N(X, \mu2, \sigma_2)$
- $w3N(X, \mu3, \sigma_3)$
The sum of these components approximates the class conditional probability $P(X|C_i)$ .
$P(X|Ci) \approx \sum{j=1}^{K} wjN(X, \muj, \sigma_j)$

The EM algorithm is used to estimate the parameters of the GMM components.
It is an iterative algorithm.
For simplicity, consider a 2-component case:
$P(X|Ci) = wN(X, \mu1, \Sigma1) + (1 - w)N(X, \mu2, \Sigma_2)$

Initialization: Take initial guesses for the parameters $w, \mu1, \Sigma1, \mu2$ , and $\Sigma2$ .
Expectation Step: Compute the responsibilities:
$\gamma(i) = \frac{w N(X^{(m)}, \mu1, \Sigma1)}{w N(X^{(m)}, \mu1, \Sigma1) + (1 - w) N(X^{(m)}, \mu2, \Sigma2)}$
$\gamma(i)$ represents the probability that $X^{(m)}$ is generated from component 1.
Maximization Step: Compute the weighted means and covariance matrices:
$\mu1 = \frac{\sum{m=1}^{M} \gammai X^{(m)}}{\sum{m=1}^{M} \gamma_i}$
$\mu2 = \frac{\sum{m=1}^{M} (1-\gammai) X^{(m)}}{\sum{m=1}^{M} (1-\gamma_i)}$
$\Sigma1 = \frac{\sum{m=1}^{M} \gammai ( X^{(m)} - \mu1 )( X^{(m)} - \mu1 )^T}{\sum{m=1}^{M} \gamma_i}$
$\Sigma2 = \frac{\sum{m=1}^{M} (1-\gammai) ( X^{(m)} - \mu2 )( X^{(m)} - \mu2 )^T}{\sum{m=1}^{M} (1-\gamma_i)}$
$w = \frac{\sum{m=1}^{M} \gammai}{M}$
Iteration: Iterate steps 2 & 3 until convergence.

$\gamma_i = P(X^{(m)} \in \text{component 1})$
Using Bayes' rule:
$\gamma_i = \frac{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1})}{P(X^{(m)})} = \frac{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1})}{P(\text{comp. 1}) P(X^{(m)} | \text{comp. 1}) + P(\text{comp. 2}) P(X^{(m)} | \text{comp. 2})}$

Initialization:
- EM is very sensitive to initial conditions.
- Starting from a bad initial guess can lead to poor results.
- Usually, K-Means is used to get a good initialization.
Number of Gaussian Components:
- Choosing the appropriate number of Gaussian components is crucial.
- Try different numbers of components and choose the best based on a validation set.