Mixture of Gaussians Dataset
Develop a toy dataset of \(m=1000\) sample points for Mixture of Gaussians (MoG):
- Feature dimensions: 2
- Number of Gaussian components: 3
- Means: random.
- Covariance matrices: diagonal.
- Create visualizations for dataset.
Gradient Formulas
We consider a 2-component Mixture of Gaussians (MoG) with 1-dimensional data. Show that the gradients you need for solving the estimation problem are as follows. Use Latex math in markdown format for this task.
The likelihood for a point \(x_i\) is
\[ p(x_i \mid \pi, \mu, \sigma^2) = \pi_1 \, \mathcal{N}(x_i \mid \mu_1, \sigma_1^2) + \pi_2 \, \mathcal{N}(x_i \mid \mu_2, \sigma_2^2). \]
The log-likelihood for \(m\) data points is
\[ \mathcal{L} = \sum_{i=1}^m \log \left( \sum_{k=1}^2 \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) \right). \]
Define the responsibility of component \(k\) for data point \(i\):
\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) } { \sum_{j=1}^2 \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2) }. \]
Then the gradients are:
Mean gradient: \[ \frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{i=1}^m \gamma_{ik} \, \frac{(x_i - \mu_k)}{\sigma_k^2}. \]
Variance gradient: \[ \frac{\partial \mathcal{L}}{\partial \sigma_k^2} = \frac{1}{2} \sum_{i=1}^m \gamma_{ik} \left[ \frac{(x_i - \mu_k)^2}{\sigma_k^4} - \frac{1}{\sigma_k^2} \right]. \]
Mixture weights gradient (with constraint \(\sum_k \pi_k = 1\)): \[ \frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{i=1}^m \frac{\gamma_{ik}}{\pi_k}. \]
SGD from Scratch for 3-Component, 3-Feature MoG
Implement Stochastic Gradient Descent (SGD) from scratch with the Negative Log-Likelihood (NLL) objective and analytic derivatives to optimize the parameters.
Notes:
- Initialize covariance matrices as diagonal.
- For a mini-batch \(B\), provide expressions for the gradients below.
- Re-parameterize variance as \(\log \sigma\) to keep the model stable and avoid invalid variances while applying SGD.
Provide and use the following in your write-up/code:
Responsibility function
The responsibility function for component \(k\) and data point \(i\):
\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k) } { \sum_{j=1}^K \pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j) }. \]
Mean gradient
For a mini-batch \(B\):
\[ \frac{\partial \mathcal{L}_B}{\partial \mu_k} = \sum_{i \in B} \gamma_{ik} \, \Sigma_k^{-1} (x_i - \mu_k). \]
Variance gradient
For diagonal covariance matrices:
\[ \frac{\partial \mathcal{L}_B}{\partial \log \sigma_{kd}^2} = \frac{1}{2} \sum_{i \in B} \gamma_{ik} \left[ \frac{(x_{id} - \mu_{kd})^2}{\sigma_{kd}^2} - 1 \right], \]
where \(d\) indexes the feature dimension.
Mixture weights gradient
For the mixture weights (with softmax re-parameterization):
\[ \frac{\partial \mathcal{L}_B}{\partial \pi_k} = \sum_{i \in B} \frac{\gamma_{ik}}{\pi_k}. \]