Mixture of Gaussians Dataset

Develop a toy dataset of \(m=1000\) sample points for Mixture of Gaussians (MoG):

Feature dimensions: 2
Number of Gaussian components: 3
Means: random.
Covariance matrices: diagonal.
Create visualizations for dataset.

Gradient Formulas

We consider a 2-component Mixture of Gaussians (MoG) with 1-dimensional data. Show that the gradients you need for solving the estimation problem are as follows. Use Latex math in markdown format for this task.

The likelihood for a point \(x_i\) is

\[ p(x_i \mid \pi, \mu, \sigma^2) = \pi_1 \, \mathcal{N}(x_i \mid \mu_1, \sigma_1^2) + \pi_2 \, \mathcal{N}(x_i \mid \mu_2, \sigma_2^2). \]

The log-likelihood for \(m\) data points is

\[ \mathcal{L} = \sum_{i=1}^m \log \left( \sum_{k=1}^2 \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) \right). \]

Define the responsibility of component \(k\) for data point \(i\):

\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) } { \sum_{j=1}^2 \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2) }. \]

Then the gradients are:

Mean gradient: \[ \frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{i=1}^m \gamma_{ik} \, \frac{(x_i - \mu_k)}{\sigma_k^2}. \]
Variance gradient: \[ \frac{\partial \mathcal{L}}{\partial \sigma_k^2} = \frac{1}{2} \sum_{i=1}^m \gamma_{ik} \left[ \frac{(x_i - \mu_k)^2}{\sigma_k^4} - \frac{1}{\sigma_k^2} \right]. \]
Mixture weights gradient (with constraint \(\sum_k \pi_k = 1\)): \[ \frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{i=1}^m \frac{\gamma_{ik}}{\pi_k}. \]

SGD from Scratch for 3-Component, 3-Feature MoG

Implement Stochastic Gradient Descent (SGD) from scratch with the Negative Log-Likelihood (NLL) objective and analytic derivatives to optimize the parameters.

Notes:

Initialize covariance matrices as diagonal.
For a mini-batch \(B\), provide expressions for the gradients below.
Re-parameterize variance as \(\log \sigma\) to keep the model stable and avoid invalid variances while applying SGD.

Provide and use the following in your write-up/code:

Responsibility function

The responsibility function for component \(k\) and data point \(i\):

\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k) } { \sum_{j=1}^K \pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j) }. \]

Mean gradient

For a mini-batch \(B\):

\[ \frac{\partial \mathcal{L}_B}{\partial \mu_k} = \sum_{i \in B} \gamma_{ik} \, \Sigma_k^{-1} (x_i - \mu_k). \]

Variance gradient

For diagonal covariance matrices:

\[ \frac{\partial \mathcal{L}_B}{\partial \log \sigma_{kd}^2} = \frac{1}{2} \sum_{i \in B} \gamma_{ik} \left[ \frac{(x_{id} - \mu_{kd})^2}{\sigma_{kd}^2} - 1 \right], \]

where \(d\) indexes the feature dimension.

Mixture weights gradient

For the mixture weights (with softmax re-parameterization):

\[ \frac{\partial \mathcal{L}_B}{\partial \pi_k} = \sum_{i \in B} \frac{\gamma_{ik}}{\pi_k}. \]

References and Review Materials

Deliverables Checklist

Environment screenshot.
Dataset generation code (2D with 3 components).
Visualizations of the dataset.
Derivations for the 2-component, 1-feature MoG gradients.
SGD implementation (NLL objective, diagonal covariance via log-std parameterization).