Assignment 1 (grad)

In this assignment you will be working on extending the ML estimation to probability distributions that can be modeled by Gaussian Mixtures.

You are mandated to use numpy and the Pytorch namespace libraries such as pytorch.linalg, pytorch.rand and in general libraries in the pytorch.xyz namespace. The idea is to implement from scratch the following without implementing every minute component such as random number generators, plotting etc.

If you are familiar with Keras and not Pytorch the same rule applies.

Points:

Development environment (10 points)
Code is commented throughout either inline or via markdown cells. (10 points)
Dataset: 20 points
Derivation: 20 points
SGD: 40 points

Development Environment Setup

Ubuntu and MAC users

Install docker in your system and the VSCode docker and remote extensions.

Windows users

Install WSL2.
Ensure that you also follow this tutorial to setup VSCode properly aka the VSCode can access the WSL2 filesystem and work with the remote docker containers.
If you have an NVIDIA GPU in your system, ensure you have enabled it.

All Users

Following the instructions of the course site with respect to the course docker container

Install docker on your machine.
Clone the repo (For windows users ensure that you clone it on the WSL2 filesystem.) Show this by a screenshot below of the terminal where you have cloned the repo.
Build and launch the docker container inside your desired IDE (if you havent used an IDE before you can start with VSCode).
Launch the virtual environment with make start inside the container and then show a screenshot of your IDE and the terminal with the (your virtual env) prefix.
Select the kernel of your virtual environment (.venv folder) and execute the following code. Save the output of all cells of this notebook before submitting.

Source: Development Environment Setup

Mixture of Gaussians Dataset

Develop a toy dataset of \(m=1000\) sample points for Mixture of Gaussians (MoG):

Feature dimensions: 2
Number of Gaussian components: 3
Means: random.
Covariance matrices: diagonal.
Create visualizations for dataset.

Gradient Formulas

We consider a 2-component Mixture of Gaussians (MoG) with 1-dimensional data. Show that the gradients you need for solving the estimation problem are as follows. Use Latex math in markdown format for this task.

The likelihood for a point \(x_i\) is

\[ p(x_i \mid \pi, \mu, \sigma^2) = \pi_1 \, \mathcal{N}(x_i \mid \mu_1, \sigma_1^2) + \pi_2 \, \mathcal{N}(x_i \mid \mu_2, \sigma_2^2). \]

The log-likelihood for \(m\) data points is

\[ \mathcal{L} = \sum_{i=1}^m \log \left( \sum_{k=1}^2 \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) \right). \]

Define the responsibility of component \(k\) for data point \(i\):

\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) } { \sum_{j=1}^2 \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2) }. \]

Then the gradients are:

Mean gradient: \[ \frac{\partial \mathcal{L}}{\partial \mu_k} = \sum_{i=1}^m \gamma_{ik} \, \frac{(x_i - \mu_k)}{\sigma_k^2}. \]
Variance gradient: \[ \frac{\partial \mathcal{L}}{\partial \sigma_k^2} = \frac{1}{2} \sum_{i=1}^m \gamma_{ik} \left[ \frac{(x_i - \mu_k)^2}{\sigma_k^4} - \frac{1}{\sigma_k^2} \right]. \]
Mixture weights gradient (with constraint \(\sum_k \pi_k = 1\)): \[ \frac{\partial \mathcal{L}}{\partial \pi_k} = \sum_{i=1}^m \frac{\gamma_{ik}}{\pi_k}. \]

SGD from Scratch for 3-Component, 3-Feature MoG

Implement Stochastic Gradient Descent (SGD) from scratch with the Negative Log-Likelihood (NLL) objective and analytic derivatives to optimize the parameters.

Notes:

Initialize covariance matrices as diagonal.
For a mini-batch \(B\), provide expressions for the gradients below.
Re-parameterize variance as \(\log \sigma\) to keep the model stable and avoid invalid variances while applying SGD.

Provide and use the following in your write-up/code:

Responsibility function

The responsibility function for component \(k\) and data point \(i\):

\[ \gamma_{ik} = \frac{ \pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k) } { \sum_{j=1}^K \pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j) }. \]

Mean gradient

For a mini-batch \(B\):

\[ \frac{\partial \mathcal{L}_B}{\partial \mu_k} = \sum_{i \in B} \gamma_{ik} \, \Sigma_k^{-1} (x_i - \mu_k). \]

Variance gradient

For diagonal covariance matrices:

\[ \frac{\partial \mathcal{L}_B}{\partial \log \sigma_{kd}^2} = \frac{1}{2} \sum_{i \in B} \gamma_{ik} \left[ \frac{(x_{id} - \mu_{kd})^2}{\sigma_{kd}^2} - 1 \right], \]

where \(d\) indexes the feature dimension.

Mixture weights gradient

For the mixture weights (with softmax re-parameterization):

\[ \frac{\partial \mathcal{L}_B}{\partial \pi_k} = \sum_{i \in B} \frac{\gamma_{ik}}{\pi_k}. \]

References and Review Materials

Deliverables Checklist

Environment screenshot.
Dataset generation code (2D with 3 components).
Visualizations of the dataset.
Derivations for the 2-component, 1-feature MoG gradients.
SGD implementation (NLL objective, diagonal covariance via log-std parameterization).

Source: Mixture of Gaussians Dataset