Assignment 1 (undergrad)

In this assignment you will be working on demonstrating via Monte-Carlo simulation the curse of dimensionality and perform a logistic regression classification task with SGD.

You are mandated to use numpy and the Pytorch namespace libraries such as pytorch.linalg, pytorch.rand and in general libraries in the pytorch.xyz namespace. The idea is to implement from scratch the following without implementing every minute component such as random number generators, plotting etc.

If you are familiar with Keras and not Pytorch the same rule applies.

Points:

Dev Environnment : 10 points
Simulation : 25 points
Logistic Regression : 50 points
Code is commented throughout either inline or via markdown cells. (15 points)

Development Environment Setup

Ubuntu and MAC users

Install docker in your system and the VSCode docker and remote extensions.

Windows users

Install WSL2.
Ensure that you also follow this tutorial to setup VSCode properly aka the VSCode can access the WSL2 filesystem and work with the remote docker containers.
If you have an NVIDIA GPU in your system, ensure you have enabled it.

All Users

Following the instructions of the course site with respect to the course docker container

Install docker on your machine.
Clone the repo (For windows users ensure that you clone it on the WSL2 filesystem.) Show this by a screenshot below of the terminal where you have cloned the repo.
Build and launch the docker container inside your desired IDE (if you havent used an IDE before you can start with VSCode).
Launch the virtual environment with make start inside the container and then show a screenshot of your IDE and the terminal with the (your virtual env) prefix.
Select the kernel of your virtual environment (.venv folder) and execute the following code. Save the output of all cells of this notebook before submitting.

Source: Development Environment Setup

Simulation - Sparsity in High Dimensions

Use Monte Carlo experiments to show that, for a fixed number of samples \(m\), data become effectively “sparser” as the input dimension \(n\) grows.

For each \(n \in \{1,2,5,10,20,50,100\}\), draw \(n\) i.i.d. points \(x_1,\dots,x_m \sim \mathrm{Unif}([0,1]^n)\).

Compute the nearest-neighbor (NN) distance for each point and report the mean NN distance as a function of \(n\). Plot it as a function of \(n\).

Source: Simulation - Sparsity in High Dimensions

Logistic Regression

You are interviewing with Google’s ad team and one of their tasks is predicting the Click Through Rate (CTR) of ads they place on web or mobile properties. Your hiring manager keen on testing you out, suggests to download this dataset and asks you to code up a model that predicts the CTR based on Logistic Regression.

Data Preprocessing

Preprocess the data you are given to your liking. This may include dropping some columns you wont use, addressing noisy or missing data etc.

Use Pandas as a dataframe abstraction for this task and you can easily convert dataframes to pytorch tensors for later processing You can learn about Pandas here:

SGD

Implement the logistic regression solution to the prediction problem that can work with Stochastic Gradient Descent (SGD).

Show clearly all equations of the gradient and include comments in markdown explaining every stage of processing. Also, highlight any enhancements you may have done to improve performance.

Plot the final precision vs recall curve of your classifier. Clearly explain the tradeoff between the two quantities and the shape of the curve.

Source: Logistic Regression