Video Search#
Step 1: Video library (10 points)#
Write a python API that will download the video and its closed captions from youtube. Make sure to document how one can use your API. Use this to download the following videos from youtube along with thei captions:
https://www.youtube.com/watch?v=wbWRWeVe1XE
Step 2: Video indexing pipeline (90 points)#
In this step you will build and train your models to extract embeddings for the frames of your videos and store the extracted information in a database such as postgres for indexing the videos.
2.1 Preprocess the video (15 points)#
You can use opencv, ffmpeg, gstreameer, https://pytorchvideo.org/ or any other library to implement the preprocessing steps as shown below:
2.2 Detecting objects (25 points)#
Use any of the pretrained object-detectors to detect objects belonging to MS COCO classes. For each video, for each frame where a detection is found, compile and report the results in the following tabular structure :
[vidId, frameNum, timestamp, detectedObjId, detectedObjClass, confidence, bbox info]
Feel free to finetune your detectors if required. If you notice that your model performs better on a different video from this channel, document this and you can use this video for the rest of this assignment.
2.3 Embedding model (30 points)#
Develop a convolutional autoencoder such as the one described here whose input will be all the objects detected in each frame (not the entire frame!), if any. Note : You can downsample the frame rate of your original video to avoid long training/processing times. For a given input image, the autoencoder should output it’s small vector embedding.
Train your autoencoder on the COCO dataset for classes which get detected in the given list of videos.
Extra credit (10 points)#
Extra credit: if you want to maximize the possibility of developing something new think about how a video can be better segmented into representative frames. For example, in this ~3min video accessed Nov 2023 you have multiple scenes each one lasting 30sec or so. Can you find a way to segment each video and store the frame embeddings of each segment? This way there are multiple embeddings per video and you need to keep them that way for the subsequesnt steps of this project.
Indexing the embeddings (20 points)#
Use docker compose
to bring up two docker containers, your application container with the dev environment (you must have done this in Step 1) and a second container with postgres.
docker pull postgres:latest
Process all the detected object sub-images for each frame of each video to compile your final results in the following tabular structure:
[vidId, frameNum, timestamp, detectedObjId, detectedObjClass, confidence, bbox info, vector
Index the video images embedding vectors in the database. To do that in postgres (with the pgvector extension) you can use this guide.
Demonstrate that you can search the database using image queries and post the screenshots of your search results that must include the first 10 similar images across the input videos.