Chat with your Video Library
In this project you will be exposed to the engineering aspects of Retrieval Augmented Generation (RAG) systems as in this key book reference.
RAG combines the strengths of retrieval-based and generation-based models, enabling the model to retrieve relevant information from a large corpus and generate a coherent domain-specific response.
No shortcuts such as heavy usage of langchain
and other bloated high level RAG frameworks are going to be acceptable as solutions to this project. You need to understand the components of the RAG system and how they interact with each other.
Don’t attempt to do this project before reading the overview sections of RAG in the book - you need to clone the book’s repository as well if you are to implement Option 1 below.
Project Goals and Strategy
In our case the task is to build a chatbot application where students are able to ask questions about the course and the bot should be able to generate responses that are grounded to specific video segments (clips) where the segments are the relevant video clips / segments from the course youtube channel.
The individual components of a general RAG system are shown in Figure 1.

You have two options to deal with the complexity of the project:
Option 1
You will be integrating into the book’s repository additional elements into the components shown in Figure 1.
Option 2
You will be using the book’s repository as a reference and you will be implementing the components of the RAG system from scratch. You can use a canonical RAG implementation to implement the components of the RAG system. The later approach is just an example and you are free to use any other RAG framework you know.
Either of the two options is acceptable. The only requirement is that you need to be able to show a working RAG system at the end of the project. Your strategy is to split up the project into multiple milestones and to have a proof of concept (POC) with all the components interacting with each other even if the end to end flow fails. Do not spend a lot of time on each component at the expense of nothing to show by the project deadline. Each iteration should improve the quality of the components. Despite that you have no per-milestone deadlines, we can guarantee that you will be unable to finish the project if you start late or cant keep up.
Data Collection Pipeline (ETL) Milestone
Here you will use the Huggingface Dataset repository and develop a pipeline that is able to stream the video dataset that is stored in webdataset format, decode its frames and other metadata as described here and store them in the MongoDB database.
Its critical that each image stored in MongoDB must be aligned with the corresponding subtitle. It is also important to understand how to remove redundancy such as multiple images in the video showing the same content.
We are in the process of populating the streamable webdataset.
Finetuning Milestone
Follow instructions from a similar finetuning exercise to finetune the model to the required domain - captioned handwritten images. You are free to use other baseline VLM (Vision Language Models) depending on your hardware such as Gemma 3.
PS: The finetuning tutorial linked must work in free colab but you may consider paying 1 month fee to Google (or use other providers) to avoid issues with finetuning.
Featurization Pipeline Milestone
Implement the featurization pipeline that will convert the raw data (images and subtitles) into a format that can be used by the RAG model. The featurization pipeline should be able to store the featurized data as vectors into Qdrant.
Baseline Featurization and Retrival Pipeline
Retrieval Milestone
Implement the retrieval pipeline that will accept a query such as “explain how ResNets work” and return the most relevant video segments (clips) from the database.
Deploying the App Milestone
Develop a gradio app that will allow the user to interact with the RAG system using Ollama and the facility to pull your model from HF hub. The app should be able to answer the following questions with the video clip(s). You need to use streaming for producing the answers.
In addition to the app, you need to ensure that there is a demonstration.ipynb notebook in your github repo where the questions and related video clips are shown (video clips are few MBs and should not be an issue to commit them as part of your repo).
The should be a minimum of 3 video clips.
- “Explain how ResNets work”
- “Explain the advantages of CNNs over fully connected networks”
- “Explain the the binary cross entropy loss function”