Finetuned RAG Systems Engineering

In this project you will be exposed to Retrieval Augmented Generation (RAG) systems. RAG is a recent paradigm for large-scale language understanding tasks. It combines the strengths of retrieval-based and generation-based models, enabling the model to retrieve relevant information from a large corpus and generate a coherent domain-specific response.

The individual components are shown in Figure 1.

You will find the engineering aspects of RAG in this book that we have heavily borrowed from to draft this project description. Note that the project description makes different choices in both the tooling and in the implementation details.

Project Goals and Strategy

To build a RAG system that can be used by an ROS2 robotics developer to develop the navigation stack of a agent with egomotion. This means that the robotics is the domain but the system must be particularly helpful (be able to answer very specific questions about the subdomains) - see the app milestone. The subdomains are:

Your strategy as a team is to have a proof of concept (POC) with all the components interacting with each other even if the evaluation of the RAG system fails. Do not spend a lot of time on each component at the expense of nothing to show by the project deadline. So despite that milestones deadlines are staggered in time, its best if you organize multiple iterations though most of the milestones over the duration of the project. Each iteration will improve the quality of the components. One strategy (if you have this as a deliverable) is to not finetune the model in the first iteration.

Project Milestones

Environment and Tooling Milestone

Deliver a docker compose file that will create a development environment for the RAG system. The environment should have the following components:

app: can train and server Pytorch or TF/Keras models, interact with the Huggingface Hub API and in general encapsulate all subcomponents that are not infrastructure.
mongodb: database for storing the RAG raw data after ETL pipeline.
qdrant: the vector search engine for the RAG system
clearml: the orchestrator and experiment tracking system for the RAG system. Please note that the book is using ZenML.

Show the screenshot of the output of your docker ps command in the project report indicating that all services are running and there are no errors. Ensure that you quote your team IDs - both github and hugguingface in your README.md file. Use notebooks to showcase the outputs/ demos of your RAG system in all subsequent milestones.

ETL Milestone

Important

Only the CS370 Honors and CS-GY-6613 students are required to ingest video transcripts in the ETL pipeline.

Use the clearml orchestrator to create an ETL pipeline that will ingest multiple media sources such as github hosted ROS2 documentation (LTS releases) and youtube videos related to domain and subdomain. The ETL pipeline should be able to store the raw data in the mongodb database. Ensure you have a notebook cell that prints all the URLs that you have ingested either explictly or via a database query.

Featurization Pipelines Milestone

Implement the featurization pipeline that will convert the raw data into a format that can be used by the RAG model. The featurization pipeline should be able to store the featurized data in the mongodb database and in the qdrant vector search engine.

Finetuning Milestone

Important

Only the CS370 Honors and CS-GY-6613 students are required to implement finetuning.

Follow instructions from a similar finetuning exercise and any additional resources such as the video above to finetune the model to the required subdomain. You are free to use baseline LLM models depending on your hardware or you can also use commercially available APIs for creating the instruct dataset.

PS: The finetuning tutorial linked must work in free colab but you may consider paying 1 month fee to Google (or use other providers) to avoid issues with finetuning.

Deploying the App Milestone

Develop a gradio app that will allow the user to interact with the RAG system using Ollama and the facility to pull your model from HF hub. The app should be able to answer the following questions and these questions must be pre-populated and selectable from a dropdown menu. Ensure that in your report / notebook you have sceenshots of the answers to these questions.

Tell me how can I navigate to a specific pose - include replanning aspects in your answer.
Can you provide me with code for this task?

Note

More questions will be added as the project progresses. Specificity and utility of answers to ROS developers of answers is key to the evaluation of the RAG system and grade will depend on this.