Computer Using Agent

In this project, you will build a Computer Using Agent (CUA) that can assist users in understanding and interacting with visual content, such as images and diagrams, through natural language and visual prompting .

The system will leverage advanced NLP and computer vision techniques to provide explanations, answer questions, and facilitate learning from materials displayed on your screen. All tests can be done locally in your machine, however, if you do not have the required hardware you can split the processing between multiple local machines or between a local machine and the cloud.

System Architecture

The system will consist of the following key components:

Web browser

You will enter a URL of your choice to display display academic papers in the field of AI that containing text, tables, images, equations, diagrams and figures. You need to present results for the paper so select papers from Arxiv that contain all these features. You can use the Acrobat Reader or any other PDF reader of your choice that supports annotations.

WebRTC

You will stream the screen sharing over webRTC to the agent. This means that you can broadcast over webRTC your screen and the agent will be able to see this stream and capture it. It is advised to use the native web browser for the screen sharing as shown below and use FastRTC for capturing the webRTC stream. Obviously the webRTC stream can be local (localhost) or remote (over the internet).

AI Agent

You will use Pydantic AI for the agent framework that will be able to call tools - at a bare minimum you need to invoke web search. Web search can provide metadata about references of the paper you are reading. You can use Sematic Scholar APIs or Google Scholar APIs for that.

Vision Language Model (VLM)

You will use a VLM such as Deepseek OCR that will process the captured frames of the device. You need to process a video stream of images (frames) from the screen but you are free to implement any downsampling or any stitching of frames to create whole page images or the complete paper for that matter. You are also free to use tricks like looking up the paper to detect if this is public and download it directly from arxiv.org but you have to subsequently image it and therefore process it with the VLM. The advantage of doing that is that you can get directly the references if the user highlights a citation such as [Yaun et al., 2022] but does not scroll to the references section.

Reasoning Large Language Model (LLM)

You will use a local reasoning LLM such as Qwen2.5 to process the outputs of the VLM and web search and generate answers to user questions. You can also use an LLM API if you do not have the required hardware to run locally but you need to be careful about spending limits - set a low spending limit to avoid surprises. You are free to select an LLM server of your choice such as Ollama, LM Studio or any other.

Demo

You will build a Gradio app that will allow the user to:

Configure the CUA system (e.g., select the VLM model, LLM model, web search API keys etc).
Start the screen sharing session and consume the webRTC stream.
Enter questions about the content displayed on the screen.
Store in a MongoDB database the interactions (extracted text, image, table, figures, highlights and associated questions asked, the answers provided and references used for the answers etc).

The questions that you need to answer are:

For the highlighhted in yellow text, provide a tutorial explanation.
Auto-highlight the most important sections and figures of the paper and explain why they are important. Use purple for highlighting. The auto-highlighing will be done as the user scrolls through the paper.
What this highlighted table (or table number) is telling us?
What is the most detrimental ablation study in this paper and why?
Replicate the bounded block diagram in Mermaid and Excalidraw (Extra credit of 20 points).

Constraints

You are not allowed to save the pdf file locally and use text extraction libraries. The only way to get the content of the pdf is via the VLM processing of the screen images.
You are not allowed to use any external API for the VLM processing - you need to run the model locally or in your cloud instance.
You are not allowed to use any external API for the LLM processing - you need to run the model locally or in your cloud instance.
You are allowed to use external APIs for web search only.
You are allowed to use pre-trained models only - no fine-tuning is required or expected.
You are not allowed to use any external paid services - all processing must be done either locally or in your cloud instances that you control.
You are not allowed to use any AI browsers recently introduced by Perplexity (Comet), OpenAI and others.
You are allowed to author Chrome plugins to facilitate the screen sharing over webRTC and other capabilities you may incorporate. You cannot use other Chrome store plugins that provide AI capabilities.
There is no constraint on the runtime performance of the system - it is understood that running locally or in the cloud may introduce latency. You will be graded on functionality and quality of responses and not speed.