Multimodal Agents
You will be working individually to develop the services needed for a real-time multimodal assistant. A user wearing a headset can converse with an AI assistant that engages a number of specialized AI agents to coordinate a response relevant to the user’s query and respond with multimedia (voice, images, video, code).
You will build the following specialized agents:
News Recommendation Agent
Code Search Agent
Image Search Agent
Video Search Agent
Speech to Text Agent
Text to Speech Agent
Text Summarization Agent
Text Translation Agent
Text to Code Agent
Code to Text Agent
Text to Image Agent
Image to Text Agent
Text to Video Agent
Video to Text Agent
in this case the user will be emulated by a youtube video transcriptions, taking away all the complications of speech to text and leaving on the table for implementation the most important part of the system: a recommendation system.