Multimodal Agents

You will be working individually to develop the services needed for a real-time multimodal assistant. A user wearing a headset can converse with an AI assistant that engages a number of specialized AI agents to coordinate a response relevant to the user’s query and respond with multimedia (voice, images, video, code).

You will build the following specialized agents:

  1. News Recommendation Agent

  2. Code Search Agent

  3. Image Search Agent

  4. Video Search Agent

  5. Speech to Text Agent

  6. Text to Speech Agent

  7. Text Summarization Agent

  8. Text Translation Agent

  9. Text to Code Agent

  10. Code to Text Agent

  11. Text to Image Agent

  12. Image to Text Agent

  13. Text to Video Agent

  14. Video to Text Agent

in this case the user will be emulated by a youtube video transcriptions, taking away all the complications of speech to text and leaving on the table for implementation the most important part of the system: a recommendation system.