News Time Machine

News Time Machine#

In this project you will be working to develop an Django Ninja based web app that implements a recommendation system. The web app will:

Accept in a search text field, a query in natural language about a topic associated with news.
Respond with an ordered table of all recommended to watch video clips across the video vault of a media outlet. The user will be able to select the order of the recommendations based on the following criteria:
- Most recent
- Most relevant
- Most popular

Milestone 1: Django site#

Learn one of the most popular web frameworks in the world, Django. You can start with the official tutorial but bear in mind that you will also need to consult Django Ninja as this will be the framework that you will use to build the API/App.

You deliver a new repo (separate from your assignments repo) called ninja-news containing the docker compose based deployment of django ninja with postgres, redis, celery and Jetstream. You can borrow the structure of the project from here and ensure that the project is using Django Ninja. Your Django views in this app are based on Bootstrap 5-based Tabler.io components.

Note

Any dynamic UI interaction, if needed, must be based on HTMX - eg this guide.

Milestone 2: Data Model#

Build a data model that will satisfy the needs of the project by making use of Pydantic 2 based data modeling tooling of Django Ninja. You can consult a slightly relevant to this exercise implementation or any other implementation you think its relevant. Note that the data model is for podcasts and you need to implement a quite different data model than this.

Deliver the data model and a screenshot of the admin interface of the django site and as a screenshot of DBeaver CE entity relationship diagram. Note that the initial data model will evolve over time as you will be adding more features to the app.

Milestone 3: YouTube News Mining#

Using the Jetstream instance define a stream called “youtube-news” that publishes video content and content metadata in a stream.

Using this channel downloads the transcripts of the first (as ranked by youtube) K=10 videos from all 9 categories that youtube categorizes their news content (Top stories, Sports, Entertainment, Science, Health, Business, Technology, National, and World). Do not download Live now or Upcoming videos.

You may use the youtube data api to retrieve the videos captions but also consult this guide in Python.

Each message of the jetstream will be serialized using protobuf and will contain the following fields:

video_url: The URL of the video
transcript: The transcript of the video
channel: The channel of the video
publication_date: The publication date of the video
category: The category of the video

The stream consumer(s) retrieve the messages and store them in the database table, exposing them to django apps via your data model. Ensure that only updates are stored and not duplicates.

Use the suggested template tabler.io to produce the table on a view using the Datatable component visibnle in the route /latest.

Note

If you want to deep dive deeper into the architectural patterns of an event based architecture specifically for the Python language you can consult this chapter

Milestone 4: Summarization and Querying APIs#

At this point you may want to switch off the publisher and the consumer of the jetstream and start working on the API. You may hit Youtube quotas and blocked from downloading other videos and this solution prevents that. The professional way to do switch off features such as a streaming publisher this is via feature flags but this is optional.

The API must support the following external endpoints:

/summarize /chat

and other internal REST endpoints of your choice based on the needs of the app.

The /summarize endpoint uses existing OpenAI APIs to summarize the text of the transcript and is constantly running but invokes the OpenAI APIs only if new summaries are needed. Periodic invocation may be achieved with Celery.

The view that supports this functionality is the original table with a new column called summary and the value points to a new view that presents two text fields: the left is the original transcript and the right is its summary. You can use langchain to orchestrate an agent invocation that will provide the summary based on the OpenAI API.

The /chat endpoint uses the OpenAI LLM (GPT3.5) to answer natural language queries using the parameterized knowledge afforded by the LLM and contents of your database. To respond to queries you will implement Retrieval Augmented Generation (RAG) as described in the langchain documentation. You need to use the PGVector component to store the embeddings of the transcripts.

Milestone 5: Frontend#

Demonstrate the conversational capabilities of the API by developing a chat interface in Django-Ninja and ask the following questions:

What are the most recent news about the war in Ukraine?
What are the chances for the US Fed to reduce interest rates in 2024?
Did the stock market fall after the latest jobs report?
Why are prices of electric vehicles falling?
What are the latest developments in the fight against cancer?