Building a Real-Time Video Assistant with RAG on AWS
> April 15, 2025
One of the most interesting projects I have shipped is a Video Assistant that autonomously joins Microsoft Teams meetings and delivers real-time news briefings relevant to the ongoing conversation. This post explains the architecture from ingestion to delivery.
> The Problem
Enterprise teams spend hours manually researching context before meetings. The goal was to build a system that automatically surfaces the most relevant news and internal documents during a live meeting — without any human curation.
> Architecture Overview
The system has four main stages:
- **Ingestion** — scrape news sources on a schedule, chunk and embed documents
- **Indexing** — store embeddings in Qdrant with rich metadata
- **Retrieval** — semantic search triggered by meeting transcript segments
- **Delivery** — a bot that joins Teams calls and posts briefings in the chat
> Stage 1: News Ingestion with AWS Lambda
import boto3
import feedparser
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
def handler(event, context):
feeds = ["https://feeds.reuters.com/reuters/topNews", "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"]
articles = []
for feed_url in feeds:
feed = feedparser.parse(feed_url)
for entry in feed.entries[:20]:
articles.append({"title": entry.title, "summary": entry.summary, "published": entry.published})
embed_and_index(articles)
return {"indexed": len(articles)}A CloudWatch Events rule triggers this Lambda every 15 minutes, keeping the index fresh.
> Stage 2: Embedding and Indexing with Qdrant
from langchain_aws import BedrockEmbeddings
from qdrant_client.models import Distance, VectorParams
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", region_name="us-east-1")
client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)
client.recreate_collection(
collection_name="news",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
def embed_and_index(articles):
points = []
for i, article in enumerate(articles):
text = f"{article['title']}. {article['summary']}"
vector = embeddings.embed_query(text)
points.append(PointStruct(id=i, vector=vector, payload=article))
client.upsert(collection_name="news", points=points)> Stage 3: Real-Time Retrieval
During a meeting, the Teams bot captures transcript segments and triggers retrieval:
def retrieve_relevant_news(transcript_segment: str, top_k: int = 5):
query_vector = embeddings.embed_query(transcript_segment)
results = client.search(
collection_name="news",
query_vector=query_vector,
limit=top_k,
score_threshold=0.72,
)
return [r.payload for r in results]The score threshold (0.72) was tuned empirically to filter noise while keeping high-signal results.
> Stage 4: Teams Bot Delivery
The bot uses the Microsoft Bot Framework SDK to join meetings and post briefings:
async def on_message_activity(self, turn_context: TurnContext):
transcript = turn_context.activity.text
news_items = retrieve_relevant_news(transcript)
if news_items:
summary = format_briefing(news_items)
await turn_context.send_activity(MessageFactory.text(summary))> Lessons Learned
- **Rate-limit your ingestion** — news APIs and RSS feeds have quotas; use SQS to buffer bursts.
- **Tune your score threshold carefully** — too low gives noise, too high misses useful context.
- **Async delivery matters** — posting synchronously in the bot handler causes timeouts in Teams; always use background tasks.
- **Monitor your vector store** — Qdrant collection sizes grow fast; implement TTL-based cleanup for stale news.
This project showed me that the hardest part of a RAG system is not the LLM — it's data freshness and retrieval quality tuning.