I built a video search engine that actually understands what's being said and shown. Try it and provide Feedback.

Monesh R — Fri, 05 Jun 2026 08:57:41 +0000

I spent the last few days building a multimodal video RAG platform. With a lot of help from Claude (vibecoding is real) and me tuning the retrieval numbers until they stopped being embarrassing.

You paste a YouTube video, the system watches it (transcribes the audio, samples visual frames, captions them with Claude), and then you can ask questions about it. Like Google for the inside of videos.

It returns timestamped answers grounded in what was actually said or shown. Not vibes. Citations.

The thing is live: multimodal-video-rag-web.vercel.app

I have about 13 videos indexed right now. Try searching. Ask it something. Break it. I want honest feedback.

What's under the hood

The query pipeline is a 9-node LangGraph graph. When you search:

Your query gets classified by intent (visual, transcript, timestamp, summary)
It hits two separate Pinecone indexes: one for transcript chunks (hybrid dense + BM25 sparse search), one for visual frame embeddings
Results get fused with reciprocal rank fusion
A retrieval gate filters out low-confidence junk per modality
Claude Haiku generates an answer grounded in the surviving evidence, with timestamps

Ingestion runs on Fargate. A worker pulls the video, runs faster-whisper for transcription, ffmpeg for frame extraction, Claude for frame captioning, then Titan embeddings into Pinecone. The whole thing is on AWS (Lambda, SQS, DynamoDB, S3, API Gateway, CloudWatch) deployed with CDK.

Frontend is Next.js 16 + React 19 on Vercel.

moneshrallapalli / multimodal-video-rag

Multimodal video RAG platform on AWS — search inside videos by visual frames and spoken transcript, with timestamped grounded citations.

Multimodal Video RAG

VideoRAG is a deployed search app for long-form video. Ask a question, and it returns the transcript chunks or visual frames that support the answer, with timestamps back to the source moment.

The frontend runs on Vercel, and the backend is AWS-native: FastAPI in a Lambda container, SQS/Fargate ingestion, S3 artifacts, Pinecone indexes, LangGraph retrieval, and Bedrock Claude Haiku for answer generation and query rewrite. If retrieval is weak, the app refuses instead of guessing.

How it works

Admin submits a YouTube URL through the web console.
A Fargate worker downloads the video, extracts keyframes every 30 seconds, transcribes the audio with faster-whisper, and embeds both modalities into separate Pinecone indexes.
When a user asks a question, a LangGraph pipeline classifies intent, retrieves from one or both indexes, fuses results with Reciprocal Rank Fusion, reranks, gates on evidence strength, and generates a grounded answer via Bedrock.
The…

View on GitHub

What I want from you

Be brutal. Specifically:

Search quality: Did it answer your question? Were the timestamps right? Did it hallucinate?
UI/UX: Anything confusing? Anything ugly?
Speed: How long did queries take? Was it annoying?
Broken stuff: Errors, weird results, blank screens. Screenshot them if you can.

Drop feedback in the comments, or open an issue on GitHub. Both work.

Why I'm posting this

I'm an AI engineer and this is a portfolio project. I vibecoded most of it with Claude and then spent my time on the parts that matter: tuning retrieval thresholds, running eval sets, and figuring out why hybrid search with BM25 kept returning garbage until I got the alpha weighting right.

But portfolio projects have a blind spot: I've been staring at this thing for days straight. I need fresh eyes.

If you work on RAG, search, or video ML, I'd especially love your take on the retrieval pipeline. The eval harness has 145 golden queries but I know there are gaps.

Thanks for reading. Go break my stuff.

DEV Community: Monesh R