From Prototype to Production: Building a Multimodal Video Search Engine

Demo: <a href="https://fennec.jasongpeterson.com" target="_blank" rel="noopener noreferrer">fennec.jasongpeterson.com
Code: <a href="https://github.com/JasonMakes801/fennec-search" target="_blank" rel="noopener noreferrer">github.com/JasonMakes801/fennec-search

#showdev #python #docker #machinelearning

In my last post, I wrote about the unreasonable effectiveness of model stacking for media search—combining CLIP, Whisper, and ArcFace to find video content through visual descriptions, dialog, and faces. Over the holidays I expanded that afternoon hack into something more production-like.

Live demo: fennec.jasongpeterson.com
Starter code: github.com/JasonMakes801/fennec-search

Try This

Go to fennec.jasongpeterson.com (desktop browser)
Enter older man on phone, harbor background in Visual Content → click +
Click the face of the older guy with glasses sitting with the harbor at his back
Enter the Americans had launched their missiles in Dialog (Semantic mode) → click +
Play the clip

You've drilled down to an exact shot without metadata, timecodes, or remembering exact words. The semantic search is fuzzy—he actually says "What it was telling him was that the US had launched their ICBMs," but that's close enough.

What's Under the Hood

Containerized architecture: Vue/Nginx frontend, FastAPI backend, standalone ingest worker, Postgres+pgvector—all via docker-compose
Background enrichment: Polling-based worker that handles drive mounting/unmounting gracefully (Watchdog doesn't work reliably with NFS/network shares)
Semantic dialog search: Sentence-transformer embeddings so "Americans launched missiles" finds "US fired rockets"
Frame-accurate playback: HTML5 video decode to canvas using requestVideoFrameCallback()
EDL export: Queue scenes and export CMX 3600 for NLE roundtrip