Kiran Baby

Posted on Dec 26, 2025 • Edited on Dec 31, 2025

Video Libraries Made Searchable by AI

#ai #webdev #opensource #python

Ever tried finding that ONE moment in a 2-hour video? Yeah, me too. It sucks.

Back again with another project! Hope y'all had an amazing Christmas! 🎄. Jingle bells jingle bells jingle all the way ✌️

The Problem

You recorded a meeting. Or a lecture. Or your kid's recital. Now you need to find that specific part where someone said something important, or that exact scene you vaguely remember.

Your options:

Scrub through the entire video like a caveman
Hope YouTube's auto-chapters got it right (they didn't)
Give up and rewatch the whole thing

What if you could just... describe what you're looking for?

"Find the part where he talks about the budget"

"Show me when there's a red car on screen"

"Jump to where she mentions the deadline"

That's what I built.

Introducing SearchLightAI 🔦

SearchLightAI lets you search your videos by describing what you see OR what was said. Upload a video, wait for it to process, then search with natural language.

It returns the exact timestamp. Click it. You're there.

Search your videos like you search your documents.

The Tech Stack 🤓

Layer	Tech
API	FastAPI + SQLModel
Databases	PostgreSQL (metadata) + Qdrant (vectors)
Vision AI	SigLIP2 (google/siglip2-base-patch16-512)
Speech AI	faster-whisper + Sentence Transformers
Video Processing	FFmpeg + PySceneDetect
Frontend	Next.js 16, React 19, Tailwind CSS, shadcn/ui

How It Works

📥 Ingestion Pipeline

Video Upload
    ↓
PySceneDetect → finds scene changes
    ↓
FFmpeg → extracts keyframes + audio
    ↓
faster-whisper → transcribes speech
    ↓
SigLIP2 → embeds keyframes (768-dim)
Sentence Transformers → embeds transcript (384-dim)
    ↓
Qdrant → stores all vectors

🔍 Search Pipeline

Your query: "when he talks about the budget"
    ↓
Same models embed your query
    ↓
Cosine similarity search in Qdrant
    ↓
Results ranked by relevance
    ↓
Click → jump to exact timestamp

Three Search Modes

🎬 Visual Search - Describe what you see

"man standing near whiteboard"
"outdoor scene with trees"
"someone holding a laptop"

🎤 Speech Search - What was said

"when they mentioned the quarterly results"
"the part about machine learning"
"discussion about the timeline"

🔀 Hybrid Search - Best of both
Combines visual and speech results. Usually what you want.

The Secret Sauce: SigLIP2

Most visual search uses CLIP. I went with SigLIP2 instead.

Why? SigLIP uses sigmoid loss instead of softmax contrastive loss. The practical difference: better zero-shot performance, especially for fine-grained visual details.

One quirk though - raw SigLIP scores are lower than you'd expect. A "great match" might be 0.25-0.35 cosine similarity. So I rescale them:

def rescale_siglip_score(cosine_score: float) -> float:
    """Maps SigLIP scores to intuitive 0-1 range."""
    midpoint = 0.18
    steepness = 12
    x = (cosine_score - midpoint) * steepness
    return 1 / (1 + math.exp(-x))

Now 0.35 → ~90%, 0.25 → ~70%, which feels right in the UI.

Smart Keyframe Extraction

I'm not extracting every frame (that would be insane). PySceneDetect uses adaptive content detection to find actual scene changes.

For each scene, I grab:

Frame at the start
Frame at the middle (for scenes > 2 seconds)

This gives good coverage without exploding storage or processing time.

Running It Yourself

Docker Compose (Recommended)

git clone https://github.com/kiranbaby14/searchlightai.git
cd searchlightai

cp apps/server/.env.example apps/server/.env
cp apps/client/.env.example apps/client/.env

docker-compose up -d

Wait for models to load (around 2-3 min first time), then:

Frontend: http://localhost:3000
API: http://localhost:8000

Prerequisites

NVIDIA GPU with CUDA support
Docker + Docker Compose
Around 4GB+ VRAM should work (SigLIP2 + faster-whisper + Sentence Transformers are relatively lightweight)

⏱️ Heads up: Processing time depends on video length. A 10-min video takes a couple minutes, but longer videos (1hr+) will need more patience. Scene detection, transcription, and embedding generation all add up.

What Could You Build With This?

Some ideas:

📹 Meeting search - Find decisions across hundreds of recorded meetings
🎓 Lecture navigation - Students jumping to specific topics
📺 Media asset management - Search through footage libraries
📱 Personal video search - Your phone videos, finally searchable

The Code Is Yours

GitHub: github.com/kiranbaby14/SearchLightAI

Star it ⭐ if you think video search should be this easy.

Shoutouts 🙏

SigLIP2 from Google for visual embeddings that actually work
PySceneDetect for making scene detection actually usable
Qdrant for a vector DB that just works
faster-whisper for Whisper that's actually fast

That's It. Go Break It.

Clone it, throw your weirdest videos at it, see what breaks. File issues. Send PRs. Roast my code in the comments.

The best part of putting stuff out there? Finding out all the ways you didn't think of using it.

Catch you in the next one. ✌️

Built with ⚡, mass Claude Code sessions, and an unhealthy amount of caffeine ☕ by @kiranbaby14

DEV Community