Ever tried finding that ONE moment in a 2-hour video? Yeah, me too. It sucks.
Back again with another project! Hope y'all had an amazing Christmas! π. Jingle bells jingle bells jingle all the way βοΈ
The Problem
You recorded a meeting. Or a lecture. Or your kid's recital. Now you need to find that specific part where someone said something important, or that exact scene you vaguely remember.
Your options:
- Scrub through the entire video like a caveman
- Hope YouTube's auto-chapters got it right (they didn't)
- Give up and rewatch the whole thing
What if you could just... describe what you're looking for?
"Find the part where he talks about the budget"
"Show me when there's a red car on screen"
"Jump to where she mentions the deadline"
That's what I built.
Introducing SearchLightAI π¦
SearchLightAI lets you search your videos by describing what you see OR what was said. Upload a video, wait for it to process, then search with natural language.
It returns the exact timestamp. Click it. You're there.
Search your videos like you search your documents.
The Tech Stack π€
| Layer | Tech |
|---|---|
| API | FastAPI + SQLModel |
| Databases | PostgreSQL (metadata) + Qdrant (vectors) |
| Vision AI | SigLIP2 (google/siglip2-base-patch16-512) |
| Speech AI | faster-whisper + Sentence Transformers |
| Video Processing | FFmpeg + PySceneDetect |
| Frontend | Next.js 16, React 19, Tailwind CSS, shadcn/ui |
How It Works
π₯ Ingestion Pipeline
Video Upload
β
PySceneDetect β finds scene changes
β
FFmpeg β extracts keyframes + audio
β
faster-whisper β transcribes speech
β
SigLIP2 β embeds keyframes (768-dim)
Sentence Transformers β embeds transcript (384-dim)
β
Qdrant β stores all vectors
π Search Pipeline
Your query: "when he talks about the budget"
β
Same models embed your query
β
Cosine similarity search in Qdrant
β
Results ranked by relevance
β
Click β jump to exact timestamp
Three Search Modes
π¬ Visual Search - Describe what you see
- "man standing near whiteboard"
- "outdoor scene with trees"
- "someone holding a laptop"
π€ Speech Search - What was said
- "when they mentioned the quarterly results"
- "the part about machine learning"
- "discussion about the timeline"
π Hybrid Search - Best of both
Combines visual and speech results. Usually what you want.
The Secret Sauce: SigLIP2
Most visual search uses CLIP. I went with SigLIP2 instead.
Why? SigLIP uses sigmoid loss instead of softmax contrastive loss. The practical difference: better zero-shot performance, especially for fine-grained visual details.
One quirk though - raw SigLIP scores are lower than you'd expect. A "great match" might be 0.25-0.35 cosine similarity. So I rescale them:
def rescale_siglip_score(cosine_score: float) -> float:
"""Maps SigLIP scores to intuitive 0-1 range."""
midpoint = 0.18
steepness = 12
x = (cosine_score - midpoint) * steepness
return 1 / (1 + math.exp(-x))
Now 0.35 β ~90%, 0.25 β ~70%, which feels right in the UI.
Smart Keyframe Extraction
I'm not extracting every frame (that would be insane). PySceneDetect uses adaptive content detection to find actual scene changes.
For each scene, I grab:
- Frame at the start
- Frame at the middle (for scenes > 2 seconds)
This gives good coverage without exploding storage or processing time.
Running It Yourself
Docker Compose (Recommended)
git clone https://github.com/kiranbaby14/searchlightai.git
cd searchlightai
cp apps/server/.env.example apps/server/.env
cp apps/client/.env.example apps/client/.env
docker-compose up -d
Wait for models to load (around 2-3 min first time), then:
- Frontend: http://localhost:3000
- API: http://localhost:8000
Prerequisites
- NVIDIA GPU with CUDA support
- Docker + Docker Compose
- Around 4GB+ VRAM should work (SigLIP2 + faster-whisper + Sentence Transformers are relatively lightweight)
β±οΈ Heads up: Processing time depends on video length. A 10-min video takes a couple minutes, but longer videos (1hr+) will need more patience. Scene detection, transcription, and embedding generation all add up.
What Could You Build With This?
Some ideas:
- πΉ Meeting search - Find decisions across hundreds of recorded meetings
- π Lecture navigation - Students jumping to specific topics
- πΊ Media asset management - Search through footage libraries
- π± Personal video search - Your phone videos, finally searchable
The Code Is Yours
GitHub: github.com/kiranbaby14/SearchLightAI
Star it β if you think video search should be this easy.
Shoutouts π
- SigLIP2 from Google for visual embeddings that actually work
- PySceneDetect for making scene detection actually usable
- Qdrant for a vector DB that just works
- faster-whisper for Whisper that's actually fast
That's It. Go Break It.
Clone it, throw your weirdest videos at it, see what breaks. File issues. Send PRs. Roast my code in the comments.
The best part of putting stuff out there? Finding out all the ways you didn't think of using it.
Catch you in the next one. βοΈ
Built with β‘, mass Claude Code sessions, and an unhealthy amount of caffeine β by @kiranbaby14
Top comments (0)