DEV Community

Kiran Baby
Kiran Baby

Posted on

I Built a Video Search Engine That Understands What You're Looking For

Ever tried finding that ONE moment in a 2-hour video? Yeah, me too. It sucks.

Back again with another project! Hope y'all had an amazing Christmas! πŸŽ„. Jingle bells jingle bells jingle all the way ✌️


The Problem

You recorded a meeting. Or a lecture. Or your kid's recital. Now you need to find that specific part where someone said something important, or that exact scene you vaguely remember.

Your options:

  1. Scrub through the entire video like a caveman
  2. Hope YouTube's auto-chapters got it right (they didn't)
  3. Give up and rewatch the whole thing

What if you could just... describe what you're looking for?

"Find the part where he talks about the budget"

"Show me when there's a red car on screen"

"Jump to where she mentions the deadline"

That's what I built.


Introducing SearchLightAI πŸ”¦

SearchLightAI lets you search your videos by describing what you see OR what was said. Upload a video, wait for it to process, then search with natural language.

It returns the exact timestamp. Click it. You're there.

Search your videos like you search your documents.


The Tech Stack πŸ€“

Layer Tech
API FastAPI + SQLModel
Databases PostgreSQL (metadata) + Qdrant (vectors)
Vision AI SigLIP2 (google/siglip2-base-patch16-512)
Speech AI faster-whisper + Sentence Transformers
Video Processing FFmpeg + PySceneDetect
Frontend Next.js 16, React 19, Tailwind CSS, shadcn/ui

How It Works

πŸ“₯ Ingestion Pipeline

Video Upload
    ↓
PySceneDetect β†’ finds scene changes
    ↓
FFmpeg β†’ extracts keyframes + audio
    ↓
faster-whisper β†’ transcribes speech
    ↓
SigLIP2 β†’ embeds keyframes (768-dim)
Sentence Transformers β†’ embeds transcript (384-dim)
    ↓
Qdrant β†’ stores all vectors
Enter fullscreen mode Exit fullscreen mode

πŸ” Search Pipeline

Your query: "when he talks about the budget"
    ↓
Same models embed your query
    ↓
Cosine similarity search in Qdrant
    ↓
Results ranked by relevance
    ↓
Click β†’ jump to exact timestamp
Enter fullscreen mode Exit fullscreen mode

Three Search Modes

🎬 Visual Search - Describe what you see

  • "man standing near whiteboard"
  • "outdoor scene with trees"
  • "someone holding a laptop"

🎀 Speech Search - What was said

  • "when they mentioned the quarterly results"
  • "the part about machine learning"
  • "discussion about the timeline"

πŸ”€ Hybrid Search - Best of both
Combines visual and speech results. Usually what you want.


The Secret Sauce: SigLIP2

Most visual search uses CLIP. I went with SigLIP2 instead.

Why? SigLIP uses sigmoid loss instead of softmax contrastive loss. The practical difference: better zero-shot performance, especially for fine-grained visual details.

One quirk though - raw SigLIP scores are lower than you'd expect. A "great match" might be 0.25-0.35 cosine similarity. So I rescale them:

def rescale_siglip_score(cosine_score: float) -> float:
    """Maps SigLIP scores to intuitive 0-1 range."""
    midpoint = 0.18
    steepness = 12
    x = (cosine_score - midpoint) * steepness
    return 1 / (1 + math.exp(-x))
Enter fullscreen mode Exit fullscreen mode

Now 0.35 β†’ ~90%, 0.25 β†’ ~70%, which feels right in the UI.


Smart Keyframe Extraction

I'm not extracting every frame (that would be insane). PySceneDetect uses adaptive content detection to find actual scene changes.

For each scene, I grab:

  • Frame at the start
  • Frame at the middle (for scenes > 2 seconds)

This gives good coverage without exploding storage or processing time.


Running It Yourself

Docker Compose (Recommended)

git clone https://github.com/kiranbaby14/searchlightai.git
cd searchlightai

cp apps/server/.env.example apps/server/.env
cp apps/client/.env.example apps/client/.env

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Wait for models to load (around 2-3 min first time), then:

Prerequisites

  • NVIDIA GPU with CUDA support
  • Docker + Docker Compose
  • Around 4GB+ VRAM should work (SigLIP2 + faster-whisper + Sentence Transformers are relatively lightweight)

⏱️ Heads up: Processing time depends on video length. A 10-min video takes a couple minutes, but longer videos (1hr+) will need more patience. Scene detection, transcription, and embedding generation all add up.


What Could You Build With This?

Some ideas:

  • πŸ“Ή Meeting search - Find decisions across hundreds of recorded meetings
  • πŸŽ“ Lecture navigation - Students jumping to specific topics
  • πŸ“Ί Media asset management - Search through footage libraries
  • πŸ“± Personal video search - Your phone videos, finally searchable

The Code Is Yours

GitHub: github.com/kiranbaby14/SearchLightAI

Star it ⭐ if you think video search should be this easy.


Shoutouts πŸ™

  • SigLIP2 from Google for visual embeddings that actually work
  • PySceneDetect for making scene detection actually usable
  • Qdrant for a vector DB that just works
  • faster-whisper for Whisper that's actually fast

That's It. Go Break It.

Clone it, throw your weirdest videos at it, see what breaks. File issues. Send PRs. Roast my code in the comments.

The best part of putting stuff out there? Finding out all the ways you didn't think of using it.

Catch you in the next one. ✌️


Built with ⚑, mass Claude Code sessions, and an unhealthy amount of caffeine β˜• by @kiranbaby14

Top comments (0)