SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2
Written by Arshdeep Singh
Scrubbing through hours of dashcam footage to find one specific moment is exactly as tedious as it sounds. You remember something happened — a car cut you off, someone ran a red light — but now you're stuck fast-forwarding through gigabytes of MP4 files like it's 2003.
SentrySearch solves this. It's an open-source Python CLI that lets you search raw video files in plain English. Type what you're looking for, get a trimmed clip back.
sentrysearch search "red truck running a stop sign"
That's it.
What Is SentrySearch?
SentrySearch is a command-line tool built by @ssrajadh that brings semantic search to any folder of MP4 files. It was originally built for Tesla Sentry Mode footage (hence the name), but it works with any dashcam or video library.
The core idea:
- Index your footage once
- Search it with natural language queries
- Get back an auto-trimmed clip of the matching moment
No transcriptions. No frame captioning. No OCR. Just raw video → vectors → search.
How It Works: The Technical Core
The secret sauce is Google's Gemini Embedding 2 — the first natively multimodal embedding model that maps text, images, audio, and video into a single unified vector space.
Here's what that means in practice:
When you search for "car cutting me off at an intersection", Gemini converts that text into a 768-dimensional vector. It can also convert a 30-second video clip into a vector in that same space. So text and video become directly comparable — no intermediate step required.
The Pipeline
MP4 Files → ffmpeg chunking → Gemini video embeddings → ChromaDB (local vector store)
↓
Text query → Gemini text embedding → cosine similarity → top match → trimmed clip
Step by step:
Chunking — ffmpeg splits each MP4 into overlapping 30-second chunks (configurable). Overlap ensures events that span chunk boundaries aren't missed.
Still-frame skipping — chunks with no meaningful visual change (parked car, nothing happening) are skipped automatically. This saves API calls and reduces cost.
Embedding — each chunk is uploaded to the Gemini Embedding API, which processes exactly 1 frame per second and returns a dense vector.
Storage — vectors are stored in a local ChromaDB database alongside metadata (source file, timestamp offset).
Search — your query is embedded as text into the same vector space, matched via cosine similarity, and the top result is trimmed from the original file.
Gemini Embedding 2: The Breakthrough That Makes This Possible
Before Gemini Embedding 2, building something like SentrySearch would require:
- Running a vision model on each frame to generate captions
- Embedding those captions as text
- Hoping the captions captured what you actually care about
That's slow, lossy, and expensive.
Gemini Embedding 2 eliminates the middleman. It's Google's first model where video, text, images, audio, and PDFs all project into a single joint embedding space. A text query is directly comparable to a video clip at the vector level.
This is what makes sub-second semantic search over hours of footage practical.
Key specs:
- 768-dimensional vectors (search mode)
- Native video support — 1 frame/second processed regardless of source FPS
- Available via Gemini API and Vertex AI
- Works with LangChain, LlamaIndex, Haystack, ChromaDB, Qdrant, Weaviate
Getting Started
Prerequisites
- Python 3.10+
- ffmpeg (or let it use bundled
imageio-ffmpeg) - Gemini API key (free tier available at aistudio.google.com)
Install
git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
python -m venv venv && source venv/bin/activate
pip install -e .
Setup
sentrysearch init
# Prompts for your Gemini API key, writes to .env, validates with a test embedding
Index Your Footage
sentrysearch index /path/to/dashcam/footage
Output:
Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 1/4]
Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 2/4]
...
Indexed 12 new chunks from 3 files. Total: 12 chunks from 3 files.
Search
sentrysearch search "red truck running a stop sign"
Output:
#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
#3 [0.61] front_2024-01-20_09-15.mp4 @ 00:30-01:00
Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4
Tesla Dashcam Overlay (Bonus Feature)
For Tesla owners, SentrySearch can burn speed, GPS location, and timestamp directly onto your trimmed clips:
sentrysearch search "car cutting me off" --overlay
This reads SEI metadata embedded in Tesla dashcam files and renders a HUD showing:
- Speed (MPH)
- Date and time
- City and road name (via OpenStreetMap reverse geocoding)
Requires Tesla firmware 2025.44.25+ and HW3+.
pip install -e ".[tesla]"
Pricing: Is It Practical?
Indexing 1 hour of footage costs approximately $2.84 with Gemini's embedding API at default settings (30s chunks, 5s overlap).
Breakdown:
- 1 hour = 3,600 seconds → 3,600 frames at $0.00079/frame = ~$2.84
- Still-frame skipping can cut this significantly for parked/security footage
- Search queries cost almost nothing (text embeddings only)
Cost optimization levers:
| Option | Effect |
|--------|--------|
| --chunk-duration 60 | Fewer chunks = fewer API calls |
| --overlap 0 | No overlap = minimum chunks |
| Still-frame skipping (default ON) | Skips idle footage = direct savings |
| --no-preprocess | Raw chunks (no ffmpeg downscaling) |
Limitations & Honest Caveats
- Chunk boundary problem — events that span two chunks may not match perfectly. Overlapping windows help, but aren't perfect.
- Gemini Embedding 2 is in preview — API behavior and pricing may change.
- No local model option — currently requires Gemini API. The community is watching for open-source multimodal embedding models to reach this quality level.
- Driving footage only for Tesla overlay — SEI telemetry isn't present in parked/Sentry Mode clips.
The Bigger Picture: Multimodal RAG Is Here
SentrySearch is a clean, practical example of what becomes possible when embedding models go truly multimodal.
The same architecture can apply to:
- Security camera footage — search hours of CCTV with natural language
- Sports video — find specific plays or moments
- Meeting recordings — semantic search without transcription
- Medical imaging — cross-modal retrieval across reports and scans
We're entering an era where the traditional RAG pipeline (chunk text → embed → retrieve) expands to cover every modality. Gemini Embedding 2 is the first production model that makes this real with video.
SentrySearch is a sharp, well-executed proof of concept. And it ships today.
Resources
- GitHub: github.com/ssrajadh/sentrysearch
- Gemini Embedding 2 docs: ai.google.dev/gemini-api/docs/models/gemini-embedding-2-preview
- HN discussion: news.ycombinator.com/item?id=47427193
- Gemini API key: aistudio.google.com/apikey
Written by Arshdeep Singh
Top comments (0)