Arshdeep Singh

Posted on Mar 26

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

#python #ai #opensource #tutorial

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

Written by Arshdeep Singh

Scrubbing through hours of dashcam footage to find one specific moment is exactly as tedious as it sounds. You remember something happened — a car cut you off, someone ran a red light — but now you're stuck fast-forwarding through gigabytes of MP4 files like it's 2003.

SentrySearch solves this. It's an open-source Python CLI that lets you search raw video files in plain English. Type what you're looking for, get a trimmed clip back.

sentrysearch search "red truck running a stop sign"

That's it.

What Is SentrySearch?

SentrySearch is a command-line tool built by @ssrajadh that brings semantic search to any folder of MP4 files. It was originally built for Tesla Sentry Mode footage (hence the name), but it works with any dashcam or video library.

The core idea:

Index your footage once
Search it with natural language queries
Get back an auto-trimmed clip of the matching moment

No transcriptions. No frame captioning. No OCR. Just raw video → vectors → search.

How It Works: The Technical Core

The secret sauce is Google's Gemini Embedding 2 — the first natively multimodal embedding model that maps text, images, audio, and video into a single unified vector space.

Here's what that means in practice:

When you search for "car cutting me off at an intersection", Gemini converts that text into a 768-dimensional vector. It can also convert a 30-second video clip into a vector in that same space. So text and video become directly comparable — no intermediate step required.

The Pipeline

MP4 Files → ffmpeg chunking → Gemini video embeddings → ChromaDB (local vector store)
                                                              ↓
                                              Text query → Gemini text embedding → cosine similarity → top match → trimmed clip

Step by step:

Chunking — ffmpeg splits each MP4 into overlapping 30-second chunks (configurable). Overlap ensures events that span chunk boundaries aren't missed.
Still-frame skipping — chunks with no meaningful visual change (parked car, nothing happening) are skipped automatically. This saves API calls and reduces cost.
Embedding — each chunk is uploaded to the Gemini Embedding API, which processes exactly 1 frame per second and returns a dense vector.
Storage — vectors are stored in a local ChromaDB database alongside metadata (source file, timestamp offset).
Search — your query is embedded as text into the same vector space, matched via cosine similarity, and the top result is trimmed from the original file.

Gemini Embedding 2: The Breakthrough That Makes This Possible

Before Gemini Embedding 2, building something like SentrySearch would require:

Running a vision model on each frame to generate captions
Embedding those captions as text
Hoping the captions captured what you actually care about

That's slow, lossy, and expensive.

Gemini Embedding 2 eliminates the middleman. It's Google's first model where video, text, images, audio, and PDFs all project into a single joint embedding space. A text query is directly comparable to a video clip at the vector level.

This is what makes sub-second semantic search over hours of footage practical.

Key specs:

768-dimensional vectors (search mode)
Native video support — 1 frame/second processed regardless of source FPS
Available via Gemini API and Vertex AI
Works with LangChain, LlamaIndex, Haystack, ChromaDB, Qdrant, Weaviate

Getting Started

Prerequisites

Python 3.10+
ffmpeg (or let it use bundled imageio-ffmpeg)
Gemini API key (free tier available at aistudio.google.com)

Install

git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
python -m venv venv && source venv/bin/activate
pip install -e .

Setup

sentrysearch init
# Prompts for your Gemini API key, writes to .env, validates with a test embedding

Index Your Footage

sentrysearch index /path/to/dashcam/footage

Output:

Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 1/4]
Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 2/4]
...
Indexed 12 new chunks from 3 files. Total: 12 chunks from 3 files.

Search

sentrysearch search "red truck running a stop sign"

Output:

#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
#3 [0.61] front_2024-01-20_09-15.mp4 @ 00:30-01:00

Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4

Tesla Dashcam Overlay (Bonus Feature)

For Tesla owners, SentrySearch can burn speed, GPS location, and timestamp directly onto your trimmed clips:

sentrysearch search "car cutting me off" --overlay

This reads SEI metadata embedded in Tesla dashcam files and renders a HUD showing:

Speed (MPH)
Date and time
City and road name (via OpenStreetMap reverse geocoding)

Requires Tesla firmware 2025.44.25+ and HW3+.

pip install -e ".[tesla]"

Pricing: Is It Practical?

Indexing 1 hour of footage costs approximately $2.84 with Gemini's embedding API at default settings (30s chunks, 5s overlap).

Breakdown:

1 hour = 3,600 seconds → 3,600 frames at $0.00079/frame = ~$2.84
Still-frame skipping can cut this significantly for parked/security footage
Search queries cost almost nothing (text embeddings only)

Cost optimization levers:
| Option | Effect |
|--------|--------|
| --chunk-duration 60 | Fewer chunks = fewer API calls |
| --overlap 0 | No overlap = minimum chunks |
| Still-frame skipping (default ON) | Skips idle footage = direct savings |
| --no-preprocess | Raw chunks (no ffmpeg downscaling) |

Limitations & Honest Caveats

Chunk boundary problem — events that span two chunks may not match perfectly. Overlapping windows help, but aren't perfect.
Gemini Embedding 2 is in preview — API behavior and pricing may change.
No local model option — currently requires Gemini API. The community is watching for open-source multimodal embedding models to reach this quality level.
Driving footage only for Tesla overlay — SEI telemetry isn't present in parked/Sentry Mode clips.

The Bigger Picture: Multimodal RAG Is Here

SentrySearch is a clean, practical example of what becomes possible when embedding models go truly multimodal.

The same architecture can apply to:

Security camera footage — search hours of CCTV with natural language
Sports video — find specific plays or moments
Meeting recordings — semantic search without transcription
Medical imaging — cross-modal retrieval across reports and scans

We're entering an era where the traditional RAG pipeline (chunk text → embed → retrieve) expands to cover every modality. Gemini Embedding 2 is the first production model that makes this real with video.

SentrySearch is a sharp, well-executed proof of concept. And it ships today.

Resources

GitHub: github.com/ssrajadh/sentrysearch
Gemini Embedding 2 docs: ai.google.dev/gemini-api/docs/models/gemini-embedding-2-preview
HN discussion: news.ycombinator.com/item?id=47427193
Gemini API key: aistudio.google.com/apikey

Written by Arshdeep Singh

DEV Community

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

What Is SentrySearch?

How It Works: The Technical Core

The Pipeline

Gemini Embedding 2: The Breakthrough That Makes This Possible

Getting Started

Prerequisites

Install

Setup

Index Your Footage

Search

Tesla Dashcam Overlay (Bonus Feature)

Pricing: Is It Practical?

Limitations & Honest Caveats

The Bigger Picture: Multimodal RAG Is Here

Resources

Top comments (0)