DEV Community

Arshdeep Singh
Arshdeep Singh

Posted on

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

SentrySearch: Semantic Search Over Dashcam Footage Using Gemini Embedding 2

Written by Arshdeep Singh


Scrubbing through hours of dashcam footage to find one specific moment is exactly as tedious as it sounds. You remember something happened — a car cut you off, someone ran a red light — but now you're stuck fast-forwarding through gigabytes of MP4 files like it's 2003.

SentrySearch solves this. It's an open-source Python CLI that lets you search raw video files in plain English. Type what you're looking for, get a trimmed clip back.

sentrysearch search "red truck running a stop sign"
Enter fullscreen mode Exit fullscreen mode

That's it.


What Is SentrySearch?

SentrySearch is a command-line tool built by @ssrajadh that brings semantic search to any folder of MP4 files. It was originally built for Tesla Sentry Mode footage (hence the name), but it works with any dashcam or video library.

The core idea:

  • Index your footage once
  • Search it with natural language queries
  • Get back an auto-trimmed clip of the matching moment

No transcriptions. No frame captioning. No OCR. Just raw video → vectors → search.


How It Works: The Technical Core

The secret sauce is Google's Gemini Embedding 2 — the first natively multimodal embedding model that maps text, images, audio, and video into a single unified vector space.

Here's what that means in practice:

When you search for "car cutting me off at an intersection", Gemini converts that text into a 768-dimensional vector. It can also convert a 30-second video clip into a vector in that same space. So text and video become directly comparable — no intermediate step required.

The Pipeline

MP4 Files → ffmpeg chunking → Gemini video embeddings → ChromaDB (local vector store)
                                                              ↓
                                              Text query → Gemini text embedding → cosine similarity → top match → trimmed clip
Enter fullscreen mode Exit fullscreen mode

Step by step:

  1. Chunking — ffmpeg splits each MP4 into overlapping 30-second chunks (configurable). Overlap ensures events that span chunk boundaries aren't missed.

  2. Still-frame skipping — chunks with no meaningful visual change (parked car, nothing happening) are skipped automatically. This saves API calls and reduces cost.

  3. Embedding — each chunk is uploaded to the Gemini Embedding API, which processes exactly 1 frame per second and returns a dense vector.

  4. Storage — vectors are stored in a local ChromaDB database alongside metadata (source file, timestamp offset).

  5. Search — your query is embedded as text into the same vector space, matched via cosine similarity, and the top result is trimmed from the original file.


Gemini Embedding 2: The Breakthrough That Makes This Possible

Before Gemini Embedding 2, building something like SentrySearch would require:

  • Running a vision model on each frame to generate captions
  • Embedding those captions as text
  • Hoping the captions captured what you actually care about

That's slow, lossy, and expensive.

Gemini Embedding 2 eliminates the middleman. It's Google's first model where video, text, images, audio, and PDFs all project into a single joint embedding space. A text query is directly comparable to a video clip at the vector level.

This is what makes sub-second semantic search over hours of footage practical.

Key specs:

  • 768-dimensional vectors (search mode)
  • Native video support — 1 frame/second processed regardless of source FPS
  • Available via Gemini API and Vertex AI
  • Works with LangChain, LlamaIndex, Haystack, ChromaDB, Qdrant, Weaviate

Getting Started

Prerequisites

  • Python 3.10+
  • ffmpeg (or let it use bundled imageio-ffmpeg)
  • Gemini API key (free tier available at aistudio.google.com)

Install

git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
python -m venv venv && source venv/bin/activate
pip install -e .
Enter fullscreen mode Exit fullscreen mode

Setup

sentrysearch init
# Prompts for your Gemini API key, writes to .env, validates with a test embedding
Enter fullscreen mode Exit fullscreen mode

Index Your Footage

sentrysearch index /path/to/dashcam/footage
Enter fullscreen mode Exit fullscreen mode

Output:

Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 1/4]
Indexing file 1/3: front_2024-01-15_14-30.mp4 [chunk 2/4]
...
Indexed 12 new chunks from 3 files. Total: 12 chunks from 3 files.
Enter fullscreen mode Exit fullscreen mode

Search

sentrysearch search "red truck running a stop sign"
Enter fullscreen mode Exit fullscreen mode

Output:

#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
#3 [0.61] front_2024-01-20_09-15.mp4 @ 00:30-01:00

Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4
Enter fullscreen mode Exit fullscreen mode

Tesla Dashcam Overlay (Bonus Feature)

For Tesla owners, SentrySearch can burn speed, GPS location, and timestamp directly onto your trimmed clips:

sentrysearch search "car cutting me off" --overlay
Enter fullscreen mode Exit fullscreen mode

This reads SEI metadata embedded in Tesla dashcam files and renders a HUD showing:

  • Speed (MPH)
  • Date and time
  • City and road name (via OpenStreetMap reverse geocoding)

Requires Tesla firmware 2025.44.25+ and HW3+.

pip install -e ".[tesla]"
Enter fullscreen mode Exit fullscreen mode

Pricing: Is It Practical?

Indexing 1 hour of footage costs approximately $2.84 with Gemini's embedding API at default settings (30s chunks, 5s overlap).

Breakdown:

  • 1 hour = 3,600 seconds → 3,600 frames at $0.00079/frame = ~$2.84
  • Still-frame skipping can cut this significantly for parked/security footage
  • Search queries cost almost nothing (text embeddings only)

Cost optimization levers:
| Option | Effect |
|--------|--------|
| --chunk-duration 60 | Fewer chunks = fewer API calls |
| --overlap 0 | No overlap = minimum chunks |
| Still-frame skipping (default ON) | Skips idle footage = direct savings |
| --no-preprocess | Raw chunks (no ffmpeg downscaling) |


Limitations & Honest Caveats

  • Chunk boundary problem — events that span two chunks may not match perfectly. Overlapping windows help, but aren't perfect.
  • Gemini Embedding 2 is in preview — API behavior and pricing may change.
  • No local model option — currently requires Gemini API. The community is watching for open-source multimodal embedding models to reach this quality level.
  • Driving footage only for Tesla overlay — SEI telemetry isn't present in parked/Sentry Mode clips.

The Bigger Picture: Multimodal RAG Is Here

SentrySearch is a clean, practical example of what becomes possible when embedding models go truly multimodal.

The same architecture can apply to:

  • Security camera footage — search hours of CCTV with natural language
  • Sports video — find specific plays or moments
  • Meeting recordings — semantic search without transcription
  • Medical imaging — cross-modal retrieval across reports and scans

We're entering an era where the traditional RAG pipeline (chunk text → embed → retrieve) expands to cover every modality. Gemini Embedding 2 is the first production model that makes this real with video.

SentrySearch is a sharp, well-executed proof of concept. And it ships today.


Resources


Written by Arshdeep Singh

Top comments (0)