DEV Community

Cover image for I built an AI agent that watches security cameras and talks to you about what it sees
Simba Zhang
Simba Zhang

Posted on

I built an AI agent that watches security cameras and talks to you about what it sees

I have a mix of Ring, Blink, and IP cameras at home. They all do the same thing: detect motion, send a notification, and make me open three different apps to see what happened. 47 alerts a day — wind, shadows, cats — and no way to just ask "did anyone come to the door today?"

So I built something different.

Aegis AI — a security agent, not a camera app

Aegis is a desktop app that connects to all your cameras and puts an AI agent in front of them. The agent watches, understands, remembers, and talks to you.

Not "motion detected." Not "person." Instead:

"A person in a blue hoodie walked up to the front door at 2:15 PM, stood there for about 30 seconds, then left a package and walked away."

And you can ask it questions:

You: "What happened at the front door today?"

Aegis: "It's been quiet. A package was delivered at 1:15 PM. Your daughter got home at 3:30 PM. No unfamiliar visitors."

How it works under the hood

The architecture is three layers:

1. Camera layer — Connects to Ring, Blink, any RTSP/ONVIF IP camera, your laptop webcam, even an old iPhone. Everything unified in one timeline. Live view uses go2rtc (WebRTC relay) for ~300ms latency.

2. Vision layer — Instead of YOLO object detection, Aegis uses Vision Language Models for scene analysis. You choose:

  • Local: llama-server with GGUF models from HuggingFace — SmolVLM2, Qwen-VL, LFM2.5, MiniCPM-V, LLaVA. Browse and download models right inside the app. Runs on Apple Silicon with Metal acceleration — a Mac M1 Mini 8GB handles LFM2.5 Q4 at about 3-5 seconds per inference.
  • Cloud: GPT Vision or Google APIs with your own key.
  • Or both — local for routine analysis, cloud for complex scenes.

The pipeline doesn't send every frame to the VLM. Motion detection (TF.js in the Electron renderer) triggers recording, key frames get extracted and composited, then only the meaningful frames hit inference.

3. Agent layer — This is what makes it different from just "a camera app with a better AI model." The agent has:

  • Memory — It learns your household. Who's family, who visits regularly, what's normal at different times of day. Day one you get 30 alerts. Day seven you get 3 — the ones that matter.
  • A configurable Soul — You set its personality: how it talks, what it cares about, how cautious it should be. It's your agent, your preferences.
  • 16 toggleable skills — Video search, forensic analysis, clip delivery, smart alerts, voice output, generative video recaps. Enable what you need.
  • Interactive messaging — Alerts go to Slack, Discord, or Telegram with action buttons inline. "Analyze this clip," "Show me who was there," "Send the video." One tap.
  • Conversational search — Ask "was anyone in the backyard this afternoon?" and the agent searches its memory, triages the results, and gives you a narrative answer — not a list of timestamps.

The tech stack

  • Electron — Desktop app shell, GPU-accelerated video decoding, TF.js for motion preprocessing
  • Python backend — VLM orchestration, motion compositing, decision service
  • llama-server — Local VLM inference with Metal/CUDA acceleration
  • go2rtc — WebRTC relay for low-latency live camera streams
  • SQLite — Local storage for clips, analysis results, and vector search (sqlite-vec)
  • Node.js gateway — Communication bridge for Telegram, Discord, Slack with interactive buttons

What I learned building this

VLMs are practical for real-world use now. A 1.6B parameter vision model on 8GB of RAM gives usable scene descriptions. Not perfect, but good enough to distinguish "UPS driver with a package" from "wind blowing a branch."

The agent layer matters more than the model. A better VLM gives you better descriptions. But memory, context, and decision-making are what turn "AI analysis" into something actually useful. The deduplication and learning system is what stops you from getting spammed.

Electron gets a bad rap but it solves real problems here. GPU video decoding, TF.js in the renderer for motion preprocessing, WebRTC for live camera streams — you need a browser engine for this. A CLI with FFmpeg gives you 5+ seconds of latency. Electron with go2rtc gives you 300ms.

Try it

Runs on Mac, Windows, and Linux. Everything stored locally on your machine.

Happy to answer questions about the architecture, the VLM pipeline, or anything else!

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

Security camera + conversational AI is a genuinely compelling combo — the natural language interface makes it accessible in ways traditional monitoring UIs aren't. The system prompt for something like this has to be really precise: role, constraints (what NOT to say), output format. I built flompt to make that kind of structured prompt design visual and auditable — typed blocks, compiled XML. flompt.dev / github.com/Nyrok/flompt