I Built a Real-Time AI Surveillance System with Voice Alerts in One Night — Here's How
Using Stream Vision Agents SDK, Moondream, and Gemini Realtime to build something I genuinely didn't think was possible in a hackathon.
The Idea
It started with a simple question: what if a security camera could actually talk to you?
Not just send a notification. Not just log an event. But literally speak to you the moment it sees something dangerous — "ALERT! Unattended bag detected!" — in real time, like having an AI guard watching your feed 24/7.
That became SentinelAI: a real-time AI surveillance system that watches a live camera feed, detects threats using plain English descriptions (no training required), and announces them via voice automatically.
I built it in one night for the Stream Vision Agents Hackathon, and honestly... it works. Here's the full story.
What SentinelAI Does
Before diving into the how, here's what the finished system does:
📷 Watches a live camera feed via Stream WebRTC at 4 FPS
🎯 Detects threats in zero-shot — you describe what to look for in plain English: "knife or blade weapon", "person fallen on ground", "fire or flames", "gun or pistol", "unattended bag or backpack"
🔊 Speaks voice alerts the instant a threat is detected — Gemini Realtime literally says "ALERT! Unattended bag detected!" out loud into the call
🗣️ Two-way voice conversation — you can talk to SentinelAI and ask it questions
🖥️ Live dashboard with dual video tiles, bounding box overlays, confidence scores, and a real-time alert log
The Tech Stack
LayerTechnologyVision Agent FrameworkStream Vision Agents SDKZero-Shot DetectionMoondream Cloud APIVoice + LLMGemini 2.0 RealtimeVideo InfrastructureStream WebRTCBackendFastAPI + PythonFrontendReact + Vite
The most interesting choice here is Moondream for zero-shot detection. Traditional CV models like YOLO need thousands of labelled images to detect custom objects. Moondream lets you describe what you want to detect in plain English and just... works. No training. No dataset. That's the magic.
Architecture
Browser (React)
│
├── Camera feed via Stream WebRTC
│
▼
FastAPI Backend (Port 8000)
│ Creates Stream sessions
│ Manages agent lifecycle
▼
Vision Agent (Port 8001)
│
├── Moondream CloudDetectionProcessor (4 FPS)
│ └── Scans every frame for threats
│ └── Queues detections when confidence > 0.65
│
└── Gemini 2.0 Realtime
└── Receives alert queue
└── Speaks alerts aloud via WebRTC
Three services, one surveillance system.
Building It: The Vision Agents Core
The heart of SentinelAI is a custom ThreatAlertProcessor — a subclass of Vision Agents' moondream.CloudDetectionProcessor. This is where the zero-shot magic happens:
pythonimport asyncio
from typing import Dict
import moondream
THREATS = [
"kitchen knife or blade weapon",
"person fallen on ground",
"fire or flames",
"gun or pistol",
"unattended bag or backpack"
]
_detection_queue: asyncio.Queue = asyncio.Queue()
class ThreatAlertProcessor(moondream.CloudDetectionProcessor):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)
self._last_alerted: Dict[str, float] = {}
self._alert_cooldown = 15.0 # seconds between same-threat alerts
async def _process_and_add_frame(self, frame):
await super()._process_and_add_frame(frame)
detections = self._last_results.get("detections", [])
now = asyncio.get_event_loop().time()
for det in detections:
label = det.get("label", "")
conf = det.get("confidence", 0)
if conf >= 0.65: # High confidence threshold
last = self._last_alerted.get(label, 0)
if now - last > self._alert_cooldown:
self._last_alerted[label] = now
await _detection_queue.put(label)
The _alert_cooldown is important — without it, once a threat is detected, Gemini would shout about it every 250ms. The 15-second cooldown makes it feel like a real alert system.
The Voice Alert Magic
This was the trickiest part. Making Gemini actually speak the alerts required using agent.llm.session.send() — sending a direct instruction to the Gemini Realtime session:
pythonasync def join_vision_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
try:
call = agent.edge.client.video.call(call_type, call_id)
async with agent.join(call):
logger.info(f"[SentinelAI] 👁 Moondream active. Watching: {THREATS}")
while True:
try:
# Wait for a threat detection (timeout after 30s for keep-alive)
label = await asyncio.wait_for(
_detection_queue.get(),
timeout=30.0
)
# Trigger Gemini to speak the alert
await agent.llm.session.send(
input=f"Say out loud urgently: ALERT! {label} detected!",
end_of_turn=True
)
logger.info(f"[SentinelAI] 🔊 Alert sent: {label}")
except asyncio.TimeoutError:
# Keep-alive ping every 30s to prevent idle timeout
await agent.llm.session.send(
input="Continue monitoring silently. Do not speak.",
end_of_turn=True
)
except Exception as e:
logger.warning(f"[SentinelAI] Session dropped ({e}), will retry on next call.")
The 30-second keep-alive was critical — without it, Gemini's Realtime session would silently die during quiet periods with no detections.
Creating the Agent
Here's how we wire everything together using Vision Agents SDK:
pythonfrom vision_agents import Agent, AgentOptions
from vision_agents.plugins import gemini, moondream
from vision_agents.plugins.getstream import StreamVideoPlugin
def create_agent():
return Agent(
agent_id="sentinel-agent",
options=AgentOptions(
llm=gemini.Realtime(
model="gemini-2.0-flash-exp",
fps=2,
instructions="""You are SentinelAI, an elite AI security guardian.
You monitor live camera feeds for threats. When you first join,
introduce yourself. When threats are detected, respond with urgency.
Always be professional and calm."""
),
plugins=[
StreamVideoPlugin(
join_call=join_vision_call,
processors=[
ThreatAlertProcessor(
detect_objects=THREATS,
conf_threshold=0.65,
fps=4,
)
]
)
]
)
)
Vision Agents SDK handles all the WebRTC complexity, audio/video track management, and plugin lifecycle — we just define what we want.
The Bugs That Almost Broke Me
No honest hackathon post is complete without the war stories.
Bug 1: Windows + asyncio + aiodns = disaster
Running on Windows, the Vision Agents serve mode crashed immediately with:
NotImplementedError: aiodns needs a SelectorEventLoop on Windows
The fix — set the event loop policy before anything else runs:
pythonimport asyncio
import sys
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
One line. One hour of debugging to find it.
Bug 2: Stream API signature verification
The FastAPI backend was returning 401 errors on every session creation. Turns out Stream's SDK requires server-side token generation with proper HMAC signatures — the frontend can't just pass an API key directly.
Bug 3: Gemini session drops
GeminRealtime object has no attribute 'session' — this crashed everything when we tried to send the intro message too early. Gemini's Realtime session isn't available until after the WebRTC connection fully establishes.
The fix: move the intro to the agent's instructions instead of sending it manually. Gemini introduces itself naturally when it's ready.
Bug 4: My phone kept triggering "knife" alerts
Zero-shot detection is powerful but imprecise. A phone's rectangular shape was matching "knife" at low confidence. Two fixes: raise the confidence threshold from 0.35 to 0.65, and use more descriptive threat names ("kitchen knife or blade weapon" instead of just "knife").
The Moment It Worked
At around 2 AM, after fixing the session keep-alive bug, I held an empty bag up to my camera.
Moondream detected it. Queued the alert. Gemini received it.
And then — through my speakers — I heard:
"ALERT! Unattended bag detected!"
I genuinely jumped. It works. It actually works.
The React Dashboard
The frontend is a custom dark surveillance UI with:
Dual video tiles — your camera feed + the SentinelAI agent feed side by side
Live bounding boxes — green overlays with confidence scores on detected threats
Dynamic watchlist editor — add/remove threat categories in real-time
Alert log — timestamped color-coded log of every detection
Built with React + Vite, connecting to Stream's JavaScript SDK for WebRTC video.
What I Learned
- Zero-shot detection is a game changer. Not needing to train a model, collect data, or label images means you can prototype surveillance for any use case in minutes. Just describe what you want to detect.
- Vision Agents SDK abstracts the hard stuff beautifully. WebRTC negotiation, ICE candidates, audio/video track management, plugin lifecycle — all handled. I focused on the AI logic, not the infrastructure.
- Gemini Realtime + WebRTC is genuinely impressive. Having a two-way voice conversation with an AI that's watching your camera feed feels like science fiction. It's not — it's a few hundred lines of Python.
- Keep-alives matter. Any long-running AI session needs a heartbeat. Without the 30-second ping, Gemini silently times out during quiet monitoring periods.
What's Next
YOLO local inference on my GTX 1650 for 30+ FPS detection (vs 4 FPS cloud)
Incident memory — SentinelAI remembers and summarizes the last hour of detections
Mobile support via Stream's React Native SDK
Multi-camera — monitor multiple feeds simultaneously
Try It Yourself
🐙 GitHub: github.com/kapilshastriwork-maker/sentinelai
🎬 Demo Video: Watch SentinelAI detect threats and speak voice alerts live on YouTube - https://youtu.be/RX4EW9DvaFQ
To run it locally:
bash# Terminal 1 — FastAPI backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Terminal 2 — Vision Agent
uv run main.py serve --port 8001
Terminal 3 — React frontend
cd frontend && npm run dev
Add your .env with STREAM_API_KEY, STREAM_API_SECRET, MOONDREAM_API_KEY, and GEMINI_API_KEY — and you're watching.
Resources
Stream Vision Agents SDK
Moondream Cloud API
Gemini 2.0 Realtime
Stream WebRTC
Built for the Stream Vision Agents Hackathon 2026. If you're building something cool with Vision Agents, I'd love to see it — drop a comment below!
Tags: python ai machinelearning webdev hackathon
Top comments (0)