Kunal Jaiswal

Posted on Mar 31

Building a Real-Time Security Camera System with Local Vision LLMs

#ai #homelab #machinelearning #llm

I replaced my Lorex NVR's motion detection — which alerted me 40 times a day about swaying trees and shadows — with a pipeline that uses a vision language model to understand what it's actually seeing. It runs entirely on local hardware, costs nothing after setup, and sends me a WhatsApp message only when something real happens.

Architecture

3× Lorex 4K cameras (RTSP)
    ↓
gate_monitor.py (Mac Studio, M2 Ultra)
    ├── OpenCV: frame capture every 5s per camera
    ├── OpenCV: contour-based motion detection (frame N vs N-1)
    ├── Crop: extract largest changed region
    ├── VLM: qwen2.5vl:7b on DGX Spark (Blackwell, 10GbE link)
    │   └── "Classify this crop: ALERT or CLEAR?"
    ├── Alert: annotate frame with contour boxes
    │   ├── WiiM speaker announcement (TTS)
    │   └── WhatsApp message with image
    └── Audio: faster-whisper transcription (gate camera only)
        └── Gated by visual confirmation (120s window)

Three cameras — front gate, backyard, driveway — each running in parallel threads. The system processes about 50,000 VLM inference calls per day and has been running 24/7 for weeks.

Why Not Just Use YOLO?

Traditional object detection (YOLO, SSD) tells you what is in a frame. A vision language model tells you what's happening.

My gate camera watches a residential street. YOLO would detect "person" for the mail carrier, the neighbor walking their dog, someone cutting through to the next street, and an actual trespasser — all equally. A VLM can distinguish:

"A delivery driver placing a package at the door" → alert
"A person walking on the public sidewalk beyond the gate" → not relevant
"The shadow of a tree branch moving across the driveway" → clear

The key insight: I don't need the VLM to be fast (it runs at ~15 tok/s). I need it to be smart. By using OpenCV contour detection as a fast pre-filter, the VLM only sees cropped regions where something actually changed — typically 2–5 calls per camera per minute instead of 12.

The Contour Detection Layer

Before any AI touches a frame, OpenCV does the heavy lifting:

Capture frame, convert to grayscale, resize to 640px width
Compute absolute difference against previous analyzed frame
Apply binary threshold (25) and dilation (3 iterations) to merge nearby changes
Find contours, filter by area (min 150px², max 40% of frame)
Merge nearby bounding boxes (within 50px)

If no contours survive filtering: CLEAR — zero VLM calls. This happens 70%+ of the time (still frames, minor lighting shifts).

If contours are found: crop the largest region, send to VLM for classification.

Every 60 seconds, a fallback full-frame check catches anything that appeared between frames but hasn't moved (a parked car that wasn't there before, a person standing still).

Exclusion Zones

Not all motion is interesting. I built a polygon zone editor (web UI at /zones) that lets me draw exclusion and inclusion zones on camera frames — similar to professional NVR software.

Current exclusion zones:

Gate camera: the road beyond the gate (top portion of frame) — cars passing on the street aren't security events
Driveway: a steam pipe and stone wall fixture that cause constant false triggers
Backyard: a kamado BBQ grill and tree branches that sway in wind

The zones are stored as JSON polygons. At runtime, cv2.fillPoly builds a binary mask, which is applied to the thresholded diff before contour detection. Masked pixels are zeroed — contours in excluded areas never form.

False Positive War Stories

Vision LLMs hallucinate. In security camera analysis, this means phantom alerts. Here are the patterns I found and fixed:

The negation problem. The VLM would say "No people, vehicles, or animals are visible in the frame" and my classifier would see "people, vehicles, animals" and trigger an alert. Fix: expanded the negation lookback from 25 to 60 characters and added sentence-level negation detection ("if sentence starts with no/not/without AND ends with visible/present/found → CLEAR").

The hedge problem. The VLM would output both "ALERT" and "CLEAR" in the same response when it was uncertain. Fix: if both keywords appear on the same line, CLEAR wins. It's better to miss an event than to false-alert at 3 AM.

The location confusion. "A vehicle on the road beyond the gate" was triggering alerts for the gate camera. But the road isn't my property. Fix: added location-based negation — "beyond the gate", "past the gate", "on the road", "on the street" → CLEAR.

The shadow/reflection problem. "A shadow of a person" would alert. Fix: added "shadow of" as a negation pattern.

The phantom description. This was the most insidious. When the VLM received a nearly-black night frame, it would occasionally hallucinate vivid descriptions of people or vehicles. Fix: contour detection at night produces zero contours (no pixel changes in darkness), so the VLM is never called — the contour pre-filter eliminates this class of error entirely.

Audio Intelligence

The gate camera has a microphone. faster-whisper (medium.en model) transcribes 15-second audio chunks, but audio alerts are gated by visual confirmation — a speech transcription only fires if there was a visually-confirmed alert within the last 120 seconds. This prevents phantom audio alerts from wind, distant traffic, or radio.

Urgent keywords (help, emergency, fire) bypass the gate.

The transcription pipeline: PCM audio from RTSP → WAV → faster-whisper → filter noise phrases ("thank you for watching", street chatter) → if visually gated → WiiM speaker announcement + WhatsApp message with OGG audio clip.

The Alert Review Tool

50,000 VLM calls per day generates a lot of classification data. I built a daily review tool (/alerts/review) that:

Parses the last 24 hours of CONFIRMED ALERT lines from the log
Groups by camera + normalized description
Sends all patterns to qwen3.5:35b for meta-classification: REAL / FALSE_POSITIVE / NOISE
Presents a web UI with tabs (Needs Review / AI Flagged / Suppressed / Acknowledged)
One-click suppress permanently filters a pattern from future alerts

The LLM classification is given context: Calgary snowy conditions, known permanent features (kamado BBQ, stone wall, gate post lights), typical neighborhood activity. It correctly flags 80%+ of false positives for one-click suppression.

Performance

Metric	Value
Cameras	3 (4K, RTSP)
Frame interval	5 seconds per camera
VLM calls/day	~50,000
VLM model	qwen2.5vl:7b-4k (14.5 GB on DGX Spark)
Inference latency	~200ms per crop (10GbE link)
False positive rate	<5% after zone exclusions + negation fixes
Total system cost	$0/month (all local hardware)

What I'd Do Differently

Start with contour detection, not VLM. I initially sent every frame to the VLM. The 70.8 GB memory leak I found in Ollama (separate blog post) was partly caused by this constant load. Contour pre-filtering reduced VLM calls by 70%+ and made the whole system viable.
Use a smaller VLM for classification, larger for description. A 3B model could handle binary ALERT/CLEAR classification. Reserve the 7B model for generating the detailed description that goes into the WhatsApp alert.
Night mode needs a different approach. IR cameras produce grayscale footage that confuses vision LLMs trained on color images. Thermal cameras or dedicated night-vision models would work better.

The Stack

All of this runs on:

Mac Studio M2 Ultra (128 GB) — camera capture, OpenCV, audio processing, web UI
NVIDIA DGX Spark (120 GB) — VLM inference via Ollama
10GbE direct link between the two machines
Raspberry Pi — WhatsApp gateway
WiiM speaker — voice announcements
Python 3.9, stdlib only — no pip dependencies in any production script

Zero cloud APIs. Zero subscriptions. Full privacy.

The full pipeline code and zone editor are on my GitHub. If you're running local vision models for home automation, I'd like to hear what models and pre-filters work for you.

DEV Community