Why my Smart Security Camera Was Actually Pretty Dumb(Until I Gave it Memory)

#ai #machinelearning #security #showdev

How I Built a Store Surveillance System That Remembers Every Face It's Ever Flagged
Most retail security setups work like this: you hire someone to stare at a wall of cameras and hope they catch something. The cameras never forget a frame, but the person watching them absolutely does. The entire system depends on a human staying alert for eight hours, scanning sixteen feeds simultaneously, and somehow not missing the one moment that matters.
I built SentinelAI to flip that model. The system watches every camera feed continuously, computes a real-time threat score for every person it detects, fires alerts the instant something looks off — and critically, it remembers. When someone who was flagged last Tuesday walks back into the store on Friday, the system already knows their face.

What the System Actually Does
At its core, SentinelAI is a FastAPI backend running YOLOv8n inference on a live video feed, a React frontend displaying real-time detection data, and a persistent memory layer powered by Hindsight that ties known faces to historical incidents.
The backend streams annotated video frames via MJPEG to the dashboard. Every ~100ms, a detection loop runs YOLO on the current frame and extracts bounding boxes for every person detected. From there, a suspicion scoring function produces a float in [0, 1] using three weighted components:
python
def _compute_suspicious_score(
self, people_count: int, confidences: List[float], frame_area: int, boxes: list
) -> float:
# component 1: crowd factor (saturates at 5+ people)
crowd = min(people_count / 5.0, 1.0)

# component 2: avg YOLO detection confidence
avg_conf = sum(confidences) / len(confidences) if confidences else 0.0

# component 3: spatial density — total bbox area vs frame area
total_box_area = sum((x2 - x1) * (y2 - y1) for x1, y1, x2, y2 in boxes)
density = min(total_box_area / max(frame_area, 1), 1.0)

score = 0.4 * crowd + 0.3 * avg_conf + 0.3 * density
return round(min(score, 1.0), 3)

When score exceeds 0.6, an alert is logged to alerts.json and surfaced on the dashboard. Scores above 0.8 escalate to HIGH severity. The frontend classifies these into four threat bands — NOMINAL, ELEVATED, GUARDED, CRITICAL — each rendered with its own color, animated arc gauge, and severity indicator on the UI.
That part was straightforward. The harder problem was memory.
The Problem With Stateless Detection
YOLOv8 is stateless. Each frame is evaluated in isolation. If the same person gets flagged at 2:00pm, leaves, and re-enters at 4:30pm, the system has no concept that these are the same individual. A new detection, a new bounding box, a new score. Context lost.
This is fine if you're just counting people. It's a problem if you're trying to build an intelligent security layer that actually learns from what it's seen.
Real retail security isn't just about catching shoplifters in the act. It's about pattern recognition over time — someone who loiters near the same shelf on three separate visits, someone who's been flagged for suspicious behavior at another branch, someone who consistently triggers the score threshold but never quite gets caught. None of that is possible when your system treats every detection as its first.
This is exactly what Hindsight's agent memory is designed for. Rather than storing raw alerts in a flat JSON file and hoping someone remembers to cross-reference them, Hindsight gives the system a way to persist, recall, and reflect on who it's seen and what happened.

Integrating Persistent Face Memory
The memory layer works as a lookup that runs in parallel with the YOLO detection pipeline. When a face is detected and a suspicion score computed, the system queries Hindsight with a face embedding vector. If there's a prior record for that individual — previous flags, timestamps, severity history — that context is pulled into the current detection window.
python# After detection, before scoring decision:
face_embedding = extract_face_embedding(frame, box)
memory_context = hindsight_client.recall(face_embedding, top_k=3)

if memory_context:
prior_flags = memory_context[0]["metadata"]["flag_count"]
last_severity = memory_context[0]["metadata"]["last_severity"]
# Boost score for repeat offenders
if prior_flags >= 2 or last_severity == "HIGH":
score = min(score + 0.2, 1.0)
This small adjustment has a disproportionate effect on accuracy. A person who is genuinely acting suspiciously tends to score near the threshold repeatedly — not quite 0.8 on any single detection, but consistently above 0.6. With memory, those repeated near-misses compound into a definitive high-severity flag. Without memory, each incident looks ambiguous in isolation.
On the "retain" side, any time a person crosses the alert threshold, the system stores their face embedding and incident metadata back to Hindsight:
pythonalert = {
"id": len(self.alerts) + 1,
"timestamp": now,
"people_count": people_count,
"suspicious_score": score,
"severity": "HIGH" if score >= 0.8 else "MEDIUM",
}
hindsight_client.retain(face_embedding, metadata=alert)
self._save_alert(alert)
Every flagged detection becomes part of a growing institutional memory. The system gets better the longer it runs, without any model retraining.
You can explore how Hindsight handles this kind of agent memory at hindsight.vectorize.io — the documentation covers the retain/recall/reflect loop in detail.

The Frontend: What Security Personnel Actually See
The React dashboard renders all of this in a format that makes it immediately actionable. The layout has three panels:
Left sidebar — active camera list with per-camera status (ACTIVE / ALERT / WARN / CRITICAL), person count per zone, and a live AI analysis engine indicator with activity bars showing inference throughput.
Center — the live MJPEG video feed streamed from /video_feed, with bounding boxes overlaid by OpenCV on the backend. Green boxes for normal detections, red for suspicious. This keeps the frontend thin — no client-side inference, just rendering.
Right panel — the threat meter (an animated SVG arc gauge), current threat classification, and a scrollable incident log showing every flagged detection with timestamp, severity badge, score percentage, and person count.
The incident log is the part security staff actually use in practice. Instead of cycling through camera feeds manually, they can scan a chronological list of everything the system flagged in the last shift, then jump to the corresponding timestamp in the camera recording if something warrants investigation.
Show Image
Architecture at a Glance
Camera Feed (webcam / IP camera / video file)
│
▼
FastAPI Backend (/video_feed endpoint)
├── YOLOv8n inference (person detection only, class 0)
├── Suspicion score computation
├── Face embedding extraction
├── Hindsight recall (prior incidents for this face)
├── Alert threshold check → Hindsight retain (if flagged)
└── MJPEG stream to frontend
│
▼
React Frontend
├── Live video feed (img src="/video_feed")
├── ThreatMeter component (arc gauge + severity bands)
├── AlertsSidebar (camera status + incident log)
└── Real-time polling → /status endpoint (people count, score)
The backend deliberately does no face recognition on its own — that's Hindsight's job. YOLO's role is detection and localization. Face embeddings are handled by a separate lightweight model. This separation of concerns makes it easy to swap out either component without touching the other.

What Surprised Me
The scoring heuristic is surprisingly robust. I expected to spend a lot of time tuning the weights (0.4 / 0.3 / 0.3 for crowd, confidence, and density). In practice, the initial values worked well on test footage. Density — the ratio of total bounding box area to frame area — is a particularly useful signal because it catches both tight clustering of people and individuals who position themselves very close to the camera, which is often correlated with concealment behavior.
Memory changed the character of alerts. Before integrating Hindsight, the incident log felt noisy — borderline detections at 0.61 or 0.63 were technically above threshold but often false positives. With the prior-flag boost applied to known repeat detections, the score distribution shifted. Genuine incidents clustered at the top of the range; borderline cases mostly stayed borderline unless reinforced by history. Fewer false positives, same true positive rate.
Streaming over MJPEG is simpler than WebSockets for this use case. I considered WebSockets for the video feed, but a plain targeting a StreamingResponse from FastAPI works fine and has zero client-side complexity. The limitation is that you can't do bidirectional control (e.g., PTZ camera commands) over MJPEG, but for a read-only monitoring dashboard it's the right call.

Lessons Worth Taking
Don't fight the stateless model — add a memory layer. YOLOv8 is excellent at what it does. Trying to make it stateful by hacking pose history or bounding box tracking into the inference loop is the wrong abstraction. Let the detection model detect; let Hindsight remember.
Score thresholds need context. A 0.65 suspicious score means something different for a first-time visitor than for someone with three prior HIGH-severity flags. Memory makes this distinction possible.
The incident log is the product. The live feed is satisfying to watch, but the scrollable log of timestamped, severity-ranked incidents is what actually saves time. Security staff don't watch feeds — they investigate alerts. Design for that.
Keep the backend doing the heavy lifting. Doing inference on the server and streaming annotated frames keeps the frontend thin. If you need to scale, you scale the backend. The frontend is just a display.
Persistence matters from day one. Alerts logged to alerts.json on day one become training data for refining the scoring heuristic on day thirty. Don't throw that data away.

The full project source, setup instructions, and dependency list (fastapi, uvicorn, ultralytics, opencv-python) are available in the repo. If you're building something similar, the Hindsight documentation and the agent memory overview on Vectorize are the best starting points for the memory integration layer.
Retail theft costs the industry billions annually. The technology to build a genuinely intelligent security layer — one that watches everything, forgets nothing, and gets better over time — is already available and not particularly expensive to run. It just needs to be assembled correctly.
That's what SentinelAI does.

DEV Community

Why my Smart Security Camera Was Actually Pretty Dumb(Until I Gave it Memory)

Top comments (0)