How I solved video AI's biggest blind spot — amnesia — by building a real-time temporal memory engine on top of the Vision Agents SDK by Stream.
https://img.shields.io/badge/ARGUS-The_AI_That_Never_Forgets-00d4aa?style=for-the-badge&logo=openai&logoColor=white&labelColor=101010&logoWidth=30&pd=20
🏆 Built for the Vision Possible: Agent Protocol Hackathon
⚡ Powered by Vision Agents SDK by Stream
🎥 Watch ARGUS in Action
Before I explain how I built it, you have to see it to believe it. Here is ARGUS detecting objects, tracking them over time, and answering questions about the past in real-time.
(If the video doesn't load, click here to watch the demo
🚨 The Problem: AI Has Amnesia
I realized something frustrating while testing modern Video AI demos. They are brilliant at telling you what is happening right now, but they are terrible at telling you what happened 5 minutes ago.
If I drop my keys and ask a standard AI agent, "Where are my keys?", it looks at the current frame, sees nothing, and says: "I don't see any keys."
The Vision Agents SDK documentation actually highlighted this limitation:
"Longer videos can cause the AI to lose context. For instance, if it's watching a soccer match, it will get confused after 30 seconds."
That was my lightbulb moment. 💡
The Mission: Build ARGUS, a real-time agent that doesn't just "see" video—it remembers it.
🧠 What is ARGUS?
ARGUS is a multimodal AI agent that watches live video, tracks objects using computer vision, and maintains a Temporal Memory Engine.
Unlike standard agents that process Frame → Detect → Forget, ARGUS uses a stateful pipeline:
mermaid
graph LR
A[Camera Feed] --> B(YOLO26 Detection)
B --> C{Temporal Memory Engine}
C --> D[Update Object History]
C --> E[Log Events]
D & E --> F[LLM Context]
F --> G((Voice Response))
Key Capabilities
👁️ Real-time Tracking: Uses YOLO26 Nano + ByteTrack to assign persistent IDs to objects.
🕰️ Time Travel: Can answer "What did I hold up 2 minutes ago?"
📍 Spatial Awareness: Converts raw coordinates into human terms like "top-left" or "center."
🗣️ Voice Interaction: Full duplex voice conversation with <1s latency.
💬 Real Conversations With ARGUS
These are actual interactions from my testing sessions:
Terminal Logs showing events
Real-time event logging showing objects appearing and moving.
I Said ARGUS Responded
"What do you see?" "Person ID:2 at middle-center, Cup ID:3 at bottom-right"
"What am I holding?" "You appear to be holding a bottle, ID:7"
"What just moved?" "Cup moved from bottom-left to bottom-right at 2:05 PM"
"Summarize everything" "Person appeared at center 30s ago. Cup moved left to right at 2:05"
⚡ Response time: ~1 second
🧠 All answers came from temporal memory — not from re-analyzing the video frame.
🏗️ The Architecture & Tech Stack
I needed a stack that was fast, cheap, and capable of handling real-time video streams without melting my laptop.
Component Technology Why I Chose It
Framework Vision Agents SDK It handled all the WebRTC/Audio/Video piping for me.
Vision Model YOLO26 Nano Benchmarked at 130ms/frame on CPU. Fast & Accurate.
Reasoning Llama 3.3 via OpenRouter Fast inference with tool-calling capabilities.
Speech Deepgram (STT) + ElevenLabs (TTS) The lowest latency combo available.
Transport Stream Edge Network Kept video latency under 30ms.
🛠️ The Build Journey
1. The "Secret Weapon": Temporal Memory Engine
This is the heart of the project. I wrote a custom Python class that sits between the vision processor and the LLM.
Instead of feeding raw video frames to the LLM (which is slow and expensive), I feed it structured event logs.
Python
Core logic: If an object moves zones, log it.
if old_zone != zone:
self._log("moved", f"{class_name} (ID:{track_id}) moved from {old_zone} to {zone}")
When I ask, "Where is the cup?", the LLM receives this context injection:
[ARGUS MEMORY]
Cup (ID:2): Last seen at bottom-right at 12:05 PM.
Person (ID:1): Currently visible at center.
Event: Cup moved from left to right 30 seconds ago.
2. Building the Custom Processor
Using the SDK's VideoProcessorPublisher pattern was intuitive. I could access the raw av.VideoFrame, run my YOLO inference, draw bounding boxes, and push the frame back to the browser.
ARGUS Detection View
ARGUS tracking objects with persistent IDs and Spatial Zones.
3. Solving the Latency Problem
My first prototype had 5-second delays. To fix this, I optimized ruthlessly:
Switched from Gemini (Rate limits) to OpenRouter/Llama.
Switched YOLO11 to YOLO26 Nano (7.7 FPS on CPU).
Used human-readable zones ("top-left") instead of raw coordinates, reducing token usage for the LLM.
🧪 Benchmark Results
I ran a diagnostic script to prove efficiency on a standard laptop (No GPU):
Benchmark Results
Model Speed Max FPS Verdict
YOLO26 Nano 130ms 7.7 ✅ Winner
YOLOv8 Nano 138ms 7.2 Solid
YOLO11 Small 310ms 3.2 Too slow
The Vision Agents SDK was crucial here. Because it handles the video transport efficiently, I could use all my CPU cycles for the actual detection logic.
💡 The "Aha!" Moment
The magic happened during a test run. I held up a water bottle, put it down, and waited. Then I asked:
Me: "What did I just show you?"
ARGUS: "You were holding a bottle (ID:7) at the center of the screen about 15 seconds ago."
The Aha Moment
It wasn't looking at the bottle now. It remembered. That feeling of interacting with an AI that has object permanence is wild.
🌍 Why This Matters
Hackathons often produce cool demos that don't solve real problems. ARGUS solves the context window problem for video.
By abstracting video into structured temporal data, we can build agents that:
Monitor security feeds for hours and summarize activity.
Help find lost items in a room.
Analyze workflow efficiency in factories.
The Vision Agents SDK made this possible by removing the complexity of WebRTC and audio handling, allowing me to focus entirely on the memory innovation.
🔗 Links & Resources
Code Repository: GitHub - [ARGUS](https://github.com/Vaibhav13Shukla/argus)
Vision Agents SDK: Star the Repo!
Hackathon: Vision Possible
Thanks to Stream and WeMakeDevs for this challenge. It pushed me to build something I didn't think was possible in a weekend!



Top comments (0)