Swagatika Beura

Posted on Mar 1

🚀Mission Accomplished: Building a Real-Time AI Spatial Agent | Vision Possible Hackathon & The Power of Learning by Building

#visionagents #wemakedevs #learnbydoing #ai

When curiosity meets constraints, innovation happens.

And when you choose to just learn by building and enjoy the journey — growth becomes unstoppable.

Recently, I built VisionMate AI as my submission for the Vision Possible: Agent Protocol Hackathon — an event that challenged developers worldwide to build cutting-edge generative AI solutions using multimodal agents.

Hosted by WeMakeDevs , the hackathon challenged participants to push the limits of what’s possible with real-time AI — from blazing-fast inference to true multimodal interactions.

And that’s exactly where VisionMateAI was born.

🚀 What is VisionMate AI?

Standard computer vision is entirely passive. If you are a cyclist, an e-scooter rider, or someone using a smart dashcam, knowing **"there is a car" **in a video feed is useless. You already know there are cars on the road!

The real problem is knowing if a specific car or bus is approaching too fast or getting dangerously close to you.

VisionMate AI is a comprehensive mobility platform that allows users to turn any standard camera into an active spatial hazard agent — transforming the way we handle personal safety on the road.

It’s built with Streamlit, OpenCV, and Python, powered by Stream's Vision Agents SDK and Ultralytics YOLOv8, running on the Gemini Realtime LLM for unparalleled inference speed and voice interaction.

🧠 How I Used Stream Vision Agents & Gemini Together :

At the core of ** VisionMate AI **lies a unique combination:

🧩 Ultralytics YOLOv8 for highly accurate entity detection and bounding box generation.
⚡ Stream Vision Agents SDK for lightning-fast edge network routing — ensuring real-time, snappy voice-to-vision pipelines.
🗣️ Gemini Realtime LLM for instant, conversational audio warnings.

⚙️ Architecture Highlights:

Instead of just identifying objects passively, VisionMate AI actively measures spatial threats across three core modes:

📷 1. Image Processing (High-Fidelity Frame Inspection)

Contextual safety thrives on precision. You simply upload an image, and the model instantly processes it to accurately detect, label, and draw bounding boxes around the objects present in the frame. It maps out the immediate environment, providing structured, high-quality safety responses in milliseconds.

🎬 2. Video Pipeline (Time-Series Hazard Tracking)

This feature is where the custom spatial math truly shines. You upload a pre-recorded dashcam video, and the system processes it frame-by-frame. It features a synchronous dashboard with a live telemetry HUD that successfully detects and tracks objects throughout the footage, exporting a final security audit in JSON format.

🎥 3. Live Protocol (Real-Time Agent Interface)

This is where the custom hackathon logic comes in. It features a Native Hardware Test for local proximity calculations and a WebRTC Edge Network. While the AI seamlessly detects everything in your live camera feed, it isolates true spatial threats in real-time.

I wrote the "40% Rule"—a custom proximity math algorithm inside the bounding box processor:


# The Core Spatial Hazard Math

def process_bounding_boxes(image, results):

    # ... setup code ...

    for box in results.boxes:

        x1, y1, x2, y2 = map(int, box.xyxy[0])

        vertical_span = (y2 - y1) / img_h # Object height relative to frame



        if vertical_span > 0.4:

            status, color, thickness = "HAZARD", (0, 0, 255), 4

        elif vertical_span > 0.2:

            status, color, thickness = "MID", (0, 255, 255), 2

        else:

            status, color, thickness = "FAR", (0, 255, 0), 2

If an object (like a car or bus) suddenly takes up more than 40% of the camera frame, my system mathematically determines it is an imminent collision risk and triggers the Gemini agent to issue an instant gTTS voice alert!

🧩 Tech Stack Overview :

Frontend: Streamlit, Custom CSS UI Components
Core Logic: Python, OpenCV, NumPy
Authentication: Bcrypt, JSON-based Local DB
AI & Vision: Stream Vision Agents SDK, Ultralytics YOLOv8, Gemini Realtime LLM
Audio: Google Text-to-Speech (gTTS)
Deployment/Environment: Dotenv, Nest_Asyncio

⚡ The Experience — When Curiosity Met Constraints :

The Vision Possible: Agent Protocol Hackathon was centered around one powerful idea: building real-time multimodal AI agents that go beyond traditional cloud pipelines.

For me, this was an opportunity to test a belief I’ve always lived by:

"You don’t need to be an expert to start. You just need to start doing."

I joined the hackathon organized by WeMakeDevs almost spontaneously. There was no massive roadmap—just curiosity and a desire to build something meaningful.

🛠️ Engineering Around the "OOM" Wall

To be honest, my development machine is not a high-end workstation. Running Ultralytics YOLOv8 locally was a struggle, and under strict 1GB cloud RAM limits, I repeatedly faced Out of Memory (OOM) crashes.

Instead of scaling down, I treated these constraints as engineering challenges:

Custom OpenCV Scaler: I engineered an aspect-ratio scaler to intelligently reduce the memory footprint without compromising spatial accuracy.
Memory Management: I minimized redundant tensor allocations and tuned inference batching to keep the system stable on low-tier instances.
WebRTC Integration: The real breakthrough came with Stream’s Vision Agents SDK and Google’s Gemini Realtime LLM.

🚀 The Result: Sub-30ms Latency :

Rather than relying on a laggy HTTP pipeline, the Vision Agents SDK enabled a WebRTC-based streaming architecture over Stream’s Edge network. This allowed YOLO detections to bind directly with Gemini’s reasoning engine.

Stream handled efficient edge routing of heavy video data, while Gemini provided refined multimodal understanding—allowing the AI to “see” through the camera and “speak” contextual warnings in real-time.

🏁 Looking back, this was about proving something to myself.

✅ You don’t need a perfect plan.
✅ You don’t need elite hardware.
✅ You don’t need to feel fully ready.

You need curiosity. You need resilience. And you need to enjoy the journey while building.

Because when curiosity meets constraints — and you choose to build anyway — innovation becomes inevitable.

🌍 What’s Next for VisionMate AI :

This is only the beginning. I’m working on turning VisionMate AI into a production-grade safety ecosystem where:

Edge Deployment: Porting the weights to run natively on NPU-enabled hardware like Raspberry Pi for dedicated smart dashcams.
Predictive Analytics: Moving beyond the "40% Rule" to include velocity-based trajectory prediction for higher accuracy.
Unified Safety Dashboard: Allowing users to review past hazard logs, security audits, and spatial data in one centralized interface.

🙌 A Huge Thanks to WeMakeDevs :

A massive shoutout to WeMakeDevs and Kunal Kushwaha for organizing such a well-structured and inspiring hackathon experience.

From the very beginning, they ensured that participants had everything they needed—from technical resources and community support to sessions with the Stream team explaining their technology in depth.

They hosted interactive workshops on how to effectively use the Vision Agents SDK, which really helped me understand how to build a low-latency, impactful project. The WeMakeDevs community was always active and approachable, answering doubts instantly and fostering an environment where learning and innovation could thrive.

I’m truly grateful for the opportunity to be a part of such an event — it wasn’t just a competition, it was a collaborative learning experience that empowered me to build, explore, and create with real-time AI.

</> Github Repo

Github Repo
⚠️ Disclaimer: This project was developed during the Vision Possible Hackathon. The current architecture is optimized for low-memory environments (1GB RAM) to ensure accessibility for users without high-end hardware.

VisionMate AI is an active spatial hazard agent that empowers users to turn any camera into a real-time safety tool. With advanced CV capabilities, sub-30ms WebRTC latency, and conversational AI alerts, VisionMate transforms passive video feeds into life-saving interactions.

Demo Video
Live Application
Built with ❤️ by Swagatika Beura for the Vision Possible Hackathon.

DEV Community