Vishwa Kumaresh

Posted on Mar 1

I Built a Real-Time AI Vision Assistant in 1 Week — Here's What I Learned About Multimodal AI

#ai #vision #agents #aiforgood

The Idea That Wouldn't Let Go

For 466 million people with disabling hearing loss and 43 million with visual impairment, two questions define their daily lives:

"What did you say?" and "What's in front of me?"

These aren't minor inconveniences. They're barriers — to independence, to safety, to just walking down a street.

When I saw the WeMakeDevs Vision Possible Hackathon, I knew exactly what I wanted to build: a system that turns a camera into an intelligent companion that can see, speak, navigate, and translate — in real-time.

No buttons. No screens to read. Just natural voice conversation with an AI that has eyes.

That's WorldLens.

What Is WorldLens?

WorldLens is a dual-mode assistive vision platform built on the Vision Agents SDK:

GuideLens — Your Walking Companion

For visually impaired users. Point any camera — laptop, phone, or even a tiny M5Stack edge device — and GuideLens becomes your eyes:

YOLO11 object detection across 80 classes — people, cars, obstacles, animals, furniture
Hazard tracking with approach speed and direction estimation (left/center/right, near/medium/far)
Real-time OCR — reads signs, building names, bus numbers aloud
Turn-by-turn walking navigation via Google Maps
Spatial memory — remembers every object it's seen, queryable by voice
Natural voice conversation — you talk, it sees and responds

SignBridge — Sign Language Translation (Prototype Level)

A real-time sign language → spoken English bridge:

YOLO11 Pose extracts 17 body keypoints
MediaPipe tracks 21 hand landmarks per hand
ASL finger-spelling recognition for letters like A, B, D, I, L, V, W, Y
Gesture classification (wave, point, thumbs up) via 30-frame buffer analysis

The Architecture — How It All Fits Together

Camera (Webcam / M5Stack K210)
    │
    ▼
GetStream Edge Network (WebRTC)
    │
    ▼
┌─────────── Vision Agents Backend ───────────┐
│                                              │
│  YOLO11 Detection ─── Hazard Tracking        │
│  YOLO11 Pose ──────── MediaPipe Hands        │
│  Multi-VLM OCR                               │
│                                              │
│  Event Bus (pub/sub)                         │
│       │                                      │
│       ▼                                      │
│  Gemini 2.5 Flash Realtime                   │
│  Speech-to-Speech @ 5 FPS                    │
│  + 12 MCP Tools (Maps, Memory, Weather...)   │
│                                              │
└──────────────────────────────────────────────┘
    │
    ▼
React 19 Frontend (WebRTC + Alerts)

The entire system is one real-time voice+vision conversation. The user speaks, the AI sees and responds. No manual triggers. Gemini autonomously decides when to call tools — "Take me to the train station" triggers Google Maps directions, "What does that sign say?" triggers the OCR pipeline.

The Build — 7 Days, One Vision

Day 1: Infrastructure

Got the Vision Agents SDK running with GetStream WebRTC transport. Built the React frontend skeleton. Established dual-mode architecture (GuideLens / SignBridge). Wired up camera input.

Day 2: Computer Vision

Integrated YOLO11 for both object detection and pose estimation. Built the multi-VLM provider chain with automatic failover across 5 providers (Gemini → Grok → Azure GPT-4o → NVIDIA Cosmos → HuggingFace). Mode switching working end-to-end.

Day 3: Advanced Visuals

OCR processor with multi-VLM chain. NVIDIA Cosmos integration for dense scene descriptions. 3D avatar with lip-sync using React Three Fiber (Discontinued) . OCR text overlay on the frontend.

Day 4: Agentic Intelligence

This was the breakthrough day. Google Maps API integration for live walking directions. SQLite spatial memory database. MediaPipe hand landmarks for ASL finger-spelling. Priority-based navigation engine with announcement cooldowns.

Day 5: Polish & Testing

Wired up all 12 MCP tools. Built AlertOverlay v2 with Web Audio API chimes and severity-based haptic feedback. Enterprise-grade telemetry panel. Glassmorphism UI.

Day 6: A LOT of bug fixes :)

Day 7 : Deployment Activities (Docker and M5Stack K210 Camera)

The Hard Lessons

1. Edge Deployment Is TOUGH

I connected an M5Stack UnitV K210 — a RISC-V chip with 8 MB of SRAM and a hardware neural accelerator. Getting YOLO v2 tiny to run on it at ~15 FPS taught me more about real-world constraints than any tutorial.

You can't just "deploy to edge." You're fighting memory limits, model quantization, serial communication protocols, and the fact that a 224×224 input resolution means your detection accuracy drops significantly. Edge AI sounds great in blog posts. In practice, it's an engineering discipline unto itself.

2. Real-Time Is Possible — But It Takes Architecture

My first approach was naive: detect objects → send to LLM → speak response. It crumbled instantly. Duplicate announcements every frame. Hazard alerts drowning out navigation. The LLM getting overwhelmed with events.

The solution was an entire event-driven architecture:

BaseEvent pub/sub for decoupled communication
Priority-based announcement queues with configurable cooldowns
Bounding box growth rate estimation for approach speed (not just "car detected" but "car approaching from the left, getting closer")
30-second deduplication cooldowns in spatial memory
User speech suppression during active navigation

Real-time isn't about speed. It's about knowing what NOT to say.

3. Voice-First Design Changes Everything

This was the deepest lesson. When your user can't see a screen:

You can't show a loading spinner — you have to say "one moment" or stay silent
You can't display status text — you have to speak it naturally
You can't use visual hierarchy — everything is sequential audio
Error messages become spoken apologies
"Tap to retry" becomes "just ask me again"

Every single UX pattern I knew was wrong. Voice-first isn't a feature. It's a complete paradigm shift.

The Vision Agents SDK Made This Possible

I want to be real about this: building an agentic real-time video+voice system from scratch would have taken months. The Vision Agents SDK gave me:

Agent class for lifecycle management
Realtime mode for speech-to-speech with Gemini
VideoProcessorPublisher base class for all my vision pipelines
BaseEvent for event-driven architecture
register_function() for MCP tool registration
GetStream Edge integration for WebRTC transport

I could focus on the what (assistive vision) instead of the how.

The Tech Stack

Layer	Tech	Purpose
Orchestration	Vision Agents SDK	Agent lifecycle, processors, events, MCP
Reasoning	Gemini 2.5 Flash Realtime	Speech-to-speech @ 5 FPS
Detection	YOLO11 (Ultralytics)	80-class detection + 17-keypoint pose
Hand Tracking	MediaPipe	21 keypoints/hand, ASL recognition
Navigation	Google Maps APIs	Directions, Places, Geocoding
Memory	aiosqlite	Persistent spatial object history
Transport	GetStream Edge (WebRTC)	Real-time video + audio
Frontend	React 19 + Vite 7 + TypeScript	WebRTC client, 3D avatar, alerts
Edge Device	M5Stack K210 (RISC-V)	On-device YOLO v2 tiny
Deployment	Docker (multi-stage)	Single-container deployment
Testing	pytest + Vitest	70 tests (24 + 46)

What's Next

WorldLens is a proof-of-concept, but the vision (pun intended) is bigger:

Full mobile edge deployment with SIM card connectivity — truly portable, untethered navigation
Lip reading to speech — supplement audio in noisy environments
Caller vibration alerts — detect when someone is speaking to you, alert via haptics
Full SignBridge two-user mode — real-time bidirectional deaf ↔ hearing translation
Expanded ASL vocabulary — beyond finger-spelling to full conversational signs
Offline fallback — on-device YOLO + edge TTS for basic hazard detection without internet (I faced a lot of connectivity issues)

Final Thought

I built WorldLens because I believe multimodal AI shouldn't just be impressive demos — it should solve real problems for real people. For someone who can't see, a camera that speaks is not a gimmick. It's independence. For someone who signs, an AI that translates in real-time isn't a novelty. It's being heard.

Built for the WeMakeDevs Vision Possible Hackathon (February 2026) using the Vision Agents SDK.

GitHub: WorldLens Repository

VisionPossible #VisionAgents #AI #Accessibility #Hackathon

DEV Community