DEV Community

Vishwa Kumaresh
Vishwa Kumaresh

Posted on

I Built a Real-Time AI Vision Assistant in 1 Week — Here's What I Learned About Multimodal AI

The Idea That Wouldn't Let Go

For 466 million people with disabling hearing loss and 43 million with visual impairment, two questions define their daily lives:

"What did you say?" and "What's in front of me?"

These aren't minor inconveniences. They're barriers — to independence, to safety, to just walking down a street.

When I saw the WeMakeDevs Vision Possible Hackathon, I knew exactly what I wanted to build: a system that turns a camera into an intelligent companion that can see, speak, navigate, and translate — in real-time.

No buttons. No screens to read. Just natural voice conversation with an AI that has eyes.

That's WorldLens.


What Is WorldLens?

WorldLens is a dual-mode assistive vision platform built on the Vision Agents SDK:

GuideLens — Your Walking Companion

For visually impaired users. Point any camera — laptop, phone, or even a tiny M5Stack edge device — and GuideLens becomes your eyes:

  • YOLO11 object detection across 80 classes — people, cars, obstacles, animals, furniture
  • Hazard tracking with approach speed and direction estimation (left/center/right, near/medium/far)
  • Real-time OCR — reads signs, building names, bus numbers aloud
  • Turn-by-turn walking navigation via Google Maps
  • Spatial memory — remembers every object it's seen, queryable by voice
  • Natural voice conversation — you talk, it sees and responds

SignBridge — Sign Language Translation (Prototype Level)

A real-time sign language → spoken English bridge:

  • YOLO11 Pose extracts 17 body keypoints
  • MediaPipe tracks 21 hand landmarks per hand
  • ASL finger-spelling recognition for letters like A, B, D, I, L, V, W, Y
  • Gesture classification (wave, point, thumbs up) via 30-frame buffer analysis

The Architecture — How It All Fits Together

Camera (Webcam / M5Stack K210)
    │
    ▼
GetStream Edge Network (WebRTC)
    │
    ▼
┌─────────── Vision Agents Backend ───────────┐
│                                              │
│  YOLO11 Detection ─── Hazard Tracking        │
│  YOLO11 Pose ──────── MediaPipe Hands        │
│  Multi-VLM OCR                               │
│                                              │
│  Event Bus (pub/sub)                         │
│       │                                      │
│       ▼                                      │
│  Gemini 2.5 Flash Realtime                   │
│  Speech-to-Speech @ 5 FPS                    │
│  + 12 MCP Tools (Maps, Memory, Weather...)   │
│                                              │
└──────────────────────────────────────────────┘
    │
    ▼
React 19 Frontend (WebRTC + Alerts)
Enter fullscreen mode Exit fullscreen mode

The entire system is one real-time voice+vision conversation. The user speaks, the AI sees and responds. No manual triggers. Gemini autonomously decides when to call tools — "Take me to the train station" triggers Google Maps directions, "What does that sign say?" triggers the OCR pipeline.


The Build — 7 Days, One Vision

Day 1: Infrastructure

Got the Vision Agents SDK running with GetStream WebRTC transport. Built the React frontend skeleton. Established dual-mode architecture (GuideLens / SignBridge). Wired up camera input.

Day 2: Computer Vision

Integrated YOLO11 for both object detection and pose estimation. Built the multi-VLM provider chain with automatic failover across 5 providers (Gemini → Grok → Azure GPT-4o → NVIDIA Cosmos → HuggingFace). Mode switching working end-to-end.

Day 3: Advanced Visuals

OCR processor with multi-VLM chain. NVIDIA Cosmos integration for dense scene descriptions. 3D avatar with lip-sync using React Three Fiber (Discontinued) . OCR text overlay on the frontend.

Day 4: Agentic Intelligence

This was the breakthrough day. Google Maps API integration for live walking directions. SQLite spatial memory database. MediaPipe hand landmarks for ASL finger-spelling. Priority-based navigation engine with announcement cooldowns.

Day 5: Polish & Testing

Wired up all 12 MCP tools. Built AlertOverlay v2 with Web Audio API chimes and severity-based haptic feedback. Enterprise-grade telemetry panel. Glassmorphism UI.

Day 6: A LOT of bug fixes :)

Day 7 : Deployment Activities (Docker and M5Stack K210 Camera)


The Hard Lessons

1. Edge Deployment Is TOUGH

I connected an M5Stack UnitV K210 — a RISC-V chip with 8 MB of SRAM and a hardware neural accelerator. Getting YOLO v2 tiny to run on it at ~15 FPS taught me more about real-world constraints than any tutorial.

You can't just "deploy to edge." You're fighting memory limits, model quantization, serial communication protocols, and the fact that a 224×224 input resolution means your detection accuracy drops significantly. Edge AI sounds great in blog posts. In practice, it's an engineering discipline unto itself.

2. Real-Time Is Possible — But It Takes Architecture

My first approach was naive: detect objects → send to LLM → speak response. It crumbled instantly. Duplicate announcements every frame. Hazard alerts drowning out navigation. The LLM getting overwhelmed with events.

The solution was an entire event-driven architecture:

  • BaseEvent pub/sub for decoupled communication
  • Priority-based announcement queues with configurable cooldowns
  • Bounding box growth rate estimation for approach speed (not just "car detected" but "car approaching from the left, getting closer")
  • 30-second deduplication cooldowns in spatial memory
  • User speech suppression during active navigation

Real-time isn't about speed. It's about knowing what NOT to say.

3. Voice-First Design Changes Everything

This was the deepest lesson. When your user can't see a screen:

  • You can't show a loading spinner — you have to say "one moment" or stay silent
  • You can't display status text — you have to speak it naturally
  • You can't use visual hierarchy — everything is sequential audio
  • Error messages become spoken apologies
  • "Tap to retry" becomes "just ask me again"

Every single UX pattern I knew was wrong. Voice-first isn't a feature. It's a complete paradigm shift.

The Vision Agents SDK Made This Possible

I want to be real about this: building an agentic real-time video+voice system from scratch would have taken months. The Vision Agents SDK gave me:

  • Agent class for lifecycle management
  • Realtime mode for speech-to-speech with Gemini
  • VideoProcessorPublisher base class for all my vision pipelines
  • BaseEvent for event-driven architecture
  • register_function() for MCP tool registration
  • GetStream Edge integration for WebRTC transport

I could focus on the what (assistive vision) instead of the how.


The Tech Stack

Layer Tech Purpose
Orchestration Vision Agents SDK Agent lifecycle, processors, events, MCP
Reasoning Gemini 2.5 Flash Realtime Speech-to-speech @ 5 FPS
Detection YOLO11 (Ultralytics) 80-class detection + 17-keypoint pose
Hand Tracking MediaPipe 21 keypoints/hand, ASL recognition
Navigation Google Maps APIs Directions, Places, Geocoding
Memory aiosqlite Persistent spatial object history
Transport GetStream Edge (WebRTC) Real-time video + audio
Frontend React 19 + Vite 7 + TypeScript WebRTC client, 3D avatar, alerts
Edge Device M5Stack K210 (RISC-V) On-device YOLO v2 tiny
Deployment Docker (multi-stage) Single-container deployment
Testing pytest + Vitest 70 tests (24 + 46)

What's Next

WorldLens is a proof-of-concept, but the vision (pun intended) is bigger:

  • Full mobile edge deployment with SIM card connectivity — truly portable, untethered navigation
  • Lip reading to speech — supplement audio in noisy environments
  • Caller vibration alerts — detect when someone is speaking to you, alert via haptics
  • Full SignBridge two-user mode — real-time bidirectional deaf ↔ hearing translation
  • Expanded ASL vocabulary — beyond finger-spelling to full conversational signs
  • Offline fallback — on-device YOLO + edge TTS for basic hazard detection without internet (I faced a lot of connectivity issues)

Final Thought

I built WorldLens because I believe multimodal AI shouldn't just be impressive demos — it should solve real problems for real people. For someone who can't see, a camera that speaks is not a gimmick. It's independence. For someone who signs, an AI that translates in real-time isn't a novelty. It's being heard.


Built for the WeMakeDevs Vision Possible Hackathon (February 2026) using the Vision Agents SDK.

GitHub: WorldLens Repository

VisionPossible #VisionAgents #AI #Accessibility #Hackathon

Top comments (0)