The Idea That Wouldn't Let Go
For 466 million people with disabling hearing loss and 43 million with visual impairment, two questions define their daily lives:
"What did you say?" and "What's in front of me?"
These aren't minor inconveniences. They're barriers — to independence, to safety, to just walking down a street.
When I saw the WeMakeDevs Vision Possible Hackathon, I knew exactly what I wanted to build: a system that turns a camera into an intelligent companion that can see, speak, navigate, and translate — in real-time.
No buttons. No screens to read. Just natural voice conversation with an AI that has eyes.
That's WorldLens.
What Is WorldLens?
WorldLens is a dual-mode assistive vision platform built on the Vision Agents SDK:
GuideLens — Your Walking Companion
For visually impaired users. Point any camera — laptop, phone, or even a tiny M5Stack edge device — and GuideLens becomes your eyes:
- YOLO11 object detection across 80 classes — people, cars, obstacles, animals, furniture
- Hazard tracking with approach speed and direction estimation (left/center/right, near/medium/far)
- Real-time OCR — reads signs, building names, bus numbers aloud
- Turn-by-turn walking navigation via Google Maps
- Spatial memory — remembers every object it's seen, queryable by voice
- Natural voice conversation — you talk, it sees and responds
SignBridge — Sign Language Translation (Prototype Level)
A real-time sign language → spoken English bridge:
- YOLO11 Pose extracts 17 body keypoints
- MediaPipe tracks 21 hand landmarks per hand
- ASL finger-spelling recognition for letters like A, B, D, I, L, V, W, Y
- Gesture classification (wave, point, thumbs up) via 30-frame buffer analysis
The Architecture — How It All Fits Together
Camera (Webcam / M5Stack K210)
│
▼
GetStream Edge Network (WebRTC)
│
▼
┌─────────── Vision Agents Backend ───────────┐
│ │
│ YOLO11 Detection ─── Hazard Tracking │
│ YOLO11 Pose ──────── MediaPipe Hands │
│ Multi-VLM OCR │
│ │
│ Event Bus (pub/sub) │
│ │ │
│ ▼ │
│ Gemini 2.5 Flash Realtime │
│ Speech-to-Speech @ 5 FPS │
│ + 12 MCP Tools (Maps, Memory, Weather...) │
│ │
└──────────────────────────────────────────────┘
│
▼
React 19 Frontend (WebRTC + Alerts)
The entire system is one real-time voice+vision conversation. The user speaks, the AI sees and responds. No manual triggers. Gemini autonomously decides when to call tools — "Take me to the train station" triggers Google Maps directions, "What does that sign say?" triggers the OCR pipeline.
The Build — 7 Days, One Vision
Day 1: Infrastructure
Got the Vision Agents SDK running with GetStream WebRTC transport. Built the React frontend skeleton. Established dual-mode architecture (GuideLens / SignBridge). Wired up camera input.
Day 2: Computer Vision
Integrated YOLO11 for both object detection and pose estimation. Built the multi-VLM provider chain with automatic failover across 5 providers (Gemini → Grok → Azure GPT-4o → NVIDIA Cosmos → HuggingFace). Mode switching working end-to-end.
Day 3: Advanced Visuals
OCR processor with multi-VLM chain. NVIDIA Cosmos integration for dense scene descriptions. 3D avatar with lip-sync using React Three Fiber (Discontinued) . OCR text overlay on the frontend.
Day 4: Agentic Intelligence
This was the breakthrough day. Google Maps API integration for live walking directions. SQLite spatial memory database. MediaPipe hand landmarks for ASL finger-spelling. Priority-based navigation engine with announcement cooldowns.
Day 5: Polish & Testing
Wired up all 12 MCP tools. Built AlertOverlay v2 with Web Audio API chimes and severity-based haptic feedback. Enterprise-grade telemetry panel. Glassmorphism UI.
Day 6: A LOT of bug fixes :)
Day 7 : Deployment Activities (Docker and M5Stack K210 Camera)
The Hard Lessons
1. Edge Deployment Is TOUGH
I connected an M5Stack UnitV K210 — a RISC-V chip with 8 MB of SRAM and a hardware neural accelerator. Getting YOLO v2 tiny to run on it at ~15 FPS taught me more about real-world constraints than any tutorial.
You can't just "deploy to edge." You're fighting memory limits, model quantization, serial communication protocols, and the fact that a 224×224 input resolution means your detection accuracy drops significantly. Edge AI sounds great in blog posts. In practice, it's an engineering discipline unto itself.
2. Real-Time Is Possible — But It Takes Architecture
My first approach was naive: detect objects → send to LLM → speak response. It crumbled instantly. Duplicate announcements every frame. Hazard alerts drowning out navigation. The LLM getting overwhelmed with events.
The solution was an entire event-driven architecture:
-
BaseEventpub/sub for decoupled communication - Priority-based announcement queues with configurable cooldowns
- Bounding box growth rate estimation for approach speed (not just "car detected" but "car approaching from the left, getting closer")
- 30-second deduplication cooldowns in spatial memory
- User speech suppression during active navigation
Real-time isn't about speed. It's about knowing what NOT to say.
3. Voice-First Design Changes Everything
This was the deepest lesson. When your user can't see a screen:
- You can't show a loading spinner — you have to say "one moment" or stay silent
- You can't display status text — you have to speak it naturally
- You can't use visual hierarchy — everything is sequential audio
- Error messages become spoken apologies
- "Tap to retry" becomes "just ask me again"
Every single UX pattern I knew was wrong. Voice-first isn't a feature. It's a complete paradigm shift.
The Vision Agents SDK Made This Possible
I want to be real about this: building an agentic real-time video+voice system from scratch would have taken months. The Vision Agents SDK gave me:
-
Agentclass for lifecycle management -
Realtimemode for speech-to-speech with Gemini -
VideoProcessorPublisherbase class for all my vision pipelines -
BaseEventfor event-driven architecture -
register_function()for MCP tool registration - GetStream Edge integration for WebRTC transport
I could focus on the what (assistive vision) instead of the how.
The Tech Stack
| Layer | Tech | Purpose |
|---|---|---|
| Orchestration | Vision Agents SDK | Agent lifecycle, processors, events, MCP |
| Reasoning | Gemini 2.5 Flash Realtime | Speech-to-speech @ 5 FPS |
| Detection | YOLO11 (Ultralytics) | 80-class detection + 17-keypoint pose |
| Hand Tracking | MediaPipe | 21 keypoints/hand, ASL recognition |
| Navigation | Google Maps APIs | Directions, Places, Geocoding |
| Memory | aiosqlite | Persistent spatial object history |
| Transport | GetStream Edge (WebRTC) | Real-time video + audio |
| Frontend | React 19 + Vite 7 + TypeScript | WebRTC client, 3D avatar, alerts |
| Edge Device | M5Stack K210 (RISC-V) | On-device YOLO v2 tiny |
| Deployment | Docker (multi-stage) | Single-container deployment |
| Testing | pytest + Vitest | 70 tests (24 + 46) |
What's Next
WorldLens is a proof-of-concept, but the vision (pun intended) is bigger:
- Full mobile edge deployment with SIM card connectivity — truly portable, untethered navigation
- Lip reading to speech — supplement audio in noisy environments
- Caller vibration alerts — detect when someone is speaking to you, alert via haptics
- Full SignBridge two-user mode — real-time bidirectional deaf ↔ hearing translation
- Expanded ASL vocabulary — beyond finger-spelling to full conversational signs
- Offline fallback — on-device YOLO + edge TTS for basic hazard detection without internet (I faced a lot of connectivity issues)
Final Thought
I built WorldLens because I believe multimodal AI shouldn't just be impressive demos — it should solve real problems for real people. For someone who can't see, a camera that speaks is not a gimmick. It's independence. For someone who signs, an AI that translates in real-time isn't a novelty. It's being heard.
Built for the WeMakeDevs Vision Possible Hackathon (February 2026) using the Vision Agents SDK.
GitHub: WorldLens Repository
Top comments (0)