This is a submission for the DEV Weekend Challenge: Community
The Community
This project is built for learners who study and build alone — developers who are grinding through tutorials late at night, students debugging code without a mentor nearby, and self-taught engineers who don't always have someone to ask "hey, does this look right?"
The dev community thrives on pair programming, code reviews, and the kind of feedback you get when someone's looking over your shoulder. But most people don't have that luxury. OneVision brings that real-time, eyes-on guidance to anyone with a browser — making the AI their always-available coding companion.
What I Built
OneVision is a real-time AI learning assistant that watches your camera or screen share, listens to your voice, and speaks back natural guidance — like a senior developer sitting next to you.
You join a video call from your browser, share your screen (or just look at the camera), and the AI agent:
- Analyzes your screen — spots IDE errors, reads your code, understands what you're working on
- Speaks feedback out loud — no typing, no copy-pasting: it just talks to you
- Listens to your questions — ask "why is this failing?" and get a spoken answer
- watches your posture / hand position via camera when you're not screen-sharing (using YOLO pose estimation)
- Switches modes automatically — screen share active? It focuses on your screen. You close it? Back to camera mode.
It's designed to feel like a conversation, not a chatbot. The agent stays quiet when everything looks fine and only speaks up when it spots something worth flagging — or when you ask.
Key Features
- Browser-based UI — no CLI setup required for users (but recommended for developers), just open the link and go
- Auto mode switching — agent seamlessly transitions between camera and screen share analysis
- Echo suppression — fuzzy-match guard prevents the agent's own voice from looping back into its "ears"
- Proactive but not annoying — exponential back-off between feedback checks keeps the agent from talking over you
- Three VLM providers — OpenRouter (Claude), Gemini, NVIDIA — all swappable via environment variable
Demo
Live App (Recommended setup is through CLI but for quick demo try the live app): vision-deploy-wine.vercel.app
Video Walkthrough:
Code
OneVision — Real-Time Multimodal Learning Assistant
An AI tutor that watches your camera or screen share in real time, listens to your voice and speaks back actionable guidance — creating an active feedback loop for hands-on learning.
Table of Contents
- What It Does
- Architecture
- Project Structure
- Prerequisites
- Installation
- Configuration
- Running the Project
- How It Works (Detailed Flow)
- LLM Provider Guide
- Troubleshooting & Known Issues
- Environment Variable Reference
- Tech Stack
What It Does
Phase 1 — Camera Coaching
- Watches live video from a call participant via camera.
- Runs YOLO pose estimation on incoming video frames to detect body posture and hand positions.
- Sends buffered frames to a Vision Language Model (VLM) to reason about what the user is doing.
- Provides spoken guidance using Deepgram TTS.
- Accepts voice questions through Deepgram STT.
Phase 2 — Screen Share Analysis
- User shares their screen (IDE, terminal, circuit tool, CAD, Figma) instead of camera.
-
ScreenShareProcessor…
How I Built It
Architecture Overview
Browser (React + Stream SDK)
│
▼
token_server.py (JWT signing + agent process manager)
│ spawns agent subprocess per call
▼
VisionLearningPipeline
├── YOLO Pose Processor (camera mode)
├── ScreenShareProcessor (screen share mode)
├── VideoLLM (Claude / Gemini / NVIDIA)
├── Deepgram STT (voice input + echo guard)
└── Deepgram TTS (voice output)
Technologies Used
| Layer | Tech |
|---|---|
| Agent Framework | Vision Agents SDK by GetStream |
| VLM | Claude 3.5 Sonnet via OpenRouter (recommended), Gemini 2.0 Flash, NVIDIA Llama 3.2 Vision |
| Pose Estimation | YOLOv8 Nano Pose |
| Speech-to-Text | Deepgram Flux General English |
| Text-to-Speech | Deepgram Aura 2 Thalia |
| Video Transport | GetStream Video + WebRTC |
| Frontend | React 18 + TypeScript + Vite 5 |
| Frontend SDK | @stream-io/video-react-sdk |
| Backend | Python 3.11 with uv package management |
The Tricky Parts
1. Getting video frames to actually reach the LLM
This was the biggest silent failure. The Vision Agents SDK only wires video tracks to classes that extend VideoLLM. The SDK's built-in openrouter.LLM extends OpenAILLM — not VideoLLM — so it never received a single frame. The agent was generating responses with zero visual input.
The fix: use openai.ChatCompletionsVLM pointed at OpenRouter's base URL. This class buffers frames, encodes them as base64 JPEG, and sends them as image_url content parts with every request.
# ❌ text-only — never receives video frames
from livekit.plugins import openrouter
llm = openrouter.LLM(model="claude-3.5-sonnet")
# ✅ VideoLLM — frames wired automatically by SDK
from livekit.plugins import openai as openai_plugin
llm = openai_plugin.ChatCompletionsVLM(
model="anthropic/claude-3.5-sonnet",
base_url="https://openrouter.ai/api/v1",
api_key=openrouter_api_key,
)
2. The echo feedback loop
When the agent speaks through TTS, the microphone picks up that audio and feeds it back into STT — the agent hears itself and responds to itself, endlessly.
The echo guard compares incoming transcripts against recently spoken text using Python's SequenceMatcher. The tricky part: STT garbles TTS output slightly ("npm" becomes "n p m", "variable" becomes "vary able"), so exact match or high-threshold fuzzy match both fail. Tuning the similarity threshold down to 0.55 and the history window to 30 seconds made it reliable.
3. Mid-sentence interruptions
Eager turn detection in Deepgram would fire a "user finished speaking" event mid-sentence if there was a half-second pause. This caused the agent to respond to incomplete thoughts. Disabling eager_turn_detection fixed the fragmented conversations entirely.
4. Keeping the agent quiet
Without careful prompt engineering, the agent narrates everything it sees — constantly. The prompts now have explicit CRITICAL RULES: never narrate, never repeat, maximum 1-2 sentences, stay silent unless asked a question or a real error is spotted. The proactive feedback loop also uses exponential back-off, doubling the interval after consecutive "no feedback needed" responses.
Running It Yourself (Highly recommend reading docs on github)
git clone https://github.com/Arjunhg/onevision.git
cd onevision/project
# Install Python deps
uv sync
# Install frontend deps
cd frontend && npm install && cd ..
# Set up .env with your API keys
# (see README for the full template)
# Terminal 1 — token server + agent manager
uv run python token_server.py
# Terminal 2 — frontend
cd frontend && npm run dev
Full setup guide in the README.
Top comments (0)