Skip to content

DEV Community

Arjun Sharma

Posted on Mar 2

Building a Vision AI That Sees Your Code and Talks Back

#devchallenge #weekendchallenge #showdev

DEV Weekend Challenge: Community

This is a submission for the DEV Weekend Challenge: Community

The Community

This project is built for learners who study and build alone — developers who are grinding through tutorials late at night, students debugging code without a mentor nearby, and self-taught engineers who don't always have someone to ask "hey, does this look right?"

The dev community thrives on pair programming, code reviews, and the kind of feedback you get when someone's looking over your shoulder. But most people don't have that luxury. OneVision brings that real-time, eyes-on guidance to anyone with a browser — making the AI their always-available coding companion.

What I Built

OneVision is a real-time AI learning assistant that watches your camera or screen share, listens to your voice, and speaks back natural guidance — like a senior developer sitting next to you.

You join a video call from your browser, share your screen (or just look at the camera), and the AI agent:

Analyzes your screen — spots IDE errors, reads your code, understands what you're working on
Speaks feedback out loud — no typing, no copy-pasting: it just talks to you
Listens to your questions — ask "why is this failing?" and get a spoken answer
watches your posture / hand position via camera when you're not screen-sharing (using YOLO pose estimation)
Switches modes automatically — screen share active? It focuses on your screen. You close it? Back to camera mode.

It's designed to feel like a conversation, not a chatbot. The agent stays quiet when everything looks fine and only speaks up when it spots something worth flagging — or when you ask.

Key Features

Browser-based UI — no CLI setup required for users (but recommended for developers), just open the link and go
Auto mode switching — agent seamlessly transitions between camera and screen share analysis
Echo suppression — fuzzy-match guard prevents the agent's own voice from looping back into its "ears"
Proactive but not annoying — exponential back-off between feedback checks keeps the agent from talking over you
Three VLM providers — OpenRouter (Claude), Gemini, NVIDIA — all swappable via environment variable

Demo

Live App (Recommended setup is through CLI but for quick demo try the live app): vision-deploy-wine.vercel.app

Video Walkthrough:

Code

Arjunhg / onevision

OneVision — Real-Time Multimodal Learning Assistant

An AI tutor that watches your camera or screen share in real time, listens to your voice and speaks back actionable guidance — creating an active feedback loop for hands-on learning.

Table of Contents

What It Does

Phase 1 — Camera Coaching

Watches live video from a call participant via camera.
Runs YOLO pose estimation on incoming video frames to detect body posture and hand positions.
Sends buffered frames to a Vision Language Model (VLM) to reason about what the user is doing.
Provides spoken guidance using Deepgram TTS.
Accepts voice questions through Deepgram STT.

Phase 2 — Screen Share Analysis

User shares their screen (IDE, terminal, circuit tool, CAD, Figma) instead of camera.
ScreenShareProcessor…

How I Built It

Architecture Overview

Browser (React + Stream SDK)
    │
    ▼
token_server.py  (JWT signing + agent process manager)
    │  spawns agent subprocess per call
    ▼
VisionLearningPipeline
    ├── YOLO Pose Processor  (camera mode)
    ├── ScreenShareProcessor (screen share mode)
    ├── VideoLLM             (Claude / Gemini / NVIDIA)
    ├── Deepgram STT         (voice input + echo guard)
    └── Deepgram TTS         (voice output)

Technologies Used

Layer	Tech
Agent Framework	Vision Agents SDK by GetStream
VLM	Claude 3.5 Sonnet via OpenRouter (recommended), Gemini 2.0 Flash, NVIDIA Llama 3.2 Vision
Pose Estimation	YOLOv8 Nano Pose
Speech-to-Text	Deepgram Flux General English
Text-to-Speech	Deepgram Aura 2 Thalia
Video Transport	GetStream Video + WebRTC
Frontend	React 18 + TypeScript + Vite 5
Frontend SDK	@stream-io/video-react-sdk
Backend	Python 3.11 with `uv` package management

The Tricky Parts

1. Getting video frames to actually reach the LLM

This was the biggest silent failure. The Vision Agents SDK only wires video tracks to classes that extend VideoLLM. The SDK's built-in openrouter.LLM extends OpenAILLM — not VideoLLM — so it never received a single frame. The agent was generating responses with zero visual input.

The fix: use openai.ChatCompletionsVLM pointed at OpenRouter's base URL. This class buffers frames, encodes them as base64 JPEG, and sends them as image_url content parts with every request.

# ❌ text-only — never receives video frames
from livekit.plugins import openrouter
llm = openrouter.LLM(model="claude-3.5-sonnet")

# ✅ VideoLLM — frames wired automatically by SDK
from livekit.plugins import openai as openai_plugin
llm = openai_plugin.ChatCompletionsVLM(
    model="anthropic/claude-3.5-sonnet",
    base_url="https://openrouter.ai/api/v1",
    api_key=openrouter_api_key,
)

2. The echo feedback loop

When the agent speaks through TTS, the microphone picks up that audio and feeds it back into STT — the agent hears itself and responds to itself, endlessly.

The echo guard compares incoming transcripts against recently spoken text using Python's SequenceMatcher. The tricky part: STT garbles TTS output slightly ("npm" becomes "n p m", "variable" becomes "vary able"), so exact match or high-threshold fuzzy match both fail. Tuning the similarity threshold down to 0.55 and the history window to 30 seconds made it reliable.

3. Mid-sentence interruptions

Eager turn detection in Deepgram would fire a "user finished speaking" event mid-sentence if there was a half-second pause. This caused the agent to respond to incomplete thoughts. Disabling eager_turn_detection fixed the fragmented conversations entirely.

4. Keeping the agent quiet

Without careful prompt engineering, the agent narrates everything it sees — constantly. The prompts now have explicit CRITICAL RULES: never narrate, never repeat, maximum 1-2 sentences, stay silent unless asked a question or a real error is spotted. The proactive feedback loop also uses exponential back-off, doubling the interval after consecutive "no feedback needed" responses.

Running It Yourself (Highly recommend reading docs on github)

git clone https://github.com/Arjunhg/onevision.git
cd onevision/project

# Install Python deps
uv sync

# Install frontend deps
cd frontend && npm install && cd ..

# Set up .env with your API keys
# (see README for the full template)

# Terminal 1 — token server + agent manager
uv run python token_server.py

# Terminal 2 — frontend
cd frontend && npm run dev

Full setup guide in the README.

Top comments (0)

Subscribe