DEV Community

Arjun Sharma
Arjun Sharma

Posted on

Building a Vision AI That Sees Your Code and Talks Back

DEV Weekend Challenge: Community

This is a submission for the DEV Weekend Challenge: Community

The Community

This project is built for learners who study and build alone — developers who are grinding through tutorials late at night, students debugging code without a mentor nearby, and self-taught engineers who don't always have someone to ask "hey, does this look right?"

The dev community thrives on pair programming, code reviews, and the kind of feedback you get when someone's looking over your shoulder. But most people don't have that luxury. OneVision brings that real-time, eyes-on guidance to anyone with a browser — making the AI their always-available coding companion.


What I Built

OneVision is a real-time AI learning assistant that watches your camera or screen share, listens to your voice, and speaks back natural guidance — like a senior developer sitting next to you.

You join a video call from your browser, share your screen (or just look at the camera), and the AI agent:

  • Analyzes your screen — spots IDE errors, reads your code, understands what you're working on
  • Speaks feedback out loud — no typing, no copy-pasting: it just talks to you
  • Listens to your questions — ask "why is this failing?" and get a spoken answer
  • watches your posture / hand position via camera when you're not screen-sharing (using YOLO pose estimation)
  • Switches modes automatically — screen share active? It focuses on your screen. You close it? Back to camera mode.

It's designed to feel like a conversation, not a chatbot. The agent stays quiet when everything looks fine and only speaks up when it spots something worth flagging — or when you ask.

Key Features

  • Browser-based UI — no CLI setup required for users (but recommended for developers), just open the link and go
  • Auto mode switching — agent seamlessly transitions between camera and screen share analysis
  • Echo suppression — fuzzy-match guard prevents the agent's own voice from looping back into its "ears"
  • Proactive but not annoying — exponential back-off between feedback checks keeps the agent from talking over you
  • Three VLM providers — OpenRouter (Claude), Gemini, NVIDIA — all swappable via environment variable

Demo

Live App (Recommended setup is through CLI but for quick demo try the live app): vision-deploy-wine.vercel.app

Video Walkthrough:


Code

OneVision — Real-Time Multimodal Learning Assistant

An AI tutor that watches your camera or screen share in real time, listens to your voice and speaks back actionable guidance — creating an active feedback loop for hands-on learning.


Table of Contents


What It Does

Phase 1 — Camera Coaching

  • Watches live video from a call participant via camera.
  • Runs YOLO pose estimation on incoming video frames to detect body posture and hand positions.
  • Sends buffered frames to a Vision Language Model (VLM) to reason about what the user is doing.
  • Provides spoken guidance using Deepgram TTS.
  • Accepts voice questions through Deepgram STT.

Phase 2 — Screen Share Analysis

  • User shares their screen (IDE, terminal, circuit tool, CAD, Figma) instead of camera.
  • ScreenShareProcessor

How I Built It

Architecture Overview

Browser (React + Stream SDK)
    │
    ▼
token_server.py  (JWT signing + agent process manager)
    │  spawns agent subprocess per call
    ▼
VisionLearningPipeline
    ├── YOLO Pose Processor  (camera mode)
    ├── ScreenShareProcessor (screen share mode)
    ├── VideoLLM             (Claude / Gemini / NVIDIA)
    ├── Deepgram STT         (voice input + echo guard)
    └── Deepgram TTS         (voice output)
Enter fullscreen mode Exit fullscreen mode

Technologies Used

Layer Tech
Agent Framework Vision Agents SDK by GetStream
VLM Claude 3.5 Sonnet via OpenRouter (recommended), Gemini 2.0 Flash, NVIDIA Llama 3.2 Vision
Pose Estimation YOLOv8 Nano Pose
Speech-to-Text Deepgram Flux General English
Text-to-Speech Deepgram Aura 2 Thalia
Video Transport GetStream Video + WebRTC
Frontend React 18 + TypeScript + Vite 5
Frontend SDK @stream-io/video-react-sdk
Backend Python 3.11 with uv package management

The Tricky Parts

1. Getting video frames to actually reach the LLM

This was the biggest silent failure. The Vision Agents SDK only wires video tracks to classes that extend VideoLLM. The SDK's built-in openrouter.LLM extends OpenAILLM — not VideoLLM — so it never received a single frame. The agent was generating responses with zero visual input.

The fix: use openai.ChatCompletionsVLM pointed at OpenRouter's base URL. This class buffers frames, encodes them as base64 JPEG, and sends them as image_url content parts with every request.

# ❌ text-only — never receives video frames
from livekit.plugins import openrouter
llm = openrouter.LLM(model="claude-3.5-sonnet")

# ✅ VideoLLM — frames wired automatically by SDK
from livekit.plugins import openai as openai_plugin
llm = openai_plugin.ChatCompletionsVLM(
    model="anthropic/claude-3.5-sonnet",
    base_url="https://openrouter.ai/api/v1",
    api_key=openrouter_api_key,
)
Enter fullscreen mode Exit fullscreen mode

2. The echo feedback loop

When the agent speaks through TTS, the microphone picks up that audio and feeds it back into STT — the agent hears itself and responds to itself, endlessly.

The echo guard compares incoming transcripts against recently spoken text using Python's SequenceMatcher. The tricky part: STT garbles TTS output slightly ("npm" becomes "n p m", "variable" becomes "vary able"), so exact match or high-threshold fuzzy match both fail. Tuning the similarity threshold down to 0.55 and the history window to 30 seconds made it reliable.

3. Mid-sentence interruptions

Eager turn detection in Deepgram would fire a "user finished speaking" event mid-sentence if there was a half-second pause. This caused the agent to respond to incomplete thoughts. Disabling eager_turn_detection fixed the fragmented conversations entirely.

4. Keeping the agent quiet

Without careful prompt engineering, the agent narrates everything it sees — constantly. The prompts now have explicit CRITICAL RULES: never narrate, never repeat, maximum 1-2 sentences, stay silent unless asked a question or a real error is spotted. The proactive feedback loop also uses exponential back-off, doubling the interval after consecutive "no feedback needed" responses.

Running It Yourself (Highly recommend reading docs on github)

git clone https://github.com/Arjunhg/onevision.git
cd onevision/project

# Install Python deps
uv sync

# Install frontend deps
cd frontend && npm install && cd ..

# Set up .env with your API keys
# (see README for the full template)

# Terminal 1 — token server + agent manager
uv run python token_server.py

# Terminal 2 — frontend
cd frontend && npm run dev
Enter fullscreen mode Exit fullscreen mode

Full setup guide in the README.

Top comments (0)