DEV Community

BEDANTA CHATTERJEE
BEDANTA CHATTERJEE

Posted on

Vision Agent — Real-Time Multimodal AI with YOLO, STT & LLM Cascade














I built Vision Agent for the Vision Possible hackathon — a real-time multimodal AI platform that watches live video, transcribes audio, detects objects & human poses, and responds using a multi-tier LLM cascade.

This isn’t a static demo.
It’s a production-style, extensible system.

🚀 What It Does

Live webcam streaming (chunked WebM ingestion)

YOLOv8 object detection + pose estimation

Real-time rep counting & posture correction

Speech-to-text processing

Fast deterministic response (instant reply)

Polished long-form LLM response (async cascade)

Tool/function registry for structured actions

SSE-based live UI updates

Deployment-ready with Docker configs

Demo: https://youtube.com/shorts/2F8jyKPJwTs?feature=share

Repo: https://github.com/rupac4530-creator/vision-agent

🧠 Architecture Overview

Browser
→ 1–2s WebM chunks
→ FastAPI backend
→ Frame extraction
→ YOLOv8 vision pipeline
→ STT transcription
→ FastReply (deterministic, instant)
→ LLM cascade (quality escalation)
→ SSE responses to UI

The separation between FastReply and PolishReply ensures:

Instant UX

Higher-quality reasoning

Cost control

Provider fallback reliability

⚙️ Key Engineering Decisions
Streaming Approach

Used MediaRecorder chunk uploads instead of WebRTC for simplicity and cross-browser compatibility. This trades a small latency increase for reliability during prototyping.

LLM Cascade Strategy

Rather than a single provider:

Fast, low-cost model replies first

Higher-tier model refines answers asynchronously

Automatic fallback on timeouts / quota errors

Provider health metrics and auto-fallback logic

Pose Counting Logic

Implemented robust rep counting using:

Joint angle thresholds

State-machine transitions (down → up → down)

Hysteresis / cooldowns to avoid duplicate counts

Per-exercise detectors (squat, pushup, curl)

🧪 Practical Use Cases

AI fitness coach (real-time posture feedback)

Security monitoring assistant (live alerts)

Accessibility tool (scene descriptions + captions)

Smart classroom assistant (lecture summarization)

Live gaming companion / strategy hints

🛠 Tech Stack

Python + FastAPI (backend)

Ultralytics YOLOv8 (vision)

Whisper-style STT (transcription)

Multi-provider LLM cascade (Gemini / OpenAI / others)

SSE for real-time UI updates

Docker + Render / Railway deployment configs

🏗 Lessons Learned

Streaming video pipelines are complex — chunk continuity matters.

Deploy build limits (e.g., PyTorch size) require careful dependency choices.

Design for failure: timeouts, retries, and graceful fallbacks are essential.

UX speed (instant replies) beats slow perfect responses for interactive apps.
Built for the Vision Possible hackathon by WeMakeDevs.
Inspired by Vision-Agents and realtime ideas from Stream.

Open source: https://github.com/rupac4530-creator/vision-agent

Feedback, issues, and contributions welcome — drop a PR or open an issue.

Top comments (0)