I built Vision Agent for the Vision Possible hackathon — a real-time multimodal AI platform that watches live video, transcribes audio, detects objects & human poses, and responds using a multi-tier LLM cascade.
This isn’t a static demo.
It’s a production-style, extensible system.
🚀 What It Does
Live webcam streaming (chunked WebM ingestion)
YOLOv8 object detection + pose estimation
Real-time rep counting & posture correction
Speech-to-text processing
Fast deterministic response (instant reply)
Polished long-form LLM response (async cascade)
Tool/function registry for structured actions
SSE-based live UI updates
Deployment-ready with Docker configs
Demo: https://youtube.com/shorts/2F8jyKPJwTs?feature=share
Repo: https://github.com/rupac4530-creator/vision-agent
🧠 Architecture Overview
Browser
→ 1–2s WebM chunks
→ FastAPI backend
→ Frame extraction
→ YOLOv8 vision pipeline
→ STT transcription
→ FastReply (deterministic, instant)
→ LLM cascade (quality escalation)
→ SSE responses to UI
The separation between FastReply and PolishReply ensures:
Instant UX
Higher-quality reasoning
Cost control
Provider fallback reliability
⚙️ Key Engineering Decisions
Streaming Approach
Used MediaRecorder chunk uploads instead of WebRTC for simplicity and cross-browser compatibility. This trades a small latency increase for reliability during prototyping.
LLM Cascade Strategy
Rather than a single provider:
Fast, low-cost model replies first
Higher-tier model refines answers asynchronously
Automatic fallback on timeouts / quota errors
Provider health metrics and auto-fallback logic
Pose Counting Logic
Implemented robust rep counting using:
Joint angle thresholds
State-machine transitions (down → up → down)
Hysteresis / cooldowns to avoid duplicate counts
Per-exercise detectors (squat, pushup, curl)
🧪 Practical Use Cases
AI fitness coach (real-time posture feedback)
Security monitoring assistant (live alerts)
Accessibility tool (scene descriptions + captions)
Smart classroom assistant (lecture summarization)
Live gaming companion / strategy hints
🛠 Tech Stack
Python + FastAPI (backend)
Ultralytics YOLOv8 (vision)
Whisper-style STT (transcription)
Multi-provider LLM cascade (Gemini / OpenAI / others)
SSE for real-time UI updates
Docker + Render / Railway deployment configs
🏗 Lessons Learned
Streaming video pipelines are complex — chunk continuity matters.
Deploy build limits (e.g., PyTorch size) require careful dependency choices.
Design for failure: timeouts, retries, and graceful fallbacks are essential.
UX speed (instant replies) beats slow perfect responses for interactive apps.
Built for the Vision Possible hackathon by WeMakeDevs.
Inspired by Vision-Agents and realtime ideas from Stream.
Open source: https://github.com/rupac4530-creator/vision-agent
Feedback, issues, and contributions welcome — drop a PR or open an issue.














Top comments (0)