How I built a real-time AI agent that watches, listens, understands, and adapts during interview preparation.
🚀 Introduction
Interview preparation tools today are mostly static.
They either:
Provide a fixed list of questions
Simulate a chatbot conversation
Or give post-session feedback without real-time intelligence
What they lack is realism.
Real interviews are dynamic. They involve:
Body language
Eye contact
Confidence
Content structure
Adaptive questioning
For the Vision Possible: Agent Protocol Hackathon, I built InterviewOS — a real-time AI Communication Coach powered by Vision Agents.
InterviewOS is not just a chatbot.
It is a multi-modal AI agent that:
Watches (live webcam analysis)
Listens (speech recognition)
Understands (behavior + content evaluation)
Adapts (progressive question generation)
Evaluates (structured scoring + final report)
🎯 The Core Idea
The goal was to simulate a realistic interview preparation experience where:
AI asks a question (via voice)
User answers verbally
AI analyzes posture and eye contact in real time
AI transcribes and evaluates the answer
User manually proceeds to next question
At the end, AI generates a detailed performance report
This required true multi-modal intelligence.
That’s where Vision Agents SDK by Stream became the foundation.
🧩 Architecture Overview
High-level flow:
Webcam → Vision Agents → Behavior Analyzer → Metrics
Microphone → Speech-to-Text → Answer Evaluator
Metrics + Content Score → Adaptive Engine → Next Question
End Session → Final Performance Report
Tech Stack
Frontend
Vanilla JavaScript
Web Speech API (Speech Recognition)
SpeechSynthesis API (AI voice interviewer)
WebSocket client
Chart.js (real-time metrics visualization)
Backend
FastAPI
Vision Agents SDK (Stream)
YOLO Pose (behavior detection)
GPT-4o (question generation + evaluation)
Async WebSocket architecture
🔍 How I Used Vision Agents
Vision Agents SDK is the core real-time video intelligence layer of InterviewOS.
1️⃣ Joining a Real-Time Stream Call
Vision Agents joins a Stream video call and continuously receives live webcam frames.
This ensures:
Low latency (<30ms processing)
Scalable architecture
Clean frame streaming pipeline
2️⃣ Real-Time Behavioral Analysis
Each frame is analyzed using YOLO pose detection to extract structured metrics:
Posture score
Eye contact score
Stability score
Attention score
These metrics are streamed through WebSocket to the backend in real time.
3️⃣ Multi-Modal Fusion
The behavioral metrics are combined with:
GPT-based answer evaluation
Clarity score
Structure score
Depth score
Relevance score
From this fusion, InterviewOS calculates:
Confidence Prediction Score
Communication Persona
Selection Probability
Vision Agents enabled real-time vision intelligence — something impossible with static APIs.
🎙️ Autonomous Yet Controlled Interview Flow
One of the biggest design decisions was avoiding full automation.
Instead of auto-generating questions continuously, the system follows a controlled state machine:
IDLE
→ ASKING_QUESTION
→ LISTENING
→ TRANSCRIPT_REVIEW
→ WAITING_FOR_NEXT
→ EVALUATING
→ ASKING_QUESTION
Why?
Because interview preparation requires:
Transcript correction (speech recognition isn’t perfect)
User control
Realistic pacing
After answering, the user can:
Edit transcript
Re-record
Click “Next Question”
Or End Interview
This makes the system feel like a real mock interview, not a bot.
🧠 Adaptive Question Generation
InterviewOS does not use a fixed question bank.
Instead, it:
Tracks question history
Tracks answer scores
Adjusts difficulty progressively
Question prompt logic:
Beginner-level for low score
Moderate questions for mid score
Advanced follow-up for high score
Never repeat previous questions
This prevents the common “Tell me about yourself” loop problem.
📊 Final Interview Report
When the user ends the interview, InterviewOS generates:
Overall Score (0–100)
Selection Probability
Communication Persona
3 Strengths
3 Areas of Improvement
Behavioral Feedback Summary
Content Feedback Summary
This transforms the tool from a demo into a preparation mentor.
🧠 Challenges Faced
1️⃣ State Machine Bugs
Initial versions auto-triggered next questions or prematurely ended sessions.
Solution: Strict event-based transitions tied only to user actions.
2️⃣ Question Repetition
GPT would repeat generic questions.
Solution: Maintain question history and enforce “never repeat” constraint in prompt.
3️⃣ Speech Recognition Errors
Browser speech APIs sometimes misinterpret words.
Solution: Editable transcript review panel before evaluation.
4️⃣ WebSocket Race Conditions
Sending multiple messages quickly caused evaluation issues.
Solution: Combine confirm + next logic into a single backend trigger.
These debugging cycles significantly improved system stability.
🌍 Beyond Interviews
InterviewOS is not limited to interviews.
The same architecture supports:
Public Speaking Practice
Event Anchoring Simulation
Presentation Coaching
Debate Training
Because Vision Agents handles real-time video intelligence, this platform can scale across multiple communication domains.
🚀 What I Learned
This project helped me deeply understand:
Real-time AI system design
State machine architecture
WebSocket concurrency handling
Multi-modal AI fusion
Vision + LLM integration
Most importantly, I learned that building AI agents is not just about calling models — it’s about designing intelligent, controlled systems.
🎯 Conclusion
InterviewOS demonstrates how Vision Agents can power real-time multi-modal AI systems that:
Watch
Listen
Understand
Adapt
Evaluate
It moves interview preparation from static Q&A tools to an immersive AI-driven simulation experience.
This project reflects the true vision of the hackathon:
Building intelligent, real-time Vision AI agents that operate beyond static image analysis.
If you’re interested in the code and implementation details, check out the GitHub repository below.
🚀 Built for Vision Possible: Agent Protocol Hackathon
Top comments (0)