Building InterviewOS: A Real-Time Multi-Modal AI Interview Tutor Using Vision Agents

DevTwinkal — Sun, 01 Mar 2026 17:40:36 +0000

How I built a real-time AI agent that watches, listens, understands, and adapts during interview preparation.

🚀 Introduction

Interview preparation tools today are mostly static.

They either:

Provide a fixed list of questions

Simulate a chatbot conversation

Or give post-session feedback without real-time intelligence

What they lack is realism.

Real interviews are dynamic. They involve:

Body language

Eye contact

Confidence

Content structure

Adaptive questioning

For the Vision Possible: Agent Protocol Hackathon, I built InterviewOS — a real-time AI Communication Coach powered by Vision Agents.

InterviewOS is not just a chatbot.

It is a multi-modal AI agent that:

Watches (live webcam analysis)

Listens (speech recognition)

Understands (behavior + content evaluation)

Adapts (progressive question generation)

Evaluates (structured scoring + final report)

🎯 The Core Idea

The goal was to simulate a realistic interview preparation experience where:

AI asks a question (via voice)

User answers verbally

AI analyzes posture and eye contact in real time

AI transcribes and evaluates the answer

User manually proceeds to next question

At the end, AI generates a detailed performance report

This required true multi-modal intelligence.

That’s where Vision Agents SDK by Stream became the foundation.

🧩 Architecture Overview

High-level flow:

Webcam → Vision Agents → Behavior Analyzer → Metrics
Microphone → Speech-to-Text → Answer Evaluator
Metrics + Content Score → Adaptive Engine → Next Question
End Session → Final Performance Report
Tech Stack

Frontend

Vanilla JavaScript

Web Speech API (Speech Recognition)

SpeechSynthesis API (AI voice interviewer)

WebSocket client

Chart.js (real-time metrics visualization)

Backend

FastAPI

Vision Agents SDK (Stream)

YOLO Pose (behavior detection)

GPT-4o (question generation + evaluation)

Async WebSocket architecture

🔍 How I Used Vision Agents

Vision Agents SDK is the core real-time video intelligence layer of InterviewOS.

1️⃣ Joining a Real-Time Stream Call

Vision Agents joins a Stream video call and continuously receives live webcam frames.

This ensures:

Low latency (<30ms processing)

Scalable architecture

Clean frame streaming pipeline

2️⃣ Real-Time Behavioral Analysis

Each frame is analyzed using YOLO pose detection to extract structured metrics:

Posture score

Eye contact score

Stability score

Attention score

These metrics are streamed through WebSocket to the backend in real time.

3️⃣ Multi-Modal Fusion

The behavioral metrics are combined with:

GPT-based answer evaluation

Clarity score

Structure score

Depth score

Relevance score

From this fusion, InterviewOS calculates:

Confidence Prediction Score

Communication Persona

Selection Probability

Vision Agents enabled real-time vision intelligence — something impossible with static APIs.

🎙️ Autonomous Yet Controlled Interview Flow

One of the biggest design decisions was avoiding full automation.

Instead of auto-generating questions continuously, the system follows a controlled state machine:

IDLE
→ ASKING_QUESTION
→ LISTENING
→ TRANSCRIPT_REVIEW
→ WAITING_FOR_NEXT
→ EVALUATING
→ ASKING_QUESTION

Why?

Because interview preparation requires:

Transcript correction (speech recognition isn’t perfect)

User control

Realistic pacing

After answering, the user can:

Edit transcript

Re-record

Click “Next Question”

Or End Interview

This makes the system feel like a real mock interview, not a bot.

🧠 Adaptive Question Generation

InterviewOS does not use a fixed question bank.

Instead, it:

Tracks question history

Tracks answer scores

Adjusts difficulty progressively

Question prompt logic:

Beginner-level for low score

Moderate questions for mid score

Advanced follow-up for high score

Never repeat previous questions

This prevents the common “Tell me about yourself” loop problem.

📊 Final Interview Report

When the user ends the interview, InterviewOS generates:

Overall Score (0–100)

Selection Probability

Communication Persona

3 Strengths

3 Areas of Improvement

Behavioral Feedback Summary

Content Feedback Summary

This transforms the tool from a demo into a preparation mentor.

🧠 Challenges Faced
1️⃣ State Machine Bugs

Initial versions auto-triggered next questions or prematurely ended sessions.
Solution: Strict event-based transitions tied only to user actions.

2️⃣ Question Repetition

GPT would repeat generic questions.
Solution: Maintain question history and enforce “never repeat” constraint in prompt.

3️⃣ Speech Recognition Errors

Browser speech APIs sometimes misinterpret words.
Solution: Editable transcript review panel before evaluation.

4️⃣ WebSocket Race Conditions

Sending multiple messages quickly caused evaluation issues.
Solution: Combine confirm + next logic into a single backend trigger.

These debugging cycles significantly improved system stability.

🌍 Beyond Interviews

InterviewOS is not limited to interviews.

The same architecture supports:

Public Speaking Practice

Event Anchoring Simulation

Presentation Coaching

Debate Training

Because Vision Agents handles real-time video intelligence, this platform can scale across multiple communication domains.

🚀 What I Learned

This project helped me deeply understand:

Real-time AI system design

State machine architecture

WebSocket concurrency handling

Multi-modal AI fusion

Vision + LLM integration

Most importantly, I learned that building AI agents is not just about calling models — it’s about designing intelligent, controlled systems.

🎯 Conclusion

InterviewOS demonstrates how Vision Agents can power real-time multi-modal AI systems that:

Watch

Listen

Understand

Adapt

Evaluate

It moves interview preparation from static Q&A tools to an immersive AI-driven simulation experience.

This project reflects the true vision of the hackathon:

Building intelligent, real-time Vision AI agents that operate beyond static image analysis.

If you’re interested in the code and implementation details, check out the GitHub repository below.

🚀 Built for Vision Possible: Agent Protocol Hackathon

DEV Community: DevTwinkal

Building InterviewOS: A Real-Time Multi-Modal AI Interview Tutor Using Vision Agents