<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevTwinkal</title>
    <description>The latest articles on DEV Community by DevTwinkal (@devtwinkal).</description>
    <link>https://dev.to/devtwinkal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2480856%2F9b935d5f-3a64-4301-9780-8a9df1cef26e.png</url>
      <title>DEV Community: DevTwinkal</title>
      <link>https://dev.to/devtwinkal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devtwinkal"/>
    <language>en</language>
    <item>
      <title>Building InterviewOS: A Real-Time Multi-Modal AI Interview Tutor Using Vision Agents</title>
      <dc:creator>DevTwinkal</dc:creator>
      <pubDate>Sun, 01 Mar 2026 17:40:36 +0000</pubDate>
      <link>https://dev.to/devtwinkal/building-interviewos-a-real-time-multi-modal-ai-interview-tutor-using-vision-agents-36dk</link>
      <guid>https://dev.to/devtwinkal/building-interviewos-a-real-time-multi-modal-ai-interview-tutor-using-vision-agents-36dk</guid>
      <description>&lt;p&gt;How I built a real-time AI agent that watches, listens, understands, and adapts during interview preparation.&lt;/p&gt;

&lt;p&gt;🚀 Introduction&lt;/p&gt;

&lt;p&gt;Interview preparation tools today are mostly static.&lt;/p&gt;

&lt;p&gt;They either:&lt;/p&gt;

&lt;p&gt;Provide a fixed list of questions&lt;/p&gt;

&lt;p&gt;Simulate a chatbot conversation&lt;/p&gt;

&lt;p&gt;Or give post-session feedback without real-time intelligence&lt;/p&gt;

&lt;p&gt;What they lack is realism.&lt;/p&gt;

&lt;p&gt;Real interviews are dynamic. They involve:&lt;/p&gt;

&lt;p&gt;Body language&lt;/p&gt;

&lt;p&gt;Eye contact&lt;/p&gt;

&lt;p&gt;Confidence&lt;/p&gt;

&lt;p&gt;Content structure&lt;/p&gt;

&lt;p&gt;Adaptive questioning&lt;/p&gt;

&lt;p&gt;For the Vision Possible: Agent Protocol Hackathon, I built InterviewOS — a real-time AI Communication Coach powered by Vision Agents.&lt;/p&gt;

&lt;p&gt;InterviewOS is not just a chatbot.&lt;/p&gt;

&lt;p&gt;It is a multi-modal AI agent that:&lt;/p&gt;

&lt;p&gt;Watches (live webcam analysis)&lt;/p&gt;

&lt;p&gt;Listens (speech recognition)&lt;/p&gt;

&lt;p&gt;Understands (behavior + content evaluation)&lt;/p&gt;

&lt;p&gt;Adapts (progressive question generation)&lt;/p&gt;

&lt;p&gt;Evaluates (structured scoring + final report)&lt;/p&gt;

&lt;p&gt;🎯 The Core Idea&lt;/p&gt;

&lt;p&gt;The goal was to simulate a realistic interview preparation experience where:&lt;/p&gt;

&lt;p&gt;AI asks a question (via voice)&lt;/p&gt;

&lt;p&gt;User answers verbally&lt;/p&gt;

&lt;p&gt;AI analyzes posture and eye contact in real time&lt;/p&gt;

&lt;p&gt;AI transcribes and evaluates the answer&lt;/p&gt;

&lt;p&gt;User manually proceeds to next question&lt;/p&gt;

&lt;p&gt;At the end, AI generates a detailed performance report&lt;/p&gt;

&lt;p&gt;This required true multi-modal intelligence.&lt;/p&gt;

&lt;p&gt;That’s where Vision Agents SDK by Stream became the foundation.&lt;/p&gt;

&lt;p&gt;🧩 Architecture Overview&lt;/p&gt;

&lt;p&gt;High-level flow:&lt;/p&gt;

&lt;p&gt;Webcam → Vision Agents → Behavior Analyzer → Metrics&lt;br&gt;
Microphone → Speech-to-Text → Answer Evaluator&lt;br&gt;
Metrics + Content Score → Adaptive Engine → Next Question&lt;br&gt;
End Session → Final Performance Report&lt;br&gt;
Tech Stack&lt;/p&gt;

&lt;p&gt;Frontend&lt;/p&gt;

&lt;p&gt;Vanilla JavaScript&lt;/p&gt;

&lt;p&gt;Web Speech API (Speech Recognition)&lt;/p&gt;

&lt;p&gt;SpeechSynthesis API (AI voice interviewer)&lt;/p&gt;

&lt;p&gt;WebSocket client&lt;/p&gt;

&lt;p&gt;Chart.js (real-time metrics visualization)&lt;/p&gt;

&lt;p&gt;Backend&lt;/p&gt;

&lt;p&gt;FastAPI&lt;/p&gt;

&lt;p&gt;Vision Agents SDK (Stream)&lt;/p&gt;

&lt;p&gt;YOLO Pose (behavior detection)&lt;/p&gt;

&lt;p&gt;GPT-4o (question generation + evaluation)&lt;/p&gt;

&lt;p&gt;Async WebSocket architecture&lt;/p&gt;

&lt;p&gt;🔍 How I Used Vision Agents&lt;/p&gt;

&lt;p&gt;Vision Agents SDK is the core real-time video intelligence layer of InterviewOS.&lt;/p&gt;

&lt;p&gt;1️⃣ Joining a Real-Time Stream Call&lt;/p&gt;

&lt;p&gt;Vision Agents joins a Stream video call and continuously receives live webcam frames.&lt;/p&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;p&gt;Low latency (&amp;lt;30ms processing)&lt;/p&gt;

&lt;p&gt;Scalable architecture&lt;/p&gt;

&lt;p&gt;Clean frame streaming pipeline&lt;/p&gt;

&lt;p&gt;2️⃣ Real-Time Behavioral Analysis&lt;/p&gt;

&lt;p&gt;Each frame is analyzed using YOLO pose detection to extract structured metrics:&lt;/p&gt;

&lt;p&gt;Posture score&lt;/p&gt;

&lt;p&gt;Eye contact score&lt;/p&gt;

&lt;p&gt;Stability score&lt;/p&gt;

&lt;p&gt;Attention score&lt;/p&gt;

&lt;p&gt;These metrics are streamed through WebSocket to the backend in real time.&lt;/p&gt;

&lt;p&gt;3️⃣ Multi-Modal Fusion&lt;/p&gt;

&lt;p&gt;The behavioral metrics are combined with:&lt;/p&gt;

&lt;p&gt;GPT-based answer evaluation&lt;/p&gt;

&lt;p&gt;Clarity score&lt;/p&gt;

&lt;p&gt;Structure score&lt;/p&gt;

&lt;p&gt;Depth score&lt;/p&gt;

&lt;p&gt;Relevance score&lt;/p&gt;

&lt;p&gt;From this fusion, InterviewOS calculates:&lt;/p&gt;

&lt;p&gt;Confidence Prediction Score&lt;/p&gt;

&lt;p&gt;Communication Persona&lt;/p&gt;

&lt;p&gt;Selection Probability&lt;/p&gt;

&lt;p&gt;Vision Agents enabled real-time vision intelligence — something impossible with static APIs.&lt;/p&gt;

&lt;p&gt;🎙️ Autonomous Yet Controlled Interview Flow&lt;/p&gt;

&lt;p&gt;One of the biggest design decisions was avoiding full automation.&lt;/p&gt;

&lt;p&gt;Instead of auto-generating questions continuously, the system follows a controlled state machine:&lt;/p&gt;

&lt;p&gt;IDLE&lt;br&gt;
→ ASKING_QUESTION&lt;br&gt;
→ LISTENING&lt;br&gt;
→ TRANSCRIPT_REVIEW&lt;br&gt;
→ WAITING_FOR_NEXT&lt;br&gt;
→ EVALUATING&lt;br&gt;
→ ASKING_QUESTION&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because interview preparation requires:&lt;/p&gt;

&lt;p&gt;Transcript correction (speech recognition isn’t perfect)&lt;/p&gt;

&lt;p&gt;User control&lt;/p&gt;

&lt;p&gt;Realistic pacing&lt;/p&gt;

&lt;p&gt;After answering, the user can:&lt;/p&gt;

&lt;p&gt;Edit transcript&lt;/p&gt;

&lt;p&gt;Re-record&lt;/p&gt;

&lt;p&gt;Click “Next Question”&lt;/p&gt;

&lt;p&gt;Or End Interview&lt;/p&gt;

&lt;p&gt;This makes the system feel like a real mock interview, not a bot.&lt;/p&gt;

&lt;p&gt;🧠 Adaptive Question Generation&lt;/p&gt;

&lt;p&gt;InterviewOS does not use a fixed question bank.&lt;/p&gt;

&lt;p&gt;Instead, it:&lt;/p&gt;

&lt;p&gt;Tracks question history&lt;/p&gt;

&lt;p&gt;Tracks answer scores&lt;/p&gt;

&lt;p&gt;Adjusts difficulty progressively&lt;/p&gt;

&lt;p&gt;Question prompt logic:&lt;/p&gt;

&lt;p&gt;Beginner-level for low score&lt;/p&gt;

&lt;p&gt;Moderate questions for mid score&lt;/p&gt;

&lt;p&gt;Advanced follow-up for high score&lt;/p&gt;

&lt;p&gt;Never repeat previous questions&lt;/p&gt;

&lt;p&gt;This prevents the common “Tell me about yourself” loop problem.&lt;/p&gt;

&lt;p&gt;📊 Final Interview Report&lt;/p&gt;

&lt;p&gt;When the user ends the interview, InterviewOS generates:&lt;/p&gt;

&lt;p&gt;Overall Score (0–100)&lt;/p&gt;

&lt;p&gt;Selection Probability&lt;/p&gt;

&lt;p&gt;Communication Persona&lt;/p&gt;

&lt;p&gt;3 Strengths&lt;/p&gt;

&lt;p&gt;3 Areas of Improvement&lt;/p&gt;

&lt;p&gt;Behavioral Feedback Summary&lt;/p&gt;

&lt;p&gt;Content Feedback Summary&lt;/p&gt;

&lt;p&gt;This transforms the tool from a demo into a preparation mentor.&lt;/p&gt;

&lt;p&gt;🧠 Challenges Faced&lt;br&gt;
1️⃣ State Machine Bugs&lt;/p&gt;

&lt;p&gt;Initial versions auto-triggered next questions or prematurely ended sessions.&lt;br&gt;
Solution: Strict event-based transitions tied only to user actions.&lt;/p&gt;

&lt;p&gt;2️⃣ Question Repetition&lt;/p&gt;

&lt;p&gt;GPT would repeat generic questions.&lt;br&gt;
Solution: Maintain question history and enforce “never repeat” constraint in prompt.&lt;/p&gt;

&lt;p&gt;3️⃣ Speech Recognition Errors&lt;/p&gt;

&lt;p&gt;Browser speech APIs sometimes misinterpret words.&lt;br&gt;
Solution: Editable transcript review panel before evaluation.&lt;/p&gt;

&lt;p&gt;4️⃣ WebSocket Race Conditions&lt;/p&gt;

&lt;p&gt;Sending multiple messages quickly caused evaluation issues.&lt;br&gt;
Solution: Combine confirm + next logic into a single backend trigger.&lt;/p&gt;

&lt;p&gt;These debugging cycles significantly improved system stability.&lt;/p&gt;

&lt;p&gt;🌍 Beyond Interviews&lt;/p&gt;

&lt;p&gt;InterviewOS is not limited to interviews.&lt;/p&gt;

&lt;p&gt;The same architecture supports:&lt;/p&gt;

&lt;p&gt;Public Speaking Practice&lt;/p&gt;

&lt;p&gt;Event Anchoring Simulation&lt;/p&gt;

&lt;p&gt;Presentation Coaching&lt;/p&gt;

&lt;p&gt;Debate Training&lt;/p&gt;

&lt;p&gt;Because Vision Agents handles real-time video intelligence, this platform can scale across multiple communication domains.&lt;/p&gt;

&lt;p&gt;🚀 What I Learned&lt;/p&gt;

&lt;p&gt;This project helped me deeply understand:&lt;/p&gt;

&lt;p&gt;Real-time AI system design&lt;/p&gt;

&lt;p&gt;State machine architecture&lt;/p&gt;

&lt;p&gt;WebSocket concurrency handling&lt;/p&gt;

&lt;p&gt;Multi-modal AI fusion&lt;/p&gt;

&lt;p&gt;Vision + LLM integration&lt;/p&gt;

&lt;p&gt;Most importantly, I learned that building AI agents is not just about calling models — it’s about designing intelligent, controlled systems.&lt;/p&gt;

&lt;p&gt;🎯 Conclusion&lt;/p&gt;

&lt;p&gt;InterviewOS demonstrates how Vision Agents can power real-time multi-modal AI systems that:&lt;/p&gt;

&lt;p&gt;Watch&lt;/p&gt;

&lt;p&gt;Listen&lt;/p&gt;

&lt;p&gt;Understand&lt;/p&gt;

&lt;p&gt;Adapt&lt;/p&gt;

&lt;p&gt;Evaluate&lt;/p&gt;

&lt;p&gt;It moves interview preparation from static Q&amp;amp;A tools to an immersive AI-driven simulation experience.&lt;/p&gt;

&lt;p&gt;This project reflects the true vision of the hackathon:&lt;/p&gt;

&lt;p&gt;Building intelligent, real-time Vision AI agents that operate beyond static image analysis.&lt;/p&gt;

&lt;p&gt;If you’re interested in the code and implementation details, check out the GitHub repository below.&lt;/p&gt;

&lt;p&gt;🚀 Built for Vision Possible: Agent Protocol Hackathon&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>career</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
