SK Faizanuddin

Posted on Mar 1

I Built a Real-Time AI Interview Coach Using Vision Agents — Here's How

#hackathon #visionagents #wemakedevs #ai

The Problem That Started It All
You've been preparing for a job interview for weeks. Your answers are sharp. Your research is solid. But when the actual interview happens — you slouch, you break eye contact, you say "um" seventeen times, and you speak so fast nobody can follow you. The frustrating part? You knew better. You just didn't have anyone watching.
That's the gap AceView fills.
AceView is a real-time AI interview coach that joins your video call, watches you with computer vision, listens to you with speech recognition, and gives you live feedback on six performance axes — every single second.
Not after the interview. During it.

The Vision: What We Set Out to Build
The dream was simple to describe and hard to build:

"An AI that watches you practice an interview and coaches you the way a great mentor would — not by grading you at the end, but by whispering in your ear the whole time."

Six things to track, in real time, for every second of every session:

Signal What We Measure
👁️ Eye Contact Are you looking at the camera?
🧍 Posture Are your shoulders straight?
💬 Filler Words How many "um", "like", "you know"?
🎙️ Speech Pace Words per minute vs ideal 130 WPM
🤖 AI Nudges Smart coaching tips fired when you slip
📊 Report Card Gemini-generated grade + actionable feedback
The result: a platform where you can do a full mock interview, get coached live by an ElevenLabs-voiced AI, and walk away with a PDF report card — all in one session.

The Execution: How Vision Agents Made This Possible
The core of AceView runs on the Vision Agents SDK by GetStream. Before building this, I was genuinely unsure if it was possible to do pose detection and speech analysis simultaneously on a live video call without building custom WebRTC infrastructure from scratch. Vision Agents solved the hardest parts.
Here's how the system is wired:

Frontend (Next.js)
│
│ WebRTC (Stream Video SDK)
▼
Backend (FastAPI)
│
├── Vision Agents Agent (joins call as participant)
│ ├── YOLOPoseProcessor → posture + eye contact
│ ├── Deepgram STT → speech → WPM + fillers
│ ├── ElevenLabs TTS → AI coach voice
│ └── Gemini 2.0 Flash → interview questions + report
│
└── Custom events → frontend metrics panel

The agent joins the video call as a second participant. It sees the user's video stream, hears their audio, and sends coaching data back to the frontend — all through the same WebRTC connection.

AceView Live Session — HIGH Confidence The confidence ring glows green when posture and eye contact are both strong. Metrics update every second on the right panel.

Setting Up the Agent

from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import deepgram, elevenlabs, openai, getstream
from agents.vision_processor import AceViewVisionProcessor
async def create_agent(**kwargs) -> Agent:
    llm = openai.ChatCompletionsLLM(
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
        model="google/gemini-2.0-flash-001"
    )
    agent = Agent(
        edge=getstream.Edge(),  # Low-latency edge network
        agent_user=User(name="AceView AI Coach", id="aceview_agent"),
        instructions=SYSTEM_PROMPT,
        processors=[
            AceViewVisionProcessor(model_path=MODEL_PATH, fps=1, conf_threshold=0.25)
        ],
        llm=llm,
        tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
        stt=deepgram.STT(eager_turn_detection=True),
        streaming_tts=True,  # Drastically cuts response latency
    )
    return agent

Notice the processors list — that's our custom vision processor injected directly into the agent pipeline. Vision Agents calls it every frame with the YOLO keypoints. We don't write any WebRTC or frame-extraction code. The SDK handles all of that.

The Code: Real-Time Vision Analysis

Building a Custom Vision Processor
The SDK's YOLOPoseProcessor is the key class we extend. You inherit from it, override process(), and you get every YOLO pose frame automatically.

from vision_agents.plugins.ultralytics import YOLOPoseProcessor
from vision_agents.core import Agent
import numpy as np
class AceViewVisionProcessor(YOLOPoseProcessor):
    """
    Custom YOLO processor that calculates real-time interview coaching metrics
    and broadcasts them to the frontend via WebRTC custom events.
    """
    name = "aceview_vision"
    def __init__(self, *args, **kwargs):
        kwargs.setdefault("fps", 1)
        super().__init__(*args, **kwargs)
        self.agent: Optional[Agent] = None
        self._last_nudge_times: Dict[str, float] = {}
    def attach_agent(self, agent: Agent) -> None:
        """Called automatically by the SDK when the agent starts."""
        self.agent = agent

The SDK automatically calls attach_agent() — so by the time the session starts, our processor has a reference to the agent and can send custom events directly to the frontend.

Posture Scoring with YOLO Keypoints
YOLO pose detection gives us 17 keypoints per person — every joint and facial landmark with confidence scores. We use shoulder alignment, torso height, and head position to compute a posture score:

# COCO keypoint indices
LEFT_SHOULDER, RIGHT_SHOULDER = 5, 6
LEFT_HIP, RIGHT_HIP = 11, 12
NOSE = 0
CONF_THRESH = 0.3
def _calculate_posture_score(self, kpts: np.ndarray) -> int:
    """Score 0-100 from shoulder level, torso height, and head position."""
    scores = []
    l_sh, r_sh = kpts[LEFT_SHOULDER], kpts[RIGHT_SHOULDER]
    l_hip, r_hip = kpts[LEFT_HIP], kpts[RIGHT_HIP]
    nose = kpts[NOSE]
    # 1. Shoulder tilt (tilted shoulders = slouching)
    if l_sh[2] > CONF_THRESH and r_sh[2] > CONF_THRESH:
        shoulder_width = abs(l_sh[0] - r_sh[0])
        if shoulder_width > 10:
            tilt = abs(l_sh[1] - r_sh[1]) / shoulder_width
            scores.append(max(10, int(100 - tilt * 300)))
    # 2. Torso upright (hips below shoulders = sitting straight)
    if all(kpts[i][2] > CONF_THRESH for i in [LEFT_SHOULDER, RIGHT_SHOULDER, LEFT_HIP, RIGHT_HIP]):
        avg_sh_y = (l_sh[1] + r_sh[1]) / 2
        avg_hip_y = (l_hip[1] + r_hip[1]) / 2
        torso_h = avg_hip_y - avg_sh_y
        scores.append(90 if torso_h > 30 else 65 if torso_h > 0 else 35)
    # 3. Head upright (nose above shoulder line)
    if nose[2] > CONF_THRESH and l_sh[2] > CONF_THRESH:
        avg_sh_y = (l_sh[1] + r_sh[1]) / 2
        scores.append(95 if nose[1] < avg_sh_y else 40)
    return min(100, int(sum(scores) / len(scores))) if scores else 0

Eye Contact via Ear Asymmetry (The Key Insight)
YOLO can't track eyeball direction. But there's a clever proxy: ear visibility.
When you look left, your right ear becomes more visible. When you look right, your left ear pops into view. Perfect forward gaze = both ears visible with similar confidence. Head turned = major asymmetry.

def _calculate_eye_contact(self, kpts: np.ndarray) -> Tuple[int, bool]:
    """
    Strategy: YOLO can't track eyeballs, but it CAN detect head rotation
    via which ears are visible. Ear asymmetry = head-rotation signal.
    """
    l_ear_conf = kpts[LEFT_EAR][2] if kpts[LEFT_EAR][2] > CONF_THRESH else 0.0
    r_ear_conf = kpts[RIGHT_EAR][2] if kpts[RIGHT_EAR][2] > CONF_THRESH else 0.0
    ear_total  = l_ear_conf + r_ear_conf
    if ear_total > 0.05:
        # 0 = symmetric (facing camera), 1 = only one ear visible (turned head)
        ear_asymmetry = abs(l_ear_conf - r_ear_conf) / ear_total
        ear_penalty = int(ear_asymmetry * 80)  # up to 80-point drop
    else:
        ear_penalty = 0
    # Refine with nose horizontal offset
    nose_penalty = 0
    l_eye, r_eye, nose = kpts[LEFT_EYE], kpts[RIGHT_EYE], kpts[NOSE]
    if all(kpts[i][2] > CONF_THRESH for i in [LEFT_EYE, RIGHT_EYE, NOSE]):
        eye_cx = (l_eye[0] + r_eye[0]) / 2
        eye_w  = abs(l_eye[0] - r_eye[0])
        if eye_w > 5:
            offset = abs(nose[0] - eye_cx) / eye_w
            nose_penalty = int(offset * 100)
    score = max(5, 100 - min(95, ear_penalty + nose_penalty))
    return score, True

This approach made the eye contact score far more sensitive than the naive "nose offset" method I'd tried first. Looking away from the camera now drops the score to 25–40 range almost instantly.

Real-Time Filler Word Detection + WPM Tracking
Speech processing runs through Deepgram's streaming STT. We subscribe to two events — partial transcripts (live preview) and final transcripts (committed sentences). On every final transcript, we count filler words and compute the real words-per-minute:

FILLER_WORDS = {
    "umm", "um", "hmm", "hm", "mm", "mhm", "uh",
    "like", "you know", "basically", "actually",
    "literally", "right", "so"
}
# Tracks real speaking time for WPM
_pace = {"words": 0, "speaking_secs": 0.0, "turn_start": None}
async def on_final_transcript(event: STTTranscriptEvent):
    text = event.text.strip()
    normalized = text.lower()
    # Count fillers (multi-word phrases first, then single words)
    filler_count = len(re.findall(r'\byou know\b', normalized))
    words = re.split(r'\W+', normalized)
    filler_count += sum(1 for w in words if w in FILLER_WORDS - {"you know"})
    # Accumulate real speaking time
    _pace["words"] += len([w for w in words if w])
    if _pace["turn_start"] is not None:
        _pace["speaking_secs"] += time.monotonic() - _pace["turn_start"]
        _pace["turn_start"] = None
    # 130 WPM = perfect score of 100
    pace_score = 82  # default until we have enough data
    if _pace["speaking_secs"] >= 3.0 and _pace["words"] >= 5:
        wpm = (_pace["words"] / _pace["speaking_secs"]) * 60
        pace_score = max(0, min(100,
            int((wpm - 50) / 80 * 100) if wpm <= 130
            else int(100 - (wpm - 130) / 80 * 100)
        ))
    await agent.send_custom_event({
        "type": "transcript",
        "text": text,
        "filler_count": filler_count,
        "pace_score": pace_score,
    })

Filler Words Highlighted in Live Transcript Every filler word is highlighted in real time as the user speaks. The counter on the right updates live.

agent.send_custom_event() broadcasts this directly to the frontend over WebRTC. No REST API, no polling. The metrics panel updates in real time.

AI Nudges — Invisible Mid-Session Coaching
When a metric crosses a threshold, we fire a coaching nudge — a small pop-up that appears for a few seconds and disappears. Each nudge type has its own 10-second cooldown, so multiple issues can fire simultaneously without overwhelming the user:

NUDGE_COOLDOWN = 10.0  # seconds
def _should_nudge(self, key: str) -> bool:
    now = time.time()
    if now - self._last_nudge_times.get(key, 0.0) >= NUDGE_COOLDOWN:
        self._last_nudge_times[key] = now
        return True
    return False
async def _send_nudge_if_needed(self, posture: int, eye_contact: int, face_visible: bool):
    # Independent if-checks — all three can fire simultaneously
    if not face_visible and self._should_nudge("face"):
        await self._safe_nudge("Make sure your face and body are visible on camera")
    if posture < 70 and self._should_nudge("posture"):
        await self._safe_nudge("Sit up straight and square your shoulders to the camera")
    if eye_contact < 65 and self._should_nudge("eye"):
        await self._safe_nudge("Look directly at your camera — eye contact matters in interviews")

AI Nudge Firing — Bad Eye Contact A nudge appears mid-session when eye contact drops below threshold. It fades out automatically — no interruption to the interview flow.

Session Averaging: Fairness by Design
One design decision I'm proud of: the report card is based on the average of the entire session, not the score at the moment you click "End Session." Early on, the report card reflected only the last 5 seconds of the call — meaning if you adjusted your posture right before ending, you'd get an inflated score. We fixed this with a running accumulator on the frontend:

// Zustand store — accumulates every frame for true session averaging
interface SessionAccumulator {
  postureSum: number; eyeSum: number; paceSum: number; count: number;
}
endSession: async () => {
  const { _acc, metrics } = get();
  const avgPosture   = _acc.count > 0 ? Math.round(_acc.postureSum / _acc.count) : metrics.postureScore;
  const avgEye       = _acc.count > 0 ? Math.round(_acc.eyeSum    / _acc.count) : metrics.eyeContactScore;
  const avgPace      = _acc.count > 0 ? Math.round(_acc.paceSum   / _acc.count) : metrics.speechPaceScore;
  // Update displayed metrics to show session averages
  set({ isSessionActive: false, metrics: { ...metrics, postureScore: avgPosture, ... }});
  // Send averages to Gemini for honest report card
  get().fetchSummary({ posture: avgPosture, eye: avgEye, pace: avgPace });
}

The result: if you had terrible eye contact for 8 minutes and then looked at the camera for the last 30 seconds, your eye contact score will honestly reflect your session performance.

The Report Card: Honest AI Feedback
After the session, averages are sent to Gemini 2.0 Flash via OpenRouter. The prompt includes strict rules to prevent contradictory feedback:

Metrics ≥ 75 → listed as strengths only
Metrics < 75 → listed as areas to improve
If all scores are low → strengths focus on effort, not weak metrics
Grading: A ≥ 85, B ≥ 70, C ≥ 55, D < 55
No more "Grade D — great posture!" contradictions.

What We Learned
On Vision Agents SDK: The SDK's YOLOPoseProcessor is genuinely powerful. The ability to extend it with a custom class and get frames automatically — without writing a single line of WebRTC video capture code — saved at least a week of work. The agent.send_custom_event() method is elegant: any JSON you send appears as a custom event on the frontend call object.

On real-time ML: Running YOLO at 1 FPS (not 3) was a critical decision. At 3 FPS, the audio queue started starving — ElevenLabs would stutter or drop audio entirely. At 1 FPS, everything runs smoothly. Always profile your ML inference loops in an audio-first pipeline.

On eye contact detection: The "nose offset" approach I tried first was too weak. Ear asymmetry is a much more reliable signal for head rotation because it reflects a physical landmark (the 3D orientation of the head), not a 2D projection artifact.

Final Result
✅ 6 real-time coaching signals
✅ ElevenLabs-voiced AI interviewer conducting the call
✅ Live transcript with filler word highlights
✅ Confidence ring animation (green → red, faster pulse = lower confidence)
✅ Mid-session AI nudges for posture, eye contact, and filler words
✅ Session-averaged Gemini report card with honest grading
✅ One-click PDF download

AI Report Card — Grade & Feedback Gemini generates an honest A–D graded report card. Strengths only appear for metrics ≥ 75 — no fake praise.

Try It
🔗 GitHub: https://github.com/SKfaizan-786/aceview
🌐 Live Demo: https://aceview-murex.vercel.app
🎬 Demo Video: https://youtu.be/PQ_nZ8KGVYQ

Built with: Vision Agents · YOLOv11 · Deepgram · ElevenLabs · Gemini 2.0 Flash · Stream Video · Next.js · FastAPI

Submitted for Vision Possible Hackathon 2025 by WeMakeDevs — #BuildInPublic #VisionAgents #AIHackathon

DEV Community

I Built a Real-Time AI Interview Coach Using Vision Agents — Here's How

Top comments (0)