Nobody tells you this when you're in college that knowing your subject is only half the battle. The other half is knowing how to present yourself. How to answer a question you weren't expecting. How to sit, how to make eye contact, how to not freeze when someone stares at you waiting for a response.
I found this out the hard way.
I graduated from a tier-3 college, stepped out into the job market, and felt like I'd walked onto a stage with no script and no rehearsal. I didn't know what a structured interview looked like from the inside. I didn't know how questions were framed, how long my answers should be, what a recruiter is actually watching for while I'm talking. None of this was ever covered. Not once.
So I started looking for help. Mock interview services, interview coaches, career counsellors — they exist. But when you're a fresh graduate with no income and a family that's already stretched thin, ₹2,000 for a single mock interview session isn't a small ask. The guidance exists. It's just locked behind a price that filters out the people who need it most.
I know I'm not alone in this. There are thousands of students from colleges like mine — bright, capable, hardworking — who lose opportunities not because they don't know their stuff, but because nobody ever taught them how to show it.
That's why I built Vision Agent.
What is Vision Agent?
Agent is an AI interviewer. Not a chatbot that asks you questions in a text box — an actual video interview agent that joins a call, watches you, listens to you, and reacts to everything it observes in real time.
Here Agent introduces himself. He asks role-specific questions one at a time. He waits patiently for you to finish. If your answer is vague, he follows up — once — with a targeted probe, just like a real recruiter would. If you're rambling, he gently redirects. If you're nervous, he notices and offers encouragement.
And here's the part that makes Agent different from anything else I've seen: he's watching you the entire time.
He can tell if you're slouching. He tracks whether you're making eye contact. He detects if your phone is sitting on your desk. He notices if someone else walks into the room. And he responds to all of it — mid-interview, in real time, in a warm and human way.
When the interview ends, Agent doesn't just hang up. He gives you verbal feedback — what you did well, what to work on, how your body language read. Then he opens the floor for your questions. And the moment you are done, a full performance feedback is provided to you, and the magic part - you can ask how to get better: every answer scored on Clarity, Relevance and Depth, your engagement metrics, posture analysis, and a final recommendation.
All of this. Free. For anyone with a laptop and an internet connection.
The Stack That Made It Possible
Before I get into the build, I want to be honest about something: I could not have built this in 4 days without Vision Agents SDK.
I considered doing it the hard way — raw OpenCV for video, manual WebRTC for streaming, direct API calls stitched together with custom orchestration. I started sketching it out and realised I was looking at weeks of work before I'd even get to the actual interview logic.
Vision Agents changed that entirely.
Here's the architecture:
GetStream handles the live video call infrastructure. Alex joins as a participant, sees and hears the candidate through GetStream's edge network. Running on the ap-south-1 Mumbai region meant latency stayed remarkably low throughout — critical when you're building something that needs to feel like a real conversation.
Vision Agents SDK is the core runtime. It gave me the Agent class that manages the entire session lifecycle — from greeting through debrief — and crucially, the processors architecture that lets multiple AI models run in parallel alongside the LLM.
return Agent(
edge=getstream.Edge(region="ap-south-1"),
agent_user=User(name="Agent - Senior AI Recruiter"),
processors=[
YOLOPoseProcessor(model_path="yolo11n-pose.pt"),
YOLOPoseProcessor(model_path="yolo11n.pt"),
],
instructions=dynamic_instructions,
llm=gemini.Realtime(model="gemini-2.5-flash-native-audio-preview-12-2025"),
)
That's the entire agent definition. Two YOLO processors and a Gemini Realtime LLM, orchestrated automatically by Vision Agents.
YOLOPoseProcessor with yolo11n-pose.pt handles real-time pose estimation — tracking eye contact, shoulder posture, and movement patterns frame by frame. This feeds an EngagementTracker that measures eye contact percentage, posture quality, and nervousness levels throughout the session.
YOLOPoseProcessor with yolo11n.pt handles object detection — scanning the candidate's environment for mobile phones, extra screens, and additional people in the room. When detected, that state passes directly to Gemini which acts on it mid-interview.
Gemini Realtime is what gives Agent his brain and his senses simultaneously. As a VideoLLM, it processes the live video feed and audio stream at the same time — it doesn't just hear the candidate, it sees them. The YOLO processor state combines with Gemini's own visual understanding to create something that feels genuinely aware.
Streamlit is the interface — a setup page where candidates choose their role, level, and interview mode (paste a JD, paste their resume, or let Alex decide), and a live Report page that polls the session data in real time and renders the full performance feedback the moment the interview ends.
No PDFs. No emails. Just instant, on-screen feedback.
The Build — What Actually Happened
The Moment It Clicked
I'd been working on the agent setup for about a day. Got the code running, no errors, pressed go — silence. Agent said nothing. Just sat there on the call, present but completely mute.
I dug through the logs. Wrong model name. I was using a standard Gemini model that doesn't support bidirectional streaming. The realtime-capable model is specifically gemini-2.5-flash-native-audio-preview-12-2025 — a version with native audio support built in.
I swapped the model name. Ran it again.
And Agent spoke.
I don't know how to fully explain what that felt like. I'd been staring at this code for hours, convinced something fundamental was broken — and then suddenly there was a voice coming out of my speakers, introducing itself, asking if I was ready to begin. I just sat there for a second. It was real. It was actually working.
That was the moment I knew I was building something.
The Skeleton Problem
The first time I ran the pose detection, green dots and red lines appeared all over my face on the video — the YOLO skeleton overlay, drawn directly onto the live feed. It looked like something out of a sci-fi horror film, not a professional interview tool.
Removing it wasn't as simple as a flag or a parameter. The SDK didn't expose a straightforward draw=False option that actually worked. The solution was subclassing YOLOPoseProcessor and overriding every possible drawing method the SDK might call:
class CleanProcessor(ultralytics.YOLOPoseProcessor):
def render(self, frame, results): return frame
def draw(self, frame, results): return frame
def annotate(self, frame, results): return frame
def visualize(self, frame, results): return frame
def plot(self, frame, results): return frame
Five overrides — because I didn't know which one the SDK was actually calling. So I overrode all of them. The pose data still flows to Gemini. The skeleton just no longer appears on screen. Clean, professional, invisible.
This felt like a small thing. But it's the kind of problem that reminds you that building with real SDKs means encountering real edges. The documentation won't always have your answer. Sometimes you have to read the source.
The Phone Detection Journey
This one took longer than I expected.
I added the object detection processor, wrote the instructions telling Agent what to do when he sees a phone, and pointed my mobile directly at the camera. Nothing. Agent kept going like nothing had happened.
The issue wasn't detection — yolo11n.pt was detecting the phone fine. The issue was communication. The processor was computing results but Gemini wasn't acting on them in real time, only picking it up eventually as ambient context. Agent mentioned the phone at the very end of the session, almost as an afterthought in feedback.
The fix was two-part: adding a dedicated ObjectAlertProcessor class that explicitly formats detections as state text for Gemini, and updating the instructions with an ACTIVE VISUAL SCANNING directive telling Gemini to check processor state continuously — not passively. Once Gemini understood it was expected to scan and act immediately, not just observe and remember, the behaviour changed completely.
Now when you hold up a phone, Agent pauses within seconds and addresses it directly. That's Vision Agents' processor-to-LLM pipeline working exactly as designed — I just needed to understand how to use it properly.
Replacing the PDF Report
My original design generated a PDF report after each session and emailed it to the candidate. It seemed like the obvious approach — interviews end, report gets sent.
But then I thought about the actual experience. You finish an interview, you're sitting there processing how it went, and you have to wait for an email? That friction is completely unnecessary. The data is already there. The session just ended. Why not show it right now, right here, on the same screen?
Sometimes the better UX decision is the simpler one.
What Vision Agents SDK Actually Gave Me
I want to be specific about this because I think it matters.
The thing that makes Vision Agents valuable isn't any single feature. It's the orchestration. Running YOLO and Gemini Realtime in parallel, having processor state automatically available to the LLM, managing the session lifecycle — all of that infrastructure exists so I don't have to build it. I got to spend my 4 days on the actual product: the interview logic, the instruction design, the feedback structure, the user experience.
The latency stayed impressively low throughout testing. The edge transport on ap-south-1 kept the conversation feeling natural — no perceptible lag between speaking and Agent responding, which is non-negotiable when you're simulating a real interview.
And honestly — the processors architecture is elegant. The idea that you can attach arbitrary vision processors to an agent and have their outputs flow into the LLM's context is a genuinely powerful primitive. I used it for two things. Someone else could use it for ten.
What I Learned
The instructions are the product. I spent almost as much time on Agent's instruction prompt as on the code. Phases, priorities, intervention rules, tone guidelines — a well-structured prompt is the difference between an agent that feels human and one that feels like a form with a voice. Don't underinvest here.
Subclass aggressively. When an SDK doesn't expose a parameter you need, don't work around it at the call site. Subclass and override. It's cleaner, it's explicit, and it survives SDK updates better than hacks.
Real-time means background threads. Anything that can run off the main agent loop should. Gemini scoring each answer, writing the report, updating the transcript — all of it runs in background threads so the live conversation never feels blocked or hesitant.
The first working moment is the most important. When Agent first spoke back to me, I could have shipped it that day and called it done. But that moment also showed me how much further it could go. Let your first working version tell you what version two should be.
For the Students Who Share This Story
Vision Agent isn't just a hackathon project to me. It's the tool I wish existed when I was preparing for my first interview. It's what I would have used if I'd had access to it at the right time.
If you're a student from a college that never prepared you for this, if you're a fresh graduate trying to figure out how to present yourself in a room full of expectations — Vision Agent is for you. You don't need to pay for coaching. You don't need connections. You just need a laptop and the willingness to practice.
But here's the thing, if you want to access it and use it you have to clone my repo on github and run it locally on your computer. This can actually be fun because once you get your hands on the project, you will understand how addictive building things become.
The gap between knowing your subject and knowing how to show it is real. But it's closable. That's what I was trying to build.
And if Vision Agents SDK made it possible to build it in 4 days instead of 4 months — then maybe the barrier to building the tools people actually need is lower than we think.
here, check out the project - https://github.com/Yamini26284/Vision-Agent
demo link - https://www.youtube.com/watch?v=mScWgvHX-As
Built for Vision Possible: Agent Protocol — WeMakeDevs Hackathon 2026
By Yamini Priya
Top comments (1)
Really solid first post, Yamini. The writing is as strong as the build itself.
The line about guidance being "locked behind a price that filters out the people who need it most"—that resonated. I work in cybersecurity education and see the same pattern everywhere. The people most vulnerable to risk are the ones least able to afford protection or preparation.
Scrapping the PDF for instant feedback was smart design thinking. And overriding five methods because you didn't know which one the SDK was calling—that's the kind of honest problem-solving that makes technical writing worth reading.
Looking forward to your next post. Keep building, keep writing.