I Built an AI That Conducts Cybersecurity Interviews (and Scores You in Real-Time

#cybersecurity #devchallenge #gemini #showdev

This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem

Cybersecurity hiring is broken in a specific way: the gap between "can talk about incident response" and "has actually triaged a 2 AM page" is enormous, and most interviewers can't tell the difference until the candidate is already on the team.

I've sat on both sides of that table. Over the last 2 years I've completed 15+ interviews with top tech companies including Amazon, SpaceX, and TikTok. I took detailed notes on every question, every follow-up probe, and every scoring pattern I could observe. The problem? There's no way to practice these interviews realistically. Mock interviews with friends don't push you the way a trained bar raiser does. And existing AI tools are just chatbots — they don't listen to you talk, watch you code, or adapt their questions based on how deep your knowledge actually goes.

So when the Gemini Live Agent Challenge dropped, I had my idea immediately: an AI interviewer that conducts real-company-style cybersecurity interviews with voice, vision, and adaptive depth.

What CyberLoop Actually Does

It's a voice-first interview platform with three modes:

Hands-On Coding — The candidate writes Python in a built-in Monaco editor to parse Apache access logs, identify suspicious IPs, and explain the attack chain. Click "Run" to execute the code in a sandboxed subprocess. The output gets sent to Gemini alongside a screen frame capture — the agent literally sees your terminal output and gives spoken feedback referencing actual results.

Behavioral — Amazon bar raiser-style STAR interviews. The agent pushes on "what did YOU do?", tracks I-vs-we ratio, and probes until it finds the edge of your experience. Choose Amazon's 16 Leadership Principles framework or custom questions.

Technical Depth — A four-level depth ladder that maps directly to engineering leveling:

Level	Maps To	What It Tests
L1 - Foundational	Junior	Do you know the concept?
L2 - Applied	Mid-Level	Can you use it in a scenario?
L3 - Architectural	Senior	Can you design with tradeoffs?
L4 - Principal	Staff	Can you challenge the premise?

If you give a senior-level answer to a junior question, the system recognizes that and skips ahead. If you stall at the same level three times, it detects your ceiling and moves on. This is how top interviewers at Amazon and Google actually calibrate candidates — CyberLoop automates that entire process.

After each session, Gemini 2.5 Pro analyzes the full transcript with mode-specific scoring rubrics and generates a report card with domain scores, strengths, areas to improve, missed concepts, and study recommendations.

The Stack

Voice: Gemini 2.5 Flash native audio through the Live API, bidirectional PCM16 streaming via WebSocket
Agent: Google ADK with run_live() for real-time orchestration
Code Execution: Sandboxed Python subprocess with 10s timeout, output sent to Gemini
Vision: Screen frame captures sent alongside code output for visual analysis
Scoring: Gemini 2.5 Pro with mode-specific rubrics (STAR for behavioral, code quality for coding, technical depth for technical)
Frontend: React 18 + Vite + TailwindCSS + Monaco Editor
Backend: FastAPI on Cloud Run
Data: 5 cybersecurity domains with calibrated question trees, depth probes, and scoring rubrics built from real interview loops

The ADK agent uses three tools: evaluate_and_continue() (a merged tool that scores the response, auto-advances depth, and fetches the next question in one round trip), get_next_question(), and end_interview(). Merging the tools was a critical optimization — more on that below.

The Hard-Won Lessons

The 2-Minute Death Wall

Our sessions kept crashing with WebSocket errors 1008 and 1011 after about 60 seconds. After hours of debugging, we discovered that Gemini Live API has a 2-minute hard limit for audio+video sessions (vs 15 minutes for audio-only). Every screen frame we sent classified the session as audio+video, filling the 128K context window at ~258 tokens/second.

The fix: context_window_compression with a SlidingWindow(target_tokens=20000). This tells Gemini to prune old conversation turns, effectively removing the session duration limit. Combined with session_resumption for handling the ~10-minute WebSocket connection lifetime. This was not well-documented anywhere and was the single most impactful discovery of the project.

Everything Is a Prompt

Any data sent via send_content() in the Live API triggers the model to respond. Screen frames every 3 seconds? The agent talked constantly, narrating what it saw. Code text updates? It commented on every line as it was typed. System instructions injected as content? It literally read them aloud.

Key insight for other developers: send_realtime() (audio) doesn't trigger responses, but send_content() (text/images) always does. There is no "silent context" mechanism. The only way to give the model information without triggering speech is through tool call responses. Plan your entire architecture around this constraint.

30-Second Response Latency → 8 Seconds

After the candidate stopped talking, the agent took 30-40 seconds to respond. Root cause: three sequential tool call round trips through the Live API (score → advance depth → get next question). Each round trip added 5-15 seconds. We merged all three into a single evaluate_and_continue() tool. One call, one round trip. Response time dropped to ~8 seconds.

The Hallucinating Interviewer

The agent would ask a question, then immediately call the scoring tool with a fabricated summary of what the candidate "said" — scoring an imaginary answer and moving on. We built a two-layer guard: reject tool calls with responses under 15 characters, and cross-reference against actual transcribed speech from the audio stream.

Gemini Says Everything Twice

The model would generate the same question twice in one audio response with different phrasing. We built semantic repetition detection that extracts key words from each sentence and mutes audio when 60%+ word overlap is detected. The transcript flush also truncates to the first question only.

What Actually Worked

The depth ladder. This is the core innovation. Real interviewers don't ask random questions — they probe deeper and deeper until they find your ceiling. Automating this with smart probe selection (checking concept coverage against the transcript before asking each probe) makes the interviews feel genuinely adaptive.

Company personas. Amazon mode does deep behavioral dives with relentless "what did YOU do?" follow-ups. SpaceX mode pushes first-principles thinking. These made interviews feel real in a way I didn't expect.

The Run Code button. Write code → click Run → see output → agent analyzes it. This closes the loop between writing code and getting evaluated on it. The agent references actual output values ("I see 10.0.0.33 has 8 requests — what makes that suspicious?") instead of generic feedback.

Mode-specific scoring. A behavioral interview evaluates STAR structure and I-vs-we ratio. A coding interview evaluates approach, code quality, and security insight. A technical interview evaluates depth and specificity. Same report card format, completely different rubrics.