DEV Community

Cover image for Why I Chose Voice Over Chat for AI Interviews (And Why It Almost Backfired)
Ademola Balogun
Ademola Balogun

Posted on

Why I Chose Voice Over Chat for AI Interviews (And Why It Almost Backfired)

Most AI interview platforms are glorified chatbots with better questions. We built Squrrel to do something harder: have actual spoken conversations with candidates.

That decision nearly killed the product before launch.

The Obvious Choice That Wasn't Obvious

When I started building Squrrel, the safe play was text-based interviews. Lower latency, fewer technical headaches, easier to parse and analyze. Every AI product manager I talked to said the same thing: "Start with chat. Voice is a nightmare."

They were right about the nightmare part.

But I kept coming back to one fact: 78% of recruiting happens over the phone. Not email. Not Slack. Phone calls. Because hiring managers want to hear how candidates think on their feet, how they structure explanations, whether they can articulate complex ideas clearly.

A text-based interview platform would be easier to build and completely miss the point.

So we went with voice. And immediately discovered why everyone warned us against it.

The Technical Debt I Didn't See Coming

Speech recognition for interviews is different from speech recognition for everything else.

Siri and Alexa are optimized for short commands. Transcription tools like Otter are optimized for meetings with multiple speakers. We needed something that could handle:

20-40 minute monologues about technical projects

Industry jargon that doesn't exist in standard training data ("Kubernetes," "PostgreSQL," "JWT authentication")

Non-native English speakers with varying accents

Candidates who talk fast when nervous or slow when thinking

Off-the-shelf speech-to-text models failed spectacularly. Our first pilot had a 23% word error rate on technical terms. A candidate said "I implemented Redis caching" and got transcribed as "I implemented ready's catching." Recruiters couldn't trust the output.

I spent three weeks fine-tuning Wav2Vec 2.0 on domain-specific data—transcripts from actual tech interviews, recordings of engineers explaining their work, podcasts about software development. Got the error rate down to 6% for technical vocabulary.

But here's what surprised me: the remaining errors weren't random. They clustered around moments of hesitation, filler words, and self-corrections—exactly the moments that reveal how someone thinks under pressure.

We almost removed those "errors" before realizing they were features, not bugs.

The Conversational AI Problem Nobody Talks About

Building an AI that can conduct a natural interview conversation is way harder than building one that asks scripted questions.

The models are good at turn-taking now—knowing when the candidate has finished speaking, when to probe deeper, when to move on. But they're terrible at knowing why to do those things.

Our first version would ask "Tell me about a time you faced a technical challenge" and then immediately jump to the next question, regardless of whether the candidate gave a three-sentence answer or a three-minute story. It felt robotic because it was robotic—no human interviewer would blow past a shallow answer without following up.

We had to build a layer that analyzes response depth and triggers follow-ups. Not just keyword matching—actual semantic understanding of whether the candidate addressed the question or danced around it.

This meant combining LLaMA 3.3 70B for conversation flow with TinyBERT for real-time classification. The large model decides what to ask, the small model decides if the answer was substantive enough to move forward. They run in parallel with about 800ms latency between candidate finishing and AI responding.

That 800ms pause? Candidates tell us it makes the conversation feel more natural. Humans don't respond instantly either.

The Bias Problem That Wasn't a Bias Problem

Everyone asked about bias in AI hiring. "How do you prevent discrimination against protected classes?"

Honest answer? We can't. Not completely.

But we can be transparent about where bias enters the system and give recruiters tools to catch it.

Our approach:

Standardized questions - Every candidate gets asked the same core questions in the same order. This eliminates the biggest source of interviewer bias: one person getting softball questions while another gets grilled.

Anonymized analysis - The AI evaluation doesn't see candidate names, photos, or demographic data. It only sees the transcript and voice characteristics relevant to communication (clarity, pace, coherence—not accent or gender).

Bias audit logs - We track which candidates get follow-up questions and why. If the AI is consistently probing deeper with one demographic group, that pattern surfaces in our analytics.

Human override - Recruiters see the full transcript alongside the AI summary. They can—and do—disagree with the AI's assessment.

The dirty secret of AI hiring tools is that removing human bias is impossible. What's possible is making bias visible and consistent. A human interviewer might grill technical candidates on Tuesdays because they're stressed, then lob softballs on Fridays when they're in a good mood. The AI applies the same standards at 2 PM and 2 AM.

That's not unbiased. It's consistently biased, which is actually useful if you know what you're looking for.

What Breaking Things Taught Me

When we started testing the system, the AI asked a great opening question, then froze for 14 seconds before asking it again. The candidate thought the system crashed and hung up.

The bug? Our conversation state management couldn't handle the candidate pausing to think. The silence triggered a "no response detected" error, which triggered a retry, which created a race condition.

Fixed it by adding a confidence threshold—the AI now distinguishes between "finished talking" silence and "still thinking" silence based on speech patterns in the previous 3 seconds. Not perfect, but it dropped the false-positive rate from 18% to 2%.

Here's the lesson I took away: voice AI in high-stakes scenarios requires defensive design at every layer. Unlike a chatbot where someone can retype their message, you can't ask a candidate to "restart the interview" because your error handling failed.

We built in:

Automatic session recovery if connectivity drops

Manual override for recruiters to flag bad transcriptions

A "pause interview" button for candidates (surprisingly popular)

Playback of the actual audio alongside transcripts

The goal isn't perfection. It's resilience when things go wrong, because they will go wrong.

Why This Matters for Other AI Builders

If you're building AI for professional contexts—interviews, legal analysis, medical screening, financial advice—here's what I'd tell you:

Voice is worth the pain. The richness of verbal communication unlocks insights that text can't capture. But only if you're willing to solve the hard problems instead of shipping a minimum viable chatbot.

Domain-specific fine-tuning isn't optional. General-purpose models are amazing and terrible at the same time. They'll handle 90% of your use case brilliantly, then catastrophically fail on the 10% that matters most. Find that 10% early and train specifically for it.

Latency is a feature. We obsessed over response time at first, trying to get under 500ms. Then we realized that instant responses felt uncanny. The sweet spot for conversational AI is 600-1000ms—fast enough to feel responsive, slow enough to feel natural.

Build for the failure modes. Your AI will misunderstand accents, mishear technical terms, and ask nonsensical follow-ups. Design the system so humans can catch these failures gracefully instead of catastrophically.

The Uncomfortable Truth About AI Products

Six months into building Squrrel, I had a realization that almost made me quit: the AI isn't the product. The product is the workflow that the AI enables.

Candidates don't care that we use Wav2Vec 2.0 for transcription or LLaMA 3.3 for conversation. They care that they can interview at midnight without scheduling four emails. Recruiters don't care about our evaluation algorithms. They care that they can review 10 candidates in an hour instead of spending all week on phone screens.

The AI is infrastructure. The value is in removing friction from a broken process.

This realization changed everything. We stopped optimizing for model accuracy and started optimizing for user experience. We added features like letting candidates preview questions before starting, because that reduced anxiety and led to better responses—even though it "broke" the blind evaluation model we'd carefully designed.

Turns out, a slightly worse AI that people actually use beats a perfect AI that sits unused because the UX is terrible.

What's Next

We're expanding our pilots and learning every day. The technology works. The question now is whether we can scale the human side—onboarding, support, training recruiters to trust but verify AI outputs.

I'm also watching the regulatory space closely. The EU AI Act classifies hiring tools as "high-risk AI systems." New York City requires bias audits for automated employment decision tools. This is good—high-stakes AI should be regulated.

But it also means we need to build compliance into the product from day one, not bolt it on later. Audit trails, explainability, human oversight—these aren't nice-to-haves. They're survival requirements.

If you're building AI products in regulated industries, start designing for compliance now. It's way easier than retrofitting later.

Top comments (0)