I want to share something I've been working on for the past few weeks, mostly because I couldn't find a single honest write-up about it when I started. So here's mine.
A client came to us with a really specific problem. They were losing leads. Not because their service was bad actually, the opposite. They were so good that their phone never stopped ringing, and a small team of two people couldn't pick up every call. The owner told me, almost embarrassed, that he'd been answering calls at 11 PM the night before because he didn't want to miss a sale.
That hit me in a way I didn't expect.
So we said, okay, let's build an AI that answers the phone like a real person. Books appointments. Qualifies leads. Tells callers what they need to know. Doesn't sound like a robot. And does it for under what a part-time receptionist would cost.
Three weeks later, we shipped it. Here's the honest story.
Why I almost gave up in week 1
I'd built voice apps before. I thought this would be easy. Twilio for the phone line, GPT-4o for the brain, some TTS for the voice done.
Lol.
The first version we built worked exactly once, in the staging environment, with a perfect WiFi connection and me speaking in a clean American accent into a $200 microphone. The moment a real customer called in from a noisy car with a British accent, the whole thing collapsed.
Three things broke immediately:
- Latency. Standard TTS pipelines add 600-1200ms before any audio comes back. On a phone call, that feels like the line went dead. People hung up.
- Interruptions. Real humans interrupt. Real humans say "uhh" and "wait, actually." Our agent waited politely for them to finish, which they never did.
- Hallucinations on critical data. The agent confidently told a caller our office was open Sundays. We are not open Sundays.
I closed my laptop at 2 AM and considered emailing the client to ask for an extension. Instead I made coffee and started over.
"Coffee fixes everything except your code."
The stack that actually worked
After three iterations, here's what stuck:
- Twilio for the carrier layer (porting numbers, SIP, programmable voice)
- OpenAI Realtime API for speech-to-speech (this changed everything more on this below)
- LiveKit for WebRTC audio routing (because Twilio's media streams alone have weird buffering)
- Node.js for the orchestration layer
- Postgres + Redis for state and call session tracking
The single biggest unlock was the Realtime API. Once we stopped doing the old audio → STT → LLM → TTS → audio pipeline and switched to native speech-to-speech, our perceived latency dropped from ~1.5 seconds to about 350ms. That's the difference between "this feels like a robot" and "wait, is this a person?"
The webhook that started it all
This is the entry point. When Twilio receives a call, it pings our webhook, and we hand off the call to LiveKit which then bridges to the realtime model.
// app/api/voice/incoming/route.ts
import { NextRequest, NextResponse } from "next/server";
import { generateLiveKitToken } from "@/lib/livekit";
export async function POST(req: NextRequest) {
const formData = await req.formData();
const callSid = formData.get("CallSid") as string;
const from = formData.get("From") as string;
// create a session row so we can track the call later
const session = await db.callSession.create({
data: { callSid, from, startedAt: new Date() },
});
const token = await generateLiveKitToken({
room: `call-${session.id}`,
identity: `caller-${from}`,
});
// hand the call off to our LiveKit room
const twiml = `
<Response>
<Connect>
<Stream url="wss://livekit.example.com/twilio?token=${token}" />
</Connect>
</Response>
`;
return new NextResponse(twiml, {
headers: { "Content-Type": "text/xml" },
});
}
Looks simple. Took me four days to get right.
Teaching the agent the business
This is the part that I think most tutorials skip, and it's the most important part.
The model is smart. The model does not know your business. We had to give it a "brain file" a structured prompt that combined business hours, services offered, pricing rules, escalation paths, and behavior guardrails. We also gave it function-calling tools so it could actually do things, not just talk about them.
const tools = [
{
type: "function",
name: "book_appointment",
description: "Book a service appointment in the calendar",
parameters: {
type: "object",
properties: {
customer_name: { type: "string" },
phone: { type: "string" },
service: { type: "string", enum: SERVICES },
preferred_date: { type: "string", format: "date-time" },
},
required: ["customer_name", "phone", "service"],
},
},
{
type: "function",
name: "escalate_to_human",
description: "Transfer to a human agent for sensitive or complex cases",
parameters: {
type: "object",
properties: {
reason: { type: "string" },
},
},
},
];
The escalate_to_human tool was the one that earned the client's trust. We baked in rules: any pricing dispute, any angry caller, any mention of legal escalate. The agent doesn't try to be a hero. It knows when to tap out.
The interruption problem
This took me a full evening to solve and I'm still not 100% happy with it.
Real humans interrupt. If you don't handle interruptions, your agent feels like it's lecturing them. We used voice activity detection to cut the agent off mid-sentence when the caller starts speaking, then resumed the conversation from where the caller left off not from where the agent was.
session.on("user_speaking_started", () => {
if (agentIsSpeaking) {
session.cancelResponse();
agentIsSpeaking = false;
}
});
Two lines. One evening of debugging. I'm not proud, but I'm tired.
What I'd do differently
A few honest takeaways:
- Don't try to make it sound exactly human. Tell callers it's an AI in the first 5 seconds. Trust goes up, not down. The ones who care will appreciate the honesty. The ones who don't will keep talking.
- Log every single conversation. We were debugging hallucinations from memory for three days before we set up proper transcript logging. Don't be us.
- Edge cases are 60% of the work. "What if they don't speak English?" "What if they say a number wrong?" "What if they're on speaker phone?" We have a 47-item edge case checklist now.
- Latency matters more than intelligence. A slightly dumber model that responds in 300ms feels better than a smarter one that takes 1.2s.
Where we are now
The agent has been live for two months. It answers about 80 calls a day. It books roughly half of them as actual appointments. The client tells me he sleeps through the night now, which honestly is the metric I care about most.
Was it worth three weeks of evenings and one mild breakdown at 2 AM?
Yeah. It was.
If you're building something similar, feel free to reach out. I'm Abbas, I run engineering at Seedinov we build AI products for small teams who can't hire a Big Tech AI department. You can see the voice agent work in our portfolio or drop us a line if you want to talk shop.
Good luck out there. Build the thing.




Top comments (1)
Great practical write-up. Voice AI is one of those areas where the “happy path” demo hides most of the real engineering work: latency, interruptions, fallback behavior, call state, retry handling, and knowing when to escalate to a human.
One thing I’d add from a production AI perspective is that these systems benefit a lot from explicit state machines. Instead of letting the model implicitly manage the whole call, you can separate intent detection, slot filling, policy checks, appointment flow, and escalation rules. That makes the behavior easier to debug and safer to evolve.
I’d also be interested in how you’re monitoring call quality over time — not just success/failure, but things like handoff rate, caller frustration signals, tool-call latency, booking completion, and cases where the AI confidently went down the wrong path.