DEV Community

Cover image for The part of building an AI receptionist nobody talks about
Rayhan Mahmood
Rayhan Mahmood

Posted on • Originally published at nevermisshq.com

The part of building an AI receptionist nobody talks about

Most teams trying to build their own AI receptionist think the hard part is the AI.

It's not. The AI is the easy part now.

The hard part is everything around the AI. The part that doesn't show up in demos or tutorials. The part that takes six to eight months to figure out and breaks every time it goes near production.

I've watched a few teams try to build this themselves. They all hit the same wall.

  • 1000ms — Total latency budget per response
  • 6-8 mo — Orchestration buildout before production
  • 8 layers — Hidden under the 30-second demo

What they think they're building

You watch a Vapi or Retell demo. Agent answers a call, takes a booking, sends a confirmation. Looks simple.

So they think the build is:

  • Pick an LLM
  • Write some prompts
  • Pick a voice
  • Connect a phone number
  • Ship it

A weekend project.

What they're actually building

Here's what's underneath that 30-second demo.

Telephony layer. SIP trunking. Carrier integration. STIR/SHAKEN attestation so calls don't get marked as spam. Inbound number provisioning. Outbound caller ID verification. DTMF detection. Call recording compliance per state.

Audio infrastructure. Voice activity detection that doesn't false-trigger on background noise. Barge-in handling so the agent stops talking when the caller interrupts. Echo cancellation. Silence detection. Dropped audio recovery.

Latency budget. The whole call has a 1000ms response window before it sounds robotic. That 1000ms gets split across speech-to-text, LLM inference, tool calls, text-to-speech, telephony round trip. Each one has to be optimized. Miss the budget and customers hang up.

Tool reliability. The agent calls your CRM to book an appointment. The API times out at 8 seconds. Agent already said "perfect, you're booked for Thursday." Customer gets no confirmation. Shows up. No record. Trust gone.

State management. Call drops mid-conversation. Customer calls back. How does the agent know they were already 80% through booking? Handoff between inbound and outbound. Retry logic. Idempotency so the same booking doesn't get created twice.

Escalation logic. When does the agent transfer to a human. When does it just take a message. How does it handle threats, lawsuits, contract disputes, refund demands. These aren't AI problems. They're product problems with hard rules.

Monitoring. How do you know the agent is failing? You can't watch every call. You need three layers — system health (uptime, error rates), leading indicators (transfer rate, low-confidence responses), business outcomes (bookings, conversion, revenue).

Model and data drift. The LLM provider updates their model. Agent behavior shifts subtly. Nobody notices for two weeks. You find out when bookings drop 15%.

The build vs buy moment

This is the conversation I have with operators who think they want to build it themselves.

They're not wrong about the AI. Anyone can prompt an LLM to sound friendly on the phone.

They're wrong about the rest.

I talked to a guy who'd been building his own setup for 8 months. He had the agent working great in test calls. The moment he tried to ship it into production, everything broke.

His telephony provider's webhook signing wasn't matching. His CRM API was throwing 500s on bookings during peak hours. His agent was confirming bookings before the API actually wrote them, so customers got told they had appointments that didn't exist. His latency was 2.4 seconds because he was running STT → LLM → TTS sequentially instead of streaming.

He asked me how long it took us to solve those problems.

About a year of running it in production with real shops.

He stopped trying to build his own.

The difference between a $300/month AI receptionist and one that actually works is everything underneath the conversation.

Why this matters if you're shopping

If you're an operator looking at AI receptionist providers, the question isn't "do you have an AI that sounds good." Every provider sounds good in the demo.

The question is "what happens when something goes wrong."

Ask them:

  • What's your average end-to-end response latency under load
  • How do you handle webhook timeouts on CRM bookings
  • What happens if a call drops mid-conversation
  • Show me how you detect false success — when the agent says "booked" but the booking didn't actually happen
  • What's your transfer rate to humans and what triggers it

Most cheap providers can't answer these. They shipped the demo. They didn't ship the production system.

The takeaway

Building AI is no longer the hard part. Infrastructure around the AI is.

If you're an operator, ask the harder questions before you buy. The conversation quality is table stakes. The orchestration is what determines whether the agent actually books the job.

If you're a builder thinking about competing in this space, plan for six to eight months on the orchestration before you ship. Or pick a different problem. This one is solved by people who have already taken the lumps.

If you want to see what running the orchestration looks like from the operator side, my last long-form was on how I replaced hours of manual work with a self-hosted AI agent — same NeverMiss, different stack, full build log including the security layer most tutorials skip.

Want this kind of automation built into your business?

If you'd rather not spend six to eight months on the orchestration yourself, that's what NeverMiss does. nevermisshq.com

Top comments (1)

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The 1000ms latency budget is the detail that reframes everything else. It's not just a technical constraint — it's a product constraint that dictates your entire architecture before you've written a single prompt. Once you accept that the total window from caller finishing their sentence to agent beginning its response is one second, you realize you can't afford sequential processing, you can't afford retries on timeouts, and you definitely can't afford an LLM that thinks for two seconds before responding. The demo doesn't care about this because the demo isn't running under load. But in production, the latency budget is the thing that separates a conversation from an automated phone tree that happens to use natural language.

What makes this harder than it looks is that the latency budget isn't evenly distributed. Speech-to-text takes what it takes. Text-to-speech takes what it takes. The network round trips are physics. By the time you subtract those, the LLM and any tool calls are fighting for whatever milliseconds are left. And the tool calls are the unpredictable part — one slow CRM API response and you've blown the budget, and the caller has already decided this thing is broken and hung up.

The false success problem — where the agent says "booked" but the API didn't actually commit — feels like the truly scary one. A slow response is frustrating. A confidently wrong confirmation is trust-destroying. And the worst part is you might not know it's happening unless you're specifically monitoring for mismatches between what the agent said and what the CRM actually recorded. Most teams wouldn't think to build that reconciliation check until after the first customer complaint. I'm curious how much of that monitoring is standard across providers now, or if it's still mostly custom infrastructure that teams have to build themselves after getting burned.