What I learned building an AI voice agent stack solo (Vapi + n8n, 2 months in)

Sam — Sat, 13 Jun 2026 21:09:32 +0000

Two months ago I started building voice agents for small service businesses. dental clinics and HVAC companies that lose real money every time a call goes to voicemail. I'm doing it solo, alongside a day job, which means every wrong turn costs me a weekend I don't get back.

Here's what actually went wrong, what I'd tell myself on day one, and the parts of the stack that held up.

The stack, briefly

Vapi for the voice layer (speech-to-text, the LLM turn, text-to-speech)
n8n self-hosted on a cheap VPS for orchestration — booking lookups, calendar writes, follow-up triggers
A Google Sheets + n8n layer for scheduling and logging while I'm pre-revenue and don't want to pay for tooling I haven't validated

Nothing here is exotic. That was the point. I wanted boring, debuggable infrastructure I could reason about at 11pm.

Lesson 1: The hard problem isn't the AI. It's the handoff.

I assumed the voice model would be the scary part. It wasn't. Modern voice platforms handle the conversation surprisingly well out of the box.

The actual pain was everything around the conversation — what happens when the agent needs to check an appointment slot, write to a calendar, or hand off to a human gracefully. That orchestration logic is where I lost the most time, and it's the part no demo video ever shows you.

If you're evaluating this space: budget your time for the plumbing, not the model.

Lesson 2: Self-hosting n8n is worth it, but prune your execution data or die

Running n8n in Docker on a small VPS is genuinely fine for low volume. What nobody warned me about: execution data accumulates fast and will quietly eat your disk.

The fix is one environment variable:

Set it early. I found out the way you'd expect — a workflow failing for no obvious reason, an hour of confusion, then a df -h showing a nearly full disk.

Lesson 3: Cold outreach taught me more than my landing page did

I ran a cold email campaign to roughly 1,600 leads over two months. Clean domain warmup, SPF/DKIM/DMARC all verified, aggregate reports showing no auth failures.

Replies: basically zero.

That stung, but it was useful. It forced me to confront that deliverability being technically correct and the message being compelling are completely different problems. The infrastructure was fine. The offer and the targeting weren't sharp enough yet. No amount of DNS hygiene fixes a message that doesn't land.

Lesson 4: Narrow beats broad, faster than I expected

Early on I wanted to serve "service businesses." Too vague. The moment I picked one vertical and wrote scripts for specific call patterns — new patient booking, after-hours emergencies, the weird edge cases a real receptionist handles — everything got easier. The demos got sharper. The objections got predictable.

If you're building anything agent-shaped: pick the narrowest viable slice and over-fit to it. You can generalize later.

What I'd tell myself on day one

The model is the easy 20%. Plan for the orchestration.
Turn on data pruning before you need it.
Correct infrastructure ≠ a message people respond to. Validate the offer separately.
Go narrower than feels comfortable.

If you're building in the voice-agent or automation space and have hit the same walls, I'd genuinely like to compare notes in the comments.

I'm building [VoiceIntego], AI voice agents for service businesses, mostly so businesses stop losing jobs to voicemail. Still early. Happy to talk shop.

Building an AI Voice Agent for Appointment Booking: What I Learned

Sam — Fri, 05 Jun 2026 09:53:17 +0000

Over the past few months I’ve been building VoiceIntego, an AI voice agent that answers calls and books appointments for service businesses (dental clinics, HVAC, plumbing). Here are some of the technical lessons that surprised me along the way.

Latency is the whole game

With text chatbots, a 2-second delay is fine. On a phone call, anything over ~800ms feels broken — people start talking over the AI. The hard part isn’t the LLM response; it’s the round trip: speech-to-text → LLM → text-to-speech, all streaming. You have to stream every stage and start TTS before the full response is generated.

Interruptions break naive pipelines

Real callers interrupt. “Actually, can we do Tuesday instead—” mid-sentence. A simple request/response loop can’t handle this. You need barge-in detection: monitor the incoming audio stream and cancel the current TTS playback the moment the caller starts speaking again.

Booking logic needs guardrails, not vibes

Letting the LLM “decide” availability is a recipe for double-bookings. The reliable pattern: the LLM extracts intent (date, time, service), then deterministic code checks the actual calendar API and confirms. The model handles language; your code handles truth.

Confirmation loops matter more than you’d think

Always read the booking back: “So that’s a cleaning on Tuesday the 9th at 2pm — correct?” Phone audio is noisy and names/times get misheard constantly. One extra confirmation turn cuts errors dramatically.

Phone numbers and edge cases everywhere

Voicemail detection, callers who mumble, background noise, people who say “yeah” to mean no. The happy path is maybe 20% of the work.

If you’re building something in this space, happy to compare notes. You can see what I’m working on at VoiceIntego.