Production AI agents: 18 months of lessons across voice, WhatsApp, and web chat

#ai #agents #langchain #production

After running AI agents in production for 18 months across content, sales calls, and WhatsApp customer support, here's the playbook I wish I had on day 1.

The 3 things that actually move the needle

Most AI agent tutorials online focus on prompts and tools. In production, those matter least. What matters:

Failure modes before features — your agent will hallucinate, lose context, or pick the wrong tool. Build observability before you build capability.
Cost ceilings per session — without hard caps, one user can burn $50 of tokens in a single conversation. Cap tokens-per-session, retries-per-tool-call, and total runtime.
Human handoff triggers — "agent confidence below X" or "user said 'human' twice" → handoff. Customers tolerate AI that says "let me get someone" 10x more than AI that bullshits an answer.

The agent stack that survived

After 4 rewrites:

┌─────────────────────────────┐
│   Channel adapters          │  WhatsApp / web chat / phone
└──────────────┬──────────────┘
               │
┌──────────────▼──────────────┐
│   Orchestrator (LangGraph)  │  state machine, not LLM-as-controller
└──────────────┬──────────────┘
               │
┌──────────────▼──────────────┐
│   Tool registry             │  search, db query, send email, escalate
└──────────────┬──────────────┘
               │
┌──────────────▼──────────────┐
│   Observability (LangSmith) │  every tool call + token + error logged
└─────────────────────────────┘

LangGraph state machine instead of LLM-as-controller cut bug rate by 60%. The LLM picks the next state. The state machine enforces what's allowed.

Per-channel quirks I learned the hard way

WhatsApp: 24-hour message window rule from Meta. After that, you can't message until user replies. Build re-engagement templates.

Voice (Twilio + STT): latency under 800ms or users hang up. Stream tokens as they generate. Don't wait for full response.

Web chat: typing indicators matter. Show "AI thinking..." within 200ms. Real LLM response can take 3s — show progress.

The cost trap

In month 3 we hit $4,200 in OpenAI bills for one client because of:

A loop where agent retried failed tool calls 5x each
No timeout on long conversations
2 misclassified intents that bounced between specialized agents

After fixing: same client, same volume, $380/month.

Token economics:

Track $ per resolved ticket, not $ per LLM call
A bad architecture with cheap models often costs more than smart architecture with GPT-4o
Caching system prompts saved 35% across the board

What I built around all this

Three of my own products run on this stack:

vocalis-ai.org is the voice AI agent for outbound sales calls
agentic-whatsup.com is the WhatsApp business agent layer
agents-ia.pro is where I publish playbooks for B2B AI deployments

For the content side I run vocalis.pro — it generates SEO articles for the same SMBs that buy the agents. Cross-pollination works. And ai-due.com is the AI-for-SMB-ops project.

What I'd skip if starting today

LangChain agents (use LangGraph directly)
Vector DBs for small contexts (just use full prompt)
Multi-agent collaboration patterns (95% of use cases need 1 agent + good tools, not 5 agents)
Custom embeddings (OpenAI's text-embedding-3-small is fine for 90% of use cases)

Open question

Anyone running production agents on Claude Sonnet 4.5 vs GPT-4o for tool use? Curious about real-world latency + tool call accuracy comparison. I've stayed on GPT-4o for tool reliability but Claude's coding is sharper for code-gen agents.