After running AI agents in production for 18 months across content, sales calls, and WhatsApp customer support, here's the playbook I wish I had on day 1.
The 3 things that actually move the needle
Most AI agent tutorials online focus on prompts and tools. In production, those matter least. What matters:
- Failure modes before features — your agent will hallucinate, lose context, or pick the wrong tool. Build observability before you build capability.
- Cost ceilings per session — without hard caps, one user can burn $50 of tokens in a single conversation. Cap tokens-per-session, retries-per-tool-call, and total runtime.
- Human handoff triggers — "agent confidence below X" or "user said 'human' twice" → handoff. Customers tolerate AI that says "let me get someone" 10x more than AI that bullshits an answer.
The agent stack that survived
After 4 rewrites:
┌─────────────────────────────┐
│ Channel adapters │ WhatsApp / web chat / phone
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Orchestrator (LangGraph) │ state machine, not LLM-as-controller
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Tool registry │ search, db query, send email, escalate
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ Observability (LangSmith) │ every tool call + token + error logged
└─────────────────────────────┘
LangGraph state machine instead of LLM-as-controller cut bug rate by 60%. The LLM picks the next state. The state machine enforces what's allowed.
Per-channel quirks I learned the hard way
WhatsApp: 24-hour message window rule from Meta. After that, you can't message until user replies. Build re-engagement templates.
Voice (Twilio + STT): latency under 800ms or users hang up. Stream tokens as they generate. Don't wait for full response.
Web chat: typing indicators matter. Show "AI thinking..." within 200ms. Real LLM response can take 3s — show progress.
The cost trap
In month 3 we hit $4,200 in OpenAI bills for one client because of:
- A loop where agent retried failed tool calls 5x each
- No timeout on long conversations
- 2 misclassified intents that bounced between specialized agents
After fixing: same client, same volume, $380/month.
Token economics:
- Track $ per resolved ticket, not $ per LLM call
- A bad architecture with cheap models often costs more than smart architecture with GPT-4o
- Caching system prompts saved 35% across the board
What I built around all this
Three of my own products run on this stack:
- vocalis-ai.org is the voice AI agent for outbound sales calls
- agentic-whatsup.com is the WhatsApp business agent layer
- agents-ia.pro is where I publish playbooks for B2B AI deployments
For the content side I run vocalis.pro — it generates SEO articles for the same SMBs that buy the agents. Cross-pollination works. And ai-due.com is the AI-for-SMB-ops project.
What I'd skip if starting today
- LangChain agents (use LangGraph directly)
- Vector DBs for small contexts (just use full prompt)
- Multi-agent collaboration patterns (95% of use cases need 1 agent + good tools, not 5 agents)
- Custom embeddings (OpenAI's text-embedding-3-small is fine for 90% of use cases)
Open question
Anyone running production agents on Claude Sonnet 4.5 vs GPT-4o for tool use? Curious about real-world latency + tool call accuracy comparison. I've stayed on GPT-4o for tool reliability but Claude's coding is sharper for code-gen agents.
Top comments (0)