Ujjawal Tyagi

Posted on Apr 28

Shipping AI WhatsApp Automation to Production: Lessons from Growara

#ai #architecture #automation #llm

Most AI-on-WhatsApp demos fall apart in production. The customer asks the same question three different ways. Sentiment shifts mid-conversation. The AI answers something it doesn't actually know — confidently and wrong.

Building Growara at Xenotix Labs — an AI-powered WhatsApp automation platform — taught us that you don't deploy an LLM to WhatsApp. You deploy a system that contains an LLM.

Here's the architecture we settled on after shipping to production for businesses handling 100k+ messages a month.

The system, not the model

A naive WhatsApp AI bot looks like this: send the message to the LLM, return whatever it says. Works for the first hundred messages. Then a customer asks "kya tum refund de doge?" and the LLM hallucinates a refund policy that doesn't exist.

The production system looks like this:

WhatsApp → Webhook
Intent Classifier (small fast model)
Knowledge Retrieval (vector store of business policies)
LLM with retrieved context
Confidence Check
Reply / Escalate to Human / Ask Clarifying Question

Each layer is doing one job, and the LLM is just one component — not the whole brain.

Layer 1: Intent classification before LLM

Every inbound message gets classified before it touches an LLM. Categories:

Transactional intent (refund, cancel, change address) → high-stakes path
Informational (store hours, return policy, product specs) → retrieval path
Conversational (greeting, smalltalk) → templated reply
Out-of-scope (medical, legal, anything unrelated) → polite decline

We use a small fine-tuned classifier model running locally — not the main LLM. Fast, cheap, deterministic. Don't pay LLM token costs for a binary decision.

Layer 2: Retrieval-augmented context

For informational intents, we never let the LLM "freestyle." We retrieve relevant content from a vector store of the business's actual policies, product catalog, FAQs, and shipping rules.

The retrieved chunks are passed into the LLM as context with a hard system prompt: "Answer ONLY using the retrieved context below. If the answer is not in the context, say you don't know and offer to escalate."

This is the difference between hallucinated refund policies and accurate ones.

Layer 3: Confidence scoring

Every LLM response goes through a second pass that scores confidence (does the response actually match the retrieved context?), sentiment (is the customer frustrated, happy, neutral?), and action commitment (does the response promise something the system can't actually do?).

If confidence is low, or the response promises an action we can't fulfill, we escalate to a human.

Layer 4: Hard escalation triggers

Some things never go to the LLM. Hard-coded triggers immediately route to a human:

Mentions of "complaint", "refund over [X]", "lawyer", "consumer court"
Sentiment sharply negative for 2+ messages in a row
Customer routed through AI 5+ times without resolution
Any payment, refund, or monetary commitment over a threshold

Better to over-escalate than to let an AI commit to a refund the business can't honor.

The infrastructure

The Meta WhatsApp Business API has its own rules — message templates for outbound, 24-hour customer-service window, opt-in management, rate limits per phone number.

Our stack:

Webhook receiver — Node.js handling Meta webhooks, signature verification, dedup
Message queue — RabbitMQ for inbound buffering and outbound rate limiting
AI orchestration — Node.js running classify → retrieve → LLM → confidence pipeline
Vector store — per-tenant knowledge bases (each business has its own)
Conversation store — PostgreSQL for canonical history + audit log
Human handoff — Next.js dashboard with pre-drafted AI responses for human review
Template manager — for managing Meta-approved message templates

Cost guardrails

LLMs cost per token. Without ceilings, one chatty user with a long conversation costs more than they're worth.

Our rules:

Per-conversation token budget (10k input + 2k output max)
Per-user daily cap (50k tokens/day max per phone)
Per-tenant monthly cap (configurable; alerts at 80%)
Cheaper models for non-critical paths (classification, summarization use smaller models)

When a budget hits, gracefully degrade: handoff to human or send a templated "we'll get back to you".

Evals before deployment

Every prompt change, every retrieval tweak, every model swap goes through an evaluation suite of ~500 representative conversations. Eval scores tracked in CI like test coverage — a regression below threshold blocks the merge.

Without evals, "improvements" silently break edge cases. With evals, you ship with confidence.

What we'd do differently

Build evals on day one. Reverse-engineering an eval suite after 6 months of production is brutal.
Don't trust LLM JSON output. Use a smaller model or schema validator before acting on it.
Pre-translate, then prompt. For multilingual (Hindi/English/Tamil), translating to English before the main LLM was more reliable than mixed-script prompts.
Log everything. Every classification, every retrieved chunk, every score, every prompt version. When something goes wrong in production, you'll need all of it.

Building AI-powered customer engagement?

Whether it's WhatsApp, in-app chat, voice, or email — production AI is a discipline. The model is 10% of the work; the rest is the system around it. If you're building in this space, Xenotix Labs has shipped AI customer-engagement products that survive real customer load. Reach out at https://xenotixlabs.com.

DEV Community