DEV Community

Alessandro Binda
Alessandro Binda

Posted on

Building a WhatsApp AI Assistant for Small Business: Architecture and Lessons Learned

WhatsApp has 2.5 billion monthly active users. In Europe, it is the primary communication channel for most people — not email, not phone calls, not SMS. For small businesses, this creates an enormous opportunity: meet customers where they already are.

But building a reliable WhatsApp AI assistant is significantly harder than it looks. This post documents the architecture we settled on after many iterations, the mistakes we made, and the things that actually work in production.

Why WhatsApp, Not a Chatbot Widget

The answer is simple: nobody installs apps, nobody visits your website's chat widget at midnight, and nobody reads email subject lines anymore. WhatsApp is different. People have it open all day. Messages get read within minutes. Response rates are 5-10x higher than email.

For a small business — a dental clinic, a real estate agency, a consulting firm — this is transformative. Instead of playing phone tag with clients, you have a persistent, asynchronous communication channel that works around the clock.

The Architecture We Use

Our implementation for the S.C.A.L.A. platform uses the following stack:

WhatsApp Cloud API (Meta)
        ↓
Webhook receiver (Fastify / Node.js)
        ↓
Message classifier (intent detection)
        ↓
RAG pipeline (retrieval-augmented generation)
        ↓
LLM (Mistral 7B / Groq / fallback chain)
        ↓
Response formatter
        ↓
WhatsApp Cloud API (send)
Enter fullscreen mode Exit fullscreen mode

Message Classification First

Before hitting the LLM, every incoming message goes through a lightweight classifier. This is not an LLM — it is a fine-tuned intent classifier that runs in milliseconds. Its job is to assign one of ~20 intents: booking_request, price_inquiry, complaint, document_request, general_question, etc.

Why bother? Because 60-70% of messages can be answered with template responses or simple database lookups, without touching the expensive LLM at all. This keeps costs low and latency down.

RAG for Business-Specific Knowledge

The LLM on its own does not know anything about your specific business — your opening hours, your pricing, your team, your policies. RAG (retrieval-augmented generation) solves this by injecting relevant context into the prompt before asking the model to respond.

We store business knowledge in a vector database. When a question comes in, we embed the query, find the top-k most relevant chunks from the knowledge base, and inject them into the prompt. The model then answers based on the actual business data, not its training data.

// Simplified RAG pipeline
async function buildPrompt(query, businessId) {
  const embedding = await embed(query);
  const context = await vectorDB.search(embedding, {
    businessId,
    topK: 5
  });

  return `
    You are a helpful assistant for ${business.name}.
    Context from our knowledge base:
    ${context.map(c => c.text).join('\n\n')}

    Customer question: ${query}

    Answer based only on the context above. If you don't know, say so.
  `;
}
Enter fullscreen mode Exit fullscreen mode

The Fallback Chain

LLMs fail. APIs go down. Rate limits get hit. You need a fallback chain:

  1. Primary: Groq API (fast, generous free tier)
  2. Secondary: Mistral API (reliable, good quality)
  3. Tertiary: Local Ollama instance (your own hardware, always available)
  4. Final fallback: Template response + flag for human review

The fallback chain means the assistant never goes dark. A degraded response is better than no response.

Lessons Learned the Hard Way

Lesson 1: Context Window Management Is Critical

WhatsApp conversations can run for weeks. You cannot stuff the entire conversation history into every prompt — you will hit token limits and costs will spiral. We use a sliding window: the last N messages, plus a summary of earlier context generated every K turns.

Lesson 2: Tone Calibration Takes Time

Italian users (our primary market) are more formal in business contexts than the English-speaking world assumes. We spent weeks calibrating the tone — not too robotic, not too casual, with appropriate courtesy markers that feel natural in Italian business communication.

Lesson 3: Handoff to Humans Must Be Seamless

The AI cannot handle everything. Angry customers, complex complaints, legally sensitive questions — these need a human. The handoff mechanism must be invisible from the customer's perspective: the conversation continues in the same WhatsApp thread, but a human agent takes over.

We use a simple flagging system: high-confidence responses are sent automatically, low-confidence responses are queued for human review before sending, and certain intent categories always require human approval.

Lesson 4: WhatsApp Template Messages Are Restrictive

Meta enforces strict rules on template messages (messages sent outside the 24-hour customer service window). They must be pre-approved, cannot be promotional, and have limited formatting options. This catches many developers off guard. Plan your template library early and get them approved before you need them.

Lesson 5: Message Queue Is Not Optional

Do not send WhatsApp messages synchronously from your webhook handler. Use a queue. Message delivery is not always instant, rate limits apply, and you need retry logic. A proper message queue (BullMQ, Redis-backed) makes this manageable.

What the Assistant Can Handle in Production

Our production deployment handles:

  • Appointment booking and rescheduling (with calendar integration)
  • FAQ responses (with RAG over the business knowledge base)
  • Document requests (sending PDFs, contracts, invoices via WhatsApp)
  • Lead qualification (collecting contact details and requirements from new inquiries)
  • Escalation to human agents (with context transfer)

The system is available as part of the S.C.A.L.A. AI OS, where businesses can configure the assistant via a no-code interface — setting the knowledge base, defining custom responses, and reviewing conversation logs.

Performance Numbers

In our production deployments:

  • ~68% of messages handled fully automatically
  • Average response latency: 2.3 seconds
  • Escalation rate: ~12% (messages flagged for human review)
  • Customer satisfaction: consistently higher than phone-only support (based on follow-up surveys)

What We Would Do Differently

If we were starting over, we would invest more time upfront in the knowledge base structure. The quality of RAG responses is almost entirely determined by the quality and organization of your business knowledge base. Garbage in, garbage out — the LLM cannot make up for missing or poorly structured context.

We would also build the human handoff mechanism from day one, not as an afterthought.


The WhatsApp AI assistant (SARA) is part of the S.C.A.L.A. platform. Source architecture details are available in our engineering documentation at get-scala.com.

Top comments (0)