DEV Community

Cover image for How I Built a Production WhatsApp AI Assistant with Gemini, Groq, and LanceDB

How I Built a Production WhatsApp AI Assistant with Gemini, Groq, and LanceDB

TL;DR: I built a self-hosted WhatsApp AI assistant that never goes down — it chains 3 LLM providers (Gemini → Groq → Ollama), remembers everything with vector search, transcribes voice notes locally with Whisper, reads your PDFs, and supports 20+ commands. The whole thing runs on a $5/mo VPS.

⭐ Star it on GitHub if you find this useful!

Graph showing how many messages i was able to automatize!


The Problem

I wanted a WhatsApp assistant that could:

  • Answer questions using multiple AI models (not just one)
  • Remember context from past conversations via RAG
  • Transcribe and respond to voice notes
  • Analyze images sent in chat
  • Download media from YouTube, TikTok, Instagram, and Spotify
  • Be monitored in real-time from a dashboard Existing solutions were either closed-source, limited to a single model, or didn't support voice/vision. So I built my own. ## The Architecture
WhatsApp (via whatsapp-web.js)
    │
    ├── Message Router
    │   ├── Command Handler (20+ commands)
    │   │   ├── !download (yt-dlp multi-platform)
    │   │   ├── !read (PDF/DOCX/XLSX parser)
    │   │   ├── !draw (image generation)
    │   │   ├── !ocr (image text extraction)
    │   │   ├── !search (SearxNG web search)
    │   │   └── !learn (RAG knowledge ingestion)
    │   │
    │   └── AI Engine (3-tier cascade)
    │       ├── Tier 1: Gemini (primary)
    │       ├── Tier 2: Groq (fallback)
    │       └── Tier 3: Ollama (local fallback)
    │
    ├── RAG Pipeline
    │   ├── LanceDB (vector store)
    │   ├── Embedding generation
    │   └── Semantic search
    │
    ├── Voice Pipeline
    │   ├── OGG → WAV conversion
    │   └── Whisper (local STT)
    │
    └── Dashboard (Express + WebSocket)
        ├── Live conversation feed
        ├── Token usage analytics
        └── System health metrics
Enter fullscreen mode Exit fullscreen mode

The 3-Tier LLM Cascade

The most interesting design decision was the AI cascade. Instead of relying on a single provider, the bot tries them in order:

async function generateResponse(prompt, context) {
  // Tier 1: Try Gemini (best quality, rate-limited)
  try {
    return await geminiGenerate(prompt, context);
  } catch (e) {
    console.log('Gemini failed, falling back to Groq...');
  }
  // Tier 2: Try Groq (fast, generous free tier)
  try {
    return await groqGenerate(prompt, context);
  } catch (e) {
    console.log('Groq failed, falling back to Ollama...');
  }
  // Tier 3: Local Ollama (always available, slower)
  return await ollamaGenerate(prompt, context);
}
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Zero downtime — if one provider is down or rate-limited, the next one picks up
  • Cost optimization — Gemini and Groq have generous free tiers
  • Privacy option — Ollama runs entirely locally ## RAG: Teaching the Bot Your Knowledge The !learn command lets you feed documents into a LanceDB vector store. When someone asks a question, the bot performs semantic search before answering:
User: !learn https://mycompany.com/docs/faq
Bot: ✅ Learned 47 chunks from FAQ page
User: What's the return policy?
Bot: Based on your FAQ, returns are accepted within 30 days 
     with original packaging. Here's the process...
Enter fullscreen mode Exit fullscreen mode

This means the bot doesn't just answer from its training data — it answers from your documents.

Voice Notes with Local Whisper

When someone sends a voice message, the bot:

  1. Downloads the OGG audio from WhatsApp
  2. Converts it to WAV using FFmpeg
  3. Transcribes it using a local Whisper model
  4. Feeds the transcript to the AI engine No cloud APIs needed for transcription — it all runs on your machine. ## The Real-Time Dashboard The Express + WebSocket dashboard shows:
  5. 📊 Live conversation feed with timestamps
  6. 📈 Token usage per model provider
  7. 🖥️ System health (CPU, RAM, uptime)
  8. 🔧 Configuration management ## Running It Yourself
git clone https://github.com/Charly-bite/whatsapp-ai-bot
cd whatsapp-ai-bot
npm install
cp .env.example .env
# Add your API keys to .env
npm start
Enter fullscreen mode Exit fullscreen mode

Scan the QR code with WhatsApp, and you're live.

What I Learned

  1. LLM cascading is a production pattern more people should use
  2. RAG with LanceDB is surprisingly easy to set up (no external DB needed)
  3. Local Whisper is good enough for voice notes (no API costs)

4. PM2 is essential for production Node.js bots (auto-restart, logs, monitoring)

Try It

The entire project is open source:
🔗 github.com/Charly-bite/whatsapp-ai-bot

If you found this useful, please consider dropping a ⭐ on the repo — it helps others discover the project!

I'm Carlos, a cybersecurity student at Universidad de Guadalajara building tools at the intersection of AI and security. Find me on GitHub and LinkedIn.

Top comments (0)