DEV Community

Cover image for How to Build an AI-Powered WhatsApp Bot That Analyzes Images Using Python and Vision Models
Ademola Balogun
Ademola Balogun

Posted on

How to Build an AI-Powered WhatsApp Bot That Analyzes Images Using Python and Vision Models

WhatsApp has over 2 billion users. Most AI tools live on websites nobody visits. What if you could bring AI directly to where people already spend their time?

In this tutorial, I'll show you how to build a WhatsApp bot that accepts images, analyzes them using AI vision models, and responds with intelligent insights.

What We're Building

By the end of this guide, you'll have a working WhatsApp bot that:

  • Receives images from users via WhatsApp
  • Processes them using AI vision models (Llama, GPT-4V, or Claude)
  • Returns structured analysis in natural conversation
  • Stores conversation history in MongoDB
  • Runs on a free-tier cloud server

The entire stack costs nearly nothing to run at low volume, making it perfect for MVPs, side projects, or learning.

Why WhatsApp + AI Vision?

Before we dive into code, let's talk about why this combination is powerful.

Traditional AI apps require users to visit a website, create an account, and learn a new interface. WhatsApp bots eliminate all that friction. Users message your bot exactly like they'd message a friend.

AI vision models have become remarkably capable. They can identify objects, read text, understand context, and generate detailed descriptions. Combining this with WhatsApp's ubiquity creates tools that feel magical.

Some real-world applications:

  • Receipt scanners that extract expenses automatically
  • Plant identifiers for gardening enthusiasts
  • Food analyzers that estimate nutrition from photos
  • Document readers that summarize uploaded PDFs
  • Product lookup tools for shopping assistance

Prerequisites

You'll need:

  • Python 3.9+
  • A Meta Developer account (free)
  • A cloud AI provider account (Together AI, OpenAI, or Anthropic)
  • MongoDB Atlas account (free tier works)
  • A server with a public URL (we'll use ngrok for development)

Architecture Overview

Here's how the pieces fit together:

User sends image via WhatsApp
         ↓
Meta's WhatsApp Cloud API receives it
         ↓
Webhook forwards to your Flask server
         ↓
Server downloads image from Meta's CDN
         ↓
Image sent to AI Vision API for analysis
         ↓
Response formatted and sent back via WhatsApp API
         ↓
User receives analysis in their chat
Enter fullscreen mode Exit fullscreen mode

The beauty of this architecture is its simplicity. One Python file handles everything.

Step 1: Set Up the Meta Developer Account

First, we need access to the WhatsApp Business API.

  1. Go to developers.facebook.com and create an account
  2. Click "My Apps" → "Create App"
  3. Select "Business" as the app type
  4. Name your app and click "Create"
  5. Find "WhatsApp" in the product list and click "Set Up"

Meta provides a free test phone number that works for development. You'll see it in the WhatsApp section of your app dashboard.

Note down these values from your dashboard:

  • Phone Number ID (under "From" phone number)
  • WhatsApp Business Account ID
  • Temporary Access Token (we'll make this permanent later)

Step 2: Set Up Your AI Vision Provider

For this tutorial, I'll use Together AI with their Llama Vision model because it's cost-effective and doesn't require waitlist approval. The code works with minor modifications for OpenAI's GPT-4V or Anthropic's Claude.

  1. Sign up at together.ai
  2. Get your API key from the dashboard
  3. Note the model name: meta-llama/Llama-4-Scout-17B-16E-Instruct

Together AI charges about $0.18 per million tokens for vision models—significantly cheaper than alternatives.

Step 3: Set Up MongoDB

We'll use MongoDB to store user sessions and analysis history.

  1. Create a free account at mongodb.com/atlas
  2. Create a new cluster (the free M0 tier works fine)
  3. Create a database user with read/write access
  4. Get your connection string (looks like mongodb+srv://user:pass@cluster.xxxxx.mongodb.net/)

Step 4: Project Structure

Create a new directory and set up these files:

whatsapp-ai-bot/
├── app.py           # Main application
├── .env             # Environment variables (don't commit this)
├── .env.example     # Template for environment variables
└── requirements.txt # Python dependencies
Enter fullscreen mode Exit fullscreen mode

Step 5: Install Dependencies

Create requirements.txt:

flask==3.0.0
requests==2.31.0
pymongo==4.6.0
python-dotenv==1.0.0
together==1.0.0
gunicorn==21.2.0
Enter fullscreen mode Exit fullscreen mode

Install them:

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Step 6: Environment Variables

Create .env.example (commit this as a template):

# WhatsApp API
WHATSAPP_ACCESS_TOKEN=your_token_here
WHATSAPP_PHONE_NUMBER_ID=your_phone_id_here
WHATSAPP_VERIFY_TOKEN=any_random_string_you_choose

# AI Provider
TOGETHER_API_KEY=your_together_api_key

# Database
MONGODB_URI=mongodb+srv://user:pass@cluster.mongodb.net/dbname
Enter fullscreen mode Exit fullscreen mode

Copy it to .env and fill in your actual values.

Step 7: The Main Application

Here's the complete app.py. I'll explain each section after:

import os
import json
import requests
from datetime import datetime
from flask import Flask, request, jsonify
from pymongo import MongoClient
from dotenv import load_dotenv
from together import Together

load_dotenv()

app = Flask(__name__)

# Configuration
WHATSAPP_TOKEN = os.getenv("WHATSAPP_ACCESS_TOKEN")
PHONE_NUMBER_ID = os.getenv("WHATSAPP_PHONE_NUMBER_ID")
VERIFY_TOKEN = os.getenv("WHATSAPP_VERIFY_TOKEN")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
MONGODB_URI = os.getenv("MONGODB_URI")

# Initialize clients
mongo_client = MongoClient(MONGODB_URI)
db = mongo_client.whatsapp_bot
together_client = Together(api_key=TOGETHER_API_KEY)

# Vision model configuration
VISION_MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"


def send_whatsapp_message(to, message):
    """Send a text message via WhatsApp API."""
    url = f"https://graph.facebook.com/v18.0/{PHONE_NUMBER_ID}/messages"
    headers = {
        "Authorization": f"Bearer {WHATSAPP_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "messaging_product": "whatsapp",
        "to": to,
        "type": "text",
        "text": {"body": message}
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()


def download_media(media_id):
    """Download media from WhatsApp's CDN."""
    # First, get the media URL
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    headers = {"Authorization": f"Bearer {WHATSAPP_TOKEN}"}

    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    if not media_url:
        return None

    # Download the actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content


def analyze_image(image_data, prompt):
    """Send image to AI vision model for analysis."""
    import base64

    # Convert to base64
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    image_url = f"data:image/jpeg;base64,{image_base64}"

    try:
        response = together_client.chat.completions.create(
            model=VISION_MODEL,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": prompt}
                ]
            }],
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Vision API error: {e}")
        return None


def get_analysis_prompt():
    """Return the prompt for image analysis."""
    return """Analyze this image and provide:

1. A brief description of what you see
2. Key details or notable elements
3. Any relevant insights or observations

Keep your response concise and conversational, suitable for a chat message."""


def log_interaction(phone_number, message_type, content, response):
    """Log the interaction to MongoDB."""
    db.interactions.insert_one({
        "phone_number": phone_number,
        "message_type": message_type,
        "content": content[:500] if content else None,
        "response": response[:500] if response else None,
        "timestamp": datetime.utcnow()
    })


@app.route("/webhook", methods=["GET"])
def verify_webhook():
    """Handle webhook verification from Meta."""
    mode = request.args.get("hub.mode")
    token = request.args.get("hub.verify_token")
    challenge = request.args.get("hub.challenge")

    if mode == "subscribe" and token == VERIFY_TOKEN:
        print("Webhook verified successfully")
        return challenge, 200

    return "Forbidden", 403


@app.route("/webhook", methods=["POST"])
def handle_webhook():
    """Process incoming WhatsApp messages."""
    data = request.json

    try:
        # Extract message details
        entry = data["entry"][0]
        changes = entry["changes"][0]
        value = changes["value"]

        # Check if this is a message (not a status update)
        if "messages" not in value:
            return jsonify({"status": "ok"}), 200

        message = value["messages"][0]
        phone_number = message["from"]
        message_type = message["type"]

        # Handle image messages
        if message_type == "image":
            media_id = message["image"]["id"]

            # Send acknowledgment
            send_whatsapp_message(
                phone_number, 
                "Got your image! Analyzing it now..."
            )

            # Download and analyze
            image_data = download_media(media_id)

            if image_data:
                analysis = analyze_image(image_data, get_analysis_prompt())

                if analysis:
                    send_whatsapp_message(phone_number, analysis)
                    log_interaction(phone_number, "image", "image_received", analysis)
                else:
                    send_whatsapp_message(
                        phone_number,
                        "Sorry, I couldn't analyze that image. Please try again."
                    )
            else:
                send_whatsapp_message(
                    phone_number,
                    "Sorry, I couldn't download that image. Please try again."
                )

        # Handle text messages
        elif message_type == "text":
            text = message["text"]["body"]

            # Simple response for non-image messages
            response = (
                "Hi! Send me an image and I'll analyze it for you.\n\n"
                "Just take a photo or send one from your gallery!"
            )
            send_whatsapp_message(phone_number, response)
            log_interaction(phone_number, "text", text, response)

        return jsonify({"status": "ok"}), 200

    except Exception as e:
        print(f"Error processing webhook: {e}")
        return jsonify({"status": "error"}), 500


@app.route("/health", methods=["GET"])
def health_check():
    """Simple health check endpoint."""
    return jsonify({"status": "healthy", "timestamp": datetime.utcnow().isoformat()})


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=True)
Enter fullscreen mode Exit fullscreen mode

Step 8: Understanding the Code

Let's break down the key parts.

Webhook Verification

When you configure your webhook URL in Meta's dashboard, they send a GET request with a challenge. Your server must echo it back:

@app.route("/webhook", methods=["GET"])
def verify_webhook():
    if mode == "subscribe" and token == VERIFY_TOKEN:
        return challenge, 200
Enter fullscreen mode Exit fullscreen mode

Processing Incoming Messages

WhatsApp sends POST requests to your webhook for each message. The nested JSON structure requires careful extraction:

message = value["messages"][0]
phone_number = message["from"]
message_type = message["type"]
Enter fullscreen mode Exit fullscreen mode

Downloading Media

WhatsApp doesn't send images directly. Instead, they provide a media ID. You must first get the download URL, then fetch the actual file:

def download_media(media_id):
    # Get URL from media ID
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    # Download actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content
Enter fullscreen mode Exit fullscreen mode

Vision Analysis

The AI vision API accepts base64-encoded images. We send both the image and a text prompt:

response = together_client.chat.completions.create(
    model=VISION_MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": image_url}},
            {"type": "text", "text": prompt}
        ]
    }]
)
Enter fullscreen mode Exit fullscreen mode

Step 9: Set Up the Webhook

For development, we'll use ngrok to expose your local server.

  1. Install ngrok from ngrok.com
  2. Run your Flask app: python app.py
  3. In another terminal, run: ngrok http 5000
  4. Copy the HTTPS URL (looks like https://abc123.ngrok.io)

Now configure the webhook in Meta's dashboard:

  1. Go to your app → WhatsApp → Configuration
  2. Click "Edit" next to Webhook
  3. Enter your URL: https://abc123.ngrok.io/webhook
  4. Enter your verify token (same as WHATSAPP_VERIFY_TOKEN in .env)
  5. Click "Verify and Save"
  6. Subscribe to "messages" events

Step 10: Test Your Bot

  1. Open WhatsApp on your phone
  2. Message the test number shown in Meta's dashboard
  3. Send an image
  4. Watch your terminal for logs
  5. Receive the AI analysis!

If something doesn't work, check:

  • Is ngrok running and the URL current?
  • Are all environment variables set?
  • Is the webhook subscribed to "messages"?
  • Check Meta's webhook logs for delivery status

Step 11: Making the Access Token Permanent

The temporary access token expires in 24 hours. For production, create a permanent one:

  1. Go to your app → WhatsApp → API Setup
  2. Click "Add" under "Add a system user token"
  3. Create a system user if you haven't
  4. Generate a token with whatsapp_business_messaging permission
  5. This token won't expire

Step 12: Deploying to Production

For production, you need a server with a stable URL. Here are affordable options:

Railway

# Install Railway CLI
npm install -g @railway/cli

# Deploy
railway init
railway up
Enter fullscreen mode Exit fullscreen mode

Render (Free tier available)

  1. Connect your GitHub repo
  2. Set environment variables in dashboard
  3. Deploy automatically on push

DigitalOcean

# On your droplet
git clone your-repo
cd your-repo
pip install -r requirements.txt
gunicorn app:app -b 0.0.0.0:5000
Enter fullscreen mode Exit fullscreen mode

For any option, remember to:

  • Set all environment variables
  • Use HTTPS (required by WhatsApp)
  • Run with gunicorn instead of Flask's dev server
  • Set up process management (systemd, supervisor, or PM2)

Step 13: Customizing for Your Use Case

The base code is intentionally generic. Here's how to adapt it for specific applications.

For a Receipt Scanner:

def get_analysis_prompt():
    return """Analyze this receipt image and extract:

1. Store/merchant name
2. Date of purchase
3. List of items with prices
4. Total amount
5. Payment method if visible

Format the response as a clear summary."""
Enter fullscreen mode Exit fullscreen mode

For a Plant Identifier:

def get_analysis_prompt():
    return """Identify this plant and provide:

1. Plant name (common and scientific)
2. Key identifying features
3. Care requirements (water, sunlight)
4. Is it toxic to pets?

Keep it conversational and helpful."""
Enter fullscreen mode Exit fullscreen mode

For a Food Analyzer:

def get_analysis_prompt():
    return """Analyze this food image and estimate:

1. What foods are present
2. Approximate calories
3. Protein, carbs, and fat estimates
4. Health rating from 1-10
5. A brief nutritional insight

Be helpful but note these are estimates."""
Enter fullscreen mode Exit fullscreen mode

Step 14: Adding Conversation Context

To make your bot smarter, store and use conversation history:

def get_user_context(phone_number, limit=5):
    """Get recent interactions for context."""
    recent = db.interactions.find(
        {"phone_number": phone_number}
    ).sort("timestamp", -1).limit(limit)

    return list(recent)


def analyze_image_with_context(image_data, phone_number):
    """Include conversation history in analysis."""
    context = get_user_context(phone_number)

    context_text = ""
    if context:
        context_text = "Previous interactions:\n"
        for item in reversed(context):
            context_text += f"- {item.get('response', '')[:100]}\n"

    prompt = f"""{context_text}

Now analyze this new image. Consider any relevant context from previous interactions."""

    return analyze_image(image_data, prompt)
Enter fullscreen mode Exit fullscreen mode

Performance Tips

After running this in production, here's what I've learned:

  1. Send acknowledgments immediately. Users get anxious if there's no response. Send "Analyzing..." before doing the heavy work.

  2. Cache repeated analyses. Hash incoming images and check if you've seen them before.

  3. Set timeout limits. Vision APIs can be slow. Set a 30-second timeout and send a graceful error if exceeded.

  4. Rate limit by user. Prevent abuse by limiting requests per phone number per hour.

  5. Monitor costs. Log API calls and set up billing alerts. Vision APIs charge per image.

Common Pitfalls

"Webhook verification failed"

  • Your verify token doesn't match
  • Your server isn't accessible (check ngrok)
  • You're not returning the challenge correctly

"Message not delivered"

  • Access token expired (get a permanent one)
  • Phone number not in allowed list (in test mode)
  • Invalid phone number format

"Image download failed"

  • Access token doesn't have media.read permission
  • Media URL expired (they're temporary)
  • Network timeout

"Vision API error"

  • Image too large (resize before sending)
  • Unsupported format (stick to JPEG/PNG)
  • Rate limit hit

What's Next?

This foundation supports many extensions:

  • Multi-language support: Detect user's language and respond accordingly
  • Voice messages: Transcribe audio and respond
  • Buttons and lists: Use WhatsApp's interactive message types
  • Payment integration: Connect to Stripe for premium features
  • Admin dashboard: Build a web interface for monitoring

Wrapping Up

You now have a complete AI-powered WhatsApp bot that can analyze images. The stack is simple, affordable, and scales well.

The combination of WhatsApp's reach and AI vision capabilities opens interesting possibilities. Users don't need to learn new interfaces—they just message like they always do.


This article was written based on my hands-on experience building production WhatsApp bots. The code examples are simplified for clarity—production deployments should include proper error handling, logging, and security measures.

Top comments (0)