Ademola Balogun

Posted on Feb 5

How to Build an AI-Powered WhatsApp Bot That Analyzes Images Using Python and Vision Models

#llm #tutorial #ai #python

WhatsApp has over 2 billion users. Most AI tools live on websites nobody visits. What if you could bring AI directly to where people already spend their time?

In this tutorial, I'll show you how to build a WhatsApp bot that accepts images, analyzes them using AI vision models, and responds with intelligent insights.

What We're Building

By the end of this guide, you'll have a working WhatsApp bot that:

Receives images from users via WhatsApp
Processes them using AI vision models (Llama, GPT-4V, or Claude)
Returns structured analysis in natural conversation
Stores conversation history in MongoDB
Runs on a free-tier cloud server

The entire stack costs nearly nothing to run at low volume, making it perfect for MVPs, side projects, or learning.

Why WhatsApp + AI Vision?

Before we dive into code, let's talk about why this combination is powerful.

Traditional AI apps require users to visit a website, create an account, and learn a new interface. WhatsApp bots eliminate all that friction. Users message your bot exactly like they'd message a friend.

AI vision models have become remarkably capable. They can identify objects, read text, understand context, and generate detailed descriptions. Combining this with WhatsApp's ubiquity creates tools that feel magical.

Some real-world applications:

Receipt scanners that extract expenses automatically
Plant identifiers for gardening enthusiasts
Food analyzers that estimate nutrition from photos
Document readers that summarize uploaded PDFs
Product lookup tools for shopping assistance

Prerequisites

You'll need:

Python 3.9+
A Meta Developer account (free)
A cloud AI provider account (Together AI, OpenAI, or Anthropic)
MongoDB Atlas account (free tier works)
A server with a public URL (we'll use ngrok for development)

Architecture Overview

Here's how the pieces fit together:

User sends image via WhatsApp
         ↓
Meta's WhatsApp Cloud API receives it
         ↓
Webhook forwards to your Flask server
         ↓
Server downloads image from Meta's CDN
         ↓
Image sent to AI Vision API for analysis
         ↓
Response formatted and sent back via WhatsApp API
         ↓
User receives analysis in their chat

The beauty of this architecture is its simplicity. One Python file handles everything.

Step 1: Set Up the Meta Developer Account

First, we need access to the WhatsApp Business API.

Go to developers.facebook.com and create an account
Click "My Apps" → "Create App"
Select "Business" as the app type
Name your app and click "Create"
Find "WhatsApp" in the product list and click "Set Up"

Meta provides a free test phone number that works for development. You'll see it in the WhatsApp section of your app dashboard.

Note down these values from your dashboard:

Phone Number ID (under "From" phone number)
WhatsApp Business Account ID
Temporary Access Token (we'll make this permanent later)

Step 2: Set Up Your AI Vision Provider

For this tutorial, I'll use Together AI with their Llama Vision model because it's cost-effective and doesn't require waitlist approval. The code works with minor modifications for OpenAI's GPT-4V or Anthropic's Claude.

Sign up at together.ai
Get your API key from the dashboard
Note the model name: meta-llama/Llama-4-Scout-17B-16E-Instruct

Together AI charges about $0.18 per million tokens for vision models—significantly cheaper than alternatives.

Step 3: Set Up MongoDB

We'll use MongoDB to store user sessions and analysis history.

Create a free account at mongodb.com/atlas
Create a new cluster (the free M0 tier works fine)
Create a database user with read/write access
Get your connection string (looks like mongodb+srv://user:pass@cluster.xxxxx.mongodb.net/)

Step 4: Project Structure

Create a new directory and set up these files:

whatsapp-ai-bot/
├── app.py           # Main application
├── .env             # Environment variables (don't commit this)
├── .env.example     # Template for environment variables
└── requirements.txt # Python dependencies

Step 5: Install Dependencies

Create requirements.txt:

flask==3.0.0
requests==2.31.0
pymongo==4.6.0
python-dotenv==1.0.0
together==1.0.0
gunicorn==21.2.0

Install them:

pip install -r requirements.txt

Step 6: Environment Variables

Create .env.example (commit this as a template):

# WhatsApp API
WHATSAPP_ACCESS_TOKEN=your_token_here
WHATSAPP_PHONE_NUMBER_ID=your_phone_id_here
WHATSAPP_VERIFY_TOKEN=any_random_string_you_choose

# AI Provider
TOGETHER_API_KEY=your_together_api_key

# Database
MONGODB_URI=mongodb+srv://user:pass@cluster.mongodb.net/dbname

Copy it to .env and fill in your actual values.

Step 7: The Main Application

Here's the complete app.py. I'll explain each section after:

import os
import json
import requests
from datetime import datetime
from flask import Flask, request, jsonify
from pymongo import MongoClient
from dotenv import load_dotenv
from together import Together

load_dotenv()

app = Flask(__name__)

# Configuration
WHATSAPP_TOKEN = os.getenv("WHATSAPP_ACCESS_TOKEN")
PHONE_NUMBER_ID = os.getenv("WHATSAPP_PHONE_NUMBER_ID")
VERIFY_TOKEN = os.getenv("WHATSAPP_VERIFY_TOKEN")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
MONGODB_URI = os.getenv("MONGODB_URI")

# Initialize clients
mongo_client = MongoClient(MONGODB_URI)
db = mongo_client.whatsapp_bot
together_client = Together(api_key=TOGETHER_API_KEY)

# Vision model configuration
VISION_MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"


def send_whatsapp_message(to, message):
    """Send a text message via WhatsApp API."""
    url = f"https://graph.facebook.com/v18.0/{PHONE_NUMBER_ID}/messages"
    headers = {
        "Authorization": f"Bearer {WHATSAPP_TOKEN}",
        "Content-Type": "application/json"
    }
    payload = {
        "messaging_product": "whatsapp",
        "to": to,
        "type": "text",
        "text": {"body": message}
    }

    response = requests.post(url, headers=headers, json=payload)
    return response.json()


def download_media(media_id):
    """Download media from WhatsApp's CDN."""
    # First, get the media URL
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    headers = {"Authorization": f"Bearer {WHATSAPP_TOKEN}"}

    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    if not media_url:
        return None

    # Download the actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content


def analyze_image(image_data, prompt):
    """Send image to AI vision model for analysis."""
    import base64

    # Convert to base64
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    image_url = f"data:image/jpeg;base64,{image_base64}"

    try:
        response = together_client.chat.completions.create(
            model=VISION_MODEL,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": prompt}
                ]
            }],
            max_tokens=500
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Vision API error: {e}")
        return None


def get_analysis_prompt():
    """Return the prompt for image analysis."""
    return """Analyze this image and provide:

1. A brief description of what you see
2. Key details or notable elements
3. Any relevant insights or observations

Keep your response concise and conversational, suitable for a chat message."""


def log_interaction(phone_number, message_type, content, response):
    """Log the interaction to MongoDB."""
    db.interactions.insert_one({
        "phone_number": phone_number,
        "message_type": message_type,
        "content": content[:500] if content else None,
        "response": response[:500] if response else None,
        "timestamp": datetime.utcnow()
    })


@app.route("/webhook", methods=["GET"])
def verify_webhook():
    """Handle webhook verification from Meta."""
    mode = request.args.get("hub.mode")
    token = request.args.get("hub.verify_token")
    challenge = request.args.get("hub.challenge")

    if mode == "subscribe" and token == VERIFY_TOKEN:
        print("Webhook verified successfully")
        return challenge, 200

    return "Forbidden", 403


@app.route("/webhook", methods=["POST"])
def handle_webhook():
    """Process incoming WhatsApp messages."""
    data = request.json

    try:
        # Extract message details
        entry = data["entry"][0]
        changes = entry["changes"][0]
        value = changes["value"]

        # Check if this is a message (not a status update)
        if "messages" not in value:
            return jsonify({"status": "ok"}), 200

        message = value["messages"][0]
        phone_number = message["from"]
        message_type = message["type"]

        # Handle image messages
        if message_type == "image":
            media_id = message["image"]["id"]

            # Send acknowledgment
            send_whatsapp_message(
                phone_number, 
                "Got your image! Analyzing it now..."
            )

            # Download and analyze
            image_data = download_media(media_id)

            if image_data:
                analysis = analyze_image(image_data, get_analysis_prompt())

                if analysis:
                    send_whatsapp_message(phone_number, analysis)
                    log_interaction(phone_number, "image", "image_received", analysis)
                else:
                    send_whatsapp_message(
                        phone_number,
                        "Sorry, I couldn't analyze that image. Please try again."
                    )
            else:
                send_whatsapp_message(
                    phone_number,
                    "Sorry, I couldn't download that image. Please try again."
                )

        # Handle text messages
        elif message_type == "text":
            text = message["text"]["body"]

            # Simple response for non-image messages
            response = (
                "Hi! Send me an image and I'll analyze it for you.\n\n"
                "Just take a photo or send one from your gallery!"
            )
            send_whatsapp_message(phone_number, response)
            log_interaction(phone_number, "text", text, response)

        return jsonify({"status": "ok"}), 200

    except Exception as e:
        print(f"Error processing webhook: {e}")
        return jsonify({"status": "error"}), 500


@app.route("/health", methods=["GET"])
def health_check():
    """Simple health check endpoint."""
    return jsonify({"status": "healthy", "timestamp": datetime.utcnow().isoformat()})


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, debug=True)

Step 8: Understanding the Code

Let's break down the key parts.

Webhook Verification

When you configure your webhook URL in Meta's dashboard, they send a GET request with a challenge. Your server must echo it back:

@app.route("/webhook", methods=["GET"])
def verify_webhook():
    if mode == "subscribe" and token == VERIFY_TOKEN:
        return challenge, 200

Processing Incoming Messages

WhatsApp sends POST requests to your webhook for each message. The nested JSON structure requires careful extraction:

message = value["messages"][0]
phone_number = message["from"]
message_type = message["type"]

Downloading Media

WhatsApp doesn't send images directly. Instead, they provide a media ID. You must first get the download URL, then fetch the actual file:

def download_media(media_id):
    # Get URL from media ID
    url = f"https://graph.facebook.com/v18.0/{media_id}"
    response = requests.get(url, headers=headers)
    media_url = response.json().get("url")

    # Download actual file
    media_response = requests.get(media_url, headers=headers)
    return media_response.content

Vision Analysis

The AI vision API accepts base64-encoded images. We send both the image and a text prompt:

response = together_client.chat.completions.create(
    model=VISION_MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": image_url}},
            {"type": "text", "text": prompt}
        ]
    }]
)

Step 9: Set Up the Webhook

For development, we'll use ngrok to expose your local server.

Install ngrok from ngrok.com
Run your Flask app: python app.py
In another terminal, run: ngrok http 5000
Copy the HTTPS URL (looks like https://abc123.ngrok.io)

Now configure the webhook in Meta's dashboard:

Go to your app → WhatsApp → Configuration
Click "Edit" next to Webhook
Enter your URL: https://abc123.ngrok.io/webhook
Enter your verify token (same as WHATSAPP_VERIFY_TOKEN in .env)
Click "Verify and Save"
Subscribe to "messages" events

Step 10: Test Your Bot

Open WhatsApp on your phone
Message the test number shown in Meta's dashboard
Send an image
Watch your terminal for logs
Receive the AI analysis!

If something doesn't work, check:

Is ngrok running and the URL current?
Are all environment variables set?
Is the webhook subscribed to "messages"?
Check Meta's webhook logs for delivery status

Step 11: Making the Access Token Permanent

The temporary access token expires in 24 hours. For production, create a permanent one:

Go to your app → WhatsApp → API Setup
Click "Add" under "Add a system user token"
Create a system user if you haven't
Generate a token with whatsapp_business_messaging permission
This token won't expire

Step 12: Deploying to Production

For production, you need a server with a stable URL. Here are affordable options:

Railway

# Install Railway CLI
npm install -g @railway/cli

# Deploy
railway init
railway up

Render (Free tier available)

Connect your GitHub repo
Set environment variables in dashboard
Deploy automatically on push

DigitalOcean

# On your droplet
git clone your-repo
cd your-repo
pip install -r requirements.txt
gunicorn app:app -b 0.0.0.0:5000

For any option, remember to:

Set all environment variables
Use HTTPS (required by WhatsApp)
Run with gunicorn instead of Flask's dev server
Set up process management (systemd, supervisor, or PM2)

Step 13: Customizing for Your Use Case

The base code is intentionally generic. Here's how to adapt it for specific applications.

For a Receipt Scanner:

def get_analysis_prompt():
    return """Analyze this receipt image and extract:

1. Store/merchant name
2. Date of purchase
3. List of items with prices
4. Total amount
5. Payment method if visible

Format the response as a clear summary."""

For a Plant Identifier:

def get_analysis_prompt():
    return """Identify this plant and provide:

1. Plant name (common and scientific)
2. Key identifying features
3. Care requirements (water, sunlight)
4. Is it toxic to pets?

Keep it conversational and helpful."""

For a Food Analyzer:

def get_analysis_prompt():
    return """Analyze this food image and estimate:

1. What foods are present
2. Approximate calories
3. Protein, carbs, and fat estimates
4. Health rating from 1-10
5. A brief nutritional insight

Be helpful but note these are estimates."""

Step 14: Adding Conversation Context

To make your bot smarter, store and use conversation history:

def get_user_context(phone_number, limit=5):
    """Get recent interactions for context."""
    recent = db.interactions.find(
        {"phone_number": phone_number}
    ).sort("timestamp", -1).limit(limit)

    return list(recent)


def analyze_image_with_context(image_data, phone_number):
    """Include conversation history in analysis."""
    context = get_user_context(phone_number)

    context_text = ""
    if context:
        context_text = "Previous interactions:\n"
        for item in reversed(context):
            context_text += f"- {item.get('response', '')[:100]}\n"

    prompt = f"""{context_text}

Now analyze this new image. Consider any relevant context from previous interactions."""

    return analyze_image(image_data, prompt)

Performance Tips

After running this in production, here's what I've learned:

Send acknowledgments immediately. Users get anxious if there's no response. Send "Analyzing..." before doing the heavy work.
Cache repeated analyses. Hash incoming images and check if you've seen them before.
Set timeout limits. Vision APIs can be slow. Set a 30-second timeout and send a graceful error if exceeded.
Rate limit by user. Prevent abuse by limiting requests per phone number per hour.
Monitor costs. Log API calls and set up billing alerts. Vision APIs charge per image.

Common Pitfalls

"Webhook verification failed"

Your verify token doesn't match
Your server isn't accessible (check ngrok)
You're not returning the challenge correctly

"Message not delivered"

Access token expired (get a permanent one)
Phone number not in allowed list (in test mode)
Invalid phone number format

"Image download failed"

Access token doesn't have media.read permission
Media URL expired (they're temporary)
Network timeout

"Vision API error"

Image too large (resize before sending)
Unsupported format (stick to JPEG/PNG)
Rate limit hit

What's Next?

This foundation supports many extensions:

Multi-language support: Detect user's language and respond accordingly
Voice messages: Transcribe audio and respond
Buttons and lists: Use WhatsApp's interactive message types
Payment integration: Connect to Stripe for premium features
Admin dashboard: Build a web interface for monitoring

Wrapping Up

You now have a complete AI-powered WhatsApp bot that can analyze images. The stack is simple, affordable, and scales well.

The combination of WhatsApp's reach and AI vision capabilities opens interesting possibilities. Users don't need to learn new interfaces—they just message like they always do.

This article was written based on my hands-on experience building production WhatsApp bots. The code examples are simplified for clarity—production deployments should include proper error handling, logging, and security measures.

DEV Community