WhatsApp has over 2 billion users. Most AI tools live on websites nobody visits. What if you could bring AI directly to where people already spend their time?
In this tutorial, I'll show you how to build a WhatsApp bot that accepts images, analyzes them using AI vision models, and responds with intelligent insights.
What We're Building
By the end of this guide, you'll have a working WhatsApp bot that:
- Receives images from users via WhatsApp
- Processes them using AI vision models (Llama, GPT-4V, or Claude)
- Returns structured analysis in natural conversation
- Stores conversation history in MongoDB
- Runs on a free-tier cloud server
The entire stack costs nearly nothing to run at low volume, making it perfect for MVPs, side projects, or learning.
Why WhatsApp + AI Vision?
Before we dive into code, let's talk about why this combination is powerful.
Traditional AI apps require users to visit a website, create an account, and learn a new interface. WhatsApp bots eliminate all that friction. Users message your bot exactly like they'd message a friend.
AI vision models have become remarkably capable. They can identify objects, read text, understand context, and generate detailed descriptions. Combining this with WhatsApp's ubiquity creates tools that feel magical.
Some real-world applications:
- Receipt scanners that extract expenses automatically
- Plant identifiers for gardening enthusiasts
- Food analyzers that estimate nutrition from photos
- Document readers that summarize uploaded PDFs
- Product lookup tools for shopping assistance
Prerequisites
You'll need:
- Python 3.9+
- A Meta Developer account (free)
- A cloud AI provider account (Together AI, OpenAI, or Anthropic)
- MongoDB Atlas account (free tier works)
- A server with a public URL (we'll use ngrok for development)
Architecture Overview
Here's how the pieces fit together:
User sends image via WhatsApp
↓
Meta's WhatsApp Cloud API receives it
↓
Webhook forwards to your Flask server
↓
Server downloads image from Meta's CDN
↓
Image sent to AI Vision API for analysis
↓
Response formatted and sent back via WhatsApp API
↓
User receives analysis in their chat
The beauty of this architecture is its simplicity. One Python file handles everything.
Step 1: Set Up the Meta Developer Account
First, we need access to the WhatsApp Business API.
- Go to developers.facebook.com and create an account
- Click "My Apps" → "Create App"
- Select "Business" as the app type
- Name your app and click "Create"
- Find "WhatsApp" in the product list and click "Set Up"
Meta provides a free test phone number that works for development. You'll see it in the WhatsApp section of your app dashboard.
Note down these values from your dashboard:
- Phone Number ID (under "From" phone number)
- WhatsApp Business Account ID
- Temporary Access Token (we'll make this permanent later)
Step 2: Set Up Your AI Vision Provider
For this tutorial, I'll use Together AI with their Llama Vision model because it's cost-effective and doesn't require waitlist approval. The code works with minor modifications for OpenAI's GPT-4V or Anthropic's Claude.
- Sign up at together.ai
- Get your API key from the dashboard
- Note the model name:
meta-llama/Llama-4-Scout-17B-16E-Instruct
Together AI charges about $0.18 per million tokens for vision models—significantly cheaper than alternatives.
Step 3: Set Up MongoDB
We'll use MongoDB to store user sessions and analysis history.
- Create a free account at mongodb.com/atlas
- Create a new cluster (the free M0 tier works fine)
- Create a database user with read/write access
- Get your connection string (looks like
mongodb+srv://user:pass@cluster.xxxxx.mongodb.net/)
Step 4: Project Structure
Create a new directory and set up these files:
whatsapp-ai-bot/
├── app.py # Main application
├── .env # Environment variables (don't commit this)
├── .env.example # Template for environment variables
└── requirements.txt # Python dependencies
Step 5: Install Dependencies
Create requirements.txt:
flask==3.0.0
requests==2.31.0
pymongo==4.6.0
python-dotenv==1.0.0
together==1.0.0
gunicorn==21.2.0
Install them:
pip install -r requirements.txt
Step 6: Environment Variables
Create .env.example (commit this as a template):
# WhatsApp API
WHATSAPP_ACCESS_TOKEN=your_token_here
WHATSAPP_PHONE_NUMBER_ID=your_phone_id_here
WHATSAPP_VERIFY_TOKEN=any_random_string_you_choose
# AI Provider
TOGETHER_API_KEY=your_together_api_key
# Database
MONGODB_URI=mongodb+srv://user:pass@cluster.mongodb.net/dbname
Copy it to .env and fill in your actual values.
Step 7: The Main Application
Here's the complete app.py. I'll explain each section after:
import os
import json
import requests
from datetime import datetime
from flask import Flask, request, jsonify
from pymongo import MongoClient
from dotenv import load_dotenv
from together import Together
load_dotenv()
app = Flask(__name__)
# Configuration
WHATSAPP_TOKEN = os.getenv("WHATSAPP_ACCESS_TOKEN")
PHONE_NUMBER_ID = os.getenv("WHATSAPP_PHONE_NUMBER_ID")
VERIFY_TOKEN = os.getenv("WHATSAPP_VERIFY_TOKEN")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
MONGODB_URI = os.getenv("MONGODB_URI")
# Initialize clients
mongo_client = MongoClient(MONGODB_URI)
db = mongo_client.whatsapp_bot
together_client = Together(api_key=TOGETHER_API_KEY)
# Vision model configuration
VISION_MODEL = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
def send_whatsapp_message(to, message):
"""Send a text message via WhatsApp API."""
url = f"https://graph.facebook.com/v18.0/{PHONE_NUMBER_ID}/messages"
headers = {
"Authorization": f"Bearer {WHATSAPP_TOKEN}",
"Content-Type": "application/json"
}
payload = {
"messaging_product": "whatsapp",
"to": to,
"type": "text",
"text": {"body": message}
}
response = requests.post(url, headers=headers, json=payload)
return response.json()
def download_media(media_id):
"""Download media from WhatsApp's CDN."""
# First, get the media URL
url = f"https://graph.facebook.com/v18.0/{media_id}"
headers = {"Authorization": f"Bearer {WHATSAPP_TOKEN}"}
response = requests.get(url, headers=headers)
media_url = response.json().get("url")
if not media_url:
return None
# Download the actual file
media_response = requests.get(media_url, headers=headers)
return media_response.content
def analyze_image(image_data, prompt):
"""Send image to AI vision model for analysis."""
import base64
# Convert to base64
image_base64 = base64.b64encode(image_data).decode("utf-8")
image_url = f"data:image/jpeg;base64,{image_base64}"
try:
response = together_client.chat.completions.create(
model=VISION_MODEL,
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": prompt}
]
}],
max_tokens=500
)
return response.choices[0].message.content
except Exception as e:
print(f"Vision API error: {e}")
return None
def get_analysis_prompt():
"""Return the prompt for image analysis."""
return """Analyze this image and provide:
1. A brief description of what you see
2. Key details or notable elements
3. Any relevant insights or observations
Keep your response concise and conversational, suitable for a chat message."""
def log_interaction(phone_number, message_type, content, response):
"""Log the interaction to MongoDB."""
db.interactions.insert_one({
"phone_number": phone_number,
"message_type": message_type,
"content": content[:500] if content else None,
"response": response[:500] if response else None,
"timestamp": datetime.utcnow()
})
@app.route("/webhook", methods=["GET"])
def verify_webhook():
"""Handle webhook verification from Meta."""
mode = request.args.get("hub.mode")
token = request.args.get("hub.verify_token")
challenge = request.args.get("hub.challenge")
if mode == "subscribe" and token == VERIFY_TOKEN:
print("Webhook verified successfully")
return challenge, 200
return "Forbidden", 403
@app.route("/webhook", methods=["POST"])
def handle_webhook():
"""Process incoming WhatsApp messages."""
data = request.json
try:
# Extract message details
entry = data["entry"][0]
changes = entry["changes"][0]
value = changes["value"]
# Check if this is a message (not a status update)
if "messages" not in value:
return jsonify({"status": "ok"}), 200
message = value["messages"][0]
phone_number = message["from"]
message_type = message["type"]
# Handle image messages
if message_type == "image":
media_id = message["image"]["id"]
# Send acknowledgment
send_whatsapp_message(
phone_number,
"Got your image! Analyzing it now..."
)
# Download and analyze
image_data = download_media(media_id)
if image_data:
analysis = analyze_image(image_data, get_analysis_prompt())
if analysis:
send_whatsapp_message(phone_number, analysis)
log_interaction(phone_number, "image", "image_received", analysis)
else:
send_whatsapp_message(
phone_number,
"Sorry, I couldn't analyze that image. Please try again."
)
else:
send_whatsapp_message(
phone_number,
"Sorry, I couldn't download that image. Please try again."
)
# Handle text messages
elif message_type == "text":
text = message["text"]["body"]
# Simple response for non-image messages
response = (
"Hi! Send me an image and I'll analyze it for you.\n\n"
"Just take a photo or send one from your gallery!"
)
send_whatsapp_message(phone_number, response)
log_interaction(phone_number, "text", text, response)
return jsonify({"status": "ok"}), 200
except Exception as e:
print(f"Error processing webhook: {e}")
return jsonify({"status": "error"}), 500
@app.route("/health", methods=["GET"])
def health_check():
"""Simple health check endpoint."""
return jsonify({"status": "healthy", "timestamp": datetime.utcnow().isoformat()})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000, debug=True)
Step 8: Understanding the Code
Let's break down the key parts.
Webhook Verification
When you configure your webhook URL in Meta's dashboard, they send a GET request with a challenge. Your server must echo it back:
@app.route("/webhook", methods=["GET"])
def verify_webhook():
if mode == "subscribe" and token == VERIFY_TOKEN:
return challenge, 200
Processing Incoming Messages
WhatsApp sends POST requests to your webhook for each message. The nested JSON structure requires careful extraction:
message = value["messages"][0]
phone_number = message["from"]
message_type = message["type"]
Downloading Media
WhatsApp doesn't send images directly. Instead, they provide a media ID. You must first get the download URL, then fetch the actual file:
def download_media(media_id):
# Get URL from media ID
url = f"https://graph.facebook.com/v18.0/{media_id}"
response = requests.get(url, headers=headers)
media_url = response.json().get("url")
# Download actual file
media_response = requests.get(media_url, headers=headers)
return media_response.content
Vision Analysis
The AI vision API accepts base64-encoded images. We send both the image and a text prompt:
response = together_client.chat.completions.create(
model=VISION_MODEL,
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": prompt}
]
}]
)
Step 9: Set Up the Webhook
For development, we'll use ngrok to expose your local server.
- Install ngrok from ngrok.com
- Run your Flask app:
python app.py - In another terminal, run:
ngrok http 5000 - Copy the HTTPS URL (looks like
https://abc123.ngrok.io)
Now configure the webhook in Meta's dashboard:
- Go to your app → WhatsApp → Configuration
- Click "Edit" next to Webhook
- Enter your URL:
https://abc123.ngrok.io/webhook - Enter your verify token (same as
WHATSAPP_VERIFY_TOKENin.env) - Click "Verify and Save"
- Subscribe to "messages" events
Step 10: Test Your Bot
- Open WhatsApp on your phone
- Message the test number shown in Meta's dashboard
- Send an image
- Watch your terminal for logs
- Receive the AI analysis!
If something doesn't work, check:
- Is ngrok running and the URL current?
- Are all environment variables set?
- Is the webhook subscribed to "messages"?
- Check Meta's webhook logs for delivery status
Step 11: Making the Access Token Permanent
The temporary access token expires in 24 hours. For production, create a permanent one:
- Go to your app → WhatsApp → API Setup
- Click "Add" under "Add a system user token"
- Create a system user if you haven't
- Generate a token with
whatsapp_business_messagingpermission - This token won't expire
Step 12: Deploying to Production
For production, you need a server with a stable URL. Here are affordable options:
Railway
# Install Railway CLI
npm install -g @railway/cli
# Deploy
railway init
railway up
Render (Free tier available)
- Connect your GitHub repo
- Set environment variables in dashboard
- Deploy automatically on push
DigitalOcean
# On your droplet
git clone your-repo
cd your-repo
pip install -r requirements.txt
gunicorn app:app -b 0.0.0.0:5000
For any option, remember to:
- Set all environment variables
- Use HTTPS (required by WhatsApp)
- Run with gunicorn instead of Flask's dev server
- Set up process management (systemd, supervisor, or PM2)
Step 13: Customizing for Your Use Case
The base code is intentionally generic. Here's how to adapt it for specific applications.
For a Receipt Scanner:
def get_analysis_prompt():
return """Analyze this receipt image and extract:
1. Store/merchant name
2. Date of purchase
3. List of items with prices
4. Total amount
5. Payment method if visible
Format the response as a clear summary."""
For a Plant Identifier:
def get_analysis_prompt():
return """Identify this plant and provide:
1. Plant name (common and scientific)
2. Key identifying features
3. Care requirements (water, sunlight)
4. Is it toxic to pets?
Keep it conversational and helpful."""
For a Food Analyzer:
def get_analysis_prompt():
return """Analyze this food image and estimate:
1. What foods are present
2. Approximate calories
3. Protein, carbs, and fat estimates
4. Health rating from 1-10
5. A brief nutritional insight
Be helpful but note these are estimates."""
Step 14: Adding Conversation Context
To make your bot smarter, store and use conversation history:
def get_user_context(phone_number, limit=5):
"""Get recent interactions for context."""
recent = db.interactions.find(
{"phone_number": phone_number}
).sort("timestamp", -1).limit(limit)
return list(recent)
def analyze_image_with_context(image_data, phone_number):
"""Include conversation history in analysis."""
context = get_user_context(phone_number)
context_text = ""
if context:
context_text = "Previous interactions:\n"
for item in reversed(context):
context_text += f"- {item.get('response', '')[:100]}\n"
prompt = f"""{context_text}
Now analyze this new image. Consider any relevant context from previous interactions."""
return analyze_image(image_data, prompt)
Performance Tips
After running this in production, here's what I've learned:
Send acknowledgments immediately. Users get anxious if there's no response. Send "Analyzing..." before doing the heavy work.
Cache repeated analyses. Hash incoming images and check if you've seen them before.
Set timeout limits. Vision APIs can be slow. Set a 30-second timeout and send a graceful error if exceeded.
Rate limit by user. Prevent abuse by limiting requests per phone number per hour.
Monitor costs. Log API calls and set up billing alerts. Vision APIs charge per image.
Common Pitfalls
"Webhook verification failed"
- Your verify token doesn't match
- Your server isn't accessible (check ngrok)
- You're not returning the challenge correctly
"Message not delivered"
- Access token expired (get a permanent one)
- Phone number not in allowed list (in test mode)
- Invalid phone number format
"Image download failed"
- Access token doesn't have media.read permission
- Media URL expired (they're temporary)
- Network timeout
"Vision API error"
- Image too large (resize before sending)
- Unsupported format (stick to JPEG/PNG)
- Rate limit hit
What's Next?
This foundation supports many extensions:
- Multi-language support: Detect user's language and respond accordingly
- Voice messages: Transcribe audio and respond
- Buttons and lists: Use WhatsApp's interactive message types
- Payment integration: Connect to Stripe for premium features
- Admin dashboard: Build a web interface for monitoring
Wrapping Up
You now have a complete AI-powered WhatsApp bot that can analyze images. The stack is simple, affordable, and scales well.
The combination of WhatsApp's reach and AI vision capabilities opens interesting possibilities. Users don't need to learn new interfaces—they just message like they always do.
This article was written based on my hands-on experience building production WhatsApp bots. The code examples are simplified for clarity—production deployments should include proper error handling, logging, and security measures.
Top comments (0)