Marcelo Acosta Cavalero for AWS Community Builders

Posted on Mar 23 • Originally published at buildwithaws.substack.com

What a Multimodal WhatsApp Agent Looks Like on AWS

#aws #architecture #ai #tutorial

Originally published on Build With AWS. Subscribe for weekly AWS builds.

I watched Miguel Otero Pedrido and Jesus Copado’s brilliant Ava the WhatsApp Agent series and tried building something similar. They built a multimodal WhatsApp bot using LangGraph and Google Cloud Run. The agent could hold conversations, analyze images, generate art, and process voice messages.

After going through the series, I had one question: what would this look like built 100% on AWS?

I started sketching out the architecture and quickly realized there were too many ways to build it. Pure Lambda orchestration? Bedrock Agents? Bedrock AgentCore? LangChain on Lambda? Step Functions? Each approach had tradeoffs I couldn’t ignore.

That’s when I decided to build a hybrid system. Not because hybrid is always better, but because building both patterns side by side would force me to understand when each approach makes sense.

The result is a production-ready WhatsApp bot on a manageable budget that demonstrates two distinct architectural patterns in the same codebase. You can find the complete code and deployment scripts at github.com/marceloacosta/multimodal-whatsapp-bot-aws to try it yourself.

What You’ll Build

By the end of this guide, you’ll understand how to build a WhatsApp bot with:

Natural conversations powered by Claude 3.5 Sonnet
Image analysis using Claude Vision
AI image generation with Stable Diffusion XL (or Amazon Titan)
Voice message transcription with AWS Transcribe
Text-to-speech responses using Amazon Polly
A serverless architecture that scales automatically

More importantly, you’ll understand when to use direct Lambda processing versus Bedrock Agent frameworks.

Why Hybrid Architecture?

Most tutorials show you one approach and call it done. I’m showing you both because the “best” architecture depends on what you’re building.

Here’s the reality: simple operations don’t need the complexity of agent frameworks. Complex operations benefit from them. I learned this the hard way after rebuilding parts of this system three times.

The project uses direct Lambda functions for straightforward tasks like image analysis, text-to-speech, and transcription. These are deterministic operations that don’t need natural language understanding or multi-turn conversations.

For image generation, I use Bedrock Agents. Why? Because turning “create a sunset over mountains” into an optimized prompt for an image model requires natural language understanding and prompt engineering. An agent handles this better than hardcoded logic.

This approach saves money where agents would be overkill, and uses them where they add real value.

The Cost Reality Check

Before we dive deeper, here’s what running this bot actually costs:

For 1,000 messages per day:

Lambda execution: $5-10
Bedrock models: $20-30
S3 storage: $1-2
API Gateway: $1
Other services: $3-5

Total: $30-50 per month.

Image generation adds extra cost per image. Titan costs $0.01 per image, Stable Diffusion XL costs $0.04. These costs scale with usage, but you have full control over which model you use.

Paying only for what you use across AWS services often beats being locked into third-party platforms with mandatory monthly fees.

Architecture Overview

The system consists of 8 Lambda functions working together:

Entry and orchestration:

inbound-webhook: Receives WhatsApp messages via API Gateway
wa-process: Main orchestrator that routes requests
wa-send: Sends messages back to WhatsApp

Feature handlers:

wa-image-analyze: Analyzes images using Claude Vision
wa-image-generate: Generates images using Titan or Stable Diffusion
wa-tts: Converts text to speech with Amazon Polly
wa-audio-transcribe: Starts transcription jobs using AWS Transcribe
wa-transcribe-finish: Handles transcription callbacks

Supporting services:

AWS Bedrock: Supervisor Agent + ImageCreator Sub-Agent
Amazon Polly: Text-to-speech synthesis
AWS Transcribe: Audio transcription
S3 buckets: Media storage and generated images
Secrets Manager: WhatsApp API credentials

The architecture diagram shows the complete flow, but I’ll walk you through how each piece works and why I made specific decisions.

Decision Framework: Lambda vs Agents

Here’s how I decided which approach to use for each feature.

Use direct Lambda when:

The operation is deterministic (TTS always works the same way)
You’re calling an AWS service directly (Transcribe, Polly)
The input-output relationship is simple
You want lower latency and cost

Use Bedrock Agents when:

You need natural language understanding
The task requires reasoning or optimization
Multi-turn conversations matter
Context needs to persist across interactions

Image analysis went to Lambda. The operation is simple: take an image, send it to Claude Vision, return the description. No complex prompt engineering needed.

Image generation went to Agents. User requests like “sunset” need to become detailed prompts like “a photorealistic sunset over mountain peaks with golden hour lighting, highly detailed, 8k resolution.” The agent handles this transformation.

The goal isn’t to pick a winner, but to match each method to what it does best.

Building the Foundation

Let’s start with the basics. You’ll need:

AWS account with Bedrock access
Python 3.9 or higher
AWS CLI configured
WhatsApp Business API account from Meta for Developers

You also need to enable model access in Bedrock for:

Claude 3.5 Sonnet v2
Claude 3.5 Haiku
Titan Image Generator v2

Model access is free to enable. You only pay when you use them.

Setting Up WhatsApp Business API

Getting WhatsApp access is straightforward but takes a few steps:

Go to Meta for Developers and create an app
Add the WhatsApp product to your app
Get your Phone Number ID and Access Token
Generate a verify token (any random string you choose)

Store the long-lived access token in AWS Secrets Manager. This is important because this token needs rotation over time.

Create a secret with this structure:

{ “token”: “your_long_lived_access_token” }
The Phone Number ID and Verify Token go in Lambda environment variables. Only the access token needs to be in Secrets Manager because it’s the credential that requires rotation and is security-sensitive.

The Configuration Strategy

Lambda functions don’t use .env files. Each function has its own environment variables set directly in AWS Console or via CLI.

The env.example file in the repo is just a reference document showing what variables exist and where they’re used. Different Lambda functions need different configurations. The orchestrator needs agent IDs. The image generator needs model IDs and bucket names. The sender only needs to know where to find the access token in Secrets Manager.

This keeps each function’s configuration minimal and explicit

Building the Entry Point

Every WhatsApp message hits inbound-webhook first. This Lambda handles two responsibilities: webhook verification and receiving messages.

The verification flow is straightforward. WhatsApp sends a GET request with a challenge token. The Lambda verifies the token matches what you configured, then returns the challenge back. This proves you control the endpoint.

After verification passes, WhatsApp starts sending POST requests with message data. When media arrives (images, audio), the webhook downloads it to S3 for processing. Then it invokes wa-process asynchronously.

The async pattern is critical. WhatsApp expects a 200 response within seconds. Your bot might take 10-20 seconds to generate a response. Async invocation lets you acknowledge receipt immediately while processing happens in the background.

Building the Orchestrator

The wa-process Lambda is the brain of the system. It receives a message and decides what to do with it.

The logic follows a simple flow: identify message type (text, image, audio), check for special intents like voice responses, route to the appropriate handler, and send the response back.

For text messages, the function invokes the Bedrock Supervisor Agent and sends the response directly. For images with questions, it prepares context that includes the S3 URI and user’s question, then invokes the agent. For audio, it triggers the transcription Lambda and waits for the callback.

The hybrid architecture shows its value here. The orchestrator doesn’t care whether a feature uses direct Lambda calls or agent frameworks. Text and image analysis go through the agent. Audio transcription calls a Lambda directly. Image generation gets delegated to a sub-agent. The orchestrator just routes requests to the right place.

The orchestrator also handles voice response requests. When a user asks for a voice message, it sets a flag and invokes the agent to generate text. Once the agent responds, it calls wa-tts to convert that text to audio. This separation keeps the agent focused on content generation while the orchestrator manages output formats.

Direct Lambda Pattern: Image Analysis

Image analysis shows the direct Lambda approach clearly. The operation is simple: download an image from S3, send it to Claude Vision via the Bedrock Converse API, and return the description.

The Lambda downloads the image bytes from S3 rather than passing an S3 reference. This makes the code more resilient to API changes. The image bytes and the user’s question get sent to Claude 3.5 Sonnet Vision, which returns a description.

This direct approach gives you full control. No agent orchestration, no prompt optimization, just a straightforward API call. The entire Lambda executes in under 3 seconds.

The cost is predictable: $0.008 per image analyzed. At 1,000 images per month, that’s $8. The agent framework would add orchestration overhead without adding value for this use case.

When would you add an agent layer? When the image analysis needs to trigger other actions, maintain conversation context across multiple images, or integrate with knowledge bases. For straightforward “analyze this image” requests, direct Lambda is the better choice.

Direct Lambda Pattern: Voice and Audio

Text-to-speech and audio transcription follow the same direct Lambda pattern.

For TTS, the wa-tts Lambda receives text from the orchestrator and calls Amazon Polly to synthesize speech. Polly returns an MP3 audio stream, which gets uploaded to S3. The Lambda generates a presigned URL for the audio file and returns it to the orchestrator. The orchestrator then calls wa-send with that audio URL to deliver it to WhatsApp. The entire operation costs about $0.016 per request (Polly charges $16 per 1 million characters).

Audio transcription is more complex because AWS Transcribe is asynchronous. You can’t just call an API and get the result immediately.

The wa-audio-transcribe Lambda starts a transcription job. It tells Transcribe where to find the audio file in S3 (uploaded earlier by the webhook), what format it’s in (usually OGG for WhatsApp voice notes), and where to store the results. Then it returns immediately.

AWS Transcribe processes the audio in the background. When finished, it writes the transcript JSON to S3. This triggers an S3 ObjectCreated event that invokes the wa-transcribe-finish Lambda. This Lambda reads the transcript from S3, extracts the text, and sends it back to the orchestrator as if it were a new text message. The orchestrator then sends it to the agent for processing.

This async pattern is crucial for long-running operations. WhatsApp users expect quick responses, but transcription can take 30-60 seconds depending on audio length. The callback pattern lets the user know their message was received while processing happens in the background.

Agent Framework Pattern: Conversations

Now let’s look at the agent side. The Supervisor Agent handles all text conversations.

The agent instructions require quite a bit of thought. You need to balance several competing concerns: natural conversation flow, WhatsApp’s messaging constraints, multi-language support, and managing different output formats.

The instructions need to handle language detection and matching. Users might write in Spanish, English, or Portuguese. The agent needs to detect this and respond appropriately. This is straightforward for text but becomes tricky when you add voice responses.

For voice responses, there’s a subtle problem. If a user asks for an audio message and the agent says “I’ll send you an audio message about quantum physics,” the TTS system converts that entire sentence to audio. The user hears “I’ll send you an audio message about quantum physics” instead of just hearing about quantum physics. The solution is explicit instructions: never mention the output format, just generate the content. The backend handles format conversion.

The instructions also need to consider WhatsApp’s messaging patterns. Long paragraphs work poorly in chat. The agent needs to keep responses concise while still being helpful. This means being explicit about brevity without sacrificing accuracy.

Benefits of this approach: the agent focuses on content generation, not infrastructure concerns. You can add new output formats (video captions, PDFs) without changing agent instructions. The separation between content and delivery is clean.

Drawbacks: the instructions become longer and more specific. More specific instructions mean less flexibility for the agent to adapt to edge cases. You also need to test thoroughly because the agent won’t tell you when it’s confused about format handling.

The agent connects to Lambda functions via action groups. For image analysis, the action group defines a function with parameters for the S3 URI, optional question, and optional language code. When a user sends an image with a question, the orchestrator formats it as a structured context block with these parameters. The agent parses this, calls the analyzeImage action, and returns the result.

This separation is powerful. You can change how image analysis works (switch models, add caching, implement fallbacks) without touching the orchestrator or agent instructions. The interface stays stable while the implementation evolves.

Agent Framework Pattern: Image Generation

Image generation shows why agents matter for complex tasks. When a user says “create a sunset,” that vague request needs to become a detailed prompt like “a photorealistic sunset over mountain peaks with golden hour lighting, vibrant orange and purple clouds, highly detailed, 8k resolution.” This transformation requires natural language understanding and prompt engineering, which agents handle well.

The architecture uses a sub-agent pattern. The Supervisor Agent detects image generation requests and delegates to an ImageCreator sub-agent. This keeps responsibility focused: the supervisor handles routing decisions, the sub-agent handles prompt optimization, and the Lambda handles the actual image generation.

The ImageCreator sub-agent analyzes the user’s natural language request and creates an optimized prompt for the image model. It considers style preferences, adds quality modifiers, and constructs negative prompts to avoid common issues. Then it calls the wa-image-generate Lambda through an action group.

The Lambda receives the optimized prompt and calls the configured Bedrock image model (Stable Diffusion XL or Titan). The generated image gets uploaded to S3, a presigned URL is created, and the Lambda uses Claude Haiku to generate a natural caption in the user’s language. Finally, it invokes wa-send to deliver the image to WhatsApp with the caption.

The sub-agent responds with a simple success indicator to the supervisor, which passes it back to the orchestrator. The orchestrator knows the image was already sent directly by the Lambda, so it doesn’t send anything else.

This multi-layer delegation (orchestrator → supervisor → sub-agent → Lambda) seems complex, but each layer has a clear purpose. The orchestrator routes by message type. The supervisor manages conversation context. The sub-agent optimizes prompts. The Lambda generates images. Each component does one thing well.

The Configuration Pattern

Earlier I mentioned environment variables are set per-Lambda. Here’s the complete pattern:

Secrets Manager (long-lived token only):

WhatsApp access token (needs rotation, security-sensitive)

Lambda environment variables (function-specific):

wa-process: Agent IDs, region, function names
wa-image-generate: Model IDs, bucket names
inbound-webhook: Bucket names, verify token, downstream functions
wa-send: Phone number ID, secret name

This approach scales better than shared configuration. Each function only knows what it needs. Changes to one function don’t affect others.

Setting these via CLI looks like:

`aws lambda update-function-configuration \

--function-name wa-process \

--environment Variables=’{

“BEDROCK_AGENT_ID”:”AGENTXXX”,

“BEDROCK_AGENT_ALIAS_ID”:”ALIASXXX”,

“BEDROCK_REGION”:”us-east-1”,

“MEDIA_BUCKET”:”my-media-bucket”

}’`
Or use the AWS Console for easier management. Both approaches work.

Deployment Strategy

The repo includes automated deployment scripts that handle the entire setup. Understanding what happens during deployment helps when debugging issues later.

Lambda deployment involves several steps: packaging the code, creating the function with the right runtime and memory settings, configuring environment variables, and setting up triggers. Each function needs different timeout and memory configurations. The webhook and orchestrator need quick response times. Image generation needs more time and memory. Audio transcription is somewhere in between.

The deployment scripts handle creating IAM roles with appropriate permissions. Each Lambda gets least-privilege access: only the specific AWS services it needs. The image analyzer reads from S3 but doesn’t write. The image generator writes to S3 but doesn’t read user data. The orchestrator invokes other Lambdas but doesn’t access S3 directly.

Triggers need configuration too. API Gateway triggers the webhook Lambda on HTTP requests. S3 ObjectCreated events trigger the transcription finish Lambda. Other Lambdas get invoked directly by other functions, so they don’t need external triggers.

The critical piece many people miss: Bedrock Agents need explicit permission to invoke Lambda functions. AWS doesn’t automatically grant this. You must add a resource-based policy to each Lambda that allows the bedrock.amazonaws.com service principal to invoke it, scoped to your specific agent ARN. Without this permission, the agent fails silently with generic error messages like “I cannot help with that.”

The automated scripts handle all these details, but knowing what they do helps when something goes wrong. If an agent can’t invoke a Lambda, check the resource policy. If a Lambda times out, check the timeout setting. If environment variables are missing, check the function configuration.

Setting Up Bedrock Agents

Creating agents through the AWS Console is straightforward but has specific steps.

For the Supervisor Agent:

- Go to Bedrock Console → Agents → Create Agent
- Name it descriptively (I use whatsapp-supervisor-agent)
- Choose Claude 3.5 Sonnet v2 as the foundation model
- Copy instructions from supervisor-agent-instructions.txt
- Add action group for image analysis
- Prepare the agent (this compiles everything)
- Create an alias pointing to the prepared version That last step trips people up. Changes to an agent don’t take effect until you:
- Prepare the agent (creates a new version)
- Update the alias to point to the new version

If you change instructions and skip these steps, your bot still uses the old version.

For the ImageCreator sub-agent:

- Create another agent with a focused name
- Use simpler instructions (it has one job)
- Add action group with the OpenAPI schema from lambdas/wa-image-generate/openapi-schema.json
- Prepare and create alias

Then link them:

- Edit the Supervisor Agent
- Add ImageCreator as a collaborator
- Specify when to delegate (image generation requests)
- Prepare the supervisor again
- Update its alias

The supervisor now knows to call the sub-agent for image requests.

Image Generation Models

The system supports two image generation models through a single Lambda function. You choose which model to use by setting the IMAGE_MODEL_ID environment variable.

VISION_MODEL_ID = os.environ.get(”VISION_MODEL_ID”, “us.anthropic.claude-3-5-sonnet-20241022-v2:0”)

Stable Diffusion XL is the default. It offers more creative control with style presets and costs about $0.04 per image. Amazon Titan Image Generator v1 is the alternative, optimized for photorealistic output at about $0.01 per image.

The Lambda detects which model is configured and uses the appropriate API format. Each model has different input parameters and response structures, but the Lambda abstracts these differences. From the agent’s perspective, image generation works the same way regardless of which model you choose.

To switch models, you update the Lambda’s environment variable in AWS Console or via CLI. The benefit of this design is that only the one Lambda changes. The orchestrator, agents, and other Lambdas continue working without modification. The abstraction layer handles the model-specific differences.

Performance Optimization

Lambda cold starts matter for user experience. When a function hasn’t run recently, AWS needs to initialize it. This adds 1-3 seconds of latency.

This demo doesn’t use provisioned concurrency to keep costs minimal. For production deployments with consistent traffic, consider provisioned concurrency for the webhook and orchestrator functions. These are in the critical path for response time. Other functions can tolerate cold starts because they’re not user-facing or run asynchronously.

Agent response time varies based on complexity. Simple text responses take 2-4 seconds. Image generation requests take 10-15 seconds total (agent reasoning + generation + upload).

For audio transcription, the system can send an immediate acknowledgment, then delivers the actual transcription when ready. This manages user expectations for the longer processing time.

Security Considerations

The system has several security layers.

Webhook verification ensures only WhatsApp can send messages. Without the correct verify token, requests are rejected.

IAM roles follow least privilege. Each Lambda only has permissions for the specific AWS services it needs. The image analyzer can read from S3 but not write. The image generator can write but not read others’ images.

Secrets Manager handles credential rotation. The WhatsApp access token can be rotated without code changes. Lambda functions fetch the current token at runtime.

S3 buckets are private by default. Images are shared via presigned URLs that expire after 7 days. No public bucket access.

What’s missing? Content moderation. The current implementation doesn’t filter generated images or user prompts. For production use, add:

Bedrock Guardrails to filter inappropriate prompts
Image scanning before sending to users
Rate limiting per user
Cost monitoring and alerts

These additions depend on your specific requirements and risk tolerance.

Lessons learned

I rebuilt parts of this system three times. Here’s what I learned:

Agent instructions require precision. Vague instructions lead to unpredictable behavior. The voice response handling needed explicit rules about never mentioning the output format. Language detection needed clear fallback behavior. Each edge case required specific handling in the instructions.

Hybrid architecture balances trade-offs. Pure agent systems cost more and respond slower for simple operations. Pure Lambda systems require writing all the conversational logic yourself. The hybrid approach uses agents where their natural language capabilities add value and direct Lambdas where they don’t.

Async patterns matter for user experience. WhatsApp users expect quick acknowledgments. Transcription takes 30-60 seconds. Image generation takes 10-15 seconds. The async callback patterns let the system respond immediately while work happens in the background.

Component isolation simplifies debugging. Each Lambda has a single responsibility. When something breaks, you can test that Lambda independently. Clear interfaces between components mean changes don’t cascade unexpectedly.

Permission issues cause silent failures. Bedrock Agents fail with generic error messages when they can’t invoke Lambdas. IAM permission debugging takes time. Checking permissions early when something doesn’t work saves troubleshooting time later.

Alternative Approaches

This hybrid architecture is one way to build this system. Here are alternatives and when to use them.

Pure Lambda orchestration: Remove Bedrock Agents entirely. The orchestrator directly calls all functions based on deterministic logic. Simpler and cheaper, but you write all the prompt engineering logic yourself.

Pure Agent architecture: Make everything an agent action group. Image analysis, TTS, transcription all go through the agent. Unified conversational interface with better context management, but higher cost and latency for simple tasks.

Bedrock AgentCore: Use AWS Bedrock AgentCore with your choice of agent framework (LangGraph, CrewAI, LlamaIndex). More infrastructure services like 8-hour runtimes and built-in observability, but requires more architectural decisions upfront.

Agent framework (LangChain, CrewAI): Replace Bedrock Agents with an open-source framework hosted in Lambda. Full control and portability, but you handle state management and dependencies yourself.

Step Functions orchestration: Use AWS Step Functions for workflow management instead of Lambda orchestration. Visual workflows with built-in retry logic, but more services to manage.

The right choice depends on your requirements. The hybrid approach teaches you both patterns so you can decide what works for your use case.

For a detailed comparison with pros, cons, and migration paths, see the ARCHITECTURE_DECISIONS.md document in the repo.

Getting Started

The repo includes automated deployment scripts that handle the Lambda setup. You can deploy everything at once or go function by function to understand each piece. After the Lambda deployment, you’ll create the Bedrock agents through the AWS Console and link them together.

The documentation walks you through both approaches. If you want to understand every component, deploy and test each Lambda individually. If you want to get running quickly, use the automated scripts and dive into specific parts later.

Setting up the agents requires more manual steps. You’ll create the supervisor agent with its conversation instructions, add the action group for image analysis, then create the image creator sub-agent and link it as a collaborator. The agent setup guide includes the exact instructions and parameters for each step.

The code is designed to be adaptable. The hybrid architecture isn’t prescriptive. Want to remove agents and handle everything with Lambda logic? The orchestrator is easy to modify. Want to add new capabilities? Create a Lambda, add it to the orchestrator’s routing logic, and decide whether to call it directly or through an agent action group.

The repo documentation covers deployment details, agent configuration, troubleshooting, and architectural alternatives. Start with what interests you most.

What This Enables

This isn’t just about building a WhatsApp bot. The patterns here apply to many AI applications.

The hybrid architecture shows how to balance simplicity with capability. The agent collaboration pattern shows how to break complex tasks into focused components. The async processing pattern shows how to maintain good user experience with slow operations.

You can adapt these patterns to build:

Telegram or Discord bots with the same backend
Slack integrations with multimodal capabilities
API services that use agents for complex requests
Customer service automation with image support

The serverless foundation means it scales automatically. The AWS services handle infrastructure so you focus on functionality.

Where to Go From Here

If you build something with this architecture, I’d like to hear about it. What worked? What didn’t? What did you change?

The complete code, documentation, and deployment scripts are at github.com/marceloacosta/multimodal-whatsapp-bot-aws. The repo is actively maintained. Issues and pull requests are welcome.

Start with the README for an overview, then dive into the architecture decisions document to understand the tradeoffs. The code includes comments explaining why specific approaches were chosen.

For questions or discussion, you can find me here or on Linkedin. I regularly share updates about AI systems and AWS architecture patterns.

Build something interesting with this. Then share what you learned.

I publish every week on buildwithaws.substack.com. If this was useful, subscribe. It's free.

DEV Community