DEV Community

Cover image for AWS Production-Ready Email Infrastructure
Aurora75
Aurora75

Posted on • Edited on

AWS Production-Ready Email Infrastructure

Building Production-Ready Email Infrastructure with SESMailEngine: An Event-Driven Architecture Deep Dive

by SESMailEngine Team on 31 JAN 2026 in Amazon Simple Email Service (SES), Architecture, Serverless


The AWS SES Challenge: Power Meets Complexity

AWS SES offers compelling economics—with a 3,000 email/month free tier and $0.10 per 1,000 emails thereafter, it's 85-90% cheaper than SendGrid or Mailgun. But this cost advantage comes with significant engineering challenges that many teams underestimate.

The reputation management burden: Unlike managed email services, AWS expects you to monitor bounce rates, complaint rates, and sender reputation. Cross the 5% bounce threshold and your account enters review. Hit 10% and sending gets paused. Reputation management is the sender's responsibility, requiring proactive monitoring and response.

The template sprawl problem: Without a centralized template system, teams inevitably construct HTML emails across multiple services. What starts as a simple welcome email quickly becomes dozens of scattered email templates. Updating a footer copyright from "2025" to "2026" becomes a multi-hour archaeology expedition across codebases, often resulting in inconsistent branding that damages platform credibility.

The AI security risk: In the era of AI-powered development, exposing ses.send_email() directly to AI agents without guardrails introduces risk. Without proper guardrails, an AI could send emails to unverified addresses, ignore suppression lists, or trigger reputation violations that suspend your entire AWS account.

The Solution: Event-Driven Multi-Producer, Single-Consumer Architecture

These challenges can be addressed through a serverless, event-driven architecture pattern. SESMailEngine implements this approach that follows AWS best practices for scalability, reliability, and cost optimization. The system implements a multi-producer, single-consumer pattern where any number of services can request emails, but all requests flow through a single, centralized email processing pipeline.

Visualize the architecture: https://www.sesmailengine.com/?demo=true

This architectural approach delivers three critical benefits:

  1. Centralized reputation protection: All emails flow through a single suppression check, preventing any service from accidentally sending to bounced or complained addresses.

  2. Template consistency: Email templates live in S3, decoupled from application code. Update a template once, and all services immediately use the new version—no code deployments required.

  3. AI-safe interface: Services publish events to EventBridge, not direct SES calls. The event schema enforces validation, suppression checks, and template rendering before any email reaches SES.


Architecture Overview: EventBridge as the Foundation

The architecture centers on Amazon EventBridge as both the entry point and fan-out mechanism for the entire system. This design choice aligns with AWS event-driven architecture best practices and provides several key advantages.

Why EventBridge as Entry Point?

Decoupling producers from consumers: Services publish "Email Request" events without knowing how emails are processed. This loose coupling means you can upgrade the email infrastructure, add new features, or even swap email providers without touching producer code.

Built-in retry and reliability: EventBridge automatically retries failed deliveries with exponential backoff. If the downstream SQS queue is temporarily unavailable, EventBridge handles retries transparently—no custom retry logic required.

Bidirectional event flow: EventBridge serves as both the entry point and the notification hub. Services publish "Email Request" events to send emails, and the Feedback Processor publishes "Email Status Changed" events back to the same bus when emails are delivered, bounced, or marked as spam. Consumer services can create EventBridge rules to subscribe to these status updates—enabling real-time reactions to bounces, delivery confirmations, or complaint handling without polling.

Multi-consumer fan-out: Need to archive all email requests to S3 for compliance? Add an EventBridge rule. Want to forward bounce notifications to a third-party analytics service? Add another rule. The producer code never changes, and new consumers can subscribe without modifying the core infrastructure.

Cross-account and cross-region support: EventBridge natively supports cross-account event delivery. If your organization runs multiple AWS accounts (common in enterprise environments), services in Account A can publish email requests that are processed by SESMailEngine in Account B—all with proper IAM controls and audit trails.

EventBridge to SQS: The Buffering Layer

EventBridge routes email requests to an SQS queue, not directly to Lambda. This architectural decision is critical for handling traffic spikes and protecting against downstream failures.

Why SQS after EventBridge?

Traffic buffering: Marketing campaigns can generate thousands of email requests in seconds. Without SQS, these bursts would overwhelm Lambda concurrency limits and trigger throttling errors. SQS absorbs the spike, queuing messages for processing at a controlled rate.

Automatic retry with visibility timeout: When Lambda fails to process a message (template error, transient SES issue, etc.), SQS automatically returns the message to the queue after the visibility timeout expires. The message gets three attempts before moving to the Dead Letter Queue—all without custom retry code.

Backpressure and rate limiting: SQS's MaximumConcurrency setting controls how many Lambda functions process emails simultaneously. This prevents overwhelming your SES sending rate limit (typically 14 emails/sec for new production accounts). Set concurrency to match your SES rate, and the system naturally throttles itself.

Message durability: SQS retains messages for 14 days. If your Lambda function is down due to a deployment issue, messages wait in the queue rather than being lost. Direct EventBridge-to-Lambda invocation relies on EventBridge's retry policy, which has a shorter retention window.

Cost optimization: SQS long polling (20-second wait) reduces empty receives and lowers costs. The queue only charges for actual message operations, making it nearly free for typical email volumes.


Lambda Processing: Single Responsibility, Maximum Reliability

The Email Sender Lambda implements a single responsibility: transform an email request event into a sent email. This focused design makes the function testable, maintainable, and reliable.

Processing Flow

1. Parse event (SQS-wrapped EventBridge event)
2. Check suppression list (DynamoDB query)
3. Load template from S3 (with 5-minute cache)
4. Render template with Jinja2
5. Resolve sender email (override → template → default)
6. Send via SES with configuration set
7. Track in DynamoDB (status: "sent")
Enter fullscreen mode Exit fullscreen mode

Idempotency protection: Before processing, the Lambda checks if the emailId already exists in DynamoDB. If found, the message is discarded without sending. This prevents duplicate emails when SQS redelivers a message after a Lambda timeout.

Fail-fast validation: Invalid email addresses, missing template variables, and suppressed recipients are detected early and tracked as "failed" status. The message is consumed (removed from queue) to prevent infinite retries. This design ensures the DLQ only contains truly unexpected errors that need human investigation.

Why Not Direct EventBridge-to-Lambda?

Direct EventBridge-to-Lambda invocation seems simpler, but it sacrifices critical reliability features:

  • No automatic retry control: EventBridge retries failed invocations, but you can't configure retry count or backoff strategy per target.
  • No visibility into pending work: With SQS, you can see queue depth in CloudWatch and scale Lambda concurrency accordingly. Direct invocation provides no visibility into pending events.
  • No backpressure mechanism: EventBridge will invoke Lambda as fast as events arrive, potentially exceeding concurrency limits and triggering throttling.
  • Harder to debug: SQS DLQ messages can be inspected, replayed, or forwarded to another queue. EventBridge DLQ messages are harder to access and replay.

S3 Template Storage: Decoupling Content from Code

Email templates live in S3, not in application code or Lambda deployment packages. This architectural decision enables template updates without code deployments and supports non-technical users managing email content.

Template Structure

s3://sesmailengine-templates-{AccountId}/
└── templates/
    └── {templateName}/
        ├── template.html      # HTML body (Jinja2)
        ├── template.txt       # Plain text fallback
        └── metadata.json      # Subject, sender, variables
Enter fullscreen mode Exit fullscreen mode

Why S3 for templates?

Zero-downtime updates: Marketing teams can update email copy, fix typos, or refresh branding by uploading new template files to S3. The Lambda function's 5-minute cache ensures changes propagate quickly without requiring code deployments or Lambda restarts.

Version control with S3 versioning: Enable S3 versioning on the template bucket, and every template change is automatically versioned. Roll back a bad template update by restoring a previous version—no Git commits or code deployments required.

Separation of concerns: Developers own the email infrastructure (Lambda, DynamoDB, SQS). Marketing owns email content (templates in S3). Neither team blocks the other, and changes happen independently.

Template reuse across environments: The same template bucket structure works in dev, staging, and production. Copy templates between environments using AWS CLI or S3 sync—no code changes needed.

Template Caching Strategy

The Lambda function caches templates in memory for 5 minutes. This reduces S3 GET requests (cost optimization) while ensuring template updates propagate reasonably quickly.

Cache invalidation: There's no explicit cache invalidation. After uploading a new template, wait 5 minutes for all Lambda instances to refresh their cache. For immediate updates, manually invoke the Lambda function to force a cold start (which clears the cache).


DynamoDB: Tracking and Suppression

SESMailEngine uses two DynamoDB tables with distinct purposes: EmailTracking for audit trails and Suppression for reputation protection.

EmailTracking Table

Purpose: Store every email request with status, timestamps, and metadata for compliance, debugging, and analytics.

Key design decisions:

On-demand billing: Email volume is unpredictable (marketing campaigns create spikes). On-demand billing eliminates capacity planning and scales automatically without throttling.

TTL for automatic cleanup: Records older than 90 days are automatically deleted via DynamoDB TTL. This reduces storage costs and aligns with GDPR data retention requirements.

GSI for querying: Three Global Secondary Indexes enable common query patterns:

  • to-email-timestamp-index: Find all emails sent to a specific address in time window (for support debugging)
  • ses-message-id-index: Look up emails by SES message ID (for correlating SES events)
  • date-partition-index: Query emails by date range (for analytics and reporting)

Suppression Table

Purpose: Store email addresses that have bounced, complained, or exceeded soft bounce thresholds. Prevent future sends to these addresses.

Why a separate table?

Performance: Suppression checks occur on every email send. A dedicated table with a simple primary key (email address) enables fast lookups. Querying the EmailTracking table would require scanning or complex GSI queries.

Data lifecycle: Suppression records are permanent (or manually removed). EmailTracking records expire after 90 days. Mixing these lifecycles in one table complicates TTL configuration.

Access patterns: Suppression is write-once, read-many. EmailTracking is write-many, read-occasionally. Separate tables optimize for these different patterns.


SNS for SES Feedback: Why Not SQS?

SES publishes bounce, complaint, and delivery events to an SNS topic, which triggers the Feedback Processor Lambda. This design raises an obvious question: why SNS instead of SQS?

SNS fan-out for future extensibility: While SESMailEngine currently has one subscriber (Feedback Processor Lambda), SNS enables adding more subscribers without changing SES configuration. Need to forward bounces to a third-party analytics service? Add an SNS subscription. Want to archive all SES events to S3? Add another subscription. The SES configuration never changes.

Native SES integration: SES Configuration Sets natively support SNS as an event destination for bounce, complaint, and delivery notifications. Using SQS would require an intermediary (Lambda or EventBridge Pipe) to bridge SES events to the queue, adding architectural complexity without meaningful benefit for this use case.

Simpler DLQ configuration: With SNS → Lambda, failed messages route to a DLQ configured on the SNS subscription. This keeps the failure handling close to the event source, making it easier to correlate failed SES feedback events with their original context.

Why we didn't need SQS after SNS:

SES feedback events (bounces, complaints, deliveries) arrive at a much lower rate than email sends. A typical system might send 10,000 emails/hour but receive only 50-100 feedback events/hour. This low volume doesn't require SQS buffering—Lambda can handle the load directly.

Additionally, feedback processing is idempotent (updating the same email status multiple times produces the same result). If Lambda fails and SNS retries, the worst case is a duplicate status update, which is harmless.


Dead Letter Queues: Zero Data Loss Architecture

SESMailEngine implements three Dead Letter Queues (DLQs) to ensure no email request or feedback event is silently lost.

EventBridge DLQ

Trigger: EventBridge fails to deliver an email request to SQS (rare, usually due to SQS service issues or misconfigured permissions).

Action: Message is sent to EventBridge DLQ for manual inspection.

EmailQueue DLQ

Trigger: Email Sender Lambda fails to process a message after 3 attempts (template errors, unexpected exceptions, or persistent SES errors).

Action: Message is sent to EmailQueue DLQ. CloudWatch alarm triggers, notifying the admin team.

Why 3 attempts? This balances retry opportunities (transient errors often resolve within 2-3 attempts) with preventing infinite loops (persistent errors shouldn't retry forever).

FeedbackProcessor DLQ

Trigger: Feedback Processor Lambda fails to process an SES event after SNS retries (rare, usually due to DynamoDB throttling or code bugs).

Action: Event is sent to FeedbackProcessor DLQ. CloudWatch alarm triggers.

DLQ monitoring: All three DLQs have CloudWatch alarms that trigger when message count > 0. Alarms send notifications to an SNS topic subscribed by the admin email. This ensures DLQ messages are investigated promptly, not discovered weeks later.


CloudWatch Alarms: Proactive Reputation Management

SESMailEngine implements seven CloudWatch alarms to detect issues before they impact sender reputation or email delivery.

SES Reputation Alarms

Bounce Rate Alarm: Triggers at 3% bounce rate (2 consecutive 1-hour periods). AWS reviews accounts at 5% and suspends at 10%. The 3% threshold provides early warning to clean email lists before AWS takes action.

Complaint Rate Alarm: Triggers at 0.05% complaint rate (2 consecutive 1-hour periods). AWS reviews at 0.1% and suspends at 0.5%. Early detection enables immediate investigation of spam complaints.

Why 2 consecutive periods? Single-period alarms create false positives from temporary spikes (e.g., one bad email batch). Two consecutive periods confirm a sustained problem requiring action.

Lambda Error Alarms

Email Sender Errors: Triggers when Lambda errors exceed 50 per 5-minute period for 2 consecutive periods (10 minutes total).

Feedback Processor Errors: Same threshold as Email Sender.

Why 50 errors? This threshold ignores temporary burst throttling (common during campaign launches) while catching sustained problems. A marketing campaign sending 10,000 emails might trigger 10-20 throttling errors in the first minute—this won't alarm. But a code bug causing 50+ errors for 10+ minutes will alert the team.

DLQ Depth Alarms

All three DLQs have alarms that trigger when message count > 0. Any message in a DLQ indicates an unexpected failure requiring investigation.


Serverless Cost Optimization: Scale to Zero

SESMailEngine's serverless architecture delivers near-zero costs when idle and scales automatically under load.

Cost Breakdown (100,000 emails/month)

Service Cost Notes
SES $9.70 $0.10 per 1,000 emails (97,000 after 3,000 free tier)
Lambda $0.00 Within 1M request + 400,000 GB-sec free tier
DynamoDB $0.00 Within 25 GB storage + 25 WCU/RCU free tier
EventBridge $0.10 $1.00 per million events
S3 $0.00 Within 5 GB storage + 20,000 GET free tier
SQS $0.00 Within 1M request free tier

Scale to zero: When no emails are sent, only S3 storage ($0.023/GB/month) and DynamoDB storage ($0.25/GB/month) incur costs. A typical deployment with 8 starter templates and minimal tracking data costs less than $0.50/month at zero volume.

No idle costs: Unlike EC2-based email services or container-based solutions, there are no compute costs when idle. Lambda only charges for actual invocations. SQS only charges for message operations. EventBridge only charges for events published.


Alignment with AWS Best Practices

SESMailEngine's architecture follows AWS Well-Architected Framework principles and service-specific best practices.

Event-Driven Architecture Best Practices

Loose coupling: Services publish events without knowing consumers. Email infrastructure can be upgraded without touching producer code.

Asynchronous communication: EventBridge and SQS decouple request submission from processing, enabling independent scaling.

Event schema validation: EventBridge event patterns enforce required fields before events reach consumers.

Idempotency: Email Sender Lambda checks for duplicate emailId before processing, preventing duplicate sends.

Dead letter queues: All failure paths route to DLQs for investigation, ensuring zero data loss.

Observability: CloudWatch Logs, Metrics, and Alarms provide visibility into every stage of the pipeline.

SES Best Practices

Bounce and complaint handling: Feedback Processor automatically suppresses bounced and complained addresses, protecting sender reputation.

Suppression list management: Dedicated Suppression table prevents sends to problematic addresses.

Configuration sets: All emails use an SES configuration set that enables event tracking and reputation monitoring.

Reputation monitoring: CloudWatch alarms trigger at 3% bounce rate and 0.05% complaint rate, providing early warning before AWS review thresholds.

Soft bounce handling: Consecutive soft bounce detection (default: 3) suppresses addresses with persistent delivery issues.

Email validation: Compatible with AWS SES Email Validation (Auto Validation) for proactive address verification.

Serverless Best Practices

Right-sized Lambda functions: Email Sender uses 512MB memory (balances performance and cost). Feedback Processor uses 256MB (lower memory for simpler processing).

Concurrency control: SQS MaximumConcurrency matches SES sending rate to prevent throttling.

On-demand DynamoDB: Eliminates capacity planning and scales automatically without throttling.

S3 for static content: Templates stored in S3, not Lambda deployment packages, reducing function size and enabling independent updates.

CloudWatch Logs retention: Configurable log retention (default: 90 days) reduces CloudWatch storage costs.


Conclusion

This architecture demonstrates how event-driven patterns address common email infrastructure challenges. By combining EventBridge's routing flexibility, SQS's buffering reliability, Lambda's serverless scalability, and DynamoDB's performance, the system delivers production-ready email infrastructure that scales from zero to millions of emails per day.

Key architectural decisions—centralized suppression, decoupled templates, and controlled access interfaces—address the specific pain points teams face when building on AWS SES. And the serverless foundation ensures costs remain minimal at low volumes while scaling automatically under load.

Explore the implementation:

"The bugs were mine to fix, the architecture mine to design. AI helped me find the words—but the sleepless nights debugging SES bounces? Those were all me."
The SESMailEngine Team

Top comments (0)