INTRODUCTION
A few months ago I built a serverless power outage reporting system for Nigeria using AWS SAM. Citizens could submit outage reports via API, the system stored them in DynamoDB, and SNS sent email alerts when a threshold was hit.
It worked. But the infrastructure lived in a single template.yaml file with no modularity, no CloudWatch dashboard, no Dead Letter Queue, and the alert threshold was hardcoded.
So I rebuilt it from scratch using Terraform modules. Same problem. Same architecture. Better infrastructure.
This is Part 2. If you missed Part 1, read it here first: How I Built a Serverless Power Outage Tracker for Nigeria on AWS.
The Architecture
The event-driven pipeline looks like this:
User
↓
API Gateway (HTTP API)
↓
Lambda Validator → SQS Dead Letter Queue (failed reports)
↓
SQS Queue
↓
Lambda Enricher → DynamoDB + SNS alert (threshold exceeded)
↓
Lambda Query ← GET /reports
↓
Lambda Aggregator ← daily summary (EventBridge scheduled)
Every component is serverless. No servers to manage, no idle compute costs, pay only when reports come in.
The Architecture Drawing
The Terraform Module Structure
Instead of one giant template I split everything into 5 independent modules:
terraform/
├── modules/
│ ├── api/ # API Gateway HTTP API + routes + Lambda permissions
│ ├── compute/ # IAM role, Lambda functions, SQS event source mapping
│ ├── messaging/ # SQS queue, Dead Letter Queue, SNS topic + subscription
│ ├── observability/ # CloudWatch log groups, alarms, dashboard
│ └── storage/ # DynamoDB table with GSI and TTL
└── environments/
└── dev/ # Root module wiring everything together
Each module has one job. The messaging module doesn't know about Lambda. The compute module doesn't know about API Gateway. They communicate through outputs and inputs only.
Module 1 — Storage
The DynamoDB table stores every outage report with 3 key design decisions:
Partition key + sort key:
hash_key = "LGA" # Local Government Area e.g. Ikeja
range_key = "timestamp" # ISO 8601 timestamp
This lets you query all reports for a specific LGA efficiently.
Global Secondary Index for state-level queries:
global_secondary_index {
name = "StateIndex"
hash_key = "state"
range_key = "timestamp"
projection_type = "ALL"
}
Query all outages in Lagos State without scanning the entire table.
TTL — auto-expire old records:
ttl {
attribute_name = "expiry"
enabled = true
}
Records automatically delete after 90 days. No manual cleanup, no growing storage costs.
Module 2 — Messaging
Three resources with one important design decision — the Dead Letter Queue:
resource "aws_sqs_queue" "dlq" {
name = "${var.project_name}-outage-dlq"
message_retention_seconds = 1209600 # 14 days
}
resource "aws_sqs_queue" "outage_queue" {
name = "${var.project_name}-outage-queue"
receive_wait_time_seconds = 10 # long polling
visibility_timeout_seconds = 30
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 3
})
}
Why the DLQ matters:
Without a DLQ, if the enricher Lambda fails to process a message it retries indefinitely. After 3 failed attempts the message disappears. You never know it failed.
With a DLQ, after 3 failed attempts the message moves to the DLQ where it stays for 14 days. You can investigate, fix the bug, and replay the message. No data loss.
Long polling (receive_wait_time_seconds = 10) reduces empty API calls to SQS — Lambda waits up to 10 seconds for messages before returning empty. Fewer API calls, lower cost.
Module 3 — Compute
Four Lambda functions, one IAM role, least privilege policies:
resource "aws_iam_role_policy" "lambda_sqs" {
policy = jsonencode({
Statement = [{
Effect = "Allow"
Action = [
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
]
Resource = [var.queue_arn, var.dlq_arn]
}]
})
}
Each policy only grants what's needed. Lambda can send to SQS but cannot delete tables. Lambda can write to DynamoDB but cannot access S3. Least privilege at every layer.
The SQS event source mapping connects the queue to the enricher Lambda automatically:
resource "aws_lambda_event_source_mapping" "sqs_enricher" {
event_source_arn = var.queue_arn
function_name = aws_lambda_function.enricher.arn
batch_size = 10
}
When messages arrive in SQS, Lambda polls and processes them in batches of up to 10. No manual polling code needed.
The alert threshold is a variable:
variable "alert_threshold" {
type = number
default = 3
}
Change it in terraform.tfvars and redeploy. No code changes needed:
alert_threshold = 5 # require 5 reports before alerting
Module 4 — API
HTTP API Gateway with two routes:
resource "aws_apigatewayv2_route" "post_reports" {
route_key = "POST /reports"
target = "integrations/${aws_apigatewayv2_integration.validator.id}"
}
resource "aws_apigatewayv2_route" "get_reports" {
route_key = "GET /reports"
target = "integrations/${aws_apigatewayv2_integration.query.id}"
}
POST /reports routes to the validator Lambda. GET /reports routes to the query Lambda. CORS is enabled so a frontend can call it directly from the browser.
Module 5 — Observability
This was missing entirely from the SAM version. Four CloudWatch log groups, alarms on Lambda errors, and a dashboard:
resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
alarm_name = "${var.project_name}-dlq-messages"
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
threshold = 0
alarm_description = "Messages are landing in the Dead Letter Queue"
dimensions = {
QueueName = "${var.project_name}-outage-dlq"
}
}
The DLQ alarm is the most important one. If even one message lands in the DLQ something is wrong. Threshold of 0 means the alarm fires immediately.
The Proof
Submit an outage report:
curl -X POST https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports \
-H "Content-Type: application/json" \
-d '{
"lga": "Ikeja",
"state": "Lagos",
"reporter_name": "Emmanuel Ulu",
"description": "Power outage on Allen Avenue since 6am"
}'
{"message": "Outage report received", "report_id": "6765dd6f-cbbd-4900-a607-b542e5720487"}
Query reports for Ikeja:
curl "https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports?lga=Ikeja"
{"count": 4, "reports": [...]}
After the 3rd report from the same LGA an email alert fires:
Power outage alert for Ikeja, Lagos.
4 reports received today.
Latest report: Generator running low on fuel, still no NEPA
Reported by: Bola
Time: 2026-06-27T21:04:45
SAM vs Terraform — What Actually Changed
| Feature | SAM Version | Terraform Version |
|---|---|---|
| IaC tool | AWS SAM | Terraform |
| State management | CloudFormation | S3 backend with native locking |
| Module structure | Single template | 5 independent modules |
| Dead Letter Queue | No | Yes |
| CloudWatch dashboard | No | Yes |
| Alert threshold | Hardcoded | Configurable variable |
| Observability | Basic logs | Log groups, alarms, dashboard |
Key Lessons
The DLQ is not optional
Without it you have no idea when messages fail. Every production SQS queue needs a DLQ.Long polling saves money
receive_wait_time_seconds= 10 cuts empty SQS API calls significantly. Small change, real savings at scale.TTL is free cleanup
DynamoDB TTL automatically removes expired items. No Lambda scheduled to delete old records, no growing storage costs.Modules make variables powerful
The alert threshold is a number interraform.tfvars. Change it, runterraform apply, done. In the SAM version you would have to edit Python code, redeploy, and hope nothing broke.Observability is infrastructure
CloudWatch dashboards and alarms should be provisioned with the same code that provisions the app. Not added later when something breaks.
What's Next
- Add EventBridge scheduled trigger for the daily aggregator summary
- Add X-Ray tracing across the full pipeline
- Build a simple frontend on S3 and CloudFront to visualize outages on a map
- Add VPC endpoints so Lambda never touches the public internet
Screenshots
Resources
GitHub: nigeria-outage-tracker
Part 1: How I Built a Serverless Power Outage Tracker for Nigeria on AWS
Terraform AWS Provider docs: registry.terraform.io/providers/hashicorp/aws





Top comments (0)