DEV Community

Cover image for From SAM to Terraform Rebuilding My Nigeria Power Outage Tracker with Modules
Emmanuel Ulu
Emmanuel Ulu

Posted on

From SAM to Terraform Rebuilding My Nigeria Power Outage Tracker with Modules

INTRODUCTION

A few months ago I built a serverless power outage reporting system for Nigeria using AWS SAM. Citizens could submit outage reports via API, the system stored them in DynamoDB, and SNS sent email alerts when a threshold was hit.
It worked. But the infrastructure lived in a single template.yaml file with no modularity, no CloudWatch dashboard, no Dead Letter Queue, and the alert threshold was hardcoded.
So I rebuilt it from scratch using Terraform modules. Same problem. Same architecture. Better infrastructure.
This is Part 2. If you missed Part 1, read it here first: How I Built a Serverless Power Outage Tracker for Nigeria on AWS.

The Architecture

The event-driven pipeline looks like this:

User
  ↓
API Gateway (HTTP API)
  ↓
Lambda Validator    → SQS Dead Letter Queue (failed reports)
  ↓
SQS Queue
  ↓
Lambda Enricher     → DynamoDB + SNS alert (threshold exceeded)
  ↓
Lambda Query        ← GET /reports
  ↓
Lambda Aggregator   ← daily summary (EventBridge scheduled)
Enter fullscreen mode Exit fullscreen mode

Every component is serverless. No servers to manage, no idle compute costs, pay only when reports come in.

The Architecture Drawing

Architecture Drawin

The Terraform Module Structure

Instead of one giant template I split everything into 5 independent modules:

terraform/
├── modules/
│   ├── api/           # API Gateway HTTP API + routes + Lambda permissions
│   ├── compute/       # IAM role, Lambda functions, SQS event source mapping
│   ├── messaging/     # SQS queue, Dead Letter Queue, SNS topic + subscription
│   ├── observability/ # CloudWatch log groups, alarms, dashboard
│   └── storage/       # DynamoDB table with GSI and TTL
└── environments/
    └── dev/           # Root module wiring everything together
Enter fullscreen mode Exit fullscreen mode

Each module has one job. The messaging module doesn't know about Lambda. The compute module doesn't know about API Gateway. They communicate through outputs and inputs only.

Module 1 — Storage

The DynamoDB table stores every outage report with 3 key design decisions:
Partition key + sort key:

hash_key  = "LGA"        # Local Government Area e.g. Ikeja
range_key = "timestamp"  # ISO 8601 timestamp
Enter fullscreen mode Exit fullscreen mode

This lets you query all reports for a specific LGA efficiently.
Global Secondary Index for state-level queries:

global_secondary_index {
  name            = "StateIndex"
  hash_key        = "state"
  range_key       = "timestamp"
  projection_type = "ALL"
}
Enter fullscreen mode Exit fullscreen mode

Query all outages in Lagos State without scanning the entire table.
TTL — auto-expire old records:

ttl {
  attribute_name = "expiry"
  enabled        = true
}
Enter fullscreen mode Exit fullscreen mode

Records automatically delete after 90 days. No manual cleanup, no growing storage costs.

Module 2 — Messaging

Three resources with one important design decision — the Dead Letter Queue:

resource "aws_sqs_queue" "dlq" {
  name                      = "${var.project_name}-outage-dlq"
  message_retention_seconds = 1209600  # 14 days
}

resource "aws_sqs_queue" "outage_queue" {
  name                       = "${var.project_name}-outage-queue"
  receive_wait_time_seconds  = 10      # long polling
  visibility_timeout_seconds = 30

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 3
  })
}
Enter fullscreen mode Exit fullscreen mode

Why the DLQ matters:
Without a DLQ, if the enricher Lambda fails to process a message it retries indefinitely. After 3 failed attempts the message disappears. You never know it failed.
With a DLQ, after 3 failed attempts the message moves to the DLQ where it stays for 14 days. You can investigate, fix the bug, and replay the message. No data loss.
Long polling (receive_wait_time_seconds = 10) reduces empty API calls to SQS — Lambda waits up to 10 seconds for messages before returning empty. Fewer API calls, lower cost.

Module 3 — Compute

Four Lambda functions, one IAM role, least privilege policies:

resource "aws_iam_role_policy" "lambda_sqs" {
  policy = jsonencode({
    Statement = [{
      Effect = "Allow"
      Action = [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes"
      ]
      Resource = [var.queue_arn, var.dlq_arn]
    }]
  })
}
Enter fullscreen mode Exit fullscreen mode

Each policy only grants what's needed. Lambda can send to SQS but cannot delete tables. Lambda can write to DynamoDB but cannot access S3. Least privilege at every layer.
The SQS event source mapping connects the queue to the enricher Lambda automatically:

resource "aws_lambda_event_source_mapping" "sqs_enricher" {
  event_source_arn = var.queue_arn
  function_name    = aws_lambda_function.enricher.arn
  batch_size       = 10
}
Enter fullscreen mode Exit fullscreen mode

When messages arrive in SQS, Lambda polls and processes them in batches of up to 10. No manual polling code needed.
The alert threshold is a variable:

variable "alert_threshold" {
  type    = number
  default = 3
}
Enter fullscreen mode Exit fullscreen mode

Change it in terraform.tfvars and redeploy. No code changes needed:

alert_threshold = 5  # require 5 reports before alerting
Enter fullscreen mode Exit fullscreen mode

Module 4 — API

HTTP API Gateway with two routes:

resource "aws_apigatewayv2_route" "post_reports" {
  route_key = "POST /reports"
  target    = "integrations/${aws_apigatewayv2_integration.validator.id}"
}

resource "aws_apigatewayv2_route" "get_reports" {
  route_key = "GET /reports"
  target    = "integrations/${aws_apigatewayv2_integration.query.id}"
}
Enter fullscreen mode Exit fullscreen mode

POST /reports routes to the validator Lambda. GET /reports routes to the query Lambda. CORS is enabled so a frontend can call it directly from the browser.

Module 5 — Observability

This was missing entirely from the SAM version. Four CloudWatch log groups, alarms on Lambda errors, and a dashboard:

resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
  alarm_name          = "${var.project_name}-dlq-messages"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  threshold           = 0
  alarm_description   = "Messages are landing in the Dead Letter Queue"

  dimensions = {
    QueueName = "${var.project_name}-outage-dlq"
  }
}
Enter fullscreen mode Exit fullscreen mode

The DLQ alarm is the most important one. If even one message lands in the DLQ something is wrong. Threshold of 0 means the alarm fires immediately.

The Proof

Submit an outage report:

curl -X POST https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports \
  -H "Content-Type: application/json" \
  -d '{
    "lga": "Ikeja",
    "state": "Lagos",
    "reporter_name": "Emmanuel Ulu",
    "description": "Power outage on Allen Avenue since 6am"
  }'

{"message": "Outage report received", "report_id": "6765dd6f-cbbd-4900-a607-b542e5720487"}
Enter fullscreen mode Exit fullscreen mode

Query reports for Ikeja:

curl "https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports?lga=Ikeja"

{"count": 4, "reports": [...]}
Enter fullscreen mode Exit fullscreen mode

After the 3rd report from the same LGA an email alert fires:

Power outage alert for Ikeja, Lagos.
4 reports received today.
Latest report: Generator running low on fuel, still no NEPA
Reported by: Bola
Time: 2026-06-27T21:04:45
Enter fullscreen mode Exit fullscreen mode

SAM vs Terraform — What Actually Changed

Feature SAM Version Terraform Version
IaC tool AWS SAM Terraform
State management CloudFormation S3 backend with native locking
Module structure Single template 5 independent modules
Dead Letter Queue No Yes
CloudWatch dashboard No Yes
Alert threshold Hardcoded Configurable variable
Observability Basic logs Log groups, alarms, dashboard

Key Lessons

  1. The DLQ is not optional
    Without it you have no idea when messages fail. Every production SQS queue needs a DLQ.

  2. Long polling saves money
    receive_wait_time_seconds = 10 cuts empty SQS API calls significantly. Small change, real savings at scale.

  3. TTL is free cleanup
    DynamoDB TTL automatically removes expired items. No Lambda scheduled to delete old records, no growing storage costs.

  4. Modules make variables powerful
    The alert threshold is a number in terraform.tfvars. Change it, run terraform apply, done. In the SAM version you would have to edit Python code, redeploy, and hope nothing broke.

  5. Observability is infrastructure
    CloudWatch dashboards and alarms should be provisioned with the same code that provisions the app. Not added later when something breaks.

What's Next

  • Add EventBridge scheduled trigger for the daily aggregator summary
  • Add X-Ray tracing across the full pipeline
  • Build a simple frontend on S3 and CloudFront to visualize outages on a map
  • Add VPC endpoints so Lambda never touches the public internet

Screenshots

terraform image

terraform image

project photo

Resources

GitHub: nigeria-outage-tracker
Part 1: How I Built a Serverless Power Outage Tracker for Nigeria on AWS
Terraform AWS Provider docs: registry.terraform.io/providers/hashicorp/aws

Top comments (0)