Emmanuel Ulu

Posted on Jul 1

From SAM to Terraform Rebuilding My Nigeria Power Outage Tracker with Modules

#aws #terraform #serverless #devops

INTRODUCTION

A few months ago I built a serverless power outage reporting system for Nigeria using AWS SAM. Citizens could submit outage reports via API, the system stored them in DynamoDB, and SNS sent email alerts when a threshold was hit.
It worked. But the infrastructure lived in a single template.yaml file with no modularity, no CloudWatch dashboard, no Dead Letter Queue, and the alert threshold was hardcoded.
So I rebuilt it from scratch using Terraform modules. Same problem. Same architecture. Better infrastructure.
This is Part 2. If you missed Part 1, read it here first: How I Built a Serverless Power Outage Tracker for Nigeria on AWS.

The Architecture

The event-driven pipeline looks like this:

User
  ↓
API Gateway (HTTP API)
  ↓
Lambda Validator    → SQS Dead Letter Queue (failed reports)
  ↓
SQS Queue
  ↓
Lambda Enricher     → DynamoDB + SNS alert (threshold exceeded)
  ↓
Lambda Query        ← GET /reports
  ↓
Lambda Aggregator   ← daily summary (EventBridge scheduled)

Every component is serverless. No servers to manage, no idle compute costs, pay only when reports come in.

The Architecture Drawing

The Terraform Module Structure

Instead of one giant template I split everything into 5 independent modules:

terraform/
├── modules/
│   ├── api/           # API Gateway HTTP API + routes + Lambda permissions
│   ├── compute/       # IAM role, Lambda functions, SQS event source mapping
│   ├── messaging/     # SQS queue, Dead Letter Queue, SNS topic + subscription
│   ├── observability/ # CloudWatch log groups, alarms, dashboard
│   └── storage/       # DynamoDB table with GSI and TTL
└── environments/
    └── dev/           # Root module wiring everything together

Each module has one job. The messaging module doesn't know about Lambda. The compute module doesn't know about API Gateway. They communicate through outputs and inputs only.

Module 1 — Storage

The DynamoDB table stores every outage report with 3 key design decisions:
Partition key + sort key:

hash_key  = "LGA"        # Local Government Area e.g. Ikeja
range_key = "timestamp"  # ISO 8601 timestamp

This lets you query all reports for a specific LGA efficiently.
Global Secondary Index for state-level queries:

global_secondary_index {
  name            = "StateIndex"
  hash_key        = "state"
  range_key       = "timestamp"
  projection_type = "ALL"
}

Query all outages in Lagos State without scanning the entire table.
TTL — auto-expire old records:

ttl {
  attribute_name = "expiry"
  enabled        = true
}

Records automatically delete after 90 days. No manual cleanup, no growing storage costs.

Module 2 — Messaging

Three resources with one important design decision — the Dead Letter Queue:

resource "aws_sqs_queue" "dlq" {
  name                      = "${var.project_name}-outage-dlq"
  message_retention_seconds = 1209600  # 14 days
}

resource "aws_sqs_queue" "outage_queue" {
  name                       = "${var.project_name}-outage-queue"
  receive_wait_time_seconds  = 10      # long polling
  visibility_timeout_seconds = 30

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 3
  })
}

Why the DLQ matters:
Without a DLQ, if the enricher Lambda fails to process a message it retries indefinitely. After 3 failed attempts the message disappears. You never know it failed.
With a DLQ, after 3 failed attempts the message moves to the DLQ where it stays for 14 days. You can investigate, fix the bug, and replay the message. No data loss.
Long polling (receive_wait_time_seconds = 10) reduces empty API calls to SQS — Lambda waits up to 10 seconds for messages before returning empty. Fewer API calls, lower cost.

Module 3 — Compute

Four Lambda functions, one IAM role, least privilege policies:

resource "aws_iam_role_policy" "lambda_sqs" {
  policy = jsonencode({
    Statement = [{
      Effect = "Allow"
      Action = [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes"
      ]
      Resource = [var.queue_arn, var.dlq_arn]
    }]
  })
}

Each policy only grants what's needed. Lambda can send to SQS but cannot delete tables. Lambda can write to DynamoDB but cannot access S3. Least privilege at every layer.
The SQS event source mapping connects the queue to the enricher Lambda automatically:

resource "aws_lambda_event_source_mapping" "sqs_enricher" {
  event_source_arn = var.queue_arn
  function_name    = aws_lambda_function.enricher.arn
  batch_size       = 10
}

When messages arrive in SQS, Lambda polls and processes them in batches of up to 10. No manual polling code needed.
The alert threshold is a variable:

variable "alert_threshold" {
  type    = number
  default = 3
}

Change it in terraform.tfvars and redeploy. No code changes needed:

alert_threshold = 5  # require 5 reports before alerting

Module 4 — API

HTTP API Gateway with two routes:

resource "aws_apigatewayv2_route" "post_reports" {
  route_key = "POST /reports"
  target    = "integrations/${aws_apigatewayv2_integration.validator.id}"
}

resource "aws_apigatewayv2_route" "get_reports" {
  route_key = "GET /reports"
  target    = "integrations/${aws_apigatewayv2_integration.query.id}"
}

POST /reports routes to the validator Lambda. GET /reports routes to the query Lambda. CORS is enabled so a frontend can call it directly from the browser.

Module 5 — Observability

This was missing entirely from the SAM version. Four CloudWatch log groups, alarms on Lambda errors, and a dashboard:

resource "aws_cloudwatch_metric_alarm" "dlq_messages" {
  alarm_name          = "${var.project_name}-dlq-messages"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  threshold           = 0
  alarm_description   = "Messages are landing in the Dead Letter Queue"

  dimensions = {
    QueueName = "${var.project_name}-outage-dlq"
  }
}

The DLQ alarm is the most important one. If even one message lands in the DLQ something is wrong. Threshold of 0 means the alarm fires immediately.

The Proof

Submit an outage report:

curl -X POST https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports \
  -H "Content-Type: application/json" \
  -d '{
    "lga": "Ikeja",
    "state": "Lagos",
    "reporter_name": "Emmanuel Ulu",
    "description": "Power outage on Allen Avenue since 6am"
  }'

{"message": "Outage report received", "report_id": "6765dd6f-cbbd-4900-a607-b542e5720487"}

Query reports for Ikeja:

curl "https://your-api-id.execute-api.eu-west-1.amazonaws.com/reports?lga=Ikeja"

{"count": 4, "reports": [...]}

After the 3rd report from the same LGA an email alert fires:

Power outage alert for Ikeja, Lagos.
4 reports received today.
Latest report: Generator running low on fuel, still no NEPA
Reported by: Bola
Time: 2026-06-27T21:04:45

SAM vs Terraform — What Actually Changed

Feature	SAM Version	Terraform Version
IaC tool	AWS SAM	Terraform
State management	CloudFormation	S3 backend with native locking
Module structure	Single template	5 independent modules
Dead Letter Queue	No	Yes
CloudWatch dashboard	No	Yes
Alert threshold	Hardcoded	Configurable variable
Observability	Basic logs	Log groups, alarms, dashboard

Key Lessons

The DLQ is not optional
Without it you have no idea when messages fail. Every production SQS queue needs a DLQ.
Long polling saves money
receive_wait_time_seconds = 10 cuts empty SQS API calls significantly. Small change, real savings at scale.
TTL is free cleanup
DynamoDB TTL automatically removes expired items. No Lambda scheduled to delete old records, no growing storage costs.
Modules make variables powerful
The alert threshold is a number in terraform.tfvars. Change it, run terraform apply, done. In the SAM version you would have to edit Python code, redeploy, and hope nothing broke.
Observability is infrastructure
CloudWatch dashboards and alarms should be provisioned with the same code that provisions the app. Not added later when something breaks.