Anthony Uketui

Posted on Jul 2

How I Built a DevOps Handover Document That Could Run Without Me.

#career #devops #productivity #sre

TL;DR: When I left my role as the sole DevOps/Platform Engineer for a fintech payment gateway, I created a handover document that could orient any engineer — from "where are the logs?" to "how do we fail over to another region?" — in under 60 seconds. Here's the framework I used and why most handover docs fail.

The Problem with Most Handovers

I've seen handover documents that are:

A brain dump of links with no context
A 50-page novel nobody reads
A set of credentials with zero operational guidance
A Confluence page last updated 6 months ago

None of these help the engineer at 2 AM when the payment gateway is down and you're unreachable.

My design principle: If I get hit by a bus tomorrow, can someone who's never seen this infrastructure keep the platform stable, deployable, observable, secure, and cost-controlled?

The Framework: 6 Domain Runbooks + 1 Quick Start

I structured the handover as a single entry point page with drill-down links to domain-specific runbooks.

The 60-Second Quick Start

At the top of the page, in a highlighted box:

If you are new or taking over in an emergency, read this first.

Core ownership: AWS infrastructure, ECS services, CI/CD pipelines, security posture, observability, Terraform migration, and cost optimization

Primary environments: Staging in us-east-2; production in eu-west-2

How code reaches production: Source control → CodePipeline → CodeBuild → ECR → ECS service deployment behind ALB

When something breaks: Check monitoring first, confirm the latest deployment, inspect ECS service events/ALB target health, then follow the Incident Ops Runbook

Top current risks: Incomplete Terraform migration, shared security groups/IAM roles, remaining observability gaps

This is the most important section. Everything else is a drill-down.

The 6 Domain Runbooks

1. Platform & Infrastructure Runbook

Covers: AWS architecture, ECS clusters, EC2 fleet, networking, security groups

Key content:

Architecture diagrams (what lives where)
Service maps: ECS cluster → services → load balancers → domains
EC2 instance inventory with IPs, types, and purposes
Security group matrix: which ports, which sources, which services
How to add a new service, resize an instance, or modify networking

Critical detail: I included a service map that mapped every ECS service to its load balancer, domain path, CI/CD pipeline, and operational status. When someone asks "what handles /checkout/*?", the answer is one table lookup.

2. CI/CD & Release Engineering Runbook

Covers: CodePipeline, CodeBuild, ECR, deployment flow

Key content:

End-to-end pipeline flow: source → build → test → deploy
Buildspec template with SonarQube integration
How to create a new pipeline for a new service
How to roll back a bad deployment
Common failure patterns and fixes

Critical detail: Rolling back is ECS is simple — select the prior task definition revision and force a new deployment. But this needs to be documented, not assumed.

3. Observability & Monitoring Runbook

Covers: New Relic, CloudWatch, alerting, dashboards

Key content:

What's monitored and what isn't (the gaps are as important as the coverage)
How to instrument a new service
Alert routing: which alerts go where, who responds
Dashboard locations for each environment

4. Security & Compliance Runbook

Covers: IAM, security groups, GuardDuty, WAF, encryption

Key content:

Current security posture (honest assessment of gaps)
Remediation plan and priority matrix
How to rotate credentials
Incident response procedures

5. FinOps & Cost Management Runbook

Covers: AWS billing, optimization strategies, governance

Key content:

Current monthly spend and breakdown by service
Active cost optimizations (staging shutdown schedules, reserved instances)
Budget alerts and thresholds
How to investigate a cost spike

6. Terraform/IaC Runbook

Covers: Terraform state, modules, migration status

Key content:

What's in Terraform and what's still ClickOps
State file locations and backend config
How to import a new resource
How to plan and apply safely

The Responsibility Matrix (RACI-style)

This is the section most handovers miss. It's not enough to document what — you need to document who.

I created a table mapping each domain to:

Domain	Primary Owner	Backup	When to Escalate
Platform & Infra	[Assign]	[Assign]	Any production service instability, scaling issues, VPC/SG changes
CI/CD	[Assign]	[Assign]	Pipeline failures lasting >30min, new service onboarding
Security	[Assign]	[Assign]	GuardDuty critical finding, suspected breach, credential exposure
Observability	[Assign]	[Assign]	Monitoring gaps, alert fatigue, instrumentation requests
FinOps	[Assign]	[Assign]	30%+ cost increase, daily spend spike, commitment purchases
Terraform	[Assign]	[Assign]	State corruption, import failures, drift detection
Incident Command	[Assign]	[Assign]	Any customer-impacting issue, suspected production instability

Critical rule: No domain should have a single named operator. Before the handover is considered complete, every domain needs at least one backup.

Common Failure Patterns (Cheat Sheet)

I included a quick-reference troubleshooting table:

Symptom	Check First	Fix
ALB health check failing	Health path, container port, target group port, SG rules	Fix path or port mapping
SonarQube gate failing	Token, project key, coverage threshold	Rotate token, verify key
Image mismatch	ECR pushed tag vs. ECS task definition image URI	Update task definition
Task crash loop	App logs, env vars, secrets access, CPU/memory limits	Fix config, increase resources
Cost spike	Cost Explorer by service, check for zombie resources	Terminate unused resources

Known Gaps and Backlog

The most honest section of any handover: what's not done.

I listed:

Terraform migration: pending services and resources
Security: remaining shared security groups and IAM roles
Observability: services not yet instrumented
FinOps: Graviton migration candidates not yet evaluated
Automation: edge cases in reconciliation system

Why this matters: The person taking over needs to know where the landmines are, not just where the paved roads go.

Linked Source Pages

Every claim in the handover links to its authoritative source page:

EC2 Infrastructure Risk Assessment
ECS Security Assessment Reports (per cluster)
Complete CI/CD Pipeline Setup Guide
Deployment Architecture Case Study
SonarQube + CI/CD Integration Standard
AWS Cost Optimization Initiative
Cost Optimization Strategy ($3K Reduction Plan)
Terraform Migration Status Report
Multi-Region Architecture Decision Record
ALB Health Check Troubleshooting Guide

What Makes This Different

It's a living document. Not a one-time brain dump. The handover page links to source pages that are independently maintained.
It starts with the emergency. The 60-second quick start is for the 2 AM incident. The domain runbooks are for the Tuesday morning onboarding.
It's honest about gaps. Most handovers present a rosy picture. Mine included a "Known Gaps and Backlog" section because the person taking over needs to know what's fragile.
It separates "what" from "who." Documentation without ownership assignment is just a wiki page. The RACI matrix ensures every domain has a human responsible for it.
It includes the "why." Not just "we use SonarQube" but "we use SonarQube because we needed a SAST tool that scales on LOC-based pricing for a growing Java team in a cost-sensitive fintech."

The Template

If you're building your own handover, here's the skeleton:

# DevOps/Cloud Handover — [Your Name]

## TL;DR — Platform in 60 Seconds
[5-bullet emergency orientation]

## Responsibility Matrix
[Domain → Owner → Backup → Escalation triggers]

## Domain Runbooks
1. Platform & Infrastructure
2. CI/CD & Release Engineering
3. Observability & Monitoring
4. Security & Compliance
5. FinOps & Cost Management
6. Infrastructure as Code

## Common Failure Patterns
[Symptom → Check → Fix table]

## Known Gaps and Backlog
[Honest list of what's not done]

## References
[Links to all authoritative source pages]

I built this handover when leaving my role as sole DevOps/Platform Engineer for a fintech payment gateway. The test of a good handover isn't whether it's comprehensive — it's whether the person reading it at 2 AM can keep the platform running. Design for the emergency first, then add depth.

DEV Community

How I Built a DevOps Handover Document That Could Run Without Me.

The Problem with Most Handovers

The Framework: 6 Domain Runbooks + 1 Quick Start

The 60-Second Quick Start

The 6 Domain Runbooks

1. Platform & Infrastructure Runbook

2. CI/CD & Release Engineering Runbook

3. Observability & Monitoring Runbook

4. Security & Compliance Runbook

5. FinOps & Cost Management Runbook

6. Terraform/IaC Runbook

The Responsibility Matrix (RACI-style)

Common Failure Patterns (Cheat Sheet)

Known Gaps and Backlog

Linked Source Pages

What Makes This Different

The Template

Top comments (0)