DEV Community

Cover image for How I Built a DevOps Handover Document That Could Run Without Me.
Anthony Uketui
Anthony Uketui

Posted on

How I Built a DevOps Handover Document That Could Run Without Me.

TL;DR: When I left my role as the sole DevOps/Platform Engineer for a fintech payment gateway, I created a handover document that could orient any engineer — from "where are the logs?" to "how do we fail over to another region?" — in under 60 seconds. Here's the framework I used and why most handover docs fail.


The Problem with Most Handovers

I've seen handover documents that are:

  • A brain dump of links with no context
  • A 50-page novel nobody reads
  • A set of credentials with zero operational guidance
  • A Confluence page last updated 6 months ago

None of these help the engineer at 2 AM when the payment gateway is down and you're unreachable.

My design principle: If I get hit by a bus tomorrow, can someone who's never seen this infrastructure keep the platform stable, deployable, observable, secure, and cost-controlled?


The Framework: 6 Domain Runbooks + 1 Quick Start

I structured the handover as a single entry point page with drill-down links to domain-specific runbooks.

The 60-Second Quick Start

At the top of the page, in a highlighted box:

If you are new or taking over in an emergency, read this first.

  • Core ownership: AWS infrastructure, ECS services, CI/CD pipelines, security posture, observability, Terraform migration, and cost optimization
  • Primary environments: Staging in us-east-2; production in eu-west-2
  • How code reaches production: Source control → CodePipeline → CodeBuild → ECR → ECS service deployment behind ALB
  • When something breaks: Check monitoring first, confirm the latest deployment, inspect ECS service events/ALB target health, then follow the Incident Ops Runbook
  • Top current risks: Incomplete Terraform migration, shared security groups/IAM roles, remaining observability gaps

This is the most important section. Everything else is a drill-down.


The 6 Domain Runbooks

1. Platform & Infrastructure Runbook

Covers: AWS architecture, ECS clusters, EC2 fleet, networking, security groups

Key content:

  • Architecture diagrams (what lives where)
  • Service maps: ECS cluster → services → load balancers → domains
  • EC2 instance inventory with IPs, types, and purposes
  • Security group matrix: which ports, which sources, which services
  • How to add a new service, resize an instance, or modify networking

Critical detail: I included a service map that mapped every ECS service to its load balancer, domain path, CI/CD pipeline, and operational status. When someone asks "what handles /checkout/*?", the answer is one table lookup.

2. CI/CD & Release Engineering Runbook

Covers: CodePipeline, CodeBuild, ECR, deployment flow

Key content:

  • End-to-end pipeline flow: source → build → test → deploy
  • Buildspec template with SonarQube integration
  • How to create a new pipeline for a new service
  • How to roll back a bad deployment
  • Common failure patterns and fixes

Critical detail: Rolling back is ECS is simple — select the prior task definition revision and force a new deployment. But this needs to be documented, not assumed.

3. Observability & Monitoring Runbook

Covers: New Relic, CloudWatch, alerting, dashboards

Key content:

  • What's monitored and what isn't (the gaps are as important as the coverage)
  • How to instrument a new service
  • Alert routing: which alerts go where, who responds
  • Dashboard locations for each environment

4. Security & Compliance Runbook

Covers: IAM, security groups, GuardDuty, WAF, encryption

Key content:

  • Current security posture (honest assessment of gaps)
  • Remediation plan and priority matrix
  • How to rotate credentials
  • Incident response procedures

5. FinOps & Cost Management Runbook

Covers: AWS billing, optimization strategies, governance

Key content:

  • Current monthly spend and breakdown by service
  • Active cost optimizations (staging shutdown schedules, reserved instances)
  • Budget alerts and thresholds
  • How to investigate a cost spike

6. Terraform/IaC Runbook

Covers: Terraform state, modules, migration status

Key content:

  • What's in Terraform and what's still ClickOps
  • State file locations and backend config
  • How to import a new resource
  • How to plan and apply safely

The Responsibility Matrix (RACI-style)

This is the section most handovers miss. It's not enough to document what — you need to document who.

I created a table mapping each domain to:

Domain Primary Owner Backup When to Escalate
Platform & Infra [Assign] [Assign] Any production service instability, scaling issues, VPC/SG changes
CI/CD [Assign] [Assign] Pipeline failures lasting >30min, new service onboarding
Security [Assign] [Assign] GuardDuty critical finding, suspected breach, credential exposure
Observability [Assign] [Assign] Monitoring gaps, alert fatigue, instrumentation requests
FinOps [Assign] [Assign] 30%+ cost increase, daily spend spike, commitment purchases
Terraform [Assign] [Assign] State corruption, import failures, drift detection
Incident Command [Assign] [Assign] Any customer-impacting issue, suspected production instability

Critical rule: No domain should have a single named operator. Before the handover is considered complete, every domain needs at least one backup.


Common Failure Patterns (Cheat Sheet)

I included a quick-reference troubleshooting table:

Symptom Check First Fix
ALB health check failing Health path, container port, target group port, SG rules Fix path or port mapping
SonarQube gate failing Token, project key, coverage threshold Rotate token, verify key
Image mismatch ECR pushed tag vs. ECS task definition image URI Update task definition
Task crash loop App logs, env vars, secrets access, CPU/memory limits Fix config, increase resources
Cost spike Cost Explorer by service, check for zombie resources Terminate unused resources

Known Gaps and Backlog

The most honest section of any handover: what's not done.

I listed:

  • Terraform migration: pending services and resources
  • Security: remaining shared security groups and IAM roles
  • Observability: services not yet instrumented
  • FinOps: Graviton migration candidates not yet evaluated
  • Automation: edge cases in reconciliation system

Why this matters: The person taking over needs to know where the landmines are, not just where the paved roads go.


Linked Source Pages

Every claim in the handover links to its authoritative source page:

  • EC2 Infrastructure Risk Assessment
  • ECS Security Assessment Reports (per cluster)
  • Complete CI/CD Pipeline Setup Guide
  • Deployment Architecture Case Study
  • SonarQube + CI/CD Integration Standard
  • AWS Cost Optimization Initiative
  • Cost Optimization Strategy ($3K Reduction Plan)
  • Terraform Migration Status Report
  • Multi-Region Architecture Decision Record
  • ALB Health Check Troubleshooting Guide

What Makes This Different

  1. It's a living document. Not a one-time brain dump. The handover page links to source pages that are independently maintained.

  2. It starts with the emergency. The 60-second quick start is for the 2 AM incident. The domain runbooks are for the Tuesday morning onboarding.

  3. It's honest about gaps. Most handovers present a rosy picture. Mine included a "Known Gaps and Backlog" section because the person taking over needs to know what's fragile.

  4. It separates "what" from "who." Documentation without ownership assignment is just a wiki page. The RACI matrix ensures every domain has a human responsible for it.

  5. It includes the "why." Not just "we use SonarQube" but "we use SonarQube because we needed a SAST tool that scales on LOC-based pricing for a growing Java team in a cost-sensitive fintech."


The Template

If you're building your own handover, here's the skeleton:

# DevOps/Cloud Handover — [Your Name]

## TL;DR — Platform in 60 Seconds
[5-bullet emergency orientation]

## Responsibility Matrix
[Domain → Owner → Backup → Escalation triggers]

## Domain Runbooks
1. Platform & Infrastructure
2. CI/CD & Release Engineering
3. Observability & Monitoring
4. Security & Compliance
5. FinOps & Cost Management
6. Infrastructure as Code

## Common Failure Patterns
[Symptom → Check → Fix table]

## Known Gaps and Backlog
[Honest list of what's not done]

## References
[Links to all authoritative source pages]
Enter fullscreen mode Exit fullscreen mode

I built this handover when leaving my role as sole DevOps/Platform Engineer for a fintech payment gateway. The test of a good handover isn't whether it's comprehensive — it's whether the person reading it at 2 AM can keep the platform running. Design for the emergency first, then add depth.

Top comments (0)