TL;DR: When I left my role as the sole DevOps/Platform Engineer for a fintech payment gateway, I created a handover document that could orient any engineer — from "where are the logs?" to "how do we fail over to another region?" — in under 60 seconds. Here's the framework I used and why most handover docs fail.
The Problem with Most Handovers
I've seen handover documents that are:
- A brain dump of links with no context
- A 50-page novel nobody reads
- A set of credentials with zero operational guidance
- A Confluence page last updated 6 months ago
None of these help the engineer at 2 AM when the payment gateway is down and you're unreachable.
My design principle: If I get hit by a bus tomorrow, can someone who's never seen this infrastructure keep the platform stable, deployable, observable, secure, and cost-controlled?
The Framework: 6 Domain Runbooks + 1 Quick Start
I structured the handover as a single entry point page with drill-down links to domain-specific runbooks.
The 60-Second Quick Start
At the top of the page, in a highlighted box:
If you are new or taking over in an emergency, read this first.
- Core ownership: AWS infrastructure, ECS services, CI/CD pipelines, security posture, observability, Terraform migration, and cost optimization
- Primary environments: Staging in us-east-2; production in eu-west-2
- How code reaches production: Source control → CodePipeline → CodeBuild → ECR → ECS service deployment behind ALB
- When something breaks: Check monitoring first, confirm the latest deployment, inspect ECS service events/ALB target health, then follow the Incident Ops Runbook
- Top current risks: Incomplete Terraform migration, shared security groups/IAM roles, remaining observability gaps
This is the most important section. Everything else is a drill-down.
The 6 Domain Runbooks
1. Platform & Infrastructure Runbook
Covers: AWS architecture, ECS clusters, EC2 fleet, networking, security groups
Key content:
- Architecture diagrams (what lives where)
- Service maps: ECS cluster → services → load balancers → domains
- EC2 instance inventory with IPs, types, and purposes
- Security group matrix: which ports, which sources, which services
- How to add a new service, resize an instance, or modify networking
Critical detail: I included a service map that mapped every ECS service to its load balancer, domain path, CI/CD pipeline, and operational status. When someone asks "what handles /checkout/*?", the answer is one table lookup.
2. CI/CD & Release Engineering Runbook
Covers: CodePipeline, CodeBuild, ECR, deployment flow
Key content:
- End-to-end pipeline flow: source → build → test → deploy
- Buildspec template with SonarQube integration
- How to create a new pipeline for a new service
- How to roll back a bad deployment
- Common failure patterns and fixes
Critical detail: Rolling back is ECS is simple — select the prior task definition revision and force a new deployment. But this needs to be documented, not assumed.
3. Observability & Monitoring Runbook
Covers: New Relic, CloudWatch, alerting, dashboards
Key content:
- What's monitored and what isn't (the gaps are as important as the coverage)
- How to instrument a new service
- Alert routing: which alerts go where, who responds
- Dashboard locations for each environment
4. Security & Compliance Runbook
Covers: IAM, security groups, GuardDuty, WAF, encryption
Key content:
- Current security posture (honest assessment of gaps)
- Remediation plan and priority matrix
- How to rotate credentials
- Incident response procedures
5. FinOps & Cost Management Runbook
Covers: AWS billing, optimization strategies, governance
Key content:
- Current monthly spend and breakdown by service
- Active cost optimizations (staging shutdown schedules, reserved instances)
- Budget alerts and thresholds
- How to investigate a cost spike
6. Terraform/IaC Runbook
Covers: Terraform state, modules, migration status
Key content:
- What's in Terraform and what's still ClickOps
- State file locations and backend config
- How to import a new resource
- How to plan and apply safely
The Responsibility Matrix (RACI-style)
This is the section most handovers miss. It's not enough to document what — you need to document who.
I created a table mapping each domain to:
| Domain | Primary Owner | Backup | When to Escalate |
|---|---|---|---|
| Platform & Infra | [Assign] | [Assign] | Any production service instability, scaling issues, VPC/SG changes |
| CI/CD | [Assign] | [Assign] | Pipeline failures lasting >30min, new service onboarding |
| Security | [Assign] | [Assign] | GuardDuty critical finding, suspected breach, credential exposure |
| Observability | [Assign] | [Assign] | Monitoring gaps, alert fatigue, instrumentation requests |
| FinOps | [Assign] | [Assign] | 30%+ cost increase, daily spend spike, commitment purchases |
| Terraform | [Assign] | [Assign] | State corruption, import failures, drift detection |
| Incident Command | [Assign] | [Assign] | Any customer-impacting issue, suspected production instability |
Critical rule: No domain should have a single named operator. Before the handover is considered complete, every domain needs at least one backup.
Common Failure Patterns (Cheat Sheet)
I included a quick-reference troubleshooting table:
| Symptom | Check First | Fix |
|---|---|---|
| ALB health check failing | Health path, container port, target group port, SG rules | Fix path or port mapping |
| SonarQube gate failing | Token, project key, coverage threshold | Rotate token, verify key |
| Image mismatch | ECR pushed tag vs. ECS task definition image URI | Update task definition |
| Task crash loop | App logs, env vars, secrets access, CPU/memory limits | Fix config, increase resources |
| Cost spike | Cost Explorer by service, check for zombie resources | Terminate unused resources |
Known Gaps and Backlog
The most honest section of any handover: what's not done.
I listed:
- Terraform migration: pending services and resources
- Security: remaining shared security groups and IAM roles
- Observability: services not yet instrumented
- FinOps: Graviton migration candidates not yet evaluated
- Automation: edge cases in reconciliation system
Why this matters: The person taking over needs to know where the landmines are, not just where the paved roads go.
Linked Source Pages
Every claim in the handover links to its authoritative source page:
- EC2 Infrastructure Risk Assessment
- ECS Security Assessment Reports (per cluster)
- Complete CI/CD Pipeline Setup Guide
- Deployment Architecture Case Study
- SonarQube + CI/CD Integration Standard
- AWS Cost Optimization Initiative
- Cost Optimization Strategy ($3K Reduction Plan)
- Terraform Migration Status Report
- Multi-Region Architecture Decision Record
- ALB Health Check Troubleshooting Guide
What Makes This Different
It's a living document. Not a one-time brain dump. The handover page links to source pages that are independently maintained.
It starts with the emergency. The 60-second quick start is for the 2 AM incident. The domain runbooks are for the Tuesday morning onboarding.
It's honest about gaps. Most handovers present a rosy picture. Mine included a "Known Gaps and Backlog" section because the person taking over needs to know what's fragile.
It separates "what" from "who." Documentation without ownership assignment is just a wiki page. The RACI matrix ensures every domain has a human responsible for it.
It includes the "why." Not just "we use SonarQube" but "we use SonarQube because we needed a SAST tool that scales on LOC-based pricing for a growing Java team in a cost-sensitive fintech."
The Template
If you're building your own handover, here's the skeleton:
# DevOps/Cloud Handover — [Your Name]
## TL;DR — Platform in 60 Seconds
[5-bullet emergency orientation]
## Responsibility Matrix
[Domain → Owner → Backup → Escalation triggers]
## Domain Runbooks
1. Platform & Infrastructure
2. CI/CD & Release Engineering
3. Observability & Monitoring
4. Security & Compliance
5. FinOps & Cost Management
6. Infrastructure as Code
## Common Failure Patterns
[Symptom → Check → Fix table]
## Known Gaps and Backlog
[Honest list of what's not done]
## References
[Links to all authoritative source pages]
I built this handover when leaving my role as sole DevOps/Platform Engineer for a fintech payment gateway. The test of a good handover isn't whether it's comprehensive — it's whether the person reading it at 2 AM can keep the platform running. Design for the emergency first, then add depth.
Top comments (0)