There's a conversation that happens in almost every Cloud Operations team, and it almost never happens in a formal meeting.
It happens in the hallway, in a direct message, or in the Slack thread where someone asks something and someone else answers from memory:
"Hey, what's the naming convention we use for buckets in staging?"
"I think it was environment-team-service… or was it team-environment-service. Ask Jhon — they defined it."
"Jhon isn't on the team anymore."
"Oh. Then check the document in the wiki."
"I checked. There are two different versions and neither has a date."
Welcome to the problem nobody puts on the sprint but everyone lives in the day-to-day.
1. Policies are born as documents — and that makes complete sense
When a team defines how it's going to operate its infrastructure, the natural thing is to write it down. A wiki, a documentation repository, a PDF in a shared folder. The tool doesn't matter — the act of writing it does.
At that moment, the document is truth. It reflects real decisions made by people who understood the context. It's useful, it's consulted, it's maintained.
The problem isn't the document. The problem is what happens after.
2. Infra evolves. The repository too. The document, not always.
The cloud isn't static. Providers deprecate services, rename resources, add options that didn't exist before and remove others that no longer make sense. What was the right way to do something eighteen months ago might be today the wrong way, the expensive way, or simply the way that no longer works.
The IaC code feels it first. A Terraform module that worked perfectly starts throwing deprecation warnings. A resource that was created with a specific block now requires a different configuration. The provider changed its API and the Terraform provider reflects it in the next version.
The team updates the code. Opens a PR, reviews it, merges it. The infra keeps working.
But the document that described how that module works — the wiki that explains the architecture, the runbook that details the steps — that one doesn't have a CI/CD pipeline validating it. It has no tests. It has no one assigned as responsible for keeping it synchronized with reality.
And so, silently, without anyone explicitly deciding it, the document starts describing infrastructure that no longer exists.
3. The day-to-day leaves no room to verify
This is where we need to be honest.
We all know we should periodically review whether documentation is still valid. We all know we should have a recurring task in the backlog that says "audit infrastructure documentation" or something similar.
But we also all know what happens to that task.
In the best case, it exists on some board, has a "technical debt" label, and hasn't been touched for months because there's always something more urgent. In the worst case, it was never created because at the moment someone was going to create it, an alert came in, a client called, an urgent deploy needed to happen, or simply the day ended.
The problem isn't lack of discipline. It's that verifying documentation has no alert associated with it. Nobody gets a PagerDuty at 3am because the database recovery runbook has steps that no longer apply. That problem only surfaces when someone needs to execute that runbook in a high-stress situation — which is exactly the worst moment to discover it's outdated.
The day-to-day in Cloud Operations is reactive by nature. Preventive documentation maintenance tasks compete with real incidents, and incidents always win.
4. Hallway gossip becomes the word of God
And then something happens that we've all lived but few write about: operational knowledge migrates from documents to people.
Not as a decision. As a natural consequence of asking being faster than reading, and reading being faster than verifying whether what the document says is still true.
"How long does a DNS change take to propagate in this environment?"
The answer isn't in any updated document. It's in someone's head who measured it six months ago and remembers it with relative accuracy.
"What happens if the backup job fails silently?"
There's a runbook. But the person who wrote it is gone, and whoever is executing it would rather ask the colleague who "knows about that" than read four pages that may or may not reflect the current configuration.
The problem with this dynamic isn't that it's inefficient — sometimes it's the fastest way to resolve something. The problem is that knowledge that lives in people leaves with people. And when the person who "knows about that" changes projects, goes on vacation, or simply isn't available at 2am on a Sunday, the team operates on assumptions nobody can validate.
5. Sometimes the answer is to stop and start over
There's a conclusion the industry is slow to accept because it sounds like defeat, but that teams with real experience recognize as maturity:
Sometimes the document can't be updated. Sometimes you need to write a new one.
Not as an admission of failure. As a recognition that reality changed so much that the existing document generates more confusion than clarity — especially when lessons learned are captured in the moment — because it mixes what was true with what is true now, and it's no longer easy to tell them apart.
A clean break. New documentation that starts from what actually exists today, not from what existed the last time someone had time to write.
And more important than that: an honest conversation about where that documentation should live so it doesn't go stale again without anyone noticing.
The answer that has convinced me most after years in Cloud Operations isn't a better wiki, a stricter review process, or more team discipline.
It's moving governance from the document to the system.
Policies that live in code — that are validated on every deploy, that the same system deploying infrastructure executes before acting — can't go stale silently. If the policy changes, the code changes. If the code changes, there's a PR, there's a review, there's a record.
The runbook nobody reads doesn't fail silently. The policy that lives in the pipeline fails loudly, at the right moment, before the damage happens.
That doesn't solve everything. Narrative documentation — the context, the reasoning behind decisions, the history of why things are the way they are — is still necessary and still human.
But operational rules, guardrails, the conventions that determine what's allowed and what isn't: those deserve to live where the system can execute them, not where someone has to remember to read them.
In the next post I'll show what that looks like in practice — what it actually means to move a policy from a document to a system, and what changes when you do.
If this resonated with something you've lived in your team, share it in the comments. These stories are more common than they appear at conferences.

Top comments (0)