In the previous articles, we explored what happens when things go wrong in a global system.
I have followed an incident from the first signal, through investigation, to resolution and learning.
That story is familiar to anyone who has worked with production systems. What is less familiar and more difficult is what comes next:
How do you make this way of working survive growth, turnover, and time?
This is not a question about tooling or individual skill.
It is a question about how an organisation behaves when pressure is applied.
After enough years in this industry, failures stop being surprising. What continues to cause damage, even in experienced organisations, is a difference in behaviour. One team responds to an incident calmly, follows signals, and converges on a fix.
Another team escalates late, debates basic facts.
Most organisations start with sensible ideas:
- Structured logs
- Correlation IDs
- Post-mortems
- Dashboards
- Runbooks.
All of these help, but when teams rotate over time or leave, the process becomes inconsistent and transforms into how we used to do things.
Instead of hoping that teams behave a certain way during incidents, the system itself guarantees certain properties:
- Requests can be followed end-to-end
- Meaningful business actions leave an audit trail
- Signals are consistent enough to be trusted
- Incidents start from a known place rather than polite guessing
What matters is not the specific tools involved, but the fact that these guarantees hold every time, regardless of who is on call.
Everybody starts in this way, where someone knows exactly where to look and remembers details. Over time, this becomes a liability.
Sustainable systems remove dangerous choices
What I argued in the previous article is that incidents should not be treated as problems to be closed, but as events that must result in concrete changes in behaviour, expectations, or system design.
Closing the series
This series started with a simple idea:
Trust is the architecture
Not trust in individuals, intentions, or processes, but trust in systems that behave predictably when things go wrong.
At a global scale, speed does not come from removing controls but comes from removing ambiguity.
That is what makes systems sustainable.
Top comments (0)