Rollback Scripts Are Not System State: Why Runtime Truth Comes First in Recovery Work
In operations work, it is easy to treat what exists in the repository, what is written in deployment scripts, or what is declared in a compose file as if it were the current state of the system.
That is a mistake.
What actually matters is what is really running right now. I think of this gap as the separation between source truth and runtime truth.
A recent recovery task exposed this problem again. The original goal was simple: redeploy a set of custom plugins into a newer environment. But if you only look at deployment scripts, image tags, and compose definitions, it is easy to conclude that the environment is already aligned. The real questions should come first:
- Which containers are actually running in production right now?
- Are the volumes carrying forward the wrong generation of data?
- Does the plugin merely exist on disk, while remaining disabled in the platform?
- Has the target environment's business data already been overwritten by another environment?
- If a feature is missing in the UI, is the problem in the frontend entry point, a backend switch, or the plugin state itself?
Those are runtime-truth questions.
Why source truth can mislead you
Source truth is still important. It defines what the ideal state should be. But in recovery, rollback, and migration scenarios, the system is often already drifting away from that ideal state.
Here are a few common traps.
1. The code exists, so the feature must exist
No.
If the plugin source exists in the repository, that only proves that the feature was implemented at some point. It does not prove that:
- the plugin was built,
- the build output made it into the correct image,
- that image was deployed,
- the container restarted with the new version,
- the platform actually enabled the plugin,
- or the frontend is exposing the entry point to users.
If any one of those steps breaks, the user still sees the same outcome: the feature is missing.
2. The compose file is correct, so production must also be correct
Also no.
Production containers may not have been recreated. Old volumes may still be attached. Environment variables may still come from an earlier release. Sometimes even the service names are correct while the internal process state is not.
docker compose config tells you how the system is supposed to start. docker ps, docker inspect, mount points, and in-platform enablement states tell you how it is actually running.
3. The rollback script finished, so the system is recovered
This one is especially dangerous.
A successful script only proves that the script completed its own actions. It does not prove that the business state returned to the intended version.
A proper recovery check needs to verify:
- whether critical data is back where it should be,
- whether critical plugins are visible to end users,
- whether a real user path succeeds end to end,
- and whether the boundary between environments is still intact.
Without that, "recovery completed" is just a surface-level status.
The order I now prefer for recovery work
When the problem is "we deployed the wrong thing, now recover it," this is the order I recommend.
1. Define what you are protecting first
Before anything else, be explicit about the protected object:
- Are you protecting data?
- Are you protecting plugins/code?
- Are you protecting the identity of the current environment such as users, agents, and settings?
A lot of recovery incidents are not caused by lack of technical skill. They happen because the protected object was never clearly defined. You think you are restoring plugins, but you touch business data. You think you are syncing code, but you overwrite a live environment.
2. Inspect runtime before you inspect the repo
The investigation order should be:
- running processes and containers,
- volumes and bind mounts,
- in-platform enablement state,
- user-visible entry points,
- and only then the repository, scripts, and images.
That order prevents the classic illusion of "but the code is clearly there."
3. Validate the user path, not the engineer path
Engineers often comfort themselves with checks like these:
- the file exists,
- the API returns 200,
- the container is running,
- the logs show no errors.
That is not enough.
The useful validation is to walk the user path:
- Can the user see the entry point?
- Can they open it?
- Can they complete the core action?
- Is the result correct?
If the user path is broken, the system is not recovered.
4. Treat partial recovery as a real state, not as a useless failure
In real automation flows, partial success is normal.
One article may publish to three platforms while one fails. A plugin may be redeployed while a token failure blocks one sync step. Code may be deployed while a platform toggle is still off.
The worst response is to label the whole thing as "failed" and preserve no useful state.
A more practical approach is to:
- record what succeeded,
- isolate what failed,
- preserve retryable state,
- and only rerun the failed parts later.
Recovery work, like publishing work, is usually not binary. It converges in stages.
A simple checklist that catches a lot of mistakes
Before doing recovery or rollback work, I now ask these six questions:
- Am I protecting code, data, or environment identity?
- Am I looking at source truth or runtime truth?
- What are the actual states of containers, volumes, environment variables, and platform switches?
- Can users see and use the target feature?
- Which parts are genuinely successful, and which parts only look successful?
- If something fails, did I preserve retry information or force myself to start over?
These are simple questions, but they block a surprising number of avoidable recovery mistakes.
Conclusion
The most dangerous thing in rollback and recovery work is not an error message. It is the absence of one.
A script finishing successfully, containers running, and repository code looking correct do not prove that the system is actually correct. The thing you should trust first is runtime truth: what is running in production right now, and what users can actually use right now.
If you do not verify that first, your "recovery" may only be moving the problem into a less visible place.
Top comments (0)