The Configuration Drift Discovery During a Drill

#devops #infrastructure #sre #cloud

Quarterly recovery drill. Backup job green for four months. Restore executes cleanly — data intact, VM boots, database service starts. The application fails on the first transaction.

Three hours disappear into backup triage before anyone checks the environment.

The backup was not the problem. It never was. This is recovery configuration drift — and it belongs to the same class of data protection failure where the protection plane reports success and the recovery plane produces nothing useful.

What the Drill Was Testing (and What It Wasn't)

The drill validated backup integrity and restore mechanics — the protection plane. It did not validate environment state — the recovery plane. Four months of silent drift had accumulated between the backup capture point and the recovery target. The backup job reported green throughout. Nothing surfaced the gap until the restore ran into an environment that had moved on.

Three drift vectors accumulated silently over those four months. Each individually invisible. Each fatal at the application layer on restore.

01 — Service Endpoint Changed — An internal service endpoint was updated — manual change, never committed back to IaC. The recovery config still pointed at the old dependency. The restored application reached out to an endpoint that no longer existed in that form. First transaction failed. The backup had nothing to do with it.

02 — Internal CA Trust Path Rotated — The internal CA trust path had been rotated by an automated process. The config reference was never updated. The restored application could no longer validate upstream auth — certificate chain validation failed silently on the first authenticated call. The backup was application-consistent at the point it was taken. The trust chain it assumed no longer existed.

03 — East-West Network Policy Tightened — A security team policy change had tightened east-west traffic rules three months prior. The recovery runbook was never updated to reflect it. The restored application attempted to reach a service dependency on first transaction and hit a policy block. The change was correct. The recovery documentation was not.

None of these were flagged as backup problems. All three were invisible until the application ran in a recovery environment that had accumulated three separate departures from the state the backup assumed.

The Recovery Drift Gap

This was a Recovery Drift Gap.

The backup was consistent. The recovery environment was not. The Consistency Boundary is where those two states diverged — the line between what the backup knows and what the environment currently is. Every untracked manual change widens it. Every automated process that updates configuration without committing it back widens it. Every security policy change that doesn't propagate to the recovery runbook widens it. The backup job reports green throughout. Nothing surfaces the gap until the restore runs into an environment that has moved on.

The gap does not announce itself. It waits for the drill — or worse, for the incident.

What a Pre-Drill Diff Actually Checks

The standard framing — "run drift detection before a drill" — is correct but not actionable. What actually gets checked in a pre-drill diff is more specific. A pre-drill diff that is actually useful checks six things:

Current service endpoints — do recovery runbook references match what the environment is currently serving?
Cert chain / trust store paths — has any CA rotation, cert renewal, or trust store update occurred since the last validated drill?
Security group / ACL / east-west policy deltas — have any policy changes been applied that would block traffic paths the restored application depends on?
DNS dependencies — do internal resolution paths the restored app will traverse still resolve to the correct targets?
Secret path references — vault paths, key references, rotation state — does the restored application's config point at secrets that still exist at those paths?
Mount / storage path assumptions — anything the application opens on first transaction — are those paths still valid in the recovery environment? This is not a full environment audit. It is a pre-flight against the specific failure modes that kill restores in well-maintained environments — the ones that have nothing to do with whether the backup completed successfully.

Sovereign Drift Auditor — surfaces delta between declared IaC state and live environment; the starting point for any pre-drill diff.

The Transferable Principle

The backup is not the unit of recovery. The backup plus the environment configuration plus the dependency graph is the unit of recovery. A drill that only validates backup integrity has validated one third of that unit.

Backup integrity tells you the restore can start. It tells you nothing about whether the application can recover.

Diagnostic: "When did you last verify the recovery target environment matches the configuration state of what you're restoring?"

This post is part of the Field Notes series. FN-01 covered the Declaration Gap in DNS failover — every layer behaving correctly while the system failed operationally. The Recovery Drift Gap is the same failure class at a different layer.

The backup restored exactly as expected. The environment didn't. Only one of them was being tested.

Originally published at rack2cloud.com