Why We Started Failing CI on Configuration Drift
Most CI pipelines validate code.
Very few validate configuration.
In my experience, a surprising number of production issues aren't logic
bugs. They're configuration drift.
Things like:
- Staging missing a required environment variable
- A secret copied as plaintext
- A renamed key in one environment but not another
- Old keys sticking around after refactors
Everything looks fine. Tests pass. Deployment succeeds. Then something
behaves differently in production and you're debugging environment
settings instead of code.
Runtime validation isn't enough
Libraries that validate required variables at startup are useful. They
make sure something exists before the app boots.
But they don't answer questions like:
- Is staging aligned with production?
- Did someone accidentally downgrade a secret to plaintext?
- Did a required key disappear in just one environment?
- Are there keys that no longer belong anywhere?
That's configuration drift.
And most pipelines don't check for it.
Treating environment variables like contracts
The approach I started experimenting with was simple: treat environment
variables like a contract.
Define which keys are required.
Define which keys must be secret.
Define what's optional.
Then compare environments before deployment.
A minimal example contract might look like this:
{
"requiredKeys": ["DATABASE_URL", "JWT_SECRET"],
"secretKeys": ["DATABASE_URL", "JWT_SECRET"],
"optionalKeys": ["SENTRY_DSN"],
"allowedPrefixes": ["FEATURE_"]
}
Before deploying, the pipeline evaluates an environment against this
contract and against a baseline.
If the differences cross a defined severity threshold, the build fails.
Not because the code is broken.
Because the configuration integrity is broken.
What this catches in practice
Here's an example of what a failing check could look like in CI:
$ configstack doctor --fail-on medium
✖ Configuration Drift Detected
Severity: HIGH
Missing required keys:
- JWT_SECRET
Disallowed keys:
- LEGACY_API_KEY
CI failed.
In practice, this surfaces things like:
- Missing required keys such as
JWT_SECRET - Secrets marked incorrectly
- Typos like
DATABASEURLinstead ofDATABASE_URL - Drift between staging and production
- Keys removed in one environment but still present in another
None of these are syntax errors. But they're still deployment risks.
Why this matters
Once you have more than one environment and a CI pipeline, relying on
convention alone starts to break down. Drift accumulates quietly until
something subtle fails.
We're generally disciplined about tests and linting. Configuration tends
to be treated as an afterthought.
I put together a small interactive demo using mocked data to illustrate
the idea: demo
Still figuring out how common this pain actually is. If you've dealt
with configuration drift in your own CI/CD setup, I'd be interested to
know how you're handling it today.
Runtime validation only?
Vault plus convention?
Custom scripts?
Something else?
Top comments (0)