Why We Started Failing CI on Configuration Drift

#testing #automation #devops #cicd

Why We Started Failing CI on Configuration Drift

Most CI pipelines validate code.

Very few validate configuration.

In my experience, a surprising number of production issues aren't logic
bugs. They're configuration drift.

Things like:

Staging missing a required environment variable
A secret copied as plaintext
A renamed key in one environment but not another
Old keys sticking around after refactors

Everything looks fine. Tests pass. Deployment succeeds. Then something
behaves differently in production and you're debugging environment
settings instead of code.

Runtime validation isn't enough

Libraries that validate required variables at startup are useful. They
make sure something exists before the app boots.

But they don't answer questions like:

Is staging aligned with production?
Did someone accidentally downgrade a secret to plaintext?
Did a required key disappear in just one environment?
Are there keys that no longer belong anywhere?

That's configuration drift.

And most pipelines don't check for it.

Treating environment variables like contracts

The approach I started experimenting with was simple: treat environment
variables like a contract.

Define which keys are required.
Define which keys must be secret.
Define what's optional.
Then compare environments before deployment.

A minimal example contract might look like this:

{
  "requiredKeys": ["DATABASE_URL", "JWT_SECRET"],
  "secretKeys": ["DATABASE_URL", "JWT_SECRET"],
  "optionalKeys": ["SENTRY_DSN"],
  "allowedPrefixes": ["FEATURE_"]
}

Before deploying, the pipeline evaluates an environment against this
contract and against a baseline.

If the differences cross a defined severity threshold, the build fails.

Not because the code is broken.
Because the configuration integrity is broken.

What this catches in practice

Here's an example of what a failing check could look like in CI:

$ configstack doctor --fail-on medium

✖ Configuration Drift Detected

Severity: HIGH

Missing required keys:
  - JWT_SECRET

Disallowed keys:
  - LEGACY_API_KEY

CI failed.

In practice, this surfaces things like:

Missing required keys such as JWT_SECRET
Secrets marked incorrectly
Typos like DATABASEURL instead of DATABASE_URL
Drift between staging and production
Keys removed in one environment but still present in another

None of these are syntax errors. But they're still deployment risks.

Why this matters

Once you have more than one environment and a CI pipeline, relying on
convention alone starts to break down. Drift accumulates quietly until
something subtle fails.

We're generally disciplined about tests and linting. Configuration tends
to be treated as an afterthought.

I put together a small interactive demo using mocked data to illustrate
the idea: demo

Still figuring out how common this pain actually is. If you've dealt
with configuration drift in your own CI/CD setup, I'd be interested to
know how you're handling it today.

Runtime validation only?
Vault plus convention?
Custom scripts?
Something else?