DEV Community

Cover image for Why We Started Failing CI on Configuration Drift
Jckilous
Jckilous

Posted on

Why We Started Failing CI on Configuration Drift

Why We Started Failing CI on Configuration Drift

Most CI pipelines validate code.

Very few validate configuration.

In my experience, a surprising number of production issues aren't logic
bugs. They're configuration drift.

Things like:

  • Staging missing a required environment variable
  • A secret copied as plaintext
  • A renamed key in one environment but not another
  • Old keys sticking around after refactors

Everything looks fine. Tests pass. Deployment succeeds. Then something
behaves differently in production and you're debugging environment
settings instead of code.

Runtime validation isn't enough

Libraries that validate required variables at startup are useful. They
make sure something exists before the app boots.

But they don't answer questions like:

  • Is staging aligned with production?
  • Did someone accidentally downgrade a secret to plaintext?
  • Did a required key disappear in just one environment?
  • Are there keys that no longer belong anywhere?

That's configuration drift.

And most pipelines don't check for it.

Treating environment variables like contracts

The approach I started experimenting with was simple: treat environment
variables like a contract.

Define which keys are required.
Define which keys must be secret.
Define what's optional.
Then compare environments before deployment.

A minimal example contract might look like this:

{
  "requiredKeys": ["DATABASE_URL", "JWT_SECRET"],
  "secretKeys": ["DATABASE_URL", "JWT_SECRET"],
  "optionalKeys": ["SENTRY_DSN"],
  "allowedPrefixes": ["FEATURE_"]
}
Enter fullscreen mode Exit fullscreen mode

Before deploying, the pipeline evaluates an environment against this
contract and against a baseline.

If the differences cross a defined severity threshold, the build fails.

Not because the code is broken.
Because the configuration integrity is broken.

What this catches in practice

Here's an example of what a failing check could look like in CI:

$ configstack doctor --fail-on medium

✖ Configuration Drift Detected

Severity: HIGH

Missing required keys:
  - JWT_SECRET

Disallowed keys:
  - LEGACY_API_KEY

CI failed.
Enter fullscreen mode Exit fullscreen mode

In practice, this surfaces things like:

  • Missing required keys such as JWT_SECRET
  • Secrets marked incorrectly
  • Typos like DATABASEURL instead of DATABASE_URL
  • Drift between staging and production
  • Keys removed in one environment but still present in another

None of these are syntax errors. But they're still deployment risks.

Why this matters

Once you have more than one environment and a CI pipeline, relying on
convention alone starts to break down. Drift accumulates quietly until
something subtle fails.

We're generally disciplined about tests and linting. Configuration tends
to be treated as an afterthought.

I put together a small interactive demo using mocked data to illustrate
the idea: demo

Still figuring out how common this pain actually is. If you've dealt
with configuration drift in your own CI/CD setup, I'd be interested to
know how you're handling it today.

Runtime validation only?
Vault plus convention?
Custom scripts?
Something else?

Top comments (0)