Kevin Mack

Posted on Jun 22, 2020 • Originally published at welldocumentednerd.com on May 28, 2020

Fast and Furious: Configuration Drift

#technology #availability #configurationdrift

Unlike the movie Tokyo Drift, the phrase “you’re not in control, until you’re out of control.” Is pretty much the worst thing you can do when delivering software.

Don’t get me wrong, I love the movie. But Configuration Drift is the kind of things that cripple an organization and also be the poison pill that runs your ability to support high availability for any solution, and increase your operation costs exponentially.

What is configuration drift?

Configuration Drift is the problem that occurs when manual changes are allowed to occur to an environment and this causes environments to change in ways that are undocumented.

Stop me if you heard this one before:

You deploy code to a dev environment, and everything works fine.
You run a battery of tests, automated or otherwise to ensure everything works.
You deploy that same code to a test environment.
You run a battery of tests, and some fail, you make some changes to the environment and things work just as you would expect.
You deploy to production, expecting it to all go fine, and start seeing errors and issues and end up losing hours to debugging weird issues.
During which you find a bunch of environment issues, and you fix each of them. You get things stable and are finally through everything.

Now honestly, that should sound pretty familiar, we’ve all lived it if I’m being honest. The problem is that this kind of situation causes configuration drift. Now what I mean by configuration drift is the situation where there is “drift” in the configuration of the environments such that they are have differences that are can cause additional problems.

If you look at the above, you will see a pattern of behavior that leads to bigger issues. For example, one of the biggest issues with the above is that the problem actually starts in the lower environments, where there are clearly configuration issues that are just “fixed” for sake of convenience.

What kind of problems does Configuration Drift create?

Ultimately by allowing configuration drift to happen, you are undermining your ability to make processes truly repeatable. You essentially create a situation where certain environments are “golden.”

So this creates a situation where each environment, or even each virtual machine can’t be trusted to run the pieces of the application.

This problem, gets even worse when you consider multi-region deployments as part of each environment. You now have to manage changes across the entire environment, not just one region.

This can cause a lot of problems:

Inconsistent service monitoring
Increased difficulty debugging
Insufficient testing of changes
Increased pressure on deployments
Eroding user confidence

How does this impact availability?

When you have configuration drift, it undermines the ability to deploy reliably to multiple regions, which means you can’t trust your failover, and you can’t create new environments as needed.

The most important thing to keep in mind is that the core concept behind everything here is that “Services are cattle, not pets…sometimes you have to make hamburgers.”

What can we do to fix it?

So given the above, how do we fix it? There are something you can do that are processed based, and others that are tool based to resolve this problem. In my experience, the following things are important to realize, and it starts with “admitting there is a problem.” Deadlines will always be more aggressive, demands for release will always be greater. But you have to take a step back and say this “if we have to change something, it has to be done by script.”

By forcing all changes to go through the pipeline, we can make sure everyone is aware of it, and make sure that the changes are always made every time. So there is a requirement to make sure you force yourself to do that, and it will change the above flow in the following ways:

You deploy code to a dev environment, and everything works fine.
You run a battery of tests, automated or otherwise to ensure everything works.
You deploy that same code to a test environment.
You run a battery of tests, and some fail, you make some changes to script to get it working.
You redeploy automatically when you change those scripts to dev, and automated tests are run on the dev environment.
The new scripts are run on test and everything works properly.
You deploy to production, and everything goes fine.

So ultimately you need to focus on simplifying and automating the changes to the environments. And there are some tools you can look at to help here.

Implement CI/CD, and limit access to environments as you move up the stack.
Require changes to be scripted as you push to test, stage, or preprod environments.
Leverage tools like Chef, Ansible, PowerShell, etc to script the actions that have to be taken during deployments.
Leverage infrastructure as code, via tools like TerraForm, to ensure that your environments are the same every time.

By taking some of the above steps you can make sure that things are consistent and ultimately limiting access to production to “machines” only.