Why you should take care of infrastructure drift

#devops #terraform #aws

When talking about infrastructure drift, you often get knowing glances and heated answers. Recording gaps in your infra between what you expected to be and the reality of what is, is a well known and wide spread issue bothering hundreds of DevOps teams around the globe. Interesting to note though, is that depending on their context, the exact definition they will give of drift will vary.

Facing impacts and consequences ranging from intensive toil to dangerous security threats, many DevOps team are keenly aware of the issue and actively looking for solutions.

We decided to look more closely into how they deal with it and conducted a study that will be released in the coming weeks. Here is a foretaste of this study, outlining some of the key facts we recorded.

How infrastructure drift is becoming a major issue : a combination of growing volume and automation
The ever growing amount of workloads running in the cloud has gone hand in hand with a similarly growing number of people interacting with infras, themselves working with several environments.

In this context, as Infrastructure as Code becomes widely adopted by users with heterogenous skillsets, the IaC codebases become larger, mechanically fueling daily occasions to create drift and making it harder to track.

The various definitions and core causes of infrastructure drift
Drift is a multi-faceted problem, let’s agree on a definition.

Indeed, depending on their context and process (or lack thereof) people will tend to stress one specific aspect of drift or another when prompted on a definition. Typically, some people will tend to relate drift to the lack of consistency between, say staging and production environments, which might be true, but incomplete.

Still, most of the people we talked to define infrastructure drift as an “unwanted delta” recorded between their Infrastructure code base and the actual state of their infras.

Drift can be driven by human input, poor configuration, applications making unwanted changes, etc. It has consequences on toil and efficiency, forces teams to put in place strict controls that decrease flexibility, and can have a security impact.

Two of the most common causes of drift are linked with process or workflow issues, like manual changes in a cloud console not being transposed as code, or changes applied to some environment but not propagated to others. Obviously those issues becomes more and more complex as the number of environments and team members grows. Some teams have dozens of environments that they need to keep updated.

Much more surprising though is the fact that what we call API-driven drift is a widespread issue. API-driven drift happens when an update from an application or a script calling a cloud provider API (or a Terraform provider), autonomously and directly impacts your actual state. In this case, you’ve just lost consistency with your codebase, and you won’t be aware of it until some random apply reveals a problem that might end up taking a couple of hours to fix. This happens even on the cloud provider’s side. One of the people we talked to told us of the time he suddenly got locked out of Azure because of a default machine authentication parameter updated from true to false.

The impacts of infrastructure drift
Maintaining a solid and reliable source of knowledge of how your cloud ressources are organized certainly is important, but at the end of the day you might simply think of infrastructure drift as a mere annoyance not worth the effort.

While drift mostly causes additional work, it also triggers severe security issues too. In one of our interviews, a DevOps lead analyzed the problem quite clearly: “every drift event causes uncertainty, a resolution time, and a potential security issue”. Willingly or not, a developer can do a lot with IAM keys and a SDK. Especially a junior one. Being able to catch bad decisions quickly and reverse your situation back to normal is crucial.

There are also more subtle effects of drift, especially on efficiency. To avoid excessive drift, some teams make significant adjustments to their workflows. In some teams, only the DevOps team lead is allowed access to production. In others where developers are not skilled at IaC, getting a small change to environments done goes through a painful and long ticketing system. In other words, drift causes issues, which leads to rigid / counterproductive processes, which leads to a decrease in speed and flexibility.

How teams solve drift so far
Whatever the maturity level of a team and the degree of automation of their cloud infrastructures, maintaining reliability and security while moving at developers speed implies tracking down inconsistencies, like infrastructure drift.

The faster it is detected, the easier it is to remediate drift, which is why many DevOps we interviewed had a terraform plan in a cron job.

Most teams will explain that to tackle drift, the first thing they did was restrict access to production to a few team members. That certainly contains the problem, but does not solve staging environments drift, as only a smaller part of teams restrict access to staging environments. Following through with that, deploying a full GitOps workflow will prevent some of the drift, but not all of it (not the API-driven events for example) and is still pretty hard to do by the book in practice when there is an urgent issue to solve on the production environment and MTTR is key for the business.

Why we think infrastructure drift should be addressed with a better solution
We recently caught an excellent twitter thread about the various impacts of fast and slow moving broken stuff. Infrastructure drift might not be considered a fast moving broken thing. This is maybe all the more dangerous as sometimes you just can’t get anyone to pay attention to slow-moving broken things because what harm could that iceberg do?”*

Analyzing what solutions are deployed against drift in over 100 teams led us to discover how poorly the topic is addressed. Stay tuned if you’d like to see how we intend to bring an open source solution to this issue.

*borrowed from @yvonnezlam