DEV Community

Cover image for Infrastructure Drift in the Cloud
Derek Morgan for AWS Community Builders

Posted on • Updated on • Originally published at morethancertified.com

Infrastructure Drift in the Cloud

Image description

What is Configuration Drift?

If you’ve spent time with infrastructure-as-code, you’ve probably heard about configuration drift. For those new to the topic, configuration drift is when your actual infrastructure deviates from the desired infrastructure. You usually script this infrastructure using tools such as Hashicorp Terraform, OpenTofu, AWS Cloudformation, Azure’s ARM/Bicep, Ansible, et al. Drift happens when an engineer or a process changes something outside these scripts. Think of manually adding a security rule from the console to allow access to a server for an emergency patch at 0300.

Obviously, you can just lock all engineers out of the console, but honestly, things (sic) happen, and stuff (sic) goes down! Outages, security breaches, and surprise updates can lead to a need for immediate changes. When the clock is ticking, someone must take action, possibly without time to read your (obviously well-commented) infrastructure code.

Before I explain how to prevent and mitigate drift, I want to remind you that configuration drift is not always bad! In the above example, this emergency patch may prevent a 0-day exploit launched by a bad actor that has a habit of attacking companies in your industry. A significant security risk is obviously more important than keeping your infrastructure scripts clean. After the emergency has subsided, someone must either add this rule to the infrastructure code or remove it entirely.

Prevention, Detection, and Mitigation Methods

Luckily, there are several guardrails you can put into place to help prevent drift or at least mitigate it efficiently when it occurs. The following are some essential strategies and tools to prevent, detect, and mitigate infrastructure drift.

Have a Strict Change-Control Policy

A change-control policy should be the bedrock for any company that deals with changes, you know, all companies. The complexity can vary by scale and the possible impact of a change. Overcomplicating change management can lead to a loss of productivity and agility. A fantastic example is the Change Management Policy doc provided by Gitlab: https://handbook.gitlab.com/handbook/security/change-management-policy/
This document is a great template to get you started on your own change management policy.

Keep all IaC and config files in Version Control

Hopefully, you keep all infrastructure files in a VCS, such as GitHub or GitLab. Your organization should have a well-documented process for modifying these files, such as PR and comment policies, branch guidelines, permissions, approval policies, etc. Tagging commits properly, having policy frameworks statically scan the code for changes to mission-critical resources and require approvals accordingly, and always ensuring those approvers are available when needed are all great ways to ensure a smooth VCS strategy.

Use Configuration Monitoring Tools

Configuration monitoring tools, such as AWS Config, Azure Application Change Analysis, and others, are crucial for maintaining a tight infrastructure policy. They monitor your infrastructure and notify you based on default and custom rules. Many of these tools are not free, so ensure accounting is also involved! Despite this, these tools are pretty much required to keep your infrastructure in check, and they’re worth the small cost to operate.

Never Share Credentials

I’m not going to use this section to pad my word count. I think it’s pretty self-explanatory. I’ll just say that OIDC, zero-trust, and IAM roles are vastly superior to passwords. Unfortunately, this isn’t a perfect world, and passwords still exist. So, if you still have to use passwords, don’t share them.

Run Scheduled Plans

Not all IaC tools have the notion of a “plan,” but most do. Running a plan on a scheduled interval allows you to see if anything about the resources known to the tool has changed.
Most tools, such as Terraform, can only check the resources they know about for drift. Due to this, a successful plan may not tell the whole story. You should include tools, like AWS Config, in your strategy to see the entire picture. If you encounter resources outside of your Terraform State due to an emergency or a click-happy engineer, you can always use Terraform’s Import Blocks (v1.5.0+) feature to import those resources.

Use drift detection tools such as driftctl

Scheduled plans are a rudimentary but somewhat effective method to detect drift. If you really want to step up your drift-detection game, driftctl is a tool that can help. With driftctl, you can check for resources outside Terraform, simplifying the import process. Unfortunately, driftctl is currently in “maintenance mode” and will not continue to be supported. Cloudquery is one alternative, and there are several other Terraform Automation and Collaboration Tools (TACOS) that include drift detection, so if the End-of-Life of drifctl is a concern, this may not be the tool for you.

Conclusion

As you can see from this non-exhaustive list, there are many methods to manage drift, intentional or unintentional. How your organization handles infrastructure drift can be complicated as numerous moving parts exist. Many of these strategies can become quite pricey, which is always a hurdle for organizations of any size. Overall, having a solid plan is the most important thing you can do. If you want to learn more about DevOps, including Terraform, Docker, Jenkins, Ansible, and more, check out my courses at https://courses.morethancertifed.com.

See my LinkedIn post that inspired this article here: https://www.linkedin.com/posts/derekm1215_infrastructure-drift-in-the-cloud-more-activity-7118210758550683648-P8zf

Top comments (1)

Collapse
 
valaug profile image
Augusto Valdivia

Interesting article @morethancertified thank you for sharing it. In addition to your section on "Use Configuration Monitoring Tools," I would also recommend setting strict policies to regularly rotate credentials and passwords for enhanced security. As for "Run Scheduled Plans," I would advise utilizing AWS SecurityHub reports, as they are very powerful as well.

Thank you again for sharing!