From Ghost Assets to Infrastructure Drift - Don't Get Spooked

#devops #cloud #security

We’ve often spoken about state and infrastructure drift at Firefly that is many times the byproduct of large-scale cloud operations that predate infrastructure as code, and were once upon a time managed manually. Just a quick recap though, for those who aren’t familiar, drift is what happens when you have assets that exist in both your code and cloud. These resources, however, have drifted from their initial intended or desired state––and have now altered from the code representation of their intended state. If you want a good reference on this - you can start here with this great talk from DevOpsDays Tel Aviv.

In this post, though, I’m going to talk a little bit about ghost assets, which we’ve found to be a recurring challenge in large-scale cloud operations. Unlike drifted assets, ghost assets are those assets that no longer exist in your cloud at all, but do still remain in your code, and this can result in all kinds of problematic, and sometimes unpredictable and insecure behavior.

So why do Ghost Assets even exist in the first place?

Ghost assets are the byproduct of a codified asset, i.e. one that appears in your infrastructure as code (IaC) configuration files, being manually deleted on the cloud - whether via the UI or an API. This can happen for two primary reasons:

The resource isn’t actually needed anymore and someone decided to remove it
Human or machine error

When we’re looking at reason #1 - this can have certain implications, and #2 as well.

Let’s take a closer look at what this looks like under the hood.

What is the Impact of Ghost Assets?

Ghost Assets and Unnecessary Cloud Costs

When we’re talking about a cloud resource that is no longer needed, sometimes it is manually deleted from the cloud, but not from the code. What will occur in such a scenario, when it comes to IaC and immutability, is that anytime you redeploy the IaC code, this manually deleted asset will return. This can have a few implications - from cost through stability.

Cost is obvious, you don’t want to be paying for resources you no longer need (certainly when cloud operations are already quite costly), but sometimes a different alternate service may have replaced the old service––and this can cause conflicts, data loss or even security issues. It is a bad practice to have your IaC and cloud out of sync.

Let’s drill down into the security issues that may arise from ghost assets. Imagine you have an EC2 Instance on AWS, or even a container you used to run on ECS or EKS, they are no longer in use, and you decided to delete them from your cloud. Now someone has redeployed it as it was still in the code (despite being manually removed from the cloud), and it is now there again. This is problematic because those workloads are no longer maintained, they might use old libraries with known vulnerabilities, they might even be exposed to the internet, making it easier for attackers to exploit them. They are sort of invisible assets that essentially run old, unmaintained code that might still work with some other managed resources such as databases, and this is a huge risk to your production deployment.

Ghost Assets and Security

The first scenario, while it can be frustrating to hunt down and resolve, is surprisingly less critical than the second scenario.

Reason #2 is the likely scenario you’d really want to avoid. If an asset is accidentally deleted–– whether by humans or machines, Houston, you have a problem. This can lead to production breakage and downtime, long-term data loss (if a database or backup service was removed accidentally), and many more issues.

Imagine for example, that you have a production deployment with different workloads, a web server, and a database. If you accidentally deleted a workload, then you definitely can suffer from downtime. What will likely happen is that when users try to reach that service, it will not be available to them, and that can cause immediate loss of business. It is more likely to happen with workloads and resources that are not often used by customers, but are critical by nature when needed. Let me guess…you probably have something in mind right now.

At times we’ve had the horror of finding haunted house config files that are just riddled with ghosts. These types of config files that are all but obsolete in the cloud, create clutter and redundancy, and many times are a source of needless cloud costs.

How to Manage and Prevent Ghost Assets

So obviously if an important and still required resource was manually deleted from the cloud––restoring it is a matter of how up-to-date and “restorable” your backups are. (There is no shortage of horror stories where the backups were never actually tested for restoration––and to the shock and dismay of everyone, the backups didn’t work in real time.) And a good piece of evergreen advice, is to always test your backups!

While we can’t undo what’s already been done, we can help you improve your practices and avoid such a recurrence in the future.

One way to avoid losing critical resources that are manually deleted from the cloud, is by preventing this kind of action entirely from the UI or via API, and enforcing policies that require any resources to be added or deleted solely through your IaC. This practice is often called GitOps or Policy as Code (and you can learn more about Policy as Code, in this post).

This means that any resource that requires deletion should have the resource block deleted entirely from the actual code, and then once you do Terraform apply or Pulumi up, the resource will not be created again. If you would like to search for such missing resources, you can leverage the Terraform plan feature and that will flag for you any resources that appear in your code, but do not appear in your cloud.

Ghost Busters

When you have very large cloud fleets, and you manage hundreds and thousands of config files that may have many ghost assets in them, this can be a heavier lift than just a ghost asset or two in a handful of files. There are certainly many upsides to the efficiency, flexibility and scale the cloud provides, but its upkeep and maintenance in the long-term has created a lot of operational overhead.

The many abstraction layers and tools we use today have caused our clouds to be riddled with drifted assets, ghost assets, and even unmanaged resources that no one even knows exist and are bleeding costs. The cloud has become a wild west with disparate ways of management, and this has become an operational nightmare for many cloud engineers to handle at scale.

This is what we set out to solve at Firefly to enable this kind of detection at scale - and quick fixes for remediation and codification. Eventually with everything shifting left and being managed as code, your cloud can’t be left behind. The ability to move and deploy rapidly is dependent on how quickly and efficiently you can manage and automate your infrastructure and resources––so codify all the things and discard the manual toil, and you’ll see your engineering velocity improve as well as your infrastructure safety and robustness.