How to enable 100+ developers to deploy cloud resources in a controlled fashion

by Bryan Wood – December 21, 2020

Governance strategy in the cloud is a great new challenge that often gets overlooked. I’ve seen lots of organizations open an AWS account and turn developers loose to learn and deploy production services only to realize later that there’s large security consequences, cost ramifications, and infrastructure sprawl that they were not prepared to deal with.

A full blown cloud initiative at an already profitable company is too wide of a topic to address in a single article like this so let’s zoom in on this one specific concern and look at how we’ve addressed part of that initiative at OpenMarket.

There’s a long list of cloud providers and in each one you can deploy and configure resources with effectively infinite complexity, so writing down on paper what standards you would like to enforce early on is a good idea, in order to have something to work towards. Starting with, at the very least, a tagging standard.

Keeping these standards in mind, it’s time to select tooling to manage your configurations and deploy your resources, I’m going to select Terraform for you. You can Google it’s strengths and shortcomings but in short, it’s going to support probably the widest variety of services across the widest variety of cloud providers. If you only need to deploy one thing, like Kubernetes, you might be better off choosing a tool to manage a specific toolchain or technology and its lifecycle. Our aim was to deploy any of the crazy number of services in AWS (or other cloud providers) that a developer might choose and allow us to manage the state of that service after it’s deployed. Terraform is a code controlled way to do that.

Architecture

We’re going to work in AWS in this example, but most of the examples should translate well across cloud providers. Terraform is code, and like any code, you shouldn’t test it in production. We’ve found that developing terraform in a “Development” account removes the risk of accidentally clobbering production resources. We have a “Staging” account for good measure, as well as the “Production” account where we run our production workloads.

For each project we follow an “Environment Branches” pattern in git to make deployment very simple. Contributors follow normal git contribution practices and changes end up in the master branch. Each environment branch “dev” “stg” or “prd” have automation that will pick up that code and apply it to the corresponding account.

We ensure that all resources in production have been deployed by terraform by only providing developers with read access to Staging and Production.

Terraform Modules

At the beginning, we didn’t have a strong convention or enough people supporting the platform to enforce anything. As time went on, the desire for support and consistency became very real. We’ve been spending a lot of time developing our terraform modules with the intention of making it so developers requesting resources will never have to request them directly and only use very abstracted modules. If they need a different feature in a platform that they are deploying, they can submit a feature request to be finished by a Site Reliability Engineer.

The Vision

If we abstract these platforms enough, projects will read a little bit more like a bill of materials than actual code. Our goal is that nobody actually calling a module from a project will have to be particularly fluent in terraform. It might be that a savvy Technical Programme Manager could fill out the necessary requirements and get the resources deployed before their developers even need them.

Example of calling an abstracted module for a new project:

module "my_new_application" {
providers = {
aws = aws.us-west-2
}
source = "git@git.company.com:cloud/om-modules.git/modules/beanstalk_uber_module?ref=v0.10"
has_mysql = true
has_redis = true
has_s3_bucket = true
instance_types = "t3.micro"
tags = merge(local.common_tags,
{
Function = "awesome new microservice"
})
}

The module might have a bunch of other parameters with sane defaults and documentation would make those options clear. I find the above easy to read, even for the non-developer.
Some Lessons Learned
It feels unsafe to be deploying data services with data or state in them with terraform. I’m afraid that a change is going to unexpectedly delete a resource and my data.

In terraform there are lots of useful “meta arguments”. Any resource block in terraform can have a “lifecycle” block. Inside of the lifecycle block you can use the “prevent_destroy” parameter. You don’t want to use it too often, because it will keep “terraform destroy” commands from working and can break your pipelines in multiple places. This will however prevent a datasource from being accidentally destroyed and is a good idea to add to resources that might contain important data.
What if I want resources configured differently in dev, stg, or prd environments?

We handle this by having dev, stg, and prd, .tfvars files. This makes it very simple to adjust values for parameters on a per environment basis. Things that we might make different between environments might be instance sizes, number of instances, tag values, or configurations like where you want your logs sent.

Parting Shots

This approach has some shortcomings. Without additional tooling we don’t have a good way of estimating costs or incorporating financial data into consideration before merging to an environment branch. Regardless, with consistent tagging at least the financial data is visible to us as we consume cloud resources.

By giving developers more freedom in our development environment and the ability to provision resources without terraform we have some untagged and unmanaged resources that can cause unnecessary spend as well as other issues. We use Cloud Custodian policies to mop up non-compliant resources.

There are lots of management solutions for cloud providers and we’re constantly evaluating new options. As we’ve progressed, this pattern with Terraform has matured into something that we find supportable and flexible enough to allow us to leverage the vast array of services that modern cloud providers make available.