Self-service infrastructure as code

#terraform #devops #gitops

how to give product teams the autonomy to quickly provision infrastructure for their services, without the platform team losing control

context 📚

For many tech startups and scale-ups, the technology team usually evolves into some sort of central Platform team(s) (aka "DevOps", "Infrastructure", or "SRE", maybe some "Dev Ex") and a bunch of Product teams, who are constantly working to build new features and maintain the core product for customers.

Product teams want to release their work quickly, often, and reliably, but with the autonomy of not needing to wait for a Platform team to provision a database or other resource for them.

The Platform team, however, wants to be comfortable and confident that all of the infrastructure the company is running is stable, securely configured, and right-sized to reduce overspending on their cloud costs. They'll also want to keep infrastructure as code, with tools like Terraform, which the Product teams might not care to use or learn.

So how can the Platform team enable the Product teams to work efficiently and not be blocked, while not losing visibility or control over the foundations upon which they run?

first attempt at self-service infra 🧪

I wrote in my previous post about how the Platform team I worked on adopted GitOps and Helm to codify the deployment process, with the additional benefits of making deployments auditable and making Kubernetes clusters recoverable in disaster scenarios.

Once that migration was completed, we wanted to enable the many Product teams to have the independence to set up new micro-services without our involvement or the need to raise any tickets.

Our first attempt was to introduce other engineering teams to Terraform - the Platform team was already using it extensively with Terragrunt, and using Atlantis to automate plan and apply operations in a Git flow to ensure infrastructure was consistent. We'd written modules, with documentation, and an engineer would simply need to raise a PR to use the module and provide the right values, and Atlantis (once the PR was approved by Platform) would go ahead and set it up for them.

To us, this felt light touch. Product engineers wouldn't have to learn Terraform (Platform would own and maintain the Terraform modules), they'd just need to learn how to use those modules with Terragrunt and apply them with Atlantis. We wrote up docs, recorded a show and tell, ... profit?

Except not quite.

another tool to learn 🛠️

While a few engineers did start to use this workflow, and they appreciated getting more hands-on instead of raising tickets to have it done for them at some undetermined future date, most engineers didn't.

Whether it's a cultural thing, and teams don't want to care about the nuts and bolts under their service at runtime, or simply because they're on tight deadlines and don't have time to stop and work out Atlantis steps, ultimately, it doesn't really matter. As a Platform team, we're an enablement team, and the Product teams are our customers. What we had built was not serving their needs.

Given the team had already adopted GitOps and were familiar with deployments powered by Helm Releases and Flux, we wanted to move the provisioning of the infrastructure to be part of the same process of creating the service and its continuous deployment.

infrastructure as code as GitOps 🚀

We stumbled upon a project for maintaining Terraform with CRDs that we could deploy with Helm. That project is now called Tofu-Controller - another WeaveWorks project, so it integrated great with our existing Flux setup.

What it meant is that Product engineers could provision a database for their service, or any other per-service infrastructure they needed, from the Helm Release they already used to configure their service at runtime.

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: account-service
  namespace: dev
spec:
  chart:
    spec:
      chart: kotlin-deployment
      version: 2.0.33
  values:
    replicas: 2
    infrastructure:
      postgres: 
        enabled: true
        rdsInstance: account

The above (simplified) example shows that .Values.infrastructure.postgres.enabled is true. When the Helm chart is being installed or updated, then a Terraform resource will be templated:

{{- if eq .Values.infrastructure.postgres.enabled true }}
---
apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
  name: "postgres-{{ .Release.Name }}"
spec:
  interval: 2h
  approvePlan: auto
  tfstate:
    forceUnlock: auto
  destroyResourcesOnDeletion: false
  sourceRef:
    kind: GitRepository
    name: service-postgres
  backendConfig:
    customConfiguration: |
      backend "s3" {
        ...
      }
  vars:
    - name: service_name
      value: "{{ .Release.Name }}"
    - name: rds_instance_name
      value: "{{ Values.infrastructure.postgres.rdsInstance }}"

Again, this code has been simplified for clarity, but there are a few things worth noting:

the Terraform resource can be configured to automatically plan and apply, and even forceUnlock some state locks
the 2 hour interval means the operator will continually try to re-apply the Terraform regularly
the backendConfig meant that the Terraform operator can share the remote state bucket with Atlantis; this is powerful since you can then reference across modules to get remote state outputs
importantly for stateful resources like databases, you can set destroyResourcesOnDeletion to false to avoid destroying data when you uninstall the helm chart
we can pass in the vars as usual to the service-postgres Terraform module; here we're passing in a name for the service (which maps to the database name and database user) and the name of the RDS instance on which to create it

When the above Helm chart was installed, it would create a CRD of the Terraform kind, the operator will go and plan and apply the service-postgres module with the vars set as inputs. In this case, it'll create a database and user called account-service on the account RDS instance, and manage roles and grants, passwords, security group access, etc.

eventually consistent infrastructure ⏱️

Of course, the Terraform resources might take a short while to set up, so the service will need to handle scenarios where its Terraform dependencies might not exist yet. In practice though, we were able to provision databases, service accounts, S3 buckets, Kafka topics, and more within a few seconds and the service's pods would simply restart until they existed.

The Terraform operator will also continually apply the resources it's managing, so that helped us to avoid drift between what we expect to exist; it also fixes any situations where a user might manually change the infrastructure outside of code approvals.

split brain terraform 🧠

It's important to point out that this workflow is only for any per-service infrastructure; each Helm Release would provision services just for itself, and be managed by the operator.

The Platform team would continue to use Atlantis and the Terragrunt repo to manage the main cloud estate (VPCs, security groups, database instances, EKS, etc). The Platform team would also maintain the deployment Helm chart and the Terraform modules referenced by it.

The per-service Terraform modules could reference the remote state of those managed by Atlantis since they shared the same remote state S3 bucket. With the example above, by passing in the name of the RDS instance from the Terraform resource in Kubernetes, the operator can pull outputs from the instance's remote state when it was set up by Atlantis, based on whatever was needed.

next steps 🥾

In many ways, this made the HelmRelease in the GitOps repo a source of truth for the deployment; already describing how it should work at runtime, it was also now including some of its dependencies.

With time, our goal was to abstract the YAML from the repo into a service catalog like Backstage, where it would be easy to say "I want to create a service called X, it needs Postgres and Kafka", and its entire setup, including some boilerplate code and its infrastructure, could go off and be created automatically.