DEV Community: garry

Self-service infrastructure as code

garry — Tue, 12 Mar 2024 15:53:02 +0000

how to give product teams the autonomy to quickly provision infrastructure for their services, without the platform team losing control

context 📚

For many tech startups and scale-ups, the technology team usually evolves into some sort of central Platform team(s) (aka "DevOps", "Infrastructure", or "SRE", maybe some "Dev Ex") and a bunch of Product teams, who are constantly working to build new features and maintain the core product for customers.

Product teams want to release their work quickly, often, and reliably, but with the autonomy of not needing to wait for a Platform team to provision a database or other resource for them.

The Platform team, however, wants to be comfortable and confident that all of the infrastructure the company is running is stable, securely configured, and right-sized to reduce overspending on their cloud costs. They'll also want to keep infrastructure as code, with tools like Terraform, which the Product teams might not care to use or learn.

So how can the Platform team enable the Product teams to work efficiently and not be blocked, while not losing visibility or control over the foundations upon which they run?

first attempt at self-service infra 🧪

I wrote in my previous post about how the Platform team I worked on adopted GitOps and Helm to codify the deployment process, with the additional benefits of making deployments auditable and making Kubernetes clusters recoverable in disaster scenarios.

Once that migration was completed, we wanted to enable the many Product teams to have the independence to set up new micro-services without our involvement or the need to raise any tickets.

Our first attempt was to introduce other engineering teams to Terraform - the Platform team was already using it extensively with Terragrunt, and using Atlantis to automate plan and apply operations in a Git flow to ensure infrastructure was consistent. We'd written modules, with documentation, and an engineer would simply need to raise a PR to use the module and provide the right values, and Atlantis (once the PR was approved by Platform) would go ahead and set it up for them.

To us, this felt light touch. Product engineers wouldn't have to learn Terraform (Platform would own and maintain the Terraform modules), they'd just need to learn how to use those modules with Terragrunt and apply them with Atlantis. We wrote up docs, recorded a show and tell, ... profit?

Except not quite.

another tool to learn 🛠️

While a few engineers did start to use this workflow, and they appreciated getting more hands-on instead of raising tickets to have it done for them at some undetermined future date, most engineers didn't.

Whether it's a cultural thing, and teams don't want to care about the nuts and bolts under their service at runtime, or simply because they're on tight deadlines and don't have time to stop and work out Atlantis steps, ultimately, it doesn't really matter. As a Platform team, we're an enablement team, and the Product teams are our customers. What we had built was not serving their needs.

Given the team had already adopted GitOps and were familiar with deployments powered by Helm Releases and Flux, we wanted to move the provisioning of the infrastructure to be part of the same process of creating the service and its continuous deployment.

infrastructure as code as GitOps 🚀

We stumbled upon a project for maintaining Terraform with CRDs that we could deploy with Helm. That project is now called Tofu-Controller - another WeaveWorks project, so it integrated great with our existing Flux setup.

What it meant is that Product engineers could provision a database for their service, or any other per-service infrastructure they needed, from the Helm Release they already used to configure their service at runtime.

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: account-service
  namespace: dev
spec:
  chart:
    spec:
      chart: kotlin-deployment
      version: 2.0.33
  values:
    replicas: 2
    infrastructure:
      postgres: 
        enabled: true
        rdsInstance: account

The above (simplified) example shows that .Values.infrastructure.postgres.enabled is true. When the Helm chart is being installed or updated, then a Terraform resource will be templated:

{{- if eq .Values.infrastructure.postgres.enabled true }}
---
apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
  name: "postgres-{{ .Release.Name }}"
spec:
  interval: 2h
  approvePlan: auto
  tfstate:
    forceUnlock: auto
  destroyResourcesOnDeletion: false
  sourceRef:
    kind: GitRepository
    name: service-postgres
  backendConfig:
    customConfiguration: |
      backend "s3" {
        ...
      }
  vars:
    - name: service_name
      value: "{{ .Release.Name }}"
    - name: rds_instance_name
      value: "{{ Values.infrastructure.postgres.rdsInstance }}"

Again, this code has been simplified for clarity, but there are a few things worth noting:

the Terraform resource can be configured to automatically plan and apply, and even forceUnlock some state locks
the 2 hour interval means the operator will continually try to re-apply the Terraform regularly
the backendConfig meant that the Terraform operator can share the remote state bucket with Atlantis; this is powerful since you can then reference across modules to get remote state outputs
importantly for stateful resources like databases, you can set destroyResourcesOnDeletion to false to avoid destroying data when you uninstall the helm chart
we can pass in the vars as usual to the service-postgres Terraform module; here we're passing in a name for the service (which maps to the database name and database user) and the name of the RDS instance on which to create it

When the above Helm chart was installed, it would create a CRD of the Terraform kind, the operator will go and plan and apply the service-postgres module with the vars set as inputs. In this case, it'll create a database and user called account-service on the account RDS instance, and manage roles and grants, passwords, security group access, etc.

eventually consistent infrastructure ⏱️

Of course, the Terraform resources might take a short while to set up, so the service will need to handle scenarios where its Terraform dependencies might not exist yet. In practice though, we were able to provision databases, service accounts, S3 buckets, Kafka topics, and more within a few seconds and the service's pods would simply restart until they existed.

The Terraform operator will also continually apply the resources it's managing, so that helped us to avoid drift between what we expect to exist; it also fixes any situations where a user might manually change the infrastructure outside of code approvals.

split brain terraform 🧠

It's important to point out that this workflow is only for any per-service infrastructure; each Helm Release would provision services just for itself, and be managed by the operator.

The Platform team would continue to use Atlantis and the Terragrunt repo to manage the main cloud estate (VPCs, security groups, database instances, EKS, etc). The Platform team would also maintain the deployment Helm chart and the Terraform modules referenced by it.

The per-service Terraform modules could reference the remote state of those managed by Atlantis since they shared the same remote state S3 bucket. With the example above, by passing in the name of the RDS instance from the Terraform resource in Kubernetes, the operator can pull outputs from the instance's remote state when it was set up by Atlantis, based on whatever was needed.

next steps 🥾

In many ways, this made the HelmRelease in the GitOps repo a source of truth for the deployment; already describing how it should work at runtime, it was also now including some of its dependencies.

With time, our goal was to abstract the YAML from the repo into a service catalog like Backstage, where it would be easy to say "I want to create a service called X, it needs Postgres and Kafka", and its entire setup, including some boilerplate code and its infrastructure, could go off and be created automatically.

Disposable Kubernetes clusters

garry — Sat, 31 Oct 2020 16:23:36 +0000

An overview of how we manage Kubernetes clusters at Curve to allow for zero downtime upgrades while handling live Curve card transactions.

context 📚

When I joined Curve as the Lead SRE in January 2019, Kubernetes was already being used in production to manage the many microservices (and few monoliths) that make up the Curve estate. Quite bravely at the time, Curve was also using Istio in production - well before it had wider (aka "stable") adoption.

The clusters were being set up manually by Kops, and deployments happened with Jenkins and a bunch of scripts. This is fine but ideally not something you want to use in a production setup; recreating the cluster, or even just tracking the current version of what's deployed in an environment is a slow and difficult task.

adopting GitOps 👩‍💻

The first step in trying to tackle this mild chaos was to define the current reality of our clusters in one central location - and what better tool to track the state of something than Git. It's scalable, it's got a change history, and all of the Engineering team already know how it works.

We dabbled briefly with ArgoCD but settled eventually on Flux, by WeaveWorks. Its simplicity, and ability to manage Helm charts effectively with the Helm Operator, was a winner for what we wanted to do.

standardising deployments 🚀

Before Flux, Jenkins managed deployments through a Git repo of its own, but templated in values from the repo with image versions at deploy time, which weren't committed back to the repo.

Additionally, everything defined in that repo was just raw YAML; engineers would copy/paste config from an existing service to define a new one, often copying bits of config that weren't relevant to their new service.

The Platform team started work on a Helm chart that would replace all of that - no more copy-pasting, just add your service name and the version of the image you want to deploy.

A bunch of sensible defaults would be established (resource requests and limits, health checks, rollout strategy), and the Platform team would encourage standardisation of services (ports, metrics endpoints, and so on).

Each service would be defined as a Helm Release; an instantiation of that Platform-managed chart. Values could be added to a release to override some defaults or add optional features, such as setting up an ingress route from the Internet.

rebuilding the infrastructure 🏗️

With work underway to manage the services as code and standardise deployments with Helm, we began work on replacing Kops with clusters that were also managed by code. We chose to move to AWS's EKS, which we'd set up and configure with Terraform.

The Terraform module we wrote for EKS sets up the infrastructure of course — such as the EKS control plane, the worker nodes, security groups and some IAM roles — but also installs onto the cluster a few components we consider core - Terraform uses Helm to install Istio, Ambassador Edge Stack (our API gateway), and Flux with its Helm Operator.

When the Terraform module is applied, a fully working cluster will start up, with an Istio service mesh, an API gateway with a load balancer for ingress, and Flux preconfigured to connect to the Git repo that defines what should be deployed on that cluster. Flux will take over deploying everything else not deployed by Terraform, including monitoring tools and all of our production services.

cycling through clusters ♻️

Combining the easily Terraform-able EKS cluster, which would start up and deploy all of the services we defined in code with Helm, meant we could easily create, destroy and recreate our environments at will.

That's great for dev environments where we can recreate the cluster often, but how do we upgrade production without causing downtime? Any outage of our services means we decline card transactions, upset customers, and cause the business to lose revenue. We need to do seamless upgrades.

Like the old "cattle not pets" mantra, we decided to treat each of our clusters as something disposable - rather than try risky in-place upgrades of Istio or other core components, we'd simply start a new cluster configured the way we want, and switch.

making the switch 🚦

The key to this was our simple cluster ingress - all customer-facing calls to our APIs go through Ambassador Edge Stack across one load balancer. Each cluster has its own load balancer, set up by Terraform.

We set the EKS Terraform module to output the DNS of the load balancer to remote state, and created another Terraform module to handle weighted routing to those load balancers. This new module would create a fixed Route 53 entry with a CNAME that would resolve to a different load balancer address based on weighting we gave each.

  cluster_weighting = [
    "cluster-a" = "90",
    "cluster-b" = "10",
  ]

For simplicity, we attribute a percentage value between clusters, and handle them as a map in Terraform. In the example above, CNAMES are created for cluster-a and cluster-b, and 90% of traffic would solve to cluster-a and reach its load balancer. The fixed Route 53 record that served the weighted load balancer records was then used as the origin for the CloudFront distribution that sits in front of our APIs.

going live 🚨

We practiced this process many times in our non-production environments before we made the switch in production, to the point where we had destroyed and recreated dozens of clusters in the months before we went live.

In the end, in a single day we moved all of our API traffic and all card payment transactions from our old Kops cluster to EKS, without dropping a single payment. We stepped up the percentage of traffic gradually at first with weighted routing until all traffic was migrated.

This week we updated to the latest version of EKS and did the same process again, but this time we did the whole thing between morning standup and lunch. We're continuing to refine the process to the point that soon, we will make it fully automated. I'll post more on how that journey goes!