DEV Community

Cover image for Introducing Snap CD: Why I Built a New Terraform Orchestrator
Karl Schriek
Karl Schriek

Posted on

Introducing Snap CD: Why I Built a New Terraform Orchestrator

Anyone who has operated Terraform/OpenTofu at scale knows the pattern:

You start with one state file. It works great. Then it grows. And grows. Maybe your company also grows so that now multiple teams are deploying infrastructure. terraform plan used to take seconds - now it takes many minutes. A single change triggers a refresh of hundreds of resources. One team's DNS change blocks another team's application deployment. You start sweating every terraform apply runs. All you want to do is add a tag to an Azure storage account, but the plan won't run through because that one fringe resource where credentials have gone stale is causing the refresh to fail! Or it does run through but now there are multiple resources you have no knowledge of, all of which will be modified by the apply.

The answer everyone arrives at is the same: break it up. Split your monolith into smaller, focused state files. Networking in one. DNS in another. Application infrastructure in a third. Give different teams the responsibility to manage different pieces.

But the moment you do that, you inherit a new problem: the dependencies between those pieces are no longer enforced, and no longer visible.

Your application module needs the vpc_id from your networking module. Your DNS module needs the load_balancer_arn from your application module. Suddenly you're stitching together terraform_remote_state data sources, writing wrapper scripts, building CI/CD pipelines with hard-coded dependency chains, and praying that someone doesn't deploy a networking change that deletes resources your application deployments depend on.

The dependency graph that Terraform/OpenTofu handles beautifully within a single state becomes your manual responsibility across states.

What I wanted

I wanted a system where I could:

  1. Break infrastructure into small, focused modules, each with its own state file, its own lifecycle, its own blast radius. Outputs from any module automatically become available as inputs to other modules, creating a declarative dependency system across my entire infrastructure.

  2. Have changes propagate automatically. When my "vpc" module produces a new private_subnet_id, downstream modules that consume it should re-plan and re-apply without manual intervention. It should also be a true GitOps orchestrator, meaning new commits or updated configuration should automatically trigger deployment.

  3. Keep my cloud credentials out of the control plane. The orchestrator should coordinate work, not execute it. Execution should happen on runners I deploy in my own infrastructure. I decide where they run, what access they have, and which modules are allowed to use them. My state files I manage in whatever remote location I am most comfortable with.

  4. Control access granularly. Infrastructure is organized into stacks (hard boundaries like "prod" and "dev"), then namespaces (logical groupings like "networking" or "storage"), then modules (individual deployments). I need role-based permissions assignable at every one of these levels, whether for service principals or users.

  5. Stay non-invasive. No proprietary runtimes, no lock-in at the execution layer. Runners should execute standard commands like terraform plan and terraform apply in a normal shell. I should be able to SSH into a runner's working directory and run commands manually if I need to.

  6. Manage everything as code. A Terraform Provider for the orchestrator itself, so that stacks, namespaces, modules, runners, secrets, role assignments etc. are all defined in HCL.

None of the existing tools delivered on all six of these and eventually I realized that if I wanted a system like this I would have to start building it myself.

How I solved it

This probably warrants a dedicated article by itself, but suffice it to say that for a software engineer with the interests I have (building cohesive solutions that consist of various interlocking systems) this was a wonderful project to work on. Few things I have done in my career have brought me as much satisfaction as this. It tapped into all the skills I had learnt over 20 years of software engineering and also demanded that I learn quite a few more!

It took me about six months to lay down the bear bones, then about another 12 months of iteration, testing (with pretty serious production infrastructure) and feature expansion. During this time I completely rewrote some of the core systems multiple times until I was happy with them.

With that being said, allow let me introduce to Snap CD and explain how it ticks off the six requirements I mentioned above:

1. Modular deployments

Snap CD organizes infrastructure into three levels:

  • A module is a single Terraform/OpenTofu deployment. It points to code in a Git repo, has its own state file, and defines inputs and outputs.
  • A namespace groups related modules. Think "networking", "storage", "applications". Typically only one team would be responsible for a single namespace.
  • A stack is a hard boundary, such "prod", "dev" or "staging". Namespaces are organized into stacks. Modules in different stacks don't influence each other.

Below is some very simple sample code using the Snap CD Terraform Provider to deploy a new namespace into an existing stack. Into the namespace we deploy two modules, "vpc" and "cluster", where the latter requires an output from the former as one of its inputs.

# Stack

data "snapcd_stack" "mystack" {
  name     = "my-stack"
}

# Namespace

resource "snapcd_namespace" "mynamespace" {
  name     = "my-namespace"
  stack_id = data.snapcd_stack.mystack.id
}

## Module 1 (VPC)

resource "snapcd_module" "vpc" {
  name                     = "vpc"
  namespace_id             = snapcd_namespace.mynamespace.id
  source_revision          = "main"
  source_url               = "https://github.com/snapcd-samples/mock-module-vpc.git"
  source_subdirectory      = ""
  runner_id                = data.snapcd_runner.my_runner.id
}

## Module 2 (Cluster)

resource "snapcd_module" "cluster" {
  name                     = "cluster"
  namespace_id             = snapcd_namespace.mynamespace.id
  source_revision          = "main"
  source_url               = "https://github.com/snapcd-samples/mock-module-kubernetes-cluster.git"
  source_subdirectory      = ""
  runner_id                = data.snapcd_runner.my_runner.id
}

resource "snapcd_module_input_from_output" "private_subnet_id" {
  input_kind       = "Param"
  module_id        = snapcd_module.cluster.id
  name             = "deploy_to_subnet_id" // The "cluster" module expects a variable called "deploy_to_subnet_id"
  output_module_id = snapcd_module.vpc.id
  output_name      = "private_subnet_id" // The "vpc" module produces an output called "private_subnet_id", which we map to "deploy_to_subnet_id"
}
Enter fullscreen mode Exit fullscreen mode

NOTE that these are "mock" deployments, meant for illustration only. You can find there code here and here

The dependency graph is the core of Snap CD. Since the "cluster" module has a snapcd_module_input_from_output that references an output from the "vpc" module, Snap CD knows that a dependency exists. No scripts. No CI/CD glue. The dependency graph is derived from the configuration itself.

dag

2. Event-driven CD

Modules can trigger automatically based on multiple events:

  • Source changes: A new commit lands on a branch, or a new semantic version tag appears. Snap CD detects this (typically via polling jobs pushed to a runner, but manual notification webhooks are also supported) and triggers a deployment job.
  • Upstream output changes: When a dependency's outputs change, downstream modules re-deploy.
  • Definition changes: When you modify a module's configuration (e.g. via the Terraform Provider, or manually via the Portal), it triggers a sync.

You can also require manual approval before applies go through, with configurable approval thresholds. This lets you build workflows where plans run automatically but apply waits for human sign-off.

Let's consider again the code for the "cluster" module above. That module points to "main" branch of the repo at "https://github.com/snapcd-samples/mock-module-vpc.git". Whenever new commits are pushed to this branch, Snap CD will automatically trigger a deployment.

Similarly if the "vpc" module outputs a new value for private_subnet_id, then the "cluster" module deployment will trigger.

Lastly, a change to the definition as follows would automatically trigger a deployment.

resource "snapcd_module" "vpc" {
  name                     = "vpc"
  namespace_id             = snapcd_namespace.mynamespace.id
  source_revision          = "main"
  - source_url               = "https://github.com/snapcd-samples/mock-module-vpc.git"
  + source_url               = "https://github.com/snapcd-samples/mock-module-another-vpc.git"
  source_subdirectory      = ""
  runner_id                = data.snapcd_runner.my_runner.id
}
Enter fullscreen mode Exit fullscreen mode

Here we are changing the source_url but any changes to the snapcd_module itself or to any child resources such as snapcd_module_input..., snapcd_extra_file, snapcd_backend_config and so forth would also automatically trigger a deployment!

3. Runner isolation

Snap CD's architecture cleanly separates orchestration from execution. The Server (snapcd.io) is the control plane - it handles configuration, dependency tracking, job management, and log/output storage. It never touches your cloud infrastructure directly. No AWS credentials, no Azure service principals, no GCP service accounts.

Runners are self-hosted agents that you deploy in a manner and location of your choosing. They connect to the Server over an authenticated WebSocket, pick up jobs, execute standard terraform plan and terraform apply etc., and report back with logs and outputs.

The Runner is an open-source component, published at github.com/schrieksoft/snapcd-runner.

We provide sample code for deploying runners locally, with Docker Compose, or on Kubernetes.

You configure your runners with whatever cloud credentials they need, and then you dictate which Snap CD modules are allowed to use them. For example, you may want to separate runners for "dev" and "prod", and/or for different cloud providers.

runners

Below is an example of how you would register a runner and assign it for use by modules within the namespace we created above.

data "snapcd_service_principal" "my_service_principal" {
  // fetch a pre-existing Service Principal (this must be created manually via the snapcd.io portal)
  name = "MyServicePrincipal"
}

resource "snapcd_runner" "my_runner" {
  name                       = "myrunner"
  service_principal_id       = data.my_service_principal.runner.id
  is_assigned_to_all_modules = false
}

resource "snapcd_runner_namespace_assignment" "myrunner_mynamespace" {
  runner_id    = data.snapcd_runner.myrunner.id
  namespace_id = snapcd_namespace.mynamespace.id
}
Enter fullscreen mode Exit fullscreen mode

Runners can be assigned to a single module, to an entire namespace, an entire stack or (by setting the is_assigned_to_all_modules flag to true) to an entire organization.

4. Permission system

Role-based access control is assignable at every level of the hierarchy: organization, stack, namespace, module, as well as to runners.

users, service principals, and groups can all be scoped precisely.

In the below example code, we set a User to Contributor on the namespace shown in the example code above


data "snapcd_user" "myuser" {
  user_name = "myuser@somedomain.com"
}

resource "snapcd_namespace_role_assignment" "myuser_contributor" {
  stack_id                = snapcd_stack.mynamespace.id
  principal_id            = data.snapcd_user.myuser.id
  principal_discriminator = "User" // Can be one of "User", "ServicePrincipal" or "Group"
  role_name               = "Contributor"
}

Enter fullscreen mode Exit fullscreen mode

5. Non-invasive orchestration

Snap CD is not a Terraform/OpenTofu replacement. It doesn't parse HCL. It doesn't have its own resource model. Your modules are regular Terraform/OpenTofu modules. Your providers are regular Terraform/OpenTofu providers. If you stopped using Snap CD tomorrow, your infrastructure and state files would still be perfectly valid.

Snap CD also doesn't force proprietary tooling into your deployment process. Runners execute standard terraform plan and terraform apply in a normal shell. Snap CD provides the inputs - .env files, .tfvars files, scripts - and the runner executes them. If you needed to, you could navigate directly to a runner's working directory and run those commands manually.

6. Everything as code

Almost everything in Snap CD is managed via its own Terraform Provider. Stacks, namespaces, modules, runners, secrets, role assignments, etc. - all defined in HCL.

For a more complete tutorial see the quickstart guide or go directly to the sample deployment

I will mention here that one of the interesting patterns that is made possible by the Terraform Provider is a module-within-module pattern. In other words, you could instruct Snap CD to deploy Snap CD modules, which then instruct Snap CD to deploy your actual resources!

Getting started

Snap CD is available as a hosted service at snapcd.io with a free community tier. The runner is open source and available on GitHub. We provide deployment instructions for local use, with Docker Compose, or on Kubernetes.

If you've ever stared at a sprawling Terraform monolith and thought "there has to be a better way to split this up" - that's exactly the problem Snap CD was built to solve. If you would like to try it out, here is a quickstart guide.

Top comments (0)