DEV Community: Karl Schriek

Scaling Terraform Infrastructure Beyond a Single Team

Karl Schriek — Mon, 06 Jul 2026 12:38:25 +0000

When a single engineer manages all the Terraform in an organisation, everything is simple. One repo, one state, one pipeline, one set of credentials. There's no coordination overhead because there's no one to coordinate with.

That stops working the moment a second team needs to deploy infrastructure. And by the time you have three or four teams — networking, platform, application, security — the single-team model is actively slowing everyone down.

This guide covers what breaks, how teams typically work around it, and how to set up a structure where each team owns their slice of infrastructure independently.

What breaks

State lock contention

Terraform's state locking is per-state. When the networking team is running terraform plan, the application team's pipeline is blocked — even though they're changing completely unrelated resources. The more teams share a state, the more time everyone spends waiting.

Blast radius

A junior engineer deploying a new application service shouldn't be able to accidentally destroy the VPC. But if application resources and networking resources share a state, a single misconfigured terraform apply can touch anything. Code review catches some of this. Not all of it.

Credential sprawl

A shared pipeline needs credentials for everything — the networking team's Azure subscription, the application team's AWS account, the security team's DNS provider. Every team's secrets end up in one CI environment, accessible to anyone who can trigger a run. This fails most compliance audits.

Approval bottlenecks

In many organisations, one person or a small group gatekeeps all infrastructure changes. Every PR needs their review. Every apply needs their approval. The gatekeeper becomes a bottleneck not because they're slow, but because they're a single point of serialisation for all infrastructure work.

Backend access as implicit access control

Terraform has no built-in concept of per-team or per-workspace permissions. All workspaces in a backend share the same credentials, so giving a user access to one workspace implicitly grants access to all of them. There's been a long-standing request to support separate backend configurations per workspace, and a related request to allow variables in backend configuration blocks — both still open. Teams that need isolation end up managing separate backends per team — which works, but now the cross-team dependency problem (how to pass outputs between backends) sits on top of the access control problem. The demand for a scalable multi-root-module architecture is significant — OpenTofu's proposal to make terraliths a thing of the past has drawn significant community support.

Knowledge boundaries

The networking team understands route tables and peering. The application team understands container orchestration and databases. When both work in the same Terraform codebase, they need to understand each other's resources well enough to avoid breaking them. That cross-training is expensive and doesn't scale.

Typical approaches

Separate repos and pipelines per team

The most common first attempt: give each team their own repo, their own CI pipeline, and their own state backend. This solves the isolation problem but creates a new one — how do teams share outputs? The networking team produces a vpc_id that the application team needs.

Teams end up with one of:

Manual handoff: someone copies an output value into another team's terraform.tfvars. This is error-prone and doesn't trigger re-deploys when the upstream value changes.
terraform_remote_state: each consuming team configures a data source pointing at the producer's state backend. This tightly couples teams to each other's backend configuration and provides no change detection.
Shell scripts or CI glue: a pipeline runs terraform output on one state and feeds the result into terraform apply -var on the next. The dependency graph lives in CI configuration rather than in code, and it's fragile.

Workspaces

Terraform workspaces let you run the same configuration against multiple state files. Some teams use this to give each team their own workspace. But workspaces don't solve cross-team dependencies — they're designed for multiple instances of the same infrastructure (dev, staging, prod), not for splitting ownership of different infrastructure.

Terragrunt

Terragrunt adds a layer on top of Terraform that can manage dependencies between configurations. It works, but introduces its own complexity — terragrunt.hcl files, dependency blocks, wrapper commands. Teams now need to learn Terragrunt in addition to Terraform, and debugging requires understanding both layers. Your Terraform code also becomes coupled to Terragrunt's conventions.

Platform team as intermediary

Some organisations create a platform team that owns all the Terraform and exposes a simplified interface (YAML files, internal portals, or custom tooling) to application teams. This can work well, but it means application teams can't deploy infrastructure directly — they file tickets or submit YAML and wait. The platform team becomes the bottleneck instead.

A better structure

The goal is straightforward: each team owns their own Terraform modules with their own state, credentials, and approval workflows, while cross-team dependencies are handled automatically.

Define ownership boundaries

Start by mapping teams to infrastructure boundaries:

Platform team       → networking, DNS, shared services
Application team A  → their databases, caches, storage
Application team B  → their databases, queues, functions
Security team       → IAM policies, compliance resources, audit logging

Each boundary becomes an independent Terraform root module with its own state. The platform team's networking module produces outputs (vpc_id, subnet_ids) that the application teams consume as inputs.

Scope credentials per team

Each team's deployment environment should only have the credentials it needs. The platform team's Runner has access to the networking subscription. Application team A's Runner has access to their project's service account. No team has access to another team's cloud credentials.

This isn't just a security measure — it's an organisational one. When teams know they can't accidentally (or intentionally) touch resources outside their boundary, they move faster and with more confidence.

Scope approvals per team

The platform team should approve changes to networking. Application team A should approve changes to their own databases. Neither team should need the other's approval for changes within their boundary.

This requires an approval system that understands infrastructure boundaries — not just "can this user approve?" but "can this user approve changes to this specific Module?"

Wire dependencies declaratively

When the platform team changes a subnet, the application teams that depend on those subnets should automatically re-plan and re-deploy. This should happen without the platform team needing to notify anyone, without the application teams needing to poll for changes, and without a CI pipeline encoding the dependency graph in YAML.

How Snap CD handles this

Snap CD's architecture maps directly to the multi-team structure described above.

Modules as ownership units

Each team's Terraform root becomes a Snap CD Module. Modules are grouped into Namespaces within a Stack, creating a natural hierarchy:

resource "snapcd_stack" "prod" {
  name = "prod"
}

resource "snapcd_namespace" "platform" {
  name     = "platform"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_namespace" "app_a" {
  name     = "app-a"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_module" "networking" {
  name            = "networking"
  namespace_id    = snapcd_namespace.platform.id
  source_url      = "https://github.com/myorg/infra-networking.git"
  source_revision = "main"
  runner_id       = data.snapcd_runner.platform.id
}

resource "snapcd_module" "app_a_database" {
  name            = "database"
  namespace_id    = snapcd_namespace.app_a.id
  source_url      = "https://github.com/myorg/app-a-database.git"
  source_revision = "main"
  runner_id       = data.snapcd_runner.app_a.id
}

Scoped permissions

Snap CD's permission system lets you assign roles at any level of the hierarchy — Organization, Stack, Namespace, or individual Module:

# Platform team owns their Namespace
resource "snapcd_namespace_role_assignment" "platform_team" {
  principal_id            = snapcd_group.platform_team.id
  principal_discriminator = "Group"
  role_name               = "Owner"
  namespace_id            = snapcd_namespace.platform.id
}

# App team A owns their Namespace
resource "snapcd_namespace_role_assignment" "app_a_team" {
  principal_id            = snapcd_group.app_a_team.id
  principal_discriminator = "Group"
  role_name               = "Owner"
  namespace_id            = snapcd_namespace.app_a.id
}

# App team A can read platform Outputs (to see what's available)
resource "snapcd_namespace_role_assignment" "app_a_reads_platform" {
  principal_id            = snapcd_group.app_a_team.id
  principal_discriminator = "Group"
  role_name               = "Reader"
  namespace_id            = snapcd_namespace.platform.id
}

Each team can deploy, approve, and manage their own Modules without involving anyone else. They can read the platform team's Outputs but can't modify platform resources.

Isolated Runners

Each team deploys their own Runner with only the credentials they need:

The platform team's Runner has Azure Networking Contributor credentials.
App team A's Runner has access to their specific resource group.
Neither Runner can access the other team's cloud resources.

Snap CD's permission system also controls which Modules can use which Runners, so even if a team tried to point their Module at the platform Runner, it would be denied.

Automatic dependency wiring

Cross-team dependencies are declared once and enforced automatically:

resource "snapcd_module_input_from_output" "vpc_id" {
  module_id        = snapcd_module.app_a_database.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

When the platform team changes networking and the vpc_id Output updates, Snap CD automatically queues a re-plan for app team A's database Module. The app team's approval workflow decides whether to apply it. No manual handoff, no polling, no CI glue.

A practical example

An organisation with three teams:

Team	Namespace	Modules	Runner
Platform	`prod/platform`	networking, dns, shared-services	`runner-platform` (Azure Networking + DNS credentials)
App team A	`prod/app-a`	api-database, api-cache, api-storage	`runner-app-a` (Azure App A resource group credentials)
App team B	`prod/app-b`	worker-queue, worker-functions	`runner-app-b` (AWS App B account credentials)

Each team:

Owns their Namespace and everything in it.
Deploys using their own Runner with scoped credentials.
Approves their own changes without involving other teams.
Receives automatic re-plans when upstream dependencies change.

The platform team can ship a networking change without notifying anyone. Both app teams automatically re-plan if relevant Outputs changed. If nothing changed that affects them, nothing happens.

Tips

Start with two teams, not five. Split the most obvious boundary first — usually platform vs. application. Add more boundaries as the need becomes clear.
Give each team a Namespace, not just Modules. Namespaces let you assign permissions once for the whole group rather than per-Module.
Use Reader roles for cross-team visibility. Teams should be able to see what other teams are deploying without being able to modify it.
Don't share Runners across trust boundaries. A Runner that has both prod networking and prod application credentials defeats the purpose of isolation.
Document the dependency graph. Even though Snap CD manages it automatically, teams should understand which of their Inputs come from other teams and what would trigger a re-plan.
Resist the urge to centralise approvals. If you've scoped permissions correctly, each team is qualified to approve their own changes. A central approval requirement reintroduces the bottleneck you're trying to eliminate.

Managing Terraform Across Multiple Cloud Providers

Karl Schriek — Mon, 06 Jul 2026 12:37:27 +0000

Most organisations don't live in a single cloud. You might run compute in AWS, DNS in Cloudflare, identity in Azure AD, and logging in GCP. Terraform handles each provider fine on its own, but the moment you need to coordinate across providers the tooling fights you.

This guide walks through the common pain points of multi-cloud Terraform setups and the approaches teams use to cope — then shows how Snap CD makes cross-cloud dependency management a solved problem.

Where it gets difficult

Credential sprawl

Each cloud provider has its own authentication mechanism. AWS uses IAM roles and access keys. Azure uses service principals and managed identities. GCP uses service accounts and workload identity federation. A single Terraform state that spans providers needs credentials for all of them — which means your CI runner or developer workstation holds keys to everything.

That's a security problem. A compromised CI pipeline with AWS and Azure credentials exposes both clouds simultaneously. And it's an operational problem — rotating credentials means updating every pipeline that touches that state. This problem compounds at scale: Terraform couples provider processes tightly to credentials, so managing hundreds of accounts across clouds means spawning thousands of provider processes, which quickly becomes unmanageable.

Provider version conflicts

Terraform providers are versioned independently. Upgrading the AWS provider to fix a bug in aws_eks_cluster shouldn't require you to also test a new version of the Azure provider. But when they share a state, a terraform init -upgrade pulls new versions for everything, and a regression in one provider blocks all deployments. Terraform also lacks built-in support for instantiating multiple providers with a loop and passing providers to modules in for_each, making multi-cloud configurations especially verbose and repetitive.

Blast radius across clouds

A misconfigured terraform apply in a single-cloud state damages resources in one cloud. A misconfigured apply in a multi-cloud state can damage resources everywhere. The blast radius scales with the number of providers in the state.

Slow plans

Every terraform plan refreshes every resource in state. When your state contains resources across three clouds, the plan makes API calls to all three — and it's only as fast as the slowest provider. A plan that takes 30 seconds per cloud takes 90 seconds when they're all in one state.

Typical approaches

Separate repos per cloud

The simplest split: one repo for AWS infrastructure, one for Azure, one for GCP. Each has its own state, its own CI pipeline, its own credentials.

infra-aws/        # VPCs, EKS, S3 buckets
infra-azure/      # AKS, Azure SQL, Key Vault
infra-gcp/        # GKE, Cloud SQL, BigQuery

This solves credential isolation and blast radius. But it introduces a new problem: cross-cloud dependencies. Your Azure DNS zone needs the IP address of an AWS load balancer. Your GCP logging sink needs the ARN of an AWS S3 bucket. These values have to flow between repos somehow.

Monorepo with directory-per-cloud

Keep everything in one repo but separate by directory. Each directory has its own state:

infra/
  aws/
    networking/
    compute/
  azure/
    dns/
    identity/
  gcp/
    logging/

Better for code organisation, but the dependency problem remains. You still need to pass outputs from aws/networking to azure/dns, and nothing in Terraform's native tooling handles that.

`terraform_remote_state` across clouds

The built-in approach: each consuming state reads the producer's state directly.

data "terraform_remote_state" "aws_networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "aws/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "azurerm_dns_a_record" "api" {
  name    = "api"
  zone_name           = azurerm_dns_zone.main.name
  resource_group_name = azurerm_dns_zone.main.resource_group_name
  ttl                 = 300
  records             = [data.terraform_remote_state.aws_networking.outputs.load_balancer_ip]
}

This works but has the same drawbacks it always does:

Every consumer needs the backend configuration of every producer — including cross-cloud backend access (Azure state reading from an S3 bucket needs AWS credentials too).
No automatic re-plan when the upstream state changes. You have to trigger it manually or via CI glue.
The dependency graph lives in your head, not in code.

Wrapper scripts and CI orchestration

When terraform_remote_state gets too painful, teams write wrapper scripts:

# Apply AWS networking first
cd infra/aws/networking
terraform apply -auto-approve

# Extract outputs
LB_IP=$(terraform output -raw load_balancer_ip)

# Apply Azure DNS with the output
cd ../../azure/dns
terraform apply -auto-approve -var="load_balancer_ip=$LB_IP"

Or they build multi-step CI pipelines that chain applies in order, passing outputs via pipeline variables or artifacts. This is fragile — the dependency graph is encoded in CI config, not infrastructure code. Adding a new dependency means editing the pipeline, not just the Terraform.

How Snap CD handles multi-cloud

Snap CD was built for exactly this problem. Each cloud's infrastructure becomes one or more Snap CD Modules, each assigned to a Runner with the appropriate credentials. The dependency graph is declared in code via Inputs, and Snap CD handles the orchestration.

One Runner per cloud

Deploy a Runner in each cloud environment with only the credentials it needs. Each Runner is a separate process — one running in AWS with an IAM role, one in Azure with a managed identity, one in GCP with workload identity federation. See Self-Hosted Terraform Runners with Credential Isolation for the full deployment model.

data "snapcd_runner" "aws" {
  name = "aws-runner"
}

data "snapcd_runner" "azure" {
  name = "azure-runner"
}

data "snapcd_runner" "gcp" {
  name = "gcp-runner"
}

Each Runner only has access to its own cloud. A compromised AWS Runner can't touch Azure resources.

Modules per cloud component

Each piece of infrastructure is a Module, assigned to the appropriate Runner:

resource "snapcd_module" "aws_networking" {
  name            = "networking"
  namespace_id    = snapcd_namespace.aws.id
  source_url      = "https://github.com/myorg/infra-aws-networking.git"
  source_revision = "main"
  runner_id       = data.snapcd_runner.aws.id
}

resource "snapcd_module" "azure_dns" {
  name            = "dns"
  namespace_id    = snapcd_namespace.azure.id
  source_url      = "https://github.com/myorg/infra-azure-dns.git"
  source_revision = "main"
  runner_id       = data.snapcd_runner.azure.id
}

Cross-cloud dependencies as code

The load balancer IP from AWS flows into Azure DNS via a snapcd_module_input_from_output:

resource "snapcd_module_input_from_output" "lb_ip_to_dns" {
  module_id        = snapcd_module.azure_dns.id
  input_kind       = "Param"
  name             = "load_balancer_ip"
  output_module_id = snapcd_module.aws_networking.id
  output_name      = "load_balancer_ip"
}

With this in place:

Snap CD knows to apply aws_networking before azure_dns.
When aws_networking is applied and its load_balancer_ip output changes, azure_dns automatically re-plans and re-applies.
No wrapper scripts. No CI orchestration. No cross-cloud terraform_remote_state.
The AWS Runner never needs Azure credentials and vice versa.

A practical multi-cloud example

A common pattern: compute in AWS, DNS and identity in Azure, logging in GCP.

Namespace: aws
  Module: networking     (Runner: aws-runner)
  Module: compute        (Runner: aws-runner)

Namespace: azure
  Module: identity       (Runner: azure-runner)
  Module: dns            (Runner: azure-runner)

Namespace: gcp
  Module: logging        (Runner: gcp-runner)

Dependencies:

# compute needs vpc_id from networking
resource "snapcd_module_input_from_output" "vpc_to_compute" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

# dns needs load_balancer_ip from compute
resource "snapcd_module_input_from_output" "lb_to_dns" {
  module_id        = snapcd_module.dns.id
  input_kind       = "Param"
  name             = "load_balancer_ip"
  output_module_id = snapcd_module.compute.id
  output_name      = "load_balancer_ip"
}

# logging needs cluster_name from compute and subscription_id from identity
resource "snapcd_module_input_from_output" "cluster_to_logging" {
  module_id        = snapcd_module.logging.id
  input_kind       = "Param"
  name             = "cluster_name"
  output_module_id = snapcd_module.compute.id
  output_name      = "cluster_name"
}

resource "snapcd_module_input_from_output" "sub_to_logging" {
  module_id        = snapcd_module.logging.id
  input_kind       = "Param"
  name             = "azure_subscription_id"
  output_module_id = snapcd_module.identity.id
  output_name      = "subscription_id"
}

A commit to infra-aws-networking triggers a cascade: networking re-applies, compute re-plans (because vpc_id might have changed), DNS re-plans if load_balancer_ip changed, and logging re-plans if cluster_name changed. Each step runs on the Runner with the right credentials. No manual intervention.

Comparison

	Single multi-cloud state	Separate repos + CI glue	Snap CD
Credential isolation	None — one set of creds for all clouds	Per-repo/pipeline	Per-Runner
Blast radius	All clouds	Single cloud	Single Module
Cross-cloud dependencies	Direct references	Scripts / CI variables	Declarative wiring
Automatic cascading	N/A (single state)	Manual triggers	Built-in
Plan speed	Slowest provider wins	Per-cloud	Per-Module

Tips

Start with one Runner per cloud. You can split further later (e.g., separate Runners for prod-aws and dev-aws), but one per cloud is the natural starting point.
Keep cross-cloud dependencies narrow. A handful of outputs flowing between clouds (IPs, ARNs, resource IDs) is normal. If you're passing dozens, you might have a boundary in the wrong place.
Use Namespaces to mirror your cloud structure. aws/networking, azure/dns, gcp/logging makes the dependency graph readable at a glance.
Don't share state backends across clouds. An S3 backend for AWS state and an Azure Storage Account for Azure state is fine — Snap CD manages the dependency graph, not the backends.

Why Snap CD: Non-invasive Orchestration

Karl Schriek — Sun, 05 Jul 2026 14:36:17 +0000

Most infrastructure CD tools ask you to change the way you write Terraform. Some require a proprietary wrapper CLI. Others impose a specific directory layout, inject custom backends, or parse plans through a format only they understand. The trade-off is always the same: you get orchestration, but your code now only works inside that tool's ecosystem.

Snap CD takes a different approach. It orchestrates deployments without modifying how Terraform runs. Your code stays portable, your commands stay standard, and nothing is hidden behind an abstraction you can't inspect.

The lock-in pattern

Infrastructure CD tools typically insert themselves between you and Terraform in one or more of these ways:

Wrapper CLIs. Instead of terraform plan, you run toolname plan or toolname run -- terraform plan. The wrapper intercepts the command, adds flags, manages state configuration, and sometimes alters the output. Your CI pipeline, your local workflow, and your debugging sessions all depend on the wrapper being present.

Proprietary plan formats. Some tools parse Terraform's plan output into their own internal representation for policy checks or approval workflows. When Terraform changes its plan format — which it does across major versions — you're waiting on the tool vendor to update their parser before you can upgrade.

Opinionated directory structures. A tool might require your repo to follow a specific layout: one directory per environment, a configuration file at the root describing which directories map to which workspaces, naming conventions that the tool uses to infer relationships. Reorganise your repo and the tool breaks.

Custom state backends. Some tools manage Terraform state themselves, replacing the S3/GCS/Azure backend you'd normally configure. This can simplify initial setup, but it means your state is locked inside the tool. Migrating away requires state surgery.

DSL layers. A few tools go further: you write configuration in a tool-specific language or templating system that generates Terraform code. At that point, you're not really writing Terraform anymore — you're writing input to a code generator.

Each of these creates a dependency. The more a tool wraps Terraform, the harder it is to leave, and the more your team needs to learn beyond Terraform itself.

What non-invasive means in practice

When Snap CD deploys a Module, here's what actually happens on the Runner:

1. Clone the Source

The Runner clones your Git repository (or downloads from a Terraform registry) into a local working directory. This is the same code you'd check out on your laptop.

The working directory follows a predictable path: ~/.snapcd/runner/<stack>/<namespace>/<module>. If the Module specifies a subdirectory within the repo, the Runner navigates into it before running any commands.

2. Provide Inputs through standard mechanisms

Snap CD writes the Inputs your Module needs into a .snapcd subdirectory using formats that Terraform already understands:

inputs.tfvars — Terraform variables (values wired from other Modules' outputs or from static configuration).
snapcd.env — environment variables your providers or scripts might need.
Shell scripts (init.sh, plan.sh, apply.sh, etc.) — wrap the Terraform commands with the correct flags and environment.

There's nothing proprietary about these files. The .tfvars file is a standard Terraform variable file. You can open it, read it, and pass it to terraform apply -var-file= yourself.

3. Run standard Terraform commands

The Runner executes terraform init, then terraform plan, then (after approval) terraform apply. These are the real Terraform binaries — not a wrapper, not a fork, not a shim. The Runner captures stdout and stderr and streams them back to the Snap CD Server for logging, but it doesn't intercept or alter the commands.

4. Collect Outputs

After a successful apply, the Runner runs terraform output -json and reports the results back to the Server. These Outputs become available as Inputs to dependent Modules. Standard Terraform, standard JSON.

You can always drop to the shell

Because the Runner operates on a plain directory with real Terraform files, you can inspect and interact with it directly:

# SSH into the Runner host
ssh runner-prod

# Navigate to the Module's working directory
cd ~/.snapcd/runner/<stack>/<namespace>/<module>

# Look at what Snap CD prepared
ls -la
# main.tf
# variables.tf
# outputs.tf
# .snapcd/
#   inputs.tfvars    ← Inputs from Snap CD
#   snapcd.env       ← environment variables
#   init.sh          ← the init command Snap CD ran
#   plan.sh          ← the plan command Snap CD ran
#   apply.sh         ← the apply command Snap CD ran
#   output.sh        ← the output command Snap CD ran

# Run a plan yourself
terraform init
terraform plan -var-file=.snapcd/inputs.tfvars

This is useful for debugging ("why is this plan showing a diff?"), for one-off operations (terraform import, terraform state mv), and for building confidence that nothing magical is happening behind the scenes. The .snapcd directory contains a .gitignore that excludes all its contents, so none of these generated files pollute your repository.

Your code doesn't know about Snap CD

A Terraform module managed by Snap CD is identical to one that isn't. There's no snapcd {} block, no special provider, no required metadata annotation. You won't find a single line in your .tf files that reveals which tool deploys them. If you searched a managed module for the string "snapcd", you'd get zero results.

All of the orchestration configuration — which Runner deploys the Module, which Inputs to provide, which Outputs to wire to downstream consumers — lives in Snap CD itself, typically managed via the Terraform Provider for Snap CD. Your infrastructure code stays portable: it works with Snap CD, without it, or with something else entirely.

Contrast with the alternatives

Concern	Typical CD tool	Snap CD
How plans run	`toolname plan` or tool-managed wrapper	`terraform plan` (standard binary)
Input delivery	Tool-specific config files or API injection	`.tfvars`, environment variables, shell scripts
State management	Often tool-managed custom backend	Your existing backend (S3, GCS, Azure, etc.)
Directory structure	Must follow tool's conventions	Any structure — Snap CD points at your repo
Debugging	Through the tool's UI/logs only	SSH to Runner, inspect files, run commands
Leaving the tool	State migration, code restructuring	Change nothing — your code already works standalone
Plan format dependency	Tool must parse each TF version's plan format	No plan parsing — Snap CD reads Outputs, not plans

When this matters

The value of non-invasiveness shows up in specific moments:

Upgrading Terraform. You upgrade from 1.5 to 1.9. With Snap CD, you update the binary on your Runner and you're done. There's no intermediary that needs to understand the new plan format.
Debugging a failed apply. Instead of reading logs through a web UI and guessing, you SSH into the Runner, look at the actual files, and run the command yourself to reproduce the error.
Onboarding a new team member. They already know Terraform. They don't need to learn a wrapper CLI, a directory convention, or a configuration DSL. The Snap CD concepts — Modules, Stacks, Namespaces — are the orchestration layer; they don't change how Terraform itself works.
Evaluating alternatives. If you decide Snap CD isn't the right fit, your Terraform code doesn't need to change. Your state files are where they've always been. You take your code and go.

Why Snap CD: An Extensive Supporting Toolset

Karl Schriek — Sun, 05 Jul 2026 14:35:03 +0000

Snap CD ships with a full supporting ecosystem — documentation, a Terraform provider, deployment references, a guided sample, and a migration tool. This article walks through each one.

Documentation

The documentation site at docs.snapcd.io covers three layers:

Quickstart guides — step-by-step walkthroughs for both the Cloud and Self-Hosted editions. From zero to a working deployment in minutes.

Resource reference — detailed pages for every configurable resource: Stacks, Namespaces, Modules, Runners, Module Inputs, Secrets, Identity & Access Management, Agents, Missions, Integrations, Hooks, Flags, and more. Each page explains what the resource is, how it relates to other resources, and how to configure it.

Component documentation — architecture and operational details for the Server, Runner, and Agent (including Sidecars). Covers deployment topology, configuration settings, and the execution model.

Terraform Provider

The Snap CD Terraform Provider lets you manage all Snap CD configuration as code. Stacks, Namespaces, Modules, Runners, Sources, Inputs, Role Assignments, Agents, Missions, Integrations — everything you can configure in the dashboard, you can express in HCL.

resource "snapcd_namespace" "platform" {
  name     = "platform"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_module" "networking" {
  name             = "networking"
  namespace_id     = snapcd_namespace.platform.id
  source_url       = "https://github.com/example/infra.git"
  source_revision  = "main"
  runner_id        = data.snapcd_runner.platform.id
}

resource "snapcd_module" "compute" {
  name             = "compute"
  namespace_id     = snapcd_namespace.platform.id
  source_url       = "https://github.com/example/infra.git"
  source_revision  = "main"
  runner_id        = data.snapcd_runner.platform.id
}

resource "snapcd_module_input_from_output" "vpc_id" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

This is standard Terraform — you plan it, review it, apply it. Your Snap CD configuration lives in version control, goes through code review, and is reproducible across environments. You don't click through a UI to set up a new environment — you copy a Terraform module and change the variables.

The module-within-module pattern

The provider enables a composition pattern: a Snap CD Module that deploys additional Snap CD Modules.

Say you have a platform team that maintains base infrastructure. Application teams each need their own set of Modules that depend on platform Outputs. Rather than manually creating Modules for each team, you write a Terraform module that creates a Snap CD Namespace, creates the application's Modules within it, and wires the Inputs from platform Outputs:

resource "snapcd_namespace" "app" {
  name     = var.app_name
  stack_id = var.stack_id
}

resource "snapcd_module" "database" {
  name         = "database"
  namespace_id = snapcd_namespace.app.id
  source_url   = var.database_source_url
  runner_id    = var.runner_id
}

resource "snapcd_module_input_from_output" "cluster_endpoint" {
  module_id        = snapcd_module.database.id
  input_kind       = "Param"
  name             = "cluster_endpoint"
  output_module_id = var.platform_compute_module_id
  output_name      = "cluster_endpoint"
}

Deploy this through Snap CD itself and you have a self-service system: the platform team defines the pattern once, and new applications are onboarded by adding an entry to a configuration file.

Reference Deployments

Snap CD components ship as Docker images and as zipped binaries on GitHub Releases. Three reference deployment repositories cover every common substrate, each containing a components/ directory with one self-contained sub-deployment per component (Server, Runner, Agent):

Substrate	Repository
Docker / Compose	schrieksoft/snapcd-deployment-docker
Kubernetes (Kustomize)	schrieksoft/snapcd-deployment-kubernetes
Local (native binaries)	schrieksoft/snapcd-deployment-local

You can bring up all three components together, or just the one you need — a Runner pointed at the Cloud edition, an Agent attached to a remote Server, etc. The images are version-pinned, the environment variables are documented, and each repo's README walks through both shapes.

These are the same deployment specifications used to run the Snap CD Cloud offering — not simplified demo versions.

Sample Deployment

The sample-deployment repository is a guided walkthrough that creates a realistic set of Snap CD resources using the Terraform Provider. It deploys four Modules with mock resources (no real cloud infrastructure needed) arranged in a dependency graph:

       |-----> cluster  ----- |
vpc ---|                      | ---> app
       |-----> database ----- |

The sample is organized into numbered sections, each introducing a new resource type with inline commentary explaining the reasoning:

Stack and Namespace — snapcd_namespace, snapcd_namespace_input_from_literal, snapcd_namespace_hook
Module and literal Inputs — snapcd_module, snapcd_module_input_from_literal (both Param and EnvVar kinds), snapcd_module_hook
Output wiring — snapcd_module_input_from_output (single Output), approval thresholds
Output Sets — snapcd_module_input_from_output_set (all Outputs by name match), snapcd_module_terraform_flag
Secrets and non-string types — snapcd_module_input_from_secret, type = "NotString" for numeric values
Agents and Missions — snapcd_agent_namespace_supply, snapcd_namespace_mission (SummarizeJob, AutoDiagnose, ApprovalRecommend)

The goal is to be copied and adapted. After completing the Self-Hosted Quickstart, you can terraform apply the sample and have a working multi-Module environment with dependency wiring, approval gates, hooks, secrets, and AI Missions configured.

Demonolith — monolith migration tool

Demonolith is a Go CLI that refactors a monolithic Terraform/OpenTofu root into independent per-module roots — the first step toward managing them with Snap CD.

The problem it solves: a single-root monolith gets slow, risky, and coupled. A one-line change re-plans everything, and one broken resource can block unrelated ones. Splitting it by hand is error-prone — you need to move resources, carve state, create variable/output boundaries at every cross-module reference, and verify that nothing is inadvertently recreated.

Demonolith automates all of this. You annotate your resources with decorator comments indicating which Module each belongs to:

# @demono:move networking
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

# @demono:move compute
resource "aws_instance" "web" {
  subnet_id = aws_subnet.public.id
}

Then run it:

# Emit carved roots (code only, offline):
demonolith split ./infra

# Also carve state into per-module local files:
demonolith split ./infra --state

# Carve + prove every module plans to zero create/destroy:
demonolith split ./infra --state --proof

The pipeline:

Parse — builds a resource-level reference graph via AST traversal (not regex — it catches refs inside templatefile(), jsonencode(), and index expressions).
Place — resolves decorators into a total assignment. Undecorated resources fall to a configurable remainder module.
Boundary — references crossing module boundaries become variable/output pairs. depends_on-only edges become ordering dependencies (no spurious value wiring).
Cycle gate — refuses impossible splits with a named cycle path.
Emit — writes per-module roots via hclwrite (formatting preserved), rewrites cross-module references to var.<input>, propagates providers and locals.
State carve — terraform state mv over local copies. Never touches the real backend.
Proof — walks modules in topological order, threads each producer's extracted outputs into its consumers' inputs (the role Snap CD plays at runtime), and plans each against its carved state. Zero creates and zero destroys = the split is operationally inert.

The carved roots are plain Terraform — valid standalone, with the cross-module edges being exactly the wiring you'd configure in Snap CD via snapcd_module_input_from_output.

Why Snap CD: AI on a Leash

Karl Schriek — Sun, 05 Jul 2026 13:58:23 +0000

AI coding agents are showing up in infrastructure workflows. They can diagnose a failed terraform apply, summarise what changed across a dozen modules overnight, draft a fix for a misconfigured security group, and recommend whether a plan is safe to approve. The potential to eliminate toil is real.

But so is the potential to break things. A bad terraform apply can delete a production database. An agent that auto-approves plans without understanding blast radius is not a productivity tool — it's a liability. The question isn't whether to use AI in infrastructure management. It's how to use it without handing over the keys.

The problem with unrestricted agents

Most AI agent frameworks assume broad access. Give the agent credentials, point it at your infrastructure, and let it figure things out. This works fine for generating code in a branch. It's a terrible model for infrastructure, where the gap between "run this command" and "destroy this resource" is one flag.

The usual mitigations are crude:

Read-only API keys. The agent can observe but not act. You get diagnostics but no automation — the human still has to do everything.
Wrapper scripts with allow-lists. You write a shell script that only permits certain Terraform commands. Fragile, hard to maintain, and easy to outgrow.
Separate CI pipelines. The agent commits to a branch, CI runs the plan, a human reviews. This works but adds latency and doesn't let the agent participate in approval or deployment at all.

None of these give you a spectrum of trust. It's all-or-nothing: either the agent can do everything, or it's limited to generating text that a human has to act on manually.

What you actually want

A useful model looks more like how you'd onboard a new team member:

Start them with read access so they can learn the system and diagnose issues.
Give them deploy access to test so they can move fast without risk.
Let them approve low-risk changes in staging once they've proven reliable.
Grant production access only when trust is established — and even then, scoped to the systems they own.

The same progression makes sense for an AI agent. The challenge is finding a system that supports this without building a separate authorization layer just for AI.

How Snap CD handles this

Snap CD has first-class support for AI agents — but rather than inventing a parallel permission system, it treats an Agent as a principal governed by the same RBAC that controls human access. The result is two complementary layers: a Missions framework that gives agents a narrow, event-driven interface to the deployment lifecycle, and a Permission System that controls what those agents (or any other AI) can actually do.

The Agent component and Missions

Snap CD's Agent is a self-hosted process that consumes deployment events and runs AI-driven Missions. Rather than giving an AI broad access and hoping it does the right thing, Missions provide a narrow frame — each Mission type is bound to a specific trigger:

Mission	Trigger	What it does
`AutoDiagnose`	Job fails	Posts a root-cause hypothesis with relevant log excerpts and suggested next steps
`AutoFix`	Job fails	Attempts an automated fix based on the diagnosis, then retries the Job
`ApprovalRecommend`	Job reaches approval-required state	Analyzes the plan output and recommends whether to approve
`SummarizeJob`	Job succeeds	Generates a human-readable summary of what changed

You create an Agent, assign it a Service Principal (which determines what it can do via RBAC), and supply it to the scopes it should serve — the same supply model used for Runners:

resource "snapcd_agent" "ai" {
  name                       = "ai-agent"
  service_principal_id       = data.snapcd_service_principal.ai_agent.id
  is_supplied_to_all_modules = false
}

resource "snapcd_agent_stack_supply" "test" {
  agent_id = snapcd_agent.ai.id
  stack_id = snapcd_stack.test.id
}

resource "snapcd_stack_mission" "diagnose_test" {
  stack_id     = snapcd_stack.test.id
  agent_id     = snapcd_agent.ai.id
  mission_type = "AutoDiagnose"
}

This sets up an Agent that auto-diagnoses any failed Job in the test Stack. Without a supply covering prod, the Agent won't receive Missions there — even if someone accidentally creates a prod-scoped Mission for it. You can scope Missions down to individual Namespaces or Modules.

Scoped role assignments

The Agent's Service Principal still needs the appropriate RBAC role to perform its actions. You grant roles at whatever granularity makes sense — scoped to a Stack, Namespace, or Module:

# The agent can read everything in prod — diagnose issues, view plans, inspect state
resource "snapcd_stack_role_assignment" "agent_prod_reader" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Reader"
  stack_id                = snapcd_stack.prod.id
}

# The agent can deploy freely in test — run plans, apply
resource "snapcd_stack_role_assignment" "agent_test_contributor" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Contributor"
  stack_id                = snapcd_stack.test.id
}

# The agent can manage jobs in staging, but only for the networking namespace
resource "snapcd_namespace_role_assignment" "agent_staging_jobs" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "JobManager"
  namespace_id            = snapcd_namespace.staging_networking.id
}

If the agent tries to approve a production deploy, it gets a permission denied — same as any user without the right role on that scope. No special-case logic, no wrapper scripts.

Approval gates as natural checkpoints

Snap CD's approval system works the same regardless of who (or what) created the plan. A Module can require a minimum number of approvals before an apply proceeds. This means:

An agent can trigger a plan and recommend approval (via the ApprovalRecommend Mission).
A human reviews the plan output and approves or rejects.
The apply only proceeds once the required approval count is met.

You can also set up a workflow where the agent itself is one of multiple required approvers. Two humans and one agent, or two agents and one human — whatever quorum makes sense for the risk level. The approval system doesn't care whether the approver is biological.

Full audit trail

Every action an agent takes — triggering a plan, approving a deployment, reading state — is logged and attributed to its Service Principal. When the Service Principal is attached to an Agent resource, Snap CD stamps an agent_id claim on the token, so the audit log distinguishes between "service principal X acting as agent Y" and "service principal X acting as a plain service account." You can answer "what did the agent do last Tuesday?" the same way you'd answer it for any user: check the audit log.

Integrations: pushing events to external systems

Missions and permissions govern what an agent can do. Integrations govern what your team sees. An Integration connects Snap CD to an external system — Slack is the first supported sink — and delivers notifications for deployment lifecycle events and Mission milestones.

Like Agents and Runners, Integrations use a supply model: you supply the Integration to the scopes it should serve, then subscribe specific Integration Events (triggers) at those scopes. A notification is delivered only when both the supply and the subscription exist.

data "snapcd_integration" "alerts" {
  name = "alerts"
}

# The integration serves every module in the production stack
resource "snapcd_integration_stack_supply" "prod" {
  integration_id = data.snapcd_integration.alerts.id
  stack_id       = snapcd_stack.production.id
}

# Notify Slack when any job in the production stack fails
resource "snapcd_stack_integration_event" "failed" {
  stack_id       = snapcd_stack.production.id
  integration_id = data.snapcd_integration.alerts.id
  trigger        = "JobFailed"
}

# Notify Slack when a Mission reports a milestone (diagnosis, fix attempt, etc.)
resource "snapcd_stack_integration_event" "milestone" {
  stack_id       = snapcd_stack.production.id
  integration_id = data.snapcd_integration.alerts.id
  trigger        = "MissionMilestoneReported"
}

Available triggers include JobSucceeded, JobFailed, JobAwaitingApproval, JobCancelled, and MissionMilestoneReported. You can scope subscriptions at the Organization, Stack, Namespace, or Module level and use optional message templates with tokens like {{moduleName}}, {{jobUrl}}, {{missionType}}, and {{message}}.

For Mission milestones, Snap CD threads all updates from a single Mission run under one Slack message. This means a multi-step AutoFix run — diagnosis, attempted fix, retry — shows up as a single threaded conversation rather than a spray of unrelated messages.

End-to-end: AutoFix in action

Here's what happens when a deployment fails and AutoFix is configured. The setup:

A test Stack with an AutoFix Mission configured for the Agent
The Agent's Service Principal has Contributor on the test Stack
A Slack Integration is supplied to the Stack with JobFailed, JobSucceeded and MissionMilestoneReported triggers

Someone pushes a commit that introduces a typo in a Terraform variable name. Snap CD detects the source change and triggers a plan-and-apply Job on the affected Module. The apply fails.

1. Slack: Job failure notification

The JobFailed Integration Event fires. Slack receives a message:

❌ Apply failed on vpc (test/networking)
https://mydomain.com/jobs/abc-123

2. AutoFix Mission dispatched

The Server dispatches an AutoFix Mission to the Agent. The Agent routes it to its Sidecar (e.g. the claude-sidecar), which works through a structured sequence: read the job logs via Snap CD's MCP server, diagnose the root cause, clone the source repo, make the minimal fix, and open a pull request. The Sidecar never pushes to the default branch directly — it always creates a fix branch and opens a PR. As it works through each step, it emits milestone events that stream back through the Agent to the Server.

Slack: AutoFix milestones (threaded under the failure message)

🔧 AutoFix — Job on vpc failed — investigating.

🔧 AutoFix — Root cause: variable vnet_cidr_block referenced in main.tf:42 does not exist. The variable was renamed to vnet_address_space in the latest commit but the reference was not updated. Fixing.

🔧 AutoFix — Opened PR: https://github.com/example/vpc-module/pull/47

3. Human merges the PR

A human reviews the PR, sees the one-line fix, and merges it. Snap CD detects the source change on the tracked branch and automatically triggers a new plan-and-apply Job.

4. Retry succeeds

The new Job runs. The apply succeeds.

Slack: success notification

✅ Apply succeeded on vpc (test/networking)
https://mydomain.com/jobs/def-456

The entire sequence — failure, diagnosis, fix PR, human merge, retry, success — plays out across one Slack thread and a GitHub PR. A human glancing at the channel sees the original failure, the agent's reasoning, and where to review the fix. Every action is attributed to the Agent's Service Principal in the audit log.

If the failure had been transient (a provider rate limit or timeout), the AutoFix Mission would have simply re-triggered the Job — no code change, no PR. And if the root cause wasn't something the agent could safely fix in-repo (expired credentials, state drift, a defect in a referenced module), it would degrade to a diagnosis with a recommended manual action.

Now contrast this with the same Agent on the prod Stack, where it is only configured with the AutoDiagnose mission (not AutoFix). The same failure would produce a diagnosis but stop there — no fix attempt, no PR, no retry. The agent reports what went wrong and a human takes it from there.

Bring your own AI

The Missions framework is the canonical way to let AI participate in your deployment lifecycle. But Snap CD doesn't force you into it. If you prefer to use your own AI agent — or a different orchestration framework — you can have it interact with Snap CD's REST API directly using a plain Service Principal with Role Assignments. The same RBAC governs what the API caller can do, regardless of whether it's a human, a CI bot, or an LLM.

There are two authentication approaches:

1. Plain Service Principal — register a Service Principal, assign it the appropriate roles, and have your agent authenticate with the Client ID / Client Secret pair via the standard OAuth token endpoint:

POST /connect/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials&client_id=<org_id>:<client_id>&client_secret=<client_secret>

The returned bearer token is then attached to subsequent API requests as an Authorization: Bearer <token> header, letting your agent call any endpoint its roles permit — trigger plans, read job logs, post approvals, etc.

2. Service Principal with Agent identity — if you want Snap CD to recognize that API calls are coming from a specific Agent (for audit trail purposes), attach the Service Principal to an Agent resource and pass the agent_id parameter when requesting the token:

POST /connect/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials&client_id=<org_id>:<client_id>&client_secret=<client_secret>&agent_id=<agent_guid>

The returned token carries an agent_id claim. Snap CD uses this to attribute API calls to the named Agent in the audit log, making it clear which actions were taken by AI versus humans or other service accounts.

A reasonable pattern: give the agent broad permissions in test, narrow permissions in prod.

test/          → Contributor (full deploy access)
staging/       → JobManager (can manage jobs, scoped to specific namespaces)
prod/          → Reader (observe only)

In test, the agent can deploy freely — run plans, approve them, apply them. It iterates fast, catches issues early, and doesn't need human intervention for routine changes. In staging, it participates in the approval process but can't act unilaterally on sensitive namespaces. In prod, it can diagnose and report but never modify.

This isn't a rigid hierarchy. You can adjust per Namespace or per Module. Maybe the agent gets Contributor on prod/monitoring because deploying a new dashboard is low-risk, while prod/database stays human-only. The permission system is granular enough to express whatever trust model you need.

Why Snap CD: A Permission System Built for Infrastructure

Karl Schriek — Fri, 03 Jul 2026 07:12:48 +0000

Most infrastructure teams handle access control in one of two places: the CI/CD layer or the cloud provider's IAM layer. Neither maps well to how infrastructure is actually structured.

CI permissions are usually binary — you can trigger a pipeline or you can't. There's no concept of "this person can deploy networking but not databases." Cloud IAM is more granular, but it governs what credentials can do, not what people can do within your deployment workflow. You end up with a gap: the system that understands your infrastructure topology has no permission model, and the system that has a permission model doesn't understand your infrastructure topology.

Snap CD sits in that gap. It provides a hierarchical role-based access control system that maps directly to the way you organise your infrastructure — Stacks, Namespaces, and Modules, Runners, Agents, and Integrations — and enforces it uniformly whether actions come through the web dashboard, the API, or the Terraform Provider.

The two common approaches and where they break down

CI/CD gating

The simplest form of infrastructure access control: who can trigger the pipeline?

Most CI systems give you repository-level permissions. If you have write access to the repo, you can trigger the workflow. Some offer environment-level protection rules — require approval from a specific team before deploying to prod.

This works until your infrastructure spans multiple repositories, or until you need more granularity than "can deploy to this environment." Can this person create new Modules but not delete existing ones? Can they approve a plan but not trigger an apply? CI systems don't model these distinctions.

There's also the backdoor problem. Protection rules only apply to CI-triggered runs. Anyone with the right credentials can run terraform apply from their laptop and bypass every gate you've set up.

Cloud IAM

Cloud providers have sophisticated permission systems — Azure RBAC, AWS IAM, GCP IAM. These control what API calls a principal can make against cloud resources. But they operate at the wrong abstraction level for deployment workflows.

Cloud IAM doesn't know that your VPC, subnets, and route tables form a logical "networking" group that one team owns. It doesn't know that module-compute depends on module-networking and should only be deployable after networking is stable. It can tell you whether a service principal can create an EC2 instance, but it can't tell you whether a human should be allowed to approve the plan that creates it.

You end up encoding deployment permissions across multiple systems — repo access in GitHub, environment protection rules in Actions, IAM policies in AWS — with no single place to answer "who can do what to which part of my infrastructure?"

Snap CD's permission model

Snap CD's permission system is built around two ideas: roles describe what you can do, and scope determines where you can do it.

Principals

Three types of identity can hold role assignments:

Users — human operators, authenticated via the identity provider.
Service principals — machine identities for automation, CI pipelines, and API integrations.
Groups — collections of users or service principals, for managing permissions at team scale.

Roles

Roles define a set of allowed operations. The same role names appear across different scope levels, with context-appropriate permissions:

Owner — full control, including the ability to delete the resource and manage role assignments on it.
Contributor — create, update, and manage child resources, but cannot delete the resource itself or manage role assignments.
Reader — read-only access.
IdentityAccessManager — can manage role assignments on this resource without having full Owner control.

Additional roles exist at specific scope levels:

StackCreator (organization) — can create new Stacks.
NamespaceCreator (Stack) — can create new Namespaces within the Stack.
ModuleCreator (Namespace) — can create new Modules within the Namespace.
Approver — can approve deployment plans.
JobManager — can manage deployment jobs (cancel, retry).
SourceChangeNotifier — can notify the system of source changes (used by webhooks and CI integrations).

Scope hierarchy

Role assignments are scoped to a specific level in the hierarchy. Permissions granted at a higher level flow down to all children:

Organization
  └── Stack (e.g. "prod", "test")
        └── Namespace (e.g. "prod/networking", "prod/application")
              └── Module (e.g. "prod/networking/vpc")

Runners, Agents, and Integrations sit outside this hierarchy — they each have their own scope. A Runner's Owner controls which Modules are allowed to execute on it. An Agent's Owner controls which scopes it can serve Missions in.

A role assigned at the Organization level applies everywhere. A role assigned at a specific Module applies only to that Module. This means you can express both broad policies ("the platform team is Reader on the entire organization") and narrow exceptions ("except they're Owner on the networking Namespace").

Concrete examples

Platform team owns networking, reads everything else

The platform team manages all networking infrastructure but should only observe application deployments:

resource "snapcd_stack_role_assignment" "platform_reader" {
  stack_id                = snapcd_stack.production.id
  principal_id            = snapcd_group.platform_team.id
  principal_discriminator = "Group"
  role_name               = "Reader"
}

resource "snapcd_namespace_role_assignment" "platform_owns_networking" {
  namespace_id            = snapcd_namespace.networking.id
  principal_id            = snapcd_group.platform_team.id
  principal_discriminator = "Group"
  role_name               = "Owner"
}

The platform team gets Reader at the Stack level (they can see everything in production) and Owner on the networking Namespace (they can deploy, approve, and manage Modules within it). They cannot modify or deploy anything in other Namespaces.

Junior engineer approves test but not prod

A junior team member should be able to approve deployment plans in the test environment but only observe production:

resource "snapcd_stack_role_assignment" "junior_test_contributor" {
  stack_id                = snapcd_stack.test.id
  principal_id            = snapcd_user.junior_engineer.id
  principal_discriminator = "User"
  role_name               = "Contributor"
}

resource "snapcd_stack_role_assignment" "junior_prod_reader" {
  stack_id                = snapcd_stack.prod.id
  principal_id            = snapcd_user.junior_engineer.id
  principal_discriminator = "User"
  role_name               = "Reader"
}

They can trigger plans, approve, and deploy in test. In prod, they can see what's happening but can't change anything.

CI Service Principal scoped to a single Module

An automated deployment pipeline that should only be able to deploy one specific Module:

resource "snapcd_module_role_assignment" "ci_deploys_api" {
  module_id               = snapcd_module.api_gateway.id
  principal_id            = snapcd_service_principal.ci_pipeline.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Contributor"
}

The service principal can trigger plans and applies on the API gateway Module, but has no access to anything else in the organization. If the pipeline is compromised, the blast radius is limited to a single Module.

Runner access control

Controlling which Modules can execute on which Runners is a security boundary — a Runner deployed in your production Azure subscription should only execute production Modules. This is handled by Runner Supply, not by role assignments. A Runner Supply declares that a Runner is available to a Stack, Namespace, or individual Module:

resource "snapcd_runner_stack_supply" "prod" {
  runner_id = snapcd_runner.azure_prod.id
  stack_id  = snapcd_stack.production.id
}

Every Module in the production Stack can execute on azure_prod. Modules in other Stacks cannot, regardless of what credentials exist elsewhere. Without a matching Supply, a Module will not execute.

Runner role assignments (snapcd_runner_role_assignment) serve a different purpose — they control what a principal can do to the Runner itself (manage, view, etc.), not which Modules execute on it.

AI Agent access control

Agents follow the same supply-and-RBAC model as Runners. An Agent is backed by a Service Principal — its permissions are whatever roles that Service Principal holds. You supply the Agent to specific scopes, and declare which Missions it can run at each scope:

resource "snapcd_agent" "ai" {
  name                       = "ai-agent"
  service_principal_id       = data.snapcd_service_principal.ai_agent.id
  is_supplied_to_all_modules = false
}

resource "snapcd_agent_stack_supply" "test" {
  agent_id = snapcd_agent.ai.id
  stack_id = snapcd_stack.test.id
}

resource "snapcd_stack_mission" "diagnose_test" {
  stack_id     = snapcd_stack.test.id
  agent_id     = snapcd_agent.ai.id
  mission_type = "AutoDiagnose"
}

This Agent can auto-diagnose failed Jobs in the test Stack. Without a supply covering prod, it won't receive Missions there — even if someone accidentally creates a prod-scoped Mission for it. The Agent's Service Principal still needs the appropriate RBAC role to perform the action (e.g. Contributor to attempt an auto-fix, Reader to diagnose). Every action is logged and attributed to the Agent's Service Principal, giving you the same audit trail as human operators.

No backdoors

A common failure mode with CI-based access control is that the gates only apply to one path. Someone with the right cloud credentials can bypass CI entirely and run terraform apply from their laptop.

Snap CD's permission model applies to every interaction path. Whether you click "Approve" in the web dashboard, call the REST API from a script, or manage resources through the Terraform Provider, the same role assignments are evaluated. There is no unenforced path.

This also means your access control configuration is auditable in one place. Instead of piecing together GitHub team permissions, CI environment protection rules, and cloud IAM policies to understand who can deploy what, you query Snap CD's role assignments.

Managing permissions as code

Because every role assignment is a Terraform resource, your permission model lives in version control alongside the rest of your infrastructure configuration. Changes go through the same review process as any other infrastructure change — pull request, review, approve, apply.

resource "snapcd_stack" "prod" {
  name            = "prod"
  organization_id = snapcd_organization.main.id
}

resource "snapcd_namespace" "networking" {
  name     = "networking"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_namespace" "application" {
  name     = "application"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_stack_role_assignment" "sre_owns_prod" {
  stack_id                = snapcd_stack.prod.id
  principal_id            = snapcd_group.sre.id
  principal_discriminator = "Group"
  role_name               = "Owner"
}

resource "snapcd_namespace_role_assignment" "appdev_contributes_app" {
  namespace_id            = snapcd_namespace.application.id
  principal_id            = snapcd_group.app_developers.id
  principal_discriminator = "Group"
  role_name               = "Contributor"
}

The SRE team owns the entire prod Stack. Application developers can deploy within the application Namespace but cannot touch networking. Both constraints are declared, version-controlled, and enforced at every interaction point.

Tips

Start broad, narrow later. Give your team Contributor at the organization level to start. As you identify boundaries — different teams, different environments, different risk levels — add scoped assignments and remove the broad one.
Use groups, not individual users. Assigning roles to groups means onboarding a new team member is a single group membership change, not a dozen role assignments.
Scope Runners to environments. A Runner with production credentials should only accept jobs from production Modules. Use Runner Supply to enforce this.
Treat permissions as infrastructure. Define all role assignments in Terraform. If a role assignment isn't in code, it shouldn't exist.
Audit regularly. Because all role assignments are Terraform resources, terraform plan will show you any drift between your intended permissions and the actual state.

Why Snap CD: Event-driven Continuous Deployment

Karl Schriek — Fri, 03 Jul 2026 07:12:19 +0000

Infrastructure deployment usually starts simple: someone runs terraform apply on their laptop, eyeballs the plan, and hits yes. That works fine with a small team and a handful of resources. But as the infrastructure grows — more states, more teams, more environments — the question shifts from "how do I apply this" to "how do I make sure the right things deploy at the right time, in the right order, without someone babysitting the process."

This guide walks through the common approaches to automating Terraform deployments, where each one breaks down, and how Snap CD's event-driven model addresses the gaps.

The manual era

Every Terraform project starts here:

cd infra/networking
terraform plan -out=plan.tfplan
# read the output carefully...
terraform apply plan.tfplan

This is fine until it isn't. The problems are well-known:

No audit trail. Who applied what, when? You'd need to grep shell history or hope someone wrote it down.
No ordering guarantee. If networking needs to be applied before compute, that lives in someone's head. A new team member doesn't know.
Drift between environments. Dev gets the latest change; prod doesn't, because someone forgot.
Stale plans. You run plan at 2pm, get distracted, apply at 5pm. The world may have changed in between.

Most teams move away from manual applies within months of going to production.

Scheduled CI pipelines

The natural next step is to put terraform apply in CI. A pipeline runs on every merge to main, or on a cron schedule:

# Typical CI approach
on:
  push:
    branches: [main]
    paths: ['infra/networking/**']

jobs:
  apply:
    steps:
      - run: terraform init
      - run: terraform apply -auto-approve

This solves the audit trail (CI logs everything) and drift (cron catches config drift eventually). But it introduces new problems:

No dependency awareness. You can trigger networking's pipeline on a path filter, but compute doesn't know to re-run when networking's outputs change. You end up writing brittle pipeline glue: "after networking finishes, trigger compute, then trigger DNS."
Wasted runs. A cron-based pipeline runs every 15 minutes whether anything changed or not. Most runs produce empty plans.
Blast radius of -auto-approve. If the pipeline auto-applies, a bad commit deploys immediately. If it doesn't, someone still has to watch it and click approve — you've automated the init and plan but not the decision.
Cross-repo coordination. If networking and compute live in different repos, the path filter approach doesn't help. You need webhook chains or a shared orchestration layer.

Teams at this stage typically spend significant time maintaining CI configuration that is, in effect, a hand-rolled deployment orchestrator.

GitOps-style operators

Tools like Atlantis and similar Terraform GitOps operators move the trigger model closer to what you want: watch a repo, run plan on PR, apply on merge. This is a genuine improvement over raw CI — the plan is visible in the PR, approvals happen in the code review flow.

But the model has limits:

Single-state focus. Most GitOps operators work within one repository or one state. They don't model the relationship between your networking state and your compute state.
No cascading. When networking outputs change, nothing tells the compute operator to re-plan. You're back to manual coordination or webhook scripts.
Approval is binary. You can approve a PR, but you can't say "this plan needs two approvals before apply" or "destroy plans need a different approval threshold than regular changes."

Event-driven deployment with Snap CD

Snap CD's approach is different: instead of triggering on CI events and bolting on dependency management after the fact, it models the dependency graph as a first-class concept and triggers deployments based on changes to that graph.

Module deployments are orchestrated automatically based on three types of events:

Source changes: A new commit lands on a branch, or a new semantic version tag appears. Snap CD detects this (typically via polling jobs pushed to a Runner, but manual notification webhooks are also supported) and triggers a deployment job.
Upstream output changes: When a dependency's outputs change, downstream Modules re-deploy.
Definition changes: When you modify a Module's configuration (e.g. via the Terraform Provider, or manually via the Dashboard), it triggers a sync.

You can also require manual approval before applies go through, with configurable approval thresholds. This lets you build workflows where plans run automatically but apply waits for human sign-off.

The following sections walk through each of these in detail.

Source changes

Every Snap CD Module points at a source — a Git repository at a specific revision:

resource "snapcd_module" "networking" {
  name             = "networking"
  namespace_id     = snapcd_namespace.platform.id
  source_url       = "https://github.com/myorg/infra-networking.git"
  source_revision  = "main"
  runner_id        = data.snapcd_runner.platform.id
}

Snap CD periodically checks the source for new commits. When it finds one, it triggers a plan. No CI pipeline configuration, no webhooks, no path filters.

If you prefer version-based releases over branch tracking, use semantic version ranges:

resource "snapcd_module" "networking" {
  name                  = "networking"
  namespace_id          = snapcd_namespace.platform.id
  source_url            = "https://github.com/myorg/infra-networking.git"
  source_revision       = "v2.*"
  source_revision_type  = "SemanticVersionRange"
  runner_id             = data.snapcd_runner.platform.id
}

This tells Snap CD to resolve the latest v2.x.y tag. When you push v2.4.0, Snap CD picks it up and triggers a plan. Tags outside the range (like v3.0.0) are ignored.

Upstream output changes (dependency cascading)

The real power shows up when you wire Modules together. Suppose your compute Module needs the VPC ID and subnet IDs from networking:

resource "snapcd_module" "compute" {
  name             = "compute"
  namespace_id     = snapcd_namespace.platform.id
  source_url       = "https://github.com/myorg/infra-compute.git"
  source_revision  = "main"
  runner_id        = data.snapcd_runner.platform.id
}

resource "snapcd_module_input_from_output" "vpc_id" {
  input_kind       = "Param"
  module_id        = snapcd_module.compute.id
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

resource "snapcd_module_input_from_output" "private_subnet_ids" {
  input_kind       = "Param"
  module_id        = snapcd_module.compute.id
  name             = "private_subnet_ids"
  output_module_id = snapcd_module.networking.id
  output_name      = "private_subnet_ids"
}

Now Snap CD knows: compute depends on networking. When networking applies and its outputs change — say you added a new subnet — compute automatically re-plans with the updated values. No webhook. No CI trigger. No glue script.

This cascading is transitive. If DNS depends on compute, and compute depends on networking, a change to networking ripples through:

networking outputs change
    → compute re-plans and applies
        → compute outputs change
            → dns re-plans and applies

Independent Modules run in parallel. If both compute and database depend on networking but not on each other, they re-plan simultaneously.

Definition changes

Source changes aren't the only trigger. If you update a Module's definition — change an input value, reassign it to a different Runner, modify a hook — Snap CD detects the configuration change and triggers a re-plan.

This means your Terraform provider code is the single source of truth. Changing a variable in your Snap CD configuration:

resource "snapcd_module_input_from_literal" "cluster_version" {
  input_kind    = "Param"
  module_id     = snapcd_module.compute.id
  name          = "kubernetes_version"
  literal_value = "1.30"   # was "1.28"
  type          = "String"
}

…triggers a re-plan of the compute Module with the new value. The same way a commit to the source repo would.

Any change to the snapcd_module resource itself or to child resources — snapcd_module_input_from_output, snapcd_module_input_from_literal, snapcd_module_input_from_secret, snapcd_extra_file, snapcd_backend_config, and others — triggers a re-plan.

Approval gates

Not every plan should auto-apply. Snap CD lets you set approval thresholds at the Module or Namespace level:

resource "snapcd_module" "database" {
  name                       = "database"
  namespace_id               = snapcd_namespace.platform.id
  source_url                 = "https://github.com/myorg/infra-database.git"
  source_revision            = "main"
  runner_id                  = data.snapcd_runner.platform.id
  apply_approval_threshold   = 1
  destroy_approval_threshold = 2
}

With apply_approval_threshold = 1, Snap CD pauses after planning and waits for at least one principal to approve before applying. Destroy operations require two separate approvals.

You can set defaults at the Namespace level so all Modules within it inherit the same policy. Namespaces live inside Stacks, which represent hard boundaries like "prod" and "dev":

resource "snapcd_namespace" "production" {
  name                                = "production"
  stack_id                            = data.snapcd_stack.main.id
  default_apply_approval_threshold    = 1
  default_destroy_approval_threshold  = 2
  default_approval_timeout_minutes    = 60
}

Individual Modules can override the Namespace defaults. A low-risk monitoring Module might not need approval; a database Module might need two.

Publishing events externally

Deployment lifecycle events aren't limited to internal orchestration. Integrations let you push events to external systems. Slack is the first supported sink, with others to follow. You define which events fire on which scope using Integration Events, and which scopes the Integration serves via the same supply model used by Runners:

data "snapcd_integration" "alerts" {
  name = "alerts"
}

resource "snapcd_integration_stack_supply" "prod" {
  integration_id = data.snapcd_integration.alerts.id
  stack_id       = snapcd_stack.production.id
}

resource "snapcd_stack_integration_event" "failed" {
  stack_id       = snapcd_stack.production.id
  integration_id = data.snapcd_integration.alerts.id
  trigger        = "JobFailed"
}

This sends a Slack notification whenever any Module in the production Stack has a failed deployment.

Putting it together

Consider an infrastructure setup with four states: networking, compute, database, and DNS. In a traditional CI setup, you'd maintain four separate pipelines with webhook triggers, shell scripts to pass outputs between them, and manual ordering logic scattered across CI configuration files.

With Snap CD, the same setup is four Modules with explicit dependency wiring:

networking (watches main branch)
    ├── compute (takes vpc_id, subnet_ids from networking)
    │       └── dns (takes load_balancer_ip from compute)
    └── database (takes subnet_ids, security_group_id from networking)

A commit to infra-networking that changes a subnet:

Networking re-plans and applies.
Snap CD detects networking's outputs changed.
Compute and database re-plan in parallel (independent of each other).
Compute applies. Its outputs change (new load balancer IP).
DNS re-plans and applies with the new IP.
Database applies. No downstream dependents, cascade stops.

All of this happens without any CI configuration. The dependency graph lives in Terraform code (the snapcd_module_input_from_output resources), not in CI pipeline YAML.

When to use what

Event-driven deployment isn't always necessary. Here's a rough guide:

Single state, single team: manual applies or a simple CI pipeline are fine. You don't need an orchestrator.
Multiple states, one team: a CI pipeline with some output-passing glue works, but starts to get brittle. Snap CD simplifies the wiring.
Multiple states, multiple teams: this is where event-driven deployment pays for itself. The dependency graph is explicit, ordering is automatic, and approval gates let each team control their own blast radius.
Version-based releases: if you tag infrastructure Modules with semantic versions and want controlled rollouts, Snap CD's version range tracking is built for this.

Tips

Start with one Namespace. Put your first few Modules in a single Namespace to learn the trigger model. Split into multiple Namespaces later when you need different default approval policies or Runner assignments.
Use source_revision_type = "SemanticVersionRange" for production. Tracking main is fine for dev, but production Modules should pin to a version range so you control exactly when changes roll out.
Set approval thresholds on destructive Modules first. Databases and DNS are the obvious candidates — the resources where a bad apply is hardest to undo.
Don't over-wire dependencies. Only create snapcd_module_input_from_output resources for values that actually flow between Modules. Not every Module needs to depend on every other Module.
Watch the cascade. When networking changes, the cascade might touch five downstream Modules. That's the point — but make sure those Modules have appropriate approval thresholds if you want a human in the loop.

Detecting and Managing Terraform Drift

Karl Schriek — Fri, 03 Jul 2026 06:55:32 +0000

Terraform assumes it's the only thing managing your infrastructure. The moment something changes outside of Terraform — a manual console edit, an auto-scaler adjusting capacity, another tool modifying a resource, an emergency hotfix applied directly in the cloud — Terraform's state file no longer reflects reality.

That gap between what Terraform thinks exists and what actually exists is drift. Every team experiences it. Few have a reliable way to detect it.

What drift looks like

Drift isn't always obvious. Some common scenarios:

Emergency console changes. Production is down. An engineer opens the AWS console and widens a security group to restore traffic. The fix works. Nobody updates the Terraform code. Two weeks later, someone runs terraform apply on an unrelated change, and the plan silently reverts the security group — taking production down again.
Auto-scaling and managed services. AWS auto-scaling changes the desired count on an ASG. Azure adjusts throughput on a Cosmos DB instance. GCP resizes a managed instance group. These are expected changes made by the cloud provider, but Terraform's state doesn't know about them. The next plan shows phantom diffs that confuse reviewers.
Cross-tool modifications. A Kubernetes operator creates a load balancer that Terraform also manages. A CI pipeline updates an IAM policy outside of Terraform. A different team uses Pulumi for their resources but shares a VPC that Terraform created. Any of these can modify resources that Terraform considers under its control.
Provider upgrades. A new version of the AWS provider reads a resource differently — normalising JSON policies, reordering security group rules, or adding new default attributes. The resource hasn't changed, but the plan shows a diff. This is one of the most common sources of noisy drift — expected changes from the refresh report that aren't real drift but look like it.

Why drift is dangerous

Silent overwrites

The most immediate danger: terraform apply will converge the real infrastructure to match the declared state. If someone made a manual fix that isn't reflected in the code, the next apply reverts it. There's no warning — the plan just shows a diff, and if the reviewer doesn't recognise it as "that emergency fix from last Tuesday," it gets applied.

Misleading plans

When state and reality diverge, terraform plan output becomes unreliable. A plan that shows "3 to change" might actually represent 1 intentional change and 2 drift reversions. Reviewers can't tell which is which. Over time, teams stop trusting the plan output — which defeats the entire purpose of plan review.

Compliance drift

Security-sensitive resources are the highest-risk category. A security group opened to 0.0.0.0/0 during an incident, an IAM policy with overly broad permissions added manually, a database encryption setting changed in the console — all of these are compliance violations that persist silently until someone runs a plan and either catches the diff or blindly applies over it.

Cascading across states

When infrastructure is split across multiple states, drift in one state can cascade. If the networking state's actual VPC configuration has drifted from what Terraform believes, every downstream state that depends on networking outputs is making decisions based on stale data. The compute state thinks the VPC has three subnets; it actually has four. Nothing breaks until it does.

How teams detect drift today

Manual `terraform plan`

The simplest approach: someone runs terraform plan and looks for unexpected diffs.

terraform plan -detailed-exitcode
# Exit code 0: no changes
# Exit code 1: error
# Exit code 2: changes detected

Why it doesn't scale:

It requires someone to remember to run it. Under deadline pressure, drift checks are the first thing skipped.
It holds the state lock for the entire plan duration. On a large state, that's minutes of blocking other operations.
The output mixes intentional changes with drift. If someone has uncommitted code changes locally, the plan shows both — and distinguishing them takes expertise.
There's no structured output. You're reading terminal text, not querying a system that knows "this resource drifted."

Scheduled CI plans

A step up: a cron-triggered CI pipeline that runs terraform plan on a schedule and alerts on non-zero exit codes.

# .github/workflows/drift-check.yml
on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -detailed-exitcode

Problems:

Lock contention. The drift-check plan holds the state lock. If an engineer tries to run terraform plan at the same time, they're blocked.
No structured alerting. The pipeline either passes or fails. There's no "these 3 resources drifted" — just a wall of plan text in a CI log. Drift detection is the most requested Atlantis feature, and how to even trigger it is a recurring question — because CI-based approaches are fundamentally awkward.
False positives. Provider version differences between CI and local, or expected changes from auto-managed attributes, generate noise that drowns out real drift.
CI costs. Running a full plan across 20 states daily burns CI minutes. Running it hourly burns more. Most of those runs find nothing.
Single state at a time. Each pipeline job checks one state. Cross-state drift — where one state's actual outputs don't match what a dependent state consumed — isn't detected at all.

`terraform plan -refresh-only`

Terraform 1.1 added the -refresh-only flag, which separates the refresh phase from the planning phase. It shows you what changed in the real world without proposing any configuration changes:

terraform plan -refresh-only

This is better for drift detection than a full plan because it doesn't conflate drift with intentional code changes. But it still requires manual execution, still holds the state lock, and still produces unstructured text output. It's also had reliability issues — OpenTofu's implementation returned false positives where --refresh-only --detailed-exitcode exited with code 2 even when there were no actual changes.

Cloud-native tools

AWS Config, Azure Policy, and GCP Security Command Center can detect configuration changes at the cloud level. They're good at what they do — but they don't understand Terraform.

AWS Config can tell you that a security group rule changed. It can't tell you which Terraform resource manages that security group, which state file it lives in, or whether the change is intentional. Correlating a Config finding back to Terraform code is manual detective work.

These tools complement Terraform drift detection. They don't replace it.

Third-party tools

Tools like driftctl (now part of Snyk) scan cloud resources and compare them against Terraform state. They can find resources that exist in the cloud but aren't in any state file (unmanaged resources) — something terraform plan can't do.

The trade-off is another tool to maintain, another set of credentials to manage, and another source of truth to reconcile. Most of these tools work at a point-in-time snapshot level, not continuously.

How Snap CD handles drift

Snap CD treats drift detection as a first-class operation, not a bolted-on CI job.

Smaller Modules, smaller drift surface

Snap CD's core design philosophy is breaking infrastructure into small, focused Modules — each with its own state, credentials, and lifecycle. This directly reduces the damage drift can cause.

In a monolithic state with 500 resources, a single drifted security group hides among hundreds of resources in the plan output. The drift check is slow because the refresh has to query every resource. And if you decide to correct the drift, the apply touches a state that contains everything — networking, compute, databases, DNS — so the blast radius of a mistake is the entire infrastructure.

When that same infrastructure is split into Modules, the security group lives in a networking Module with 30 resources. The drift check runs in seconds, the plan output is short enough to actually read, and a corrective apply only touches networking. The database and application Modules are untouched.

Smaller states also mean faster refresh cycles, which means you can check for drift more frequently without the lock contention and API throttling that make frequent checks impractical on large states. See The Problem with Large Terraform States for why state size matters, and Splitting a Terraform Monolith for how to get there.

Scheduled drift checks per Module

Each Snap CD Module can run periodic plans on a schedule, independent of code changes. The Server triggers a plan, the Runner executes it, and the result is stored with the same structured metadata as any other deployment.

resource "snapcd_module" "networking" {
  name         = "networking"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-networking.git"
  runner_id    = snapcd_runner.prod.id
}

Drift checks run as normal plans — they refresh state against reality and show any differences. The key distinction is that they're triggered by the scheduler, not by a code change, so the plan output represents pure drift: changes made outside of Terraform.

Structured results in the dashboard

When drift is detected, it's visible in the Snap CD dashboard as a plan with changes. Reviewers can see exactly which resources drifted and what changed — not a wall of CI log text, but a structured plan output with the same review interface used for normal deployments.

Approval before correction

A drift detection plan that shows changes doesn't automatically apply. It enters the same approval workflow as any other plan. If the drift is intentional (an emergency fix that needs to stay), someone can dismiss the plan. If it's unintentional (someone accidentally changed a setting in the console), the team can approve the corrective apply to bring reality back in line with code.

This is the critical difference from continuous reconciliation tools like Crossplane, which would silently revert the change. Snap CD surfaces drift and lets humans decide what to do about it.

RBAC for drift visibility

Not everyone needs to see drift in every Module. Snap CD's permission system controls who can view plans (including drift check results) and who can approve corrective applies. The security team can have Reader access to see drift across all Modules without the ability to approve changes. The networking team can approve corrections to their own Modules without needing access to application infrastructure.

Cross-state drift awareness

When drift is detected in a Module that produces outputs consumed by other Modules, Snap CD understands the dependency graph. If the networking Module's actual vpc_id has drifted, Snap CD knows that the compute and database Modules depend on that output. Correcting the drift in networking can trigger re-plans in dependent Modules, catching cascading effects that CI-based drift detection misses entirely.

A practical setup

A team with five Modules across networking, compute, database, application, and DNS:

resource "snapcd_module" "networking" {
  name         = "networking"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-networking.git"
  runner_id    = snapcd_runner.prod.id
  apply_approval_threshold = 2
}

resource "snapcd_module" "compute" {
  name         = "compute"
  namespace_id = snapcd_namespace.prod.id
  source_url   = "https://github.com/myorg/infra-compute.git"
  runner_id    = snapcd_runner.prod.id
  apply_approval_threshold = 1
}

resource "snapcd_module_input_from_output" "vpc_to_compute" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

When the scheduled drift check runs on the networking module and detects that a subnet was added manually in the console:

The drift appears in the dashboard as a plan showing the unexpected subnet.
Two approvers review the plan (networking requires 2 approvals).
If they approve the corrective apply, Terraform removes the manually-added subnet (or, if the team wants to keep it, they update the code first and the next plan shows no changes).
If the corrective apply changes networking's outputs, compute automatically re-plans.

No CI cron job. No Slack message asking "did anyone change the VPC?" No terraform plan holding the lock while an engineer reads the output.

Comparison

	Manual plan	CI cron	Cloud-native tools	Snap CD
Automation	None	Schedule-based	Continuous	Schedule-based
Lock contention	Yes	Yes	No	Managed per-Module
Structured results	No (terminal text)	No (CI logs)	Yes (but not Terraform-aware)	Yes (dashboard)
Approval before fix	Manual	Manual	N/A	Built-in
Cross-state awareness	None	None	None	Dependency graph
Drift vs. code change	Mixed	Mixed	N/A	Separated
Cost	Engineer time	CI minutes	Cloud service pricing	Included

Tips

Check drift more often on high-risk resources. Security groups, IAM policies, and database configurations are the most common targets for manual changes. Schedule drift checks for modules containing these resources more frequently than stable infrastructure like DNS.
Don't auto-apply drift corrections. The whole point is to surface drift for human review. Automatic correction is just continuous reconciliation with extra steps — and it defeats the safety of the plan-then-approve workflow.
Investigate before correcting. When drift is detected, the first question is "why?" If someone made an emergency change, the fix is to update the Terraform code to match, not to revert the change. If the drift is from a provider bug or an auto-managed attribute, the fix might be ignore_changes, not a corrective apply.
Track your drift rate. If the same module drifts repeatedly, that's a signal — either the manual change is actually needed (and should be in code) or the team doesn't trust the Terraform workflow enough to use it for urgent changes.
Separate drift-prone resources. Resources that are frequently modified outside Terraform (auto-scaled groups, resources managed by operators) should be in their own Module, so their expected drift doesn't create noise in Modules that should never drift. Smaller, focused states also reduce the blast radius of any single drift correction — see The Problem with Large Terraform States and Splitting a Terraform Monolith for how to get there.

Why Snap CD: Modular Deployments

Karl Schriek — Thu, 02 Jul 2026 19:22:06 +0000

Terraform manages dependencies between resources within a single state. The moment your infrastructure outgrows one state file — slow plans, wide blast radius, team contention — you need to split. But the pieces still depend on each other: compute needs the VPC ID from networking, application infrastructure needs the cluster endpoint from compute.

The usual approaches — terraform_remote_state, parameter stores, wrapper scripts, Terragrunt — each solve part of the problem but leave gaps in change detection, ordering enforcement, and visibility. For a walkthrough of these approaches and how to perform the split itself, see Splitting a Terraform Monolith.

Snap CD's module system was built specifically for what comes after the split: declaring the dependency graph as code, enforcing apply ordering, and cascading changes automatically.

Managing Snap CD with Terraform

The HCL examples throughout this guide use the Snap CD Terraform provider. This is the canonical way to configure Snap CD — you use Terraform to manage the system that manages your Terraform modules. Everything you see in the examples below — the hierarchy, the dependency wiring, the secret bindings — is declared as standard Terraform resources via this provider.

This means your Snap CD configuration is version-controlled, reviewable, and reproducible — the same properties you expect from the infrastructure it orchestrates. For more on the provider and the broader toolset, see An Extensive Supporting Toolset. For how runners provide credential isolation between modules, see Self-Hosted Terraform Runners with Credential Isolation.

Stacks, Namespaces, and Modules

Snap CD organises infrastructure in a three-level hierarchy:

Stack — a top-level grouping, typically an environment or a product. Examples: production, staging, platform-services.
Namespace — a logical grouping within a Stack, typically a team or an infrastructure layer. Examples: networking, data-platform, frontend.
Module — a single Terraform root within a Namespace. This is the unit of deployment — each Module has its own state, its own Runner, and its own lifecycle.

Permissions, secrets, and default inputs can be set at any level and inherited downward. A secret defined at the namespace level is available to all modules in that namespace. A permission granted at the stack level applies to all namespaces and modules within it.

resource "snapcd_stack" "production" {
  name            = "production"
  organization_id = snapcd_organization.main.id
}

resource "snapcd_namespace" "platform" {
  name     = "platform"
  stack_id = snapcd_stack.production.id
}

Modules

A module is an independent Terraform root — its own source repository, its own state, its own runner, its own credentials. Modules are defined within a namespace:

resource "snapcd_module" "networking" {
  name         = "networking"
  namespace_id = snapcd_namespace.platform.id
  source_url   = "https://github.com/myorg/infra-networking.git"
  runner_id    = snapcd_runner.platform.id
}

resource "snapcd_module" "compute" {
  name         = "compute"
  namespace_id = snapcd_namespace.platform.id
  source_url   = "https://github.com/myorg/infra-compute.git"
  runner_id    = snapcd_runner.platform.id
}

resource "snapcd_module" "database" {
  name         = "database"
  namespace_id = snapcd_namespace.platform.id
  source_url   = "https://github.com/myorg/infra-database.git"
  runner_id    = snapcd_runner.platform.id
}

Each module is fully independent — its own state, its own credentials (scoped via the runner), its own lifecycle.

Inputs

Modules receive values through inputs. Snap CD supports several input types, each suited to a different use case.

Outputs from other modules

The most common input type. A module consumes an output from another module, and Snap CD builds the dependency graph from these declarations:

resource "snapcd_module_input_from_output" "vpc_id" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "vpc_id"
  output_module_id = snapcd_module.networking.id
  output_name      = "vpc_id"
}

resource "snapcd_module_input_from_output" "private_subnet_ids" {
  module_id        = snapcd_module.compute.id
  input_kind       = "Param"
  name             = "private_subnet_ids"
  output_module_id = snapcd_module.networking.id
  output_name      = "private_subnet_ids"
}

The input_kind controls how the value is delivered to Terraform:

Param — injected as a Terraform variable (written to a .tfvars file). Use this when your Terraform code declares a matching variable block.
EnvVar — injected as an environment variable. Use this for values that configure provider authentication or backend settings (e.g., ARM_SUBSCRIPTION_ID).

This replaces terraform_remote_state entirely. Modules don't need to know each other's backend configuration. They declare what they produce and what they consume — Snap CD handles the wiring.

When two modules share many outputs, snapcd_module_input_from_output_set wires all outputs from the producer in a single resource — no need to declare each one individually.

Literal values

Static configuration values that don't come from another module or a secret:

resource "snapcd_module_input_from_literal" "environment" {
  module_id     = snapcd_module.compute.id
  input_kind    = "Param"
  name          = "environment"
  literal_value = "production"
}

resource "snapcd_module_input_from_literal" "instance_count" {
  module_id     = snapcd_module.compute.id
  input_kind    = "Param"
  name          = "instance_count"
  literal_value = "3"
  type          = "NotString"
}

The type attribute defaults to "String". Set it to "NotString" for numbers, booleans, lists, or maps — this tells Snap CD to pass the value unquoted so Terraform interprets it as the correct type.

Secrets

Secrets are stored encrypted in Snap CD's secret store and injected at runtime. See Managing Secrets in Terraform for the full picture. The binding looks like:

resource "snapcd_module_input_from_secret" "db_password" {
  module_id  = snapcd_module.database.id
  name       = "db_password"
  secret_id  = data.snapcd_module_secret.db_password.id
  input_kind = "Param"
}

Secrets are scoped — a secret bound to the database module is never visible to the networking or compute modules.

Namespace-level inputs

When multiple modules in a namespace share common inputs — a subscription ID, a region, a shared tag set — you can define inputs at the namespace level:

resource "snapcd_namespace_input_from_secret" "arm_subscription_id" {
  namespace_id = snapcd_namespace.platform.id
  name         = "ARM_SUBSCRIPTION_ID"
  secret_id    = data.snapcd_namespace_secret.subscription_id.id
  input_kind   = "EnvVar"
  usage_mode   = "UseByDefault"
}

The usage_mode controls inheritance:

UseByDefault — every module in the namespace receives this input automatically unless it declares its own override.
UseIfSelected — the input is available but only applied to modules that explicitly opt in.

This eliminates the need to repeat the same credential or configuration binding on every module in a namespace. Namespace inputs also support literals (snapcd_namespace_input_from_literal) and definition values (snapcd_namespace_input_from_definition).

Definition values

Snap CD can inject its own metadata — module IDs, namespace names, source revisions — as inputs to your Terraform code:

resource "snapcd_module_input_from_definition" "module_name" {
  module_id       = snapcd_module.compute.id
  input_kind      = "Param"
  name            = "snapcd_module_name"
  definition_name = "ModuleName"
}

Available definition values include ModuleId, ModuleName, NamespaceId, NamespaceName, StackId, StackName, SourceUrl, SourceRevision, and SourceSubdirectory. This is useful for tagging resources with their Snap CD provenance or for conditional logic based on which module is being deployed.

Change propagation

Once modules and their inputs are defined, Snap CD handles the deployment lifecycle automatically:

Automatic ordering. Snap CD knows that compute and database depend on networking. It will never apply compute before networking has successfully completed.

Parallel execution. Compute and database both depend on networking but not on each other. Snap CD runs them in parallel once networking completes.

Cascading changes. When networking's outputs change — say you add a subnet — Snap CD automatically queues compute and database for re-planning. If the new plan has changes, it either auto-applies (if configured) or waits for approval. If compute's outputs are unchanged, downstream modules that depend on compute are skipped entirely.

Source-triggered plans. When a commit is pushed to a module's source repository, Snap CD detects the change and triggers a new plan. If the plan produces output changes, dependents cascade as above.

networking
    ├──► compute ──► application
    └──► database

A commit to infra-networking that changes a subnet triggers this cascade:

Networking re-plans and applies.
Compute and database re-plan in parallel.
If compute's outputs change, application re-plans after compute finishes.
If compute's outputs are unchanged, application is skipped.

No scripts. No parameter stores. No terraform_remote_state.

Compared to the alternatives

	`terraform_remote_state`	Parameter store	Wrapper scripts	Terragrunt	Snap CD
Backend coupling	Yes	No	No	Partial	No
Change detection	No	No	No	No	Automatic
Ordering enforcement	No	No	Manual	Automatic	Automatic
Parallelism	N/A	N/A	Manual	Automatic	Automatic
Approval gates	No	No	DIY	No	Built-in
Scoped permissions	No	No	No	No	Built-in
Persistent visibility	No	No	CI logs	No	Dashboard

Getting started

If you have a monolithic Terraform state today, the path to modular deployments is:

Identify boundaries — group resources by team, lifecycle, and credential scope.
Split the state — use terraform state mv to migrate resources to new roots (see Splitting a Terraform Monolith).
Define modules in Snap CD — one snapcd_module per root.
Wire the dependencies — snapcd_module_input_from_output for cross-module values, snapcd_module_input_from_secret for credentials.
Remove the glue — delete terraform_remote_state blocks, wrapper scripts, and CI pipeline steps.

From that point on, Snap CD manages the dependency graph, propagates changes, and keeps your infrastructure in sync.

Why Snap CD: Self-Hosted Terraform Runners with Credential Isolation

Karl Schriek — Thu, 02 Jul 2026 19:18:00 +0000

Most infrastructure teams run Terraform from a CI pipeline. That pipeline has credentials — cloud provider keys, state backend tokens, maybe a vault token to fetch more secrets. Early on, one pipeline with one set of credentials works fine. But as the infrastructure grows and more environments come online, the shared-runner model starts creating problems that are hard to fix without rethinking the architecture.

The shared-runner problem

When a single CI runner (or pool of identical runners) handles all Terraform work, several things go wrong at the same time.

Credential sprawl

Your CI runner needs to deploy networking in production, spin up a dev Kubernetes cluster, manage DNS records, and provision a staging database. That means it holds credentials for all of those things — often across multiple cloud providers and accounts.

Every credential on the runner is accessible to every job that runs on it. A misconfigured pipeline step for the dev environment can reach production AWS keys. The blast radius of a compromised runner is everything it has access to.

Blast radius

A bad Terraform run is supposed to be scoped to the infrastructure it manages. But when the runner has broad access, a bug in one pipeline — or a malicious commit — can reach resources it was never intended to touch. The runner doesn't know that a dev pipeline shouldn't be able to destroy production resources. It just runs whatever Terraform tells it to, with whatever credentials it has.

Compliance and auditability

Auditors want to know who (or what) can access production, and they want that list to be short and verifiable. "Our CI runner can access everything" is not a satisfying answer. Showing that only a specific, dedicated runner with a specific identity can reach production — and that it can only be invoked by specific modules with specific approval gates — is a much stronger story.

Team boundaries

Different teams own different parts of the infrastructure. The networking team shouldn't need to care about the application team's deployment pipeline, and vice versa. But when they share a runner, they share the pipeline configuration, the credential setup, the job queue, and the failure modes.

How teams typically cope

These are real patterns that work, up to a point.

Separate CI projects

Create one CI project per environment or per team. The prod project has prod credentials; the dev project has dev credentials. This solves credential scoping but multiplies the number of CI configurations you maintain. Pipeline logic gets duplicated or abstracted into shared templates that become their own maintenance burden.

Vault-based credential injection

Use HashiCorp Vault (or a cloud-native equivalent) to issue short-lived credentials at job time. The runner itself has minimal standing access — it authenticates to Vault, gets scoped credentials, and uses them for one job.

This is architecturally sound but adds operational complexity: you need a Vault cluster (or managed service), policies for every credential path, rotation logic, and monitoring for lease expiry. The runner still executes all jobs — you've scoped the credentials, but the execution environment is shared.

Environment-specific pipelines

Separate pipeline definitions for each environment, each with their own credential configuration. Similar to separate CI projects but within a single CI system. You get some isolation but the runner infrastructure is still shared, and the pipeline definitions tend to diverge over time.

Self-hosted runner groups

CI systems like GitHub Actions and GitLab CI support runner groups or tags. You deploy dedicated runner machines for production and different ones for development, then use labels to route jobs to the right group.

This works well for compute isolation but you're now managing runner infrastructure yourself — provisioning machines, keeping them patched, scaling them, and managing the credential distribution to each group. The CI system orchestrates which job goes where, but the operational burden is on you.

Snap CD's approach: separate orchestration from execution

Snap CD organises infrastructure in a three-level hierarchy: Stacks, Namespaces, and Modules. Stacks typically represent environments (production, staging), Namespaces group by team or infrastructure layer (networking, data-platform), and Modules are individual Terraform roots. Permissions, secrets, and Runner access can be scoped at any level and inherited downward. For a full walkthrough of this hierarchy and the input system, see Modular Deployments, and for the permission system that supports it, see A Permission System Built for Infrastructure.

Snap CD was designed around the idea that the system coordinating deployments should not be the same system executing them.

Two distinct roles

The Snap CD Server (hosted at snapcd.io, or self-hosted) handles:

Module definitions — what to deploy, from which source, with which inputs
Dependency tracking — which modules depend on which outputs
Change detection — watching Git repos and upstream outputs for changes
Plan review and approval gates
Logging and audit trails

The server never touches your cloud provider. It never holds your AWS keys or Azure credentials. It doesn't run terraform plan or terraform apply.

Runners handle execution. A Runner is a lightweight, self-hosted worker that you deploy wherever makes sense — a Kubernetes pod in your cluster, a VM in your cloud account, a container on a developer machine. The Runner:

Connects to the Snap CD server via a long-lived, authenticated, bi-directional WebSocket connection.
Picks up jobs assigned to it (plan, apply).
Downloads the module source code.
Executes standard Terraform/OpenTofu commands in a local shell session.
Reports results (plan output, apply output, state changes) back to the server.

The Runner only has the credentials you give it. A Runner deployed into your production Azure subscription with a managed identity has access to production Azure — and nothing else. A Runner on a dev machine with dev AWS keys can only reach dev AWS.

No credential forwarding

The Snap CD server never sees, stores, or forwards cloud credentials. Credentials live on the Runner, configured the same way you'd configure them for any local Terraform run — environment variables, cloud provider metadata services, credential files. The server tells the Runner what to do; the Runner uses its own credentials to do it.

This means a compromise of the Snap CD server does not expose your cloud credentials. The server knows your module definitions and deployment history, but it cannot execute infrastructure changes on its own.

Permission-controlled runner access

Snap CD's permission system extends to Runners. You can control which Modules are allowed to use which Runners through supply resources:

The HCL examples below use the Snap CD Terraform provider — the canonical way to configure Snap CD with Terraform. For more on the provider, see An Extensive Supporting Toolset.

resource "snapcd_runner" "prod_azure" {
  name            = "prod-azure"
  organization_id = snapcd_organization.main.id
}

resource "snapcd_runner" "dev_azure" {
  name            = "dev-azure"
  organization_id = snapcd_organization.main.id
}

Then use supply resources to declare which Runners are available to which scopes. A Runner "supplies" itself to a Stack, Namespace, or individual Module. The most common pattern is supplying a Runner to a Stack — since Stacks typically represent environments (production, staging, dev), this gives you per-environment credential isolation:

resource "snapcd_runner_stack_supply" "prod" {
  runner_id = snapcd_runner.prod_azure.id
  stack_id  = snapcd_stack.production.id
}

resource "snapcd_runner_stack_supply" "dev" {
  runner_id = snapcd_runner.dev_azure.id
  stack_id  = snapcd_stack.dev.id
}

Every Module in the production Stack — across all its Namespaces (networking, compute, data, etc.) — can only execute on prod-azure. Every Module in dev can only execute on dev-azure.

Supply resources also work at the Namespace and Module level, so you can drill down when a team or a specific piece of infrastructure needs its own isolated Runner.

Namespace-level supply. Useful when a team manages critical resources that require dedicated credentials — for example, a data-platform Namespace where only a Runner with access to production databases should execute:

resource "snapcd_runner" "prod_data" {
  name            = "prod-data-platform"
  organization_id = snapcd_organization.main.id
}

resource "snapcd_runner_namespace_supply" "data_platform" {
  runner_id    = snapcd_runner.prod_data.id
  namespace_id = snapcd_namespace.data_platform.id
}

Modules in the data_platform Namespace execute on prod-data-platform instead of the Stack-level Runner. Other Namespaces in the same Stack continue using the Stack-level Runner.

Module-level supply. Rarely needed, but available for cases where a single module requires its own isolated runner — for example, a module that manages a key vault or certificate authority with uniquely sensitive credentials:

resource "snapcd_runner_module_supply" "key_vault" {
  runner_id = snapcd_runner.prod_keyvault.id
  module_id = snapcd_module.prod_key_vault.id
}

The boundary is enforced by the server before a job is dispatched — a module without a matching supply will not execute, regardless of what credentials are available elsewhere.

Deployment patterns

One Runner per environment

The most common pattern. Deploy a Runner into each environment (dev, staging, production), each with credentials scoped to that environment.

snapcd.io
   │
   ├── Runner-dev      (dev AWS credentials)
   ├── Runner-staging   (staging AWS credentials)
   └── Runner-prod      (prod AWS credentials, approval gates required)

Clean credential boundaries. A compromised dev Runner cannot reach production. Simple to reason about.

One Runner per cloud provider

When your infrastructure spans multiple clouds, deploy Runners with provider-specific credentials:

snapcd.io
   │
   ├── Runner-azure     (Azure managed identity)
   ├── Runner-aws       (AWS IAM role)
   └── Runner-gcp       (GCP service account)

Useful when environment boundaries are less important than provider boundaries — for example, if your Azure and AWS infrastructure are managed by different teams with different credential policies.

Combined: environment × provider

For larger organizations, combine both dimensions:

snapcd.io
   │
   ├── Runner-azure-dev
   ├── Runner-azure-prod
   ├── Runner-aws-dev
   └── Runner-aws-prod

Each Runner has exactly the credentials it needs — nothing more.

Shared Runner with scoped permissions

For smaller teams that don't need strict environment isolation, a single Runner with broad credentials can work. Use Snap CD's permission system to control which users and service principals can trigger deployments on that Runner, and rely on approval gates for production changes.

This trades some isolation for operational simplicity. It's a reasonable starting point that you can tighten as the team and infrastructure grow.

Fully source-available

Snap CD is source-available — the entire codebase, including the Server, Runner, and Terraform provider, is maintained in a single monorepo. You can inspect every component, understand exactly what runs in your environment, and modify it if you need to.

The Runner is designed to be stateless between jobs. It downloads Module source code, runs Terraform, reports results, and cleans up. No job data persists on the Runner after completion, which simplifies security reviews and makes Runners easy to replace or scale.

Compared to the alternatives

Concern	Separate CI projects	Vault + shared runner	Self-hosted runner groups	Snap CD Runners
Credential scoping	Per-project	Per-job (dynamic)	Per-group	Per-Runner
Execution isolation	Separate machines	Shared machine	Separate machines	Separate machines
Orchestration burden	Duplicated pipelines	Single pipeline + Vault	Labels/tags in CI	Managed by Snap CD Server
Dependency awareness	None (manual ordering)	None (manual ordering)	None (manual ordering)	Built-in (Module dependencies)
Audit trail	CI logs per project	CI logs + Vault audit	CI logs per group	Centralized in Snap CD

The key difference is that Snap CD Runners are not general-purpose CI machines repurposed for Terraform. They're purpose-built for infrastructure deployment, integrated with a system that understands Module dependencies, approval gates, and deployment ordering.

Tips

Start with one Runner and split later. You don't need per-environment Runners on day one. Start with a single Runner and add more as your isolation requirements become clear.
Use managed identities where possible. A Runner in Azure with a managed identity, or in AWS with an IAM role attached to its instance profile, avoids storing long-lived credentials entirely.
Keep Runners stateless. Don't store Terraform state on the Runner. Use remote backends (which Snap CD manages) so Runners can be replaced without losing state.
Monitor Runner connectivity. The WebSocket connection is long-lived but not immortal. The Runner reconnects automatically, but monitoring connection status helps you catch network issues before they block deployments.
Scope permissions early. It's easier to set up Runner permissions correctly from the start than to tighten them later when teams are already used to a permissive setup.

Splitting a Terraform Monolith into Smaller States

Karl Schriek — Tue, 30 Jun 2026 15:39:06 +0000

If your Terraform plans are slow, your blast radius is too wide, or multiple teams are stepping on each other's changes, it's time to split your monolith. See The Problem with Large Terraform States for how to diagnose whether you've reached that point.

This guide walks through the process of breaking a monolithic Terraform root into smaller, independent roots — each with its own state — and how to wire the dependencies between them.

The steps

Splitting a monolith is a seven-step process:

Parse the root into a resource-level reference graph — understand what references what.
Place every resource into a target module based on lifecycle and ownership.
Compute boundaries — references crossing module boundaries become variable/output pairs; depends_on-only references become ordering edges without spurious value wiring.
Check for cycles — if module A needs an output of module B and B needs an output of A, no valid apply order exists. Catch this before writing any files.
Emit per-module roots — rewrite cross-module references to var.<input>, generate variables.tf and outputs.tf, propagate providers and locals.
Carve state — terraform state mv over local copies. The monolith state after all moves becomes the remainder module's state. Never touch the live backend during migration.
Prove the split — walk modules in topological order, thread each producer's extracted outputs into its consumers' inputs, and plan each against its carved state. Zero creates and zero destroys = the split is operationally inert.

The following sections walk through each step in detail.

Automating the steps with Demonolith

Demonolith is a Go CLI that automates all seven steps. You annotate resources with decorator comments indicating which module each belongs to:

# @demono:move networking
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

# @demono:move networking
resource "aws_subnet" "private" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

# @demono:move compute
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = [aws_subnet.private.id]
  }
}

Resources without a decorator fall to a configurable remainder module (default: monolith). Data sources can be decorated with multiple targets — they're stateless reads and get duplicated into each.

Demonolith handles parsing (via HCL AST traversal), boundary computation, code emission (with hclwrite-preserved formatting), state carving, and the topologically-threaded proof — in a single command:

# Emit carved roots only (code, no state):
demonolith split ./infra

# Also carve state into per-module local files:
demonolith split ./infra --state

# Carve + prove every module plans to zero create/destroy:
demonolith split ./infra --state --proof

The carved roots are plain Terraform — valid standalone. The cross-module edges Demonolith computes are exactly the wiring you'd configure in Snap CD via snapcd_module_input_from_output.

Going through the steps manually

1. Identify natural boundaries

Look at your resources and group them by lifecycle and ownership. Common boundaries:

Networking — VPCs, subnets, route tables, NAT gateways. Changes rarely, underpins everything.
DNS — Zones, records. Usually owned by a platform team.
Compute — Kubernetes clusters, VM scale sets, container services. Changes more often, depends on networking.
Application infrastructure — Databases, caches, queues, storage accounts. Owned by application teams.
Monitoring — Dashboards, alerts, log sinks. Changes frequently, depends on everything but nothing depends on it.

A useful test: if two resources would never be changed in the same PR by the same person, they probably belong in different states.

2. Map the dependency graph

Before you move anything, build a resource-level reference graph. For every resource, identify what it references — and trace those references across the boundaries you drew in step 1. References that cross a boundary become the variable/output pairs you'll need to create.

networking          dns
    │                 ▲
    ▼                 │
  compute ──────────►─┘
    │
    ▼
application
    │
    ▼
monitoring

The values that cross these boundaries are the wiring surface of the split. Typical examples:

Networking → Compute: vpc_id, private_subnet_ids
Compute → DNS: load_balancer_ip
Compute → Application: cluster_endpoint, cluster_ca_certificate
Application → Monitoring: database_id, cache_name

Check for cycles: if module A needs an output of module B and B needs an output of A, no valid apply order exists. You'll need to break the cycle before proceeding — move one of the cross-referencing resources to the other side, or extract the shared resource into a third module.

3. Carve the code

For each new root, create a directory and move the assigned resources into it. Three things happen at the boundary:

On the producer side, expose cross-boundary values as output blocks:

# networking/outputs.tf
output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

On the consumer side, declare those values as variable blocks:

# compute/variables.tf
variable "private_subnet_ids" {
  type = list(string)
}

In the consumer's resource definitions, rewrite the hard references to use the new variable:

# Before (monolith) — direct reference
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# After (split) — variable reference
resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = var.private_subnet_ids
  }
}

Don't forget structural blocks: provider configurations, locals, and root variable declarations need to be carried into every module that uses them. A depends_on that pointed at a resource now in another root should be removed — the ordering dependency is carried by the input/output wiring instead.

4. Carve the state

Terraform's state mv command lets you move resources from one state to another without destroying and recreating them. Work on local copies of the state — never against the live backend during the migration.

# Pull the monolith state to a local file
cd monolith
terraform state pull > terraform.tfstate

# Move resources to the new root's state
terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_vpc.main aws_vpc.main

terraform state mv \
  -state=../monolith/terraform.tfstate \
  -state-out=./terraform.tfstate \
  aws_subnet.private aws_subnet.private

The monolith state file, after all moves are complete, becomes the remainder module's state — it contains exactly the resources that weren't moved out.

5. Verify the split

After carving code and state, every new root must plan to zero changes. This is the proof that the split is operationally inert — nothing will be destroyed or recreated.

The catch: a carved module planned in isolation has its upstream-sourced variables unset, because the input/output wiring doesn't exist yet at the Terraform level. You need to supply those values manually for the verification plan. Walk the modules in topological order: plan each producer first, extract its output values, and feed them as -var arguments into the consumer's plan.

# Plan the producer (no upstream dependencies)
cd networking
terraform plan

# Extract outputs
terraform output -json > ../outputs/networking.json

# Plan the consumer with the producer's outputs
cd ../compute
terraform plan \
  -var="vpc_id=$(jq -r '.vpc_id.value' ../outputs/networking.json)" \
  -var="private_subnet_ids=$(jq -c '.private_subnet_ids.value' ../outputs/networking.json)"

If any module shows creates or destroys, something went wrong — a resource was missed in the state move, a reference was rewritten incorrectly, or a variable type doesn't match.

6. Wire up the cross-state dependencies

Once the split is verified, you need a runtime mechanism to pass outputs from producers to consumers on every deploy. There are several options:

Option A: terraform_remote_state data sources

The built-in approach. Each consuming module reads the producer's state directly:

data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

This works but has significant drawbacks:

Every consumer needs to know the backend configuration of every producer.
There's no enforcement of the dependency order — you have to manually ensure networking is applied before compute.
Changes to networking outputs don't automatically trigger a re-plan of compute.

Option B: Wrapper scripts and CI glue

You write shell scripts or CI pipeline steps that run terraform output on one state and feed the values into terraform apply -var on the next. This is what most teams end up doing, and it's fragile — the dependency graph lives in CI config rather than in code.

Option C: Terragrunt

Terragrunt adds a dependency layer on top of Terraform:

# compute/terragrunt.hcl
dependency "networking" {
  config_path = "../networking"
}

inputs = {
  vpc_id             = dependency.networking.outputs.vpc_id
  private_subnet_ids = dependency.networking.outputs.private_subnet_ids
}

This is a genuine improvement — dependencies are declared in code, ordering is enforced, and terragrunt run-all apply handles the graph. But Terragrunt is a local CLI tool. It doesn't provide a persistent view of deployment status, approval gates, automatic re-deployment when upstream outputs change, or scoped permissions.

Option D: Snap CD

Snap CD was built for this problem. Each split becomes a Snap CD Module, and cross-state dependencies are declared as code using the Terraform Provider for Snap CD. Snap CD enforces apply ordering, runs independent Modules in parallel, and automatically cascades changes when upstream Outputs change. The cross-module edges from the split — every output/variable pair — map directly to snapcd_module_input_from_output resources. See Modular Deployments for a detailed walkthrough of how the Module and Input system works.

Tips

Split incrementally. Move one logical group at a time. Don't try to split everything in one go.
Start with the layer that changes least. Networking is usually the best first candidate — it has many dependents but few dependencies.
Keep shared modules small. If a Terraform module (in the module {} sense) is used by multiple states, keep it focused. A module that provisions "everything for an app" is just a monolith in disguise.
Test with terraform plan after every move. A clean plan (no changes) on both the source and destination states confirms the migration was correct.

The Problem with Large Terraform States

Karl Schriek — Tue, 30 Jun 2026 15:35:47 +0000

At some point every growing Terraform project hits a wall. Plans that used to finish in seconds now take minutes. Applies feel risky because hundreds of resources share a single blast radius. Colleagues avoid running terraform plan because it hammers cloud APIs hard enough to trigger throttling. The state file itself becomes a liability — large, slow to lock, and one bad write away from corruption.

This guide covers the symptoms of an oversized state, the band-aids teams reach for, and the structural fix that actually works.

How Terraform state works under the hood

Every terraform plan does two things:

Refresh — for every resource in state, Terraform calls the provider's API to read the current real-world status. A state with 500 resources means 500+ API calls, often more when resources have nested data sources.
Diff — compare the refreshed state against the desired configuration and produce a change set.

The refresh phase is the bottleneck. It's sequential per provider (parallelism helps across providers, not within one), and every resource pays the cost whether you changed it or not. Adding ten resources to a 500-resource state doesn't make plans 2% slower — it makes the refresh 2% slower on every single plan, for every engineer, forever.

Symptoms of a state that's too large

Slow plans

The most visible symptom. Plan time scales with resource count because every resource is refreshed on every plan, regardless of whether its configuration changed. The exact speed depends on provider — AWS resources with complex nested structures (IAM policies, security group rules) are slower to refresh than simple ones, and Azure resources that require multiple API calls per refresh are worse still. These aren't edge cases — users regularly report 2,900-resource states taking 20–25 minutes to plan and 1,600-resource states taking 8+ minutes. Even starting Terraform with a large state can take minutes before a single API call is made. There's a long-standing proposal for terraform plan -light that would only refresh resources whose configuration changed, but it remains unimplemented. OpenTofu has a similar request to skip refreshing unchanged resources and a proposal for state compression to reduce the overhead of large state files.

API rate limiting

Cloud providers throttle API calls. When Terraform refreshes hundreds of resources, it can exhaust rate limits:

AWS: ThrottlingException or Rate exceeded errors, especially on IAM, EC2 describe calls, and CloudFormation.
Azure: 429 Too Many Requests, particularly on Resource Manager and Key Vault APIs.
GCP: rateLimitExceeded on Compute Engine and IAM.

Terraform retries on throttling, which makes plans even slower. In severe cases, retries exhaust their budget and the plan fails entirely.

Blast radius

Every resource in a state shares a blast radius. A typo in a DNS record can, in the same plan, sit alongside a database resize. One bad terraform apply can damage resources the operator didn't intend to touch.

This isn't theoretical. Common incidents:

A for_each key change causes Terraform to destroy and recreate resources it shouldn't.
A provider upgrade changes how a resource is read, causing phantom diffs on dozens of resources.
An engineer runs terraform apply on a plan that's stale — someone else merged a change to a different resource in the same state, and the apply picks up both.
A third-party API is down or throttling, so the refresh fails for a resource you weren't even changing — blocking the entire plan. With a smaller state, that resource would be in a different state file and wouldn't affect your work at all.

With smaller states, each of these incidents affects only the resources in that state. With a monolith, everything is in play.

Locking contention

Remote state backends use locking to prevent concurrent writes. The longer a plan or apply takes, the longer the lock is held. With a 10-minute plan, other engineers are blocked for 10 minutes. If an apply follows, that's another stretch of locked state.

Teams start working around locks — using -lock=false (dangerous), splitting work by time of day (inefficient), or simply waiting. Concurrent updates to large state files are also significantly slower because each write serialises the entire state. None of these are real solutions.

State file size and corruption risk

State files grow linearly with resource count. A 1,000-resource state file can be several megabytes of JSON. Every plan downloads the full state, and every apply uploads a new version. On slow connections or with large states, this adds latency.

More critically, large state files are harder to recover from corruption. If a write is interrupted (network failure during apply, process killed), the state can become inconsistent. With a small state, recovery is straightforward — reimport a handful of resources. With a monolith, you're reimporting hundreds. Large state files also compound the secrets problem — Terraform stores sensitive values in plaintext in state, so a bigger state means more secrets exposed in a single file. OpenTofu implemented state encryption, but Terraform's proposal has been open since 2016.

Band-aids that don't fix the problem

`terraform plan -target`

The -target flag tells Terraform to only refresh and plan specific resources:

terraform plan -target=aws_instance.web

This makes individual plans fast, but it's a trap:

You must know which resources to target. Miss a dependency and the plan is incomplete.
Targeted plans skip dependency checking. You can apply a change that breaks a resource you didn't target.
It's manual and error-prone. There's no guardrail preventing someone from running a full plan and waiting 15 minutes.
Terraform itself warns: "Resource targeting is intended for exceptional use and should not be part of normal workflow."

`terraform plan -refresh=false`

Skipping refresh makes plans fast because Terraform uses the last-known state instead of querying APIs:

terraform plan -refresh=false

The problem is obvious: if the real world has drifted from state, the plan is wrong. An engineer deleted a resource manually, someone changed a security group in the console, a colleague applied from a different branch — none of this shows up. You're planning against fiction.

Workspaces

Terraform workspaces let you maintain multiple state files from the same configuration. They're designed for deploying the same infrastructure to different environments (dev, staging, prod), not for splitting a large state into smaller pieces.

Workspaces don't reduce the number of resources per state. If your monolith has 500 resources, each workspace still has 500 resources. They solve a different problem.

`terraform state rm` and manual state surgery

When a single resource is causing problems, engineers sometimes remove it from state and reimport it:

terraform state rm aws_instance.problematic
terraform import aws_instance.problematic i-0123456789abcdef0

This is a valid recovery technique but not a scaling strategy. It's manual, risky (removing the wrong resource is destructive), and doesn't address the underlying size problem.

The real fix: smaller states

The only way to permanently fix a large state is to break it into smaller ones. Each state contains a logical group of resources — networking, compute, databases, monitoring — with its own lifecycle, credentials, and blast radius. If your state spans multiple cloud providers, splitting along provider boundaries is one of the most effective first moves. Each provider has its own API rate limits, its own authentication, and its own failure modes — an Azure outage shouldn't block a plan that only touches AWS resources. Separate states per provider also let you scope credentials more tightly and parallelise plans that would otherwise run sequentially through a single refresh cycle.

The hard part isn't the split itself — it's managing the dependencies between the resulting states. Networking outputs need to flow into compute. Compute outputs need to flow into application infrastructure. Changes to one state need to trigger re-plans in dependent states. Snap CD was built for exactly this workflow — it tracks cross-state dependencies declaratively and cascades changes automatically, so you get the benefits of smaller states without the coordination overhead. For a discussion of approaches to breaking a monolith into smaller states, see Splitting a Terraform Monolith. To learn more about how Snap CD approaches modular deployments, see Modular Deployments.

How to tell when it's time

There's no universal threshold, but if any of these are true, you should start planning a split:

terraform plan consistently takes more than a few minutes.
More than one team commits to the same Terraform root module.
You've had an incident where an apply affected resources the operator didn't intend to change.
Applies are failing due to issues with unrelated resources in the same state.
Your state spans multiple cloud providers, and an outage or rate limit on one provider blocks plans for resources on another.
Engineers routinely use -target or -refresh=false to work around slowness.

Start with the layer that changes least (usually networking) and work outward. The Splitting a Terraform Monolith guide has the step-by-step process.

Tips

Split by cloud provider early. If your state has resources across AWS, Azure, and GCP, separating them into per-provider states is one of the highest-value splits. Each provider has independent rate limits, authentication, and failure modes — keeping them together means a slow Azure API refresh delays your AWS plan for no reason.
Watch for provider-specific bottlenecks. Even within a single cloud, some resource types are slower than others. If most of your plan time is AWS IAM resources, splitting out IAM alone might cut plan time dramatically. This is also a prerequisite if you are serious about not mixing credentials on the workers that are responsible for the deployments.
Don't over-split. Five resources that always change together, owned by the same team, with the same credentials, should stay in one state. The goal is fast plans and small blast radius, not one resource per state.
Use -parallelism wisely. Terraform's -parallelism flag (default 10) controls concurrent provider operations. Increasing it can speed up plans but also increases the risk of hitting API rate limits. With smaller states, the default is usually fine.

DEV Community: Karl Schriek

Scaling Terraform Infrastructure Beyond a Single Team

What breaks

State lock contention

Blast radius

Credential sprawl

Approval bottlenecks

Backend access as implicit access control

Knowledge boundaries

Typical approaches

Separate repos and pipelines per team

Workspaces

Terragrunt

Platform team as intermediary

A better structure

Define ownership boundaries

Scope credentials per team

Scope approvals per team

Wire dependencies declaratively

How Snap CD handles this

Modules as ownership units

Scoped permissions

Isolated Runners

Automatic dependency wiring

A practical example

Tips

See also

Managing Terraform Across Multiple Cloud Providers

Where it gets difficult

Credential sprawl

Provider version conflicts

Blast radius across clouds

Slow plans

Typical approaches

Separate repos per cloud

Monorepo with directory-per-cloud

terraform_remote_state across clouds

Wrapper scripts and CI orchestration

How Snap CD handles multi-cloud

One Runner per cloud

Modules per cloud component

Cross-cloud dependencies as code

A practical multi-cloud example

Comparison

Tips

See also

Why Snap CD: Non-invasive Orchestration

The lock-in pattern

What non-invasive means in practice

1. Clone the Source

2. Provide Inputs through standard mechanisms

3. Run standard Terraform commands

4. Collect Outputs

You can always drop to the shell

Your code doesn't know about Snap CD

Contrast with the alternatives

When this matters

See also

Why Snap CD: An Extensive Supporting Toolset

Documentation

Terraform Provider

The module-within-module pattern

Reference Deployments

Sample Deployment

Demonolith — monolith migration tool

See also

Why Snap CD: AI on a Leash

The problem with unrestricted agents

What you actually want

How Snap CD handles this

The Agent component and Missions

Scoped role assignments

Approval gates as natural checkpoints

Full audit trail

Integrations: pushing events to external systems

End-to-end: AutoFix in action

Bring your own AI

See also

Why Snap CD: A Permission System Built for Infrastructure

The two common approaches and where they break down

`terraform_remote_state` across clouds

Manual `terraform plan`

`terraform plan -refresh-only`