DEV Community

Cover image for Why Snap CD: A Permission System Built for Infrastructure
Karl Schriek
Karl Schriek

Posted on • Originally published at snapcd.io

Why Snap CD: A Permission System Built for Infrastructure

Most infrastructure teams handle access control in one of two places: the CI/CD layer or the cloud provider's IAM layer. Neither maps well to how infrastructure is actually structured.

CI permissions are usually binary — you can trigger a pipeline or you can't. There's no concept of "this person can deploy networking but not databases." Cloud IAM is more granular, but it governs what credentials can do, not what people can do within your deployment workflow. You end up with a gap: the system that understands your infrastructure topology has no permission model, and the system that has a permission model doesn't understand your infrastructure topology.

Snap CD sits in that gap. It provides a hierarchical role-based access control system that maps directly to the way you organise your infrastructure — Stacks, Namespaces, and Modules, Runners, Agents, and Integrations — and enforces it uniformly whether actions come through the web dashboard, the API, or the Terraform Provider.

The two common approaches and where they break down

CI/CD gating

The simplest form of infrastructure access control: who can trigger the pipeline?

Most CI systems give you repository-level permissions. If you have write access to the repo, you can trigger the workflow. Some offer environment-level protection rules — require approval from a specific team before deploying to prod.

This works until your infrastructure spans multiple repositories, or until you need more granularity than "can deploy to this environment." Can this person create new Modules but not delete existing ones? Can they approve a plan but not trigger an apply? CI systems don't model these distinctions.

There's also the backdoor problem. Protection rules only apply to CI-triggered runs. Anyone with the right credentials can run terraform apply from their laptop and bypass every gate you've set up.

Cloud IAM

Cloud providers have sophisticated permission systems — Azure RBAC, AWS IAM, GCP IAM. These control what API calls a principal can make against cloud resources. But they operate at the wrong abstraction level for deployment workflows.

Cloud IAM doesn't know that your VPC, subnets, and route tables form a logical "networking" group that one team owns. It doesn't know that module-compute depends on module-networking and should only be deployable after networking is stable. It can tell you whether a service principal can create an EC2 instance, but it can't tell you whether a human should be allowed to approve the plan that creates it.

You end up encoding deployment permissions across multiple systems — repo access in GitHub, environment protection rules in Actions, IAM policies in AWS — with no single place to answer "who can do what to which part of my infrastructure?"

Snap CD's permission model

Snap CD's permission system is built around two ideas: roles describe what you can do, and scope determines where you can do it.

Principals

Three types of identity can hold role assignments:

  • Users — human operators, authenticated via the identity provider.
  • Service principals — machine identities for automation, CI pipelines, and API integrations.
  • Groups — collections of users or service principals, for managing permissions at team scale.

Roles

Roles define a set of allowed operations. The same role names appear across different scope levels, with context-appropriate permissions:

  • Owner — full control, including the ability to delete the resource and manage role assignments on it.
  • Contributor — create, update, and manage child resources, but cannot delete the resource itself or manage role assignments.
  • Reader — read-only access.
  • IdentityAccessManager — can manage role assignments on this resource without having full Owner control.

Additional roles exist at specific scope levels:

  • StackCreator (organization) — can create new Stacks.
  • NamespaceCreator (Stack) — can create new Namespaces within the Stack.
  • ModuleCreator (Namespace) — can create new Modules within the Namespace.
  • Approver — can approve deployment plans.
  • JobManager — can manage deployment jobs (cancel, retry).
  • SourceChangeNotifier — can notify the system of source changes (used by webhooks and CI integrations).

Scope hierarchy

Role assignments are scoped to a specific level in the hierarchy. Permissions granted at a higher level flow down to all children:

Organization
  └── Stack (e.g. "prod", "test")
        └── Namespace (e.g. "prod/networking", "prod/application")
              └── Module (e.g. "prod/networking/vpc")
Enter fullscreen mode Exit fullscreen mode

Runners, Agents, and Integrations sit outside this hierarchy — they each have their own scope. A Runner's Owner controls which Modules are allowed to execute on it. An Agent's Owner controls which scopes it can serve Missions in.

A role assigned at the Organization level applies everywhere. A role assigned at a specific Module applies only to that Module. This means you can express both broad policies ("the platform team is Reader on the entire organization") and narrow exceptions ("except they're Owner on the networking Namespace").

Concrete examples

Platform team owns networking, reads everything else

The platform team manages all networking infrastructure but should only observe application deployments:

resource "snapcd_stack_role_assignment" "platform_reader" {
  stack_id                = snapcd_stack.production.id
  principal_id            = snapcd_group.platform_team.id
  principal_discriminator = "Group"
  role_name               = "Reader"
}

resource "snapcd_namespace_role_assignment" "platform_owns_networking" {
  namespace_id            = snapcd_namespace.networking.id
  principal_id            = snapcd_group.platform_team.id
  principal_discriminator = "Group"
  role_name               = "Owner"
}
Enter fullscreen mode Exit fullscreen mode

The platform team gets Reader at the Stack level (they can see everything in production) and Owner on the networking Namespace (they can deploy, approve, and manage Modules within it). They cannot modify or deploy anything in other Namespaces.

Junior engineer approves test but not prod

A junior team member should be able to approve deployment plans in the test environment but only observe production:

resource "snapcd_stack_role_assignment" "junior_test_contributor" {
  stack_id                = snapcd_stack.test.id
  principal_id            = snapcd_user.junior_engineer.id
  principal_discriminator = "User"
  role_name               = "Contributor"
}

resource "snapcd_stack_role_assignment" "junior_prod_reader" {
  stack_id                = snapcd_stack.prod.id
  principal_id            = snapcd_user.junior_engineer.id
  principal_discriminator = "User"
  role_name               = "Reader"
}
Enter fullscreen mode Exit fullscreen mode

They can trigger plans, approve, and deploy in test. In prod, they can see what's happening but can't change anything.

CI Service Principal scoped to a single Module

An automated deployment pipeline that should only be able to deploy one specific Module:

resource "snapcd_module_role_assignment" "ci_deploys_api" {
  module_id               = snapcd_module.api_gateway.id
  principal_id            = snapcd_service_principal.ci_pipeline.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Contributor"
}
Enter fullscreen mode Exit fullscreen mode

The service principal can trigger plans and applies on the API gateway Module, but has no access to anything else in the organization. If the pipeline is compromised, the blast radius is limited to a single Module.

Runner access control

Controlling which Modules can execute on which Runners is a security boundary — a Runner deployed in your production Azure subscription should only execute production Modules. This is handled by Runner Supply, not by role assignments. A Runner Supply declares that a Runner is available to a Stack, Namespace, or individual Module:

resource "snapcd_runner_stack_supply" "prod" {
  runner_id = snapcd_runner.azure_prod.id
  stack_id  = snapcd_stack.production.id
}
Enter fullscreen mode Exit fullscreen mode

Every Module in the production Stack can execute on azure_prod. Modules in other Stacks cannot, regardless of what credentials exist elsewhere. Without a matching Supply, a Module will not execute.

Runner role assignments (snapcd_runner_role_assignment) serve a different purpose — they control what a principal can do to the Runner itself (manage, view, etc.), not which Modules execute on it.

AI Agent access control

Agents follow the same supply-and-RBAC model as Runners. An Agent is backed by a Service Principal — its permissions are whatever roles that Service Principal holds. You supply the Agent to specific scopes, and declare which Missions it can run at each scope:

resource "snapcd_agent" "ai" {
  name                       = "ai-agent"
  service_principal_id       = data.snapcd_service_principal.ai_agent.id
  is_supplied_to_all_modules = false
}

resource "snapcd_agent_stack_supply" "test" {
  agent_id = snapcd_agent.ai.id
  stack_id = snapcd_stack.test.id
}

resource "snapcd_stack_mission" "diagnose_test" {
  stack_id     = snapcd_stack.test.id
  agent_id     = snapcd_agent.ai.id
  mission_type = "AutoDiagnose"
}
Enter fullscreen mode Exit fullscreen mode

This Agent can auto-diagnose failed Jobs in the test Stack. Without a supply covering prod, it won't receive Missions there — even if someone accidentally creates a prod-scoped Mission for it. The Agent's Service Principal still needs the appropriate RBAC role to perform the action (e.g. Contributor to attempt an auto-fix, Reader to diagnose). Every action is logged and attributed to the Agent's Service Principal, giving you the same audit trail as human operators.

No backdoors

A common failure mode with CI-based access control is that the gates only apply to one path. Someone with the right cloud credentials can bypass CI entirely and run terraform apply from their laptop.

Snap CD's permission model applies to every interaction path. Whether you click "Approve" in the web dashboard, call the REST API from a script, or manage resources through the Terraform Provider, the same role assignments are evaluated. There is no unenforced path.

This also means your access control configuration is auditable in one place. Instead of piecing together GitHub team permissions, CI environment protection rules, and cloud IAM policies to understand who can deploy what, you query Snap CD's role assignments.

Managing permissions as code

Because every role assignment is a Terraform resource, your permission model lives in version control alongside the rest of your infrastructure configuration. Changes go through the same review process as any other infrastructure change — pull request, review, approve, apply.

resource "snapcd_stack" "prod" {
  name            = "prod"
  organization_id = snapcd_organization.main.id
}

resource "snapcd_namespace" "networking" {
  name     = "networking"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_namespace" "application" {
  name     = "application"
  stack_id = snapcd_stack.prod.id
}

resource "snapcd_stack_role_assignment" "sre_owns_prod" {
  stack_id                = snapcd_stack.prod.id
  principal_id            = snapcd_group.sre.id
  principal_discriminator = "Group"
  role_name               = "Owner"
}

resource "snapcd_namespace_role_assignment" "appdev_contributes_app" {
  namespace_id            = snapcd_namespace.application.id
  principal_id            = snapcd_group.app_developers.id
  principal_discriminator = "Group"
  role_name               = "Contributor"
}
Enter fullscreen mode Exit fullscreen mode

The SRE team owns the entire prod Stack. Application developers can deploy within the application Namespace but cannot touch networking. Both constraints are declared, version-controlled, and enforced at every interaction point.

Tips

  • Start broad, narrow later. Give your team Contributor at the organization level to start. As you identify boundaries — different teams, different environments, different risk levels — add scoped assignments and remove the broad one.
  • Use groups, not individual users. Assigning roles to groups means onboarding a new team member is a single group membership change, not a dozen role assignments.
  • Scope Runners to environments. A Runner with production credentials should only accept jobs from production Modules. Use Runner Supply to enforce this.
  • Treat permissions as infrastructure. Define all role assignments in Terraform. If a role assignment isn't in code, it shouldn't exist.
  • Audit regularly. Because all role assignments are Terraform resources, terraform plan will show you any drift between your intended permissions and the actual state.

See also

Top comments (0)