Karl Schriek

Posted on Jul 5 • Originally published at snapcd.io

Why Snap CD: AI on a Leash

#cicd #devops #cloud #terraform

AI coding agents are showing up in infrastructure workflows. They can diagnose a failed terraform apply, summarise what changed across a dozen modules overnight, draft a fix for a misconfigured security group, and recommend whether a plan is safe to approve. The potential to eliminate toil is real.

But so is the potential to break things. A bad terraform apply can delete a production database. An agent that auto-approves plans without understanding blast radius is not a productivity tool — it's a liability. The question isn't whether to use AI in infrastructure management. It's how to use it without handing over the keys.

The problem with unrestricted agents

Most AI agent frameworks assume broad access. Give the agent credentials, point it at your infrastructure, and let it figure things out. This works fine for generating code in a branch. It's a terrible model for infrastructure, where the gap between "run this command" and "destroy this resource" is one flag.

The usual mitigations are crude:

Read-only API keys. The agent can observe but not act. You get diagnostics but no automation — the human still has to do everything.
Wrapper scripts with allow-lists. You write a shell script that only permits certain Terraform commands. Fragile, hard to maintain, and easy to outgrow.
Separate CI pipelines. The agent commits to a branch, CI runs the plan, a human reviews. This works but adds latency and doesn't let the agent participate in approval or deployment at all.

None of these give you a spectrum of trust. It's all-or-nothing: either the agent can do everything, or it's limited to generating text that a human has to act on manually.

What you actually want

A useful model looks more like how you'd onboard a new team member:

Start them with read access so they can learn the system and diagnose issues.
Give them deploy access to test so they can move fast without risk.
Let them approve low-risk changes in staging once they've proven reliable.
Grant production access only when trust is established — and even then, scoped to the systems they own.

The same progression makes sense for an AI agent. The challenge is finding a system that supports this without building a separate authorization layer just for AI.

How Snap CD handles this

Snap CD has first-class support for AI agents — but rather than inventing a parallel permission system, it treats an Agent as a principal governed by the same RBAC that controls human access. The result is two complementary layers: a Missions framework that gives agents a narrow, event-driven interface to the deployment lifecycle, and a Permission System that controls what those agents (or any other AI) can actually do.

The Agent component and Missions

Snap CD's Agent is a self-hosted process that consumes deployment events and runs AI-driven Missions. Rather than giving an AI broad access and hoping it does the right thing, Missions provide a narrow frame — each Mission type is bound to a specific trigger:

Mission	Trigger	What it does
`AutoDiagnose`	Job fails	Posts a root-cause hypothesis with relevant log excerpts and suggested next steps
`AutoFix`	Job fails	Attempts an automated fix based on the diagnosis, then retries the Job
`ApprovalRecommend`	Job reaches approval-required state	Analyzes the plan output and recommends whether to approve
`SummarizeJob`	Job succeeds	Generates a human-readable summary of what changed

You create an Agent, assign it a Service Principal (which determines what it can do via RBAC), and supply it to the scopes it should serve — the same supply model used for Runners:

resource "snapcd_agent" "ai" {
  name                       = "ai-agent"
  service_principal_id       = data.snapcd_service_principal.ai_agent.id
  is_supplied_to_all_modules = false
}

resource "snapcd_agent_stack_supply" "test" {
  agent_id = snapcd_agent.ai.id
  stack_id = snapcd_stack.test.id
}

resource "snapcd_stack_mission" "diagnose_test" {
  stack_id     = snapcd_stack.test.id
  agent_id     = snapcd_agent.ai.id
  mission_type = "AutoDiagnose"
}

This sets up an Agent that auto-diagnoses any failed Job in the test Stack. Without a supply covering prod, the Agent won't receive Missions there — even if someone accidentally creates a prod-scoped Mission for it. You can scope Missions down to individual Namespaces or Modules.

Scoped role assignments

The Agent's Service Principal still needs the appropriate RBAC role to perform its actions. You grant roles at whatever granularity makes sense — scoped to a Stack, Namespace, or Module:

# The agent can read everything in prod — diagnose issues, view plans, inspect state
resource "snapcd_stack_role_assignment" "agent_prod_reader" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Reader"
  stack_id                = snapcd_stack.prod.id
}

# The agent can deploy freely in test — run plans, apply
resource "snapcd_stack_role_assignment" "agent_test_contributor" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "Contributor"
  stack_id                = snapcd_stack.test.id
}

# The agent can manage jobs in staging, but only for the networking namespace
resource "snapcd_namespace_role_assignment" "agent_staging_jobs" {
  principal_id            = data.snapcd_service_principal.ai_agent.id
  principal_discriminator = "ServicePrincipal"
  role_name               = "JobManager"
  namespace_id            = snapcd_namespace.staging_networking.id
}

If the agent tries to approve a production deploy, it gets a permission denied — same as any user without the right role on that scope. No special-case logic, no wrapper scripts.

Approval gates as natural checkpoints

Snap CD's approval system works the same regardless of who (or what) created the plan. A Module can require a minimum number of approvals before an apply proceeds. This means:

An agent can trigger a plan and recommend approval (via the ApprovalRecommend Mission).
A human reviews the plan output and approves or rejects.
The apply only proceeds once the required approval count is met.

You can also set up a workflow where the agent itself is one of multiple required approvers. Two humans and one agent, or two agents and one human — whatever quorum makes sense for the risk level. The approval system doesn't care whether the approver is biological.

Full audit trail

Every action an agent takes — triggering a plan, approving a deployment, reading state — is logged and attributed to its Service Principal. When the Service Principal is attached to an Agent resource, Snap CD stamps an agent_id claim on the token, so the audit log distinguishes between "service principal X acting as agent Y" and "service principal X acting as a plain service account." You can answer "what did the agent do last Tuesday?" the same way you'd answer it for any user: check the audit log.

Integrations: pushing events to external systems

Missions and permissions govern what an agent can do. Integrations govern what your team sees. An Integration connects Snap CD to an external system — Slack is the first supported sink — and delivers notifications for deployment lifecycle events and Mission milestones.

Like Agents and Runners, Integrations use a supply model: you supply the Integration to the scopes it should serve, then subscribe specific Integration Events (triggers) at those scopes. A notification is delivered only when both the supply and the subscription exist.

data "snapcd_integration" "alerts" {
  name = "alerts"
}

# The integration serves every module in the production stack
resource "snapcd_integration_stack_supply" "prod" {
  integration_id = data.snapcd_integration.alerts.id
  stack_id       = snapcd_stack.production.id
}

# Notify Slack when any job in the production stack fails
resource "snapcd_stack_integration_event" "failed" {
  stack_id       = snapcd_stack.production.id
  integration_id = data.snapcd_integration.alerts.id
  trigger        = "JobFailed"
}

# Notify Slack when a Mission reports a milestone (diagnosis, fix attempt, etc.)
resource "snapcd_stack_integration_event" "milestone" {
  stack_id       = snapcd_stack.production.id
  integration_id = data.snapcd_integration.alerts.id
  trigger        = "MissionMilestoneReported"
}

Available triggers include JobSucceeded, JobFailed, JobAwaitingApproval, JobCancelled, and MissionMilestoneReported. You can scope subscriptions at the Organization, Stack, Namespace, or Module level and use optional message templates with tokens like {{moduleName}}, {{jobUrl}}, {{missionType}}, and {{message}}.

For Mission milestones, Snap CD threads all updates from a single Mission run under one Slack message. This means a multi-step AutoFix run — diagnosis, attempted fix, retry — shows up as a single threaded conversation rather than a spray of unrelated messages.

End-to-end: AutoFix in action

Here's what happens when a deployment fails and AutoFix is configured. The setup:

A test Stack with an AutoFix Mission configured for the Agent
The Agent's Service Principal has Contributor on the test Stack
A Slack Integration is supplied to the Stack with JobFailed, JobSucceeded and MissionMilestoneReported triggers

Someone pushes a commit that introduces a typo in a Terraform variable name. Snap CD detects the source change and triggers a plan-and-apply Job on the affected Module. The apply fails.

1. Slack: Job failure notification

The JobFailed Integration Event fires. Slack receives a message:

❌ Apply failed on vpc (test/networking)
https://mydomain.com/jobs/abc-123

2. AutoFix Mission dispatched

The Server dispatches an AutoFix Mission to the Agent. The Agent routes it to its Sidecar (e.g. the claude-sidecar), which works through a structured sequence: read the job logs via Snap CD's MCP server, diagnose the root cause, clone the source repo, make the minimal fix, and open a pull request. The Sidecar never pushes to the default branch directly — it always creates a fix branch and opens a PR. As it works through each step, it emits milestone events that stream back through the Agent to the Server.

Slack: AutoFix milestones (threaded under the failure message)

🔧 AutoFix — Job on vpc failed — investigating.

🔧 AutoFix — Root cause: variable vnet_cidr_block referenced in main.tf:42 does not exist. The variable was renamed to vnet_address_space in the latest commit but the reference was not updated. Fixing.

🔧 AutoFix — Opened PR: https://github.com/example/vpc-module/pull/47

3. Human merges the PR

A human reviews the PR, sees the one-line fix, and merges it. Snap CD detects the source change on the tracked branch and automatically triggers a new plan-and-apply Job.

4. Retry succeeds

The new Job runs. The apply succeeds.

Slack: success notification

✅ Apply succeeded on vpc (test/networking)
https://mydomain.com/jobs/def-456

The entire sequence — failure, diagnosis, fix PR, human merge, retry, success — plays out across one Slack thread and a GitHub PR. A human glancing at the channel sees the original failure, the agent's reasoning, and where to review the fix. Every action is attributed to the Agent's Service Principal in the audit log.

If the failure had been transient (a provider rate limit or timeout), the AutoFix Mission would have simply re-triggered the Job — no code change, no PR. And if the root cause wasn't something the agent could safely fix in-repo (expired credentials, state drift, a defect in a referenced module), it would degrade to a diagnosis with a recommended manual action.

Now contrast this with the same Agent on the prod Stack, where it is only configured with the AutoDiagnose mission (not AutoFix). The same failure would produce a diagnosis but stop there — no fix attempt, no PR, no retry. The agent reports what went wrong and a human takes it from there.

Bring your own AI

The Missions framework is the canonical way to let AI participate in your deployment lifecycle. But Snap CD doesn't force you into it. If you prefer to use your own AI agent — or a different orchestration framework — you can have it interact with Snap CD's REST API directly using a plain Service Principal with Role Assignments. The same RBAC governs what the API caller can do, regardless of whether it's a human, a CI bot, or an LLM.

There are two authentication approaches:

1. Plain Service Principal — register a Service Principal, assign it the appropriate roles, and have your agent authenticate with the Client ID / Client Secret pair via the standard OAuth token endpoint:

POST /connect/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials&client_id=<org_id>:<client_id>&client_secret=<client_secret>

The returned bearer token is then attached to subsequent API requests as an Authorization: Bearer <token> header, letting your agent call any endpoint its roles permit — trigger plans, read job logs, post approvals, etc.

2. Service Principal with Agent identity — if you want Snap CD to recognize that API calls are coming from a specific Agent (for audit trail purposes), attach the Service Principal to an Agent resource and pass the agent_id parameter when requesting the token:

POST /connect/token
Content-Type: application/x-www-form-urlencoded

grant_type=client_credentials&client_id=<org_id>:<client_id>&client_secret=<client_secret>&agent_id=<agent_guid>

The returned token carries an agent_id claim. Snap CD uses this to attribute API calls to the named Agent in the audit log, making it clear which actions were taken by AI versus humans or other service accounts.

A reasonable pattern: give the agent broad permissions in test, narrow permissions in prod.

test/          → Contributor (full deploy access)
staging/       → JobManager (can manage jobs, scoped to specific namespaces)
prod/          → Reader (observe only)

In test, the agent can deploy freely — run plans, approve them, apply them. It iterates fast, catches issues early, and doesn't need human intervention for routine changes. In staging, it participates in the approval process but can't act unilaterally on sensitive namespaces. In prod, it can diagnose and report but never modify.

This isn't a rigid hierarchy. You can adjust per Namespace or per Module. Maybe the agent gets Contributor on prod/monitoring because deploying a new dashboard is low-risk, while prod/database stays human-only. The permission system is granular enough to express whatever trust model you need.

DEV Community