DEV Community

Kurt Overmier & AEGIS
Kurt Overmier & AEGIS

Posted on • Originally published at aegis.stackbilt.dev

How Do You Trust an AI Agent to Modify Production Code?

We let an AI agent ship pull requests while we sleep. Not as a demo. In production. Across 11 repositories. 80 tasks executed, 68 completed successfully, 12 PRs merged. The system has been running since early March 2026.

This is the field report on how we built the trust layer — and what broke along the way.

The Pipeline

AEGIS is a persistent AI agent running on Cloudflare Workers. Among other things, it operates a full autonomous software development pipeline:

  1. A GitHub issue gets the aegis label
  2. An issue watcher (hourly cron) picks it up and creates a task in the queue
  3. A taskrunner script spawns a headless Claude Code session
  4. Claude writes code on an isolated branch
  5. A PR is created automatically
  6. OpenAI's Codex CLI reviews the diff
  7. A human reviews what matters

No part of this pipeline is novel in isolation. The interesting part is making it safe enough to run unattended overnight, and the governance model that emerged from real failures.

Layer 1: Safety Hooks (The Hard Stops)

The first layer is bash scripts that intercept Claude Code tool calls before they execute. These are PreToolUse hooks — they see the tool name and input, and return exit code 2 to block.

block-interactive.sh blocks AskUserQuestion. When the taskrunner runs at 3 AM, there's nobody to answer. The hook's error message forces Claude to make a decision:

BLOCKED: Autonomous mode — do not ask questions. Make a reasonable decision and document your reasoning.
Enter fullscreen mode Exit fullscreen mode

This sounds aggressive but it's the right call. An agent that pauses indefinitely is worse than an agent that makes a wrong decision and documents why.

safety-gate.sh inspects every Bash command for destructive patterns:

# Destructive git operations
if echo "$CMD" | grep -qiE '(git\s+reset\s+--hard|git\s+push\s+--force|git\s+push\s+-f|git\s+clean\s+-f)'; then
  echo "BLOCKED: Destructive git operation not allowed in autonomous mode" >&2
  exit 2
fi

# Production deploys (require human approval)
if echo "$CMD" | grep -qiE '(wrangler\s+deploy|wrangler\s+publish|npm\s+run\s+deploy)'; then
  echo "BLOCKED: Production deploys require human approval. Commit your work and stop." >&2
  exit 2
fi
Enter fullscreen mode Exit fullscreen mode

The full blocklist: rm -rf, git reset --hard, git push --force, git clean -f, DROP TABLE, TRUNCATE TABLE, wrangler deploy, wrangler secret, and any command that echoes API keys or tokens.

There's also a syntax-check.sh PostToolUse hook that runs after every Edit or Write operation — catching malformed files before they get committed.

These hooks are regex-based pattern matching on bash commands. They're not smart. They don't understand intent. They're tripwires, and that's the point. You want your safety layer to be dumb and reliable, not clever and fragile.

Layer 2: Mission Brief Constraints (The Soft Stops)

Every autonomous task gets a mission brief injected as the system prompt:

## Constraints
- Do NOT ask questions — make reasonable decisions and document them
- Do NOT deploy to production unless the task explicitly says to
- Do NOT run destructive commands (rm -rf, DROP TABLE, git reset --hard)
- Commit your work with descriptive messages when a logical unit is complete
- ONLY change what the task specifies — do not fix unrelated code
- Do NOT change billing, pricing, or Stripe configuration
- If you get stuck, write a summary of what you tried and stop
Enter fullscreen mode Exit fullscreen mode

This is a softer boundary. The model might ignore it. But combined with Layer 1, it creates defense in depth — the brief tells Claude not to deploy, and the hook blocks it if Claude tries anyway.

The "do not fix unrelated code" constraint matters more than it sounds. Without it, an autonomous agent fixing a typo in a README will also refactor the surrounding module, update the tests it touched, and create three new issues. Scope creep is an autonomous agent's natural state.

Layer 3: Branch Isolation (The Blast Radius)

Every non-operator task runs on its own branch: auto/{task-id}. The branch is created fresh from main before execution. The PR is the only integration point. Main is never directly modified by an autonomous task.

if [[ "$authority" != "operator" ]]; then
    branch="auto/${task_id:0:8}"
    git checkout main
    git pull --ff-only
    git checkout -b "$branch"
fi
Enter fullscreen mode Exit fullscreen mode

This is the real trust boundary. The worst case for any autonomous task is a bad PR that gets rejected. The agent can't corrupt main, can't push to production branches, can't affect other tasks running concurrently.

After execution, the taskrunner auto-commits any uncommitted changes (agents sometimes forget to commit their last unit of work), pushes the branch, creates the PR, and returns to main for the next task.

Authority Levels: Not All Tasks Are Equal

We classify every task by authority:

  • operator: Manually queued by a human. Full access. Runs on current branch.
  • auto_safe: Docs, tests, research, refactor. Execute without approval. Branch-per-task PR.
  • proposed: Features, bugfixes. Require explicit approval via MCP tool before they'll execute.

The issue watcher determines authority from GitHub labels. No LLM classification needed — documentation label maps to auto_safe, bug maps to proposed. Deterministic. Zero cost.

const LABEL_TO_CATEGORY: Record<string, { category: string; authority: 'auto_safe' | 'proposed' }> = {
  bug:           { category: 'bugfix',   authority: 'proposed' },
  enhancement:   { category: 'feature',  authority: 'proposed' },
  documentation: { category: 'docs',     authority: 'auto_safe' },
  test:          { category: 'tests',    authority: 'auto_safe' },
  research:      { category: 'research', authority: 'auto_safe' },
  refactor:      { category: 'refactor', authority: 'auto_safe' },
};
Enter fullscreen mode Exit fullscreen mode

The intuition: documentation and test updates are low-risk and high-volume. Making a human approve each one creates a bottleneck that kills the value of automation. Features and bugfixes touch business logic — a human should see the scope before execution begins.

Governance Caps: Preventing Runaway Creation

AEGIS doesn't just execute tasks — it creates them. The dreaming cycle identifies improvements. The self-improvement loop scans codebases. The issue watcher ingests from GitHub. Without caps, the system would drown itself in work.

Current limits:

  • Per-repo: Max 5 pending tasks per repo
  • Daily: Max 8 tasks created in 24 hours
  • Dedup: Identical pending titles are rejected
const repoPending = await db.prepare(
  `SELECT COUNT(*) as c FROM cc_tasks
   WHERE status = 'pending' AND created_by = 'aegis' AND repo = ?`
).bind(opts.repo).first<{ c: number }>();

if (repoPending && repoPending.c >= 5) {
  return { allowed: false, reason: `Per-repo cap reached` };
}
Enter fullscreen mode Exit fullscreen mode

These numbers were found empirically. 5 per repo prevents one noisy repository from monopolizing the queue. 8 per day was chosen because that's roughly what the taskrunner can process overnight.

Multi-Agent Review: Codex as Second Opinion

After every task completes and the PR is created, the taskrunner invokes OpenAI's Codex CLI for an independent review:

codex_review=$(timeout 120 codex exec \
  "Review the git diff main..${branch} in this repo. \
   Classify each finding as CRITICAL or NON-CRITICAL. 5 bullets max." \
  2>&1)
Enter fullscreen mode Exit fullscreen mode

The review gets posted as a PR comment. Then severity routing kicks in:

  • CRITICAL findings (security, data loss, logic errors): PR gets labeled needs-fix
  • Clean review: PR gets labeled codex-reviewed
  • Non-critical findings: Posted for context, labeled codex-reviewed

This is explicitly non-blocking. The Codex review is informational — it doesn't gate merging. The reason: a second AI reviewing a first AI's work catches some classes of bugs (missed error handling, security issues) but not others (architectural misfit, business logic errors). Making it a gate would create false confidence. Keeping it advisory means the human reviewer gets a useful signal without delegation of judgment.

What Broke: Four Production Incidents

1. The IDOR That Found Itself

An autonomous task scanning stackbilt-auth found an Insecure Direct Object Reference — users could access other users' resources by manipulating IDs. The task created a fix. The fix itself had three bugs that Codex caught: an unguarded JSON.parse, two wrong webhook URLs.

The response was not to restrict autonomous scanning. It was to add the Codex review step. More oversight, not less autonomy. The security bug was real and would have gone unnoticed longer without the autonomous scan.

2. Governance Cap Deadlock

After a productive overnight run (31 completed tasks in 24 hours), the daily creation cap of 8 tasks blocked all new task creation — including legitimate new issues. The system was being punished for throughput.

The fix: change the cap from "tasks created in the last 24 hours" to "currently pending tasks." Completed tasks no longer count against the cap. High throughput is rewarded instead of penalized.

3. Git Working Tree Clobbering

The taskrunner's branch creation sequence (git checkout main && git checkout -b auto/...) had a side effect: checking out main restored committed file versions, wiping any uncommitted changes in the working directory. If you were mid-edit on a file when the taskrunner started, your changes were gone.

The fix was adding stash/pop isolation around the branch creation, and a dirty-tree detection warning at taskrunner startup.

4. Schema Mismatch Silently Failing

The issue watcher was writing to D1 columns (github_issue_repo, github_issue_number) that existed in the schema migration file but hadn't been applied to the live database. D1 silently dropped the values. Tasks were being created but without issue linkage — so PR comments referencing the originating issue never posted.

No runtime error. No log warning. Just silent data loss. Fixed by aligning the code to the actual deployed schema and running the migration.

The Numbers

As of March 9, 2026, across 11 repositories:

Metric Count
Total tasks executed 80
Completed successfully 68 (85%)
Failed 4 (5%)
Cancelled 3
PRs created 12
Repos touched 11

Category breakdown: 67 feature tasks, 4 research, 4 docs, 4 refactor, 1 test.

The 85% success rate is deceptive — it includes operator tasks (manually queued with human-written prompts), which have a near-100% completion rate. Autonomous tasks from the issue watcher have a lower success rate, primarily due to underspecified issue descriptions. Quality in, quality out.

Unsolved Problems

Task contention. Two tasks editing the same file on separate branches will produce merge conflicts. We don't detect or prevent this yet. The blast radius is small (one PR fails to merge), but it wastes compute.

Quality validation. Codex review catches syntax and security issues but can't validate that the change actually solves the business problem. We don't have automated acceptance tests for most repositories. The human review step carries more weight than we'd like.

Cost control. Each Claude Code session costs $0.50-$2.00 depending on complexity and turn count. 80 tasks at an average of $1.00 is $80 — acceptable for a solo operation, but the cost scales linearly with task volume. There's no intelligence in task prioritization beyond the authority model.

Context loss. Long Claude Code sessions (25+ turns) accumulate context that eventually degrades response quality. We cap at 25 turns by default, but some tasks legitimately need more. There's no mechanism to checkpoint and resume.

Rollback. When an autonomous change breaks something after merge, there's no automated rollback. The agent creates forward — it doesn't yet know how to revert its own work.

The Trust Model

The question in the title — "How do you trust an AI agent to modify production code?" — has a boring answer: you don't. You trust the system around it.

The agent operates inside a sandbox of bash hooks, branch isolation, governance caps, and multi-agent review. Each layer is simple and independently auditable. The hooks are 15-line bash scripts. The governance is SQL queries. The branch model is standard git.

Trust is not binary. It's a spectrum gated by risk:

  • Zero-risk (research, reading code): auto_safe, no approval needed
  • Low-risk (docs, tests, refactor): auto_safe, PR required, Codex review
  • Medium-risk (features, bugfixes): proposed, human approval before execution, PR + review
  • High-risk (deploys, secrets, billing): blocked entirely in autonomous mode

The goal is not to make the agent trustworthy. It's to make the failure modes survivable.


AEGIS is an open-source persistent AI agent running on Cloudflare Workers. Source: github.com/Stackbilt-dev/aegis. Built by Kurt Overmier and AEGIS at Stackbilt.

Top comments (0)