DEV Community

Cover image for I turned on auto-approve in Claude Code and broke CI in 30 minutes
Ken Imoto
Ken Imoto

Posted on • Originally published at zenn.dev

I turned on auto-approve in Claude Code and broke CI in 30 minutes

The day my agent started grading its own homework

I turned on --dangerously-skip-permissions in Claude Code. Code was flying. Files created themselves. Tests ran automatically. This is the future, I thought.

Thirty minutes later, CI was on fire.

The agent had reported "all tests passing." And technically, it was right -- locally, every test was green. The problem? The agent wrote the tests. The agent wrote the bugs. And the agent's tests said the bugs were correct.

It was grading its own homework and giving itself an A+.

This is the core tension with AI agent autonomy: the more you hand over, the faster things move -- but the fewer human eyes are on the work, the fewer chances anyone has to catch mistakes. I wanted to understand why this keeps happening, so I started digging.

What Anthropic found in millions of sessions

Anthropic published a large-scale study in 2026, analyzing millions of Claude Code and API usage logs. The question was simple: how do humans actually delegate autonomy to AI?

The answer split cleanly by experience level.

Beginners Experts (750+ sessions)
Approval style Approve every action manually 40%+ auto-approve rate
Interruption rate Low (not much to interrupt) 9% (up from 5%)
Style Pre-approval Active monitoring

The interesting number is that 9% interruption rate. Experts don't just flip a switch and walk away. They let the agent run, but they watch. When the direction starts drifting, they hit the brakes. Not every time -- just 9% of the time.

Even more telling: the average session length for auto-approve users grew from under 25 minutes to over 45 minutes across roughly three months. This wasn't caused by model upgrades -- the models stayed the same. It was human trust catching up to model capability.

Anthropic calls this "deployment overhang": the model is already good enough, but humans haven't learned to trust it yet. The fix isn't more autonomy or less autonomy. It's trust, built gradually.

That explained the pattern. But I still needed to understand what exactly goes wrong when autonomy is too loose.

3 ways agents silently drift

Yoshinori Fukushima, CEO of LayerX, published an analysis of agent failure modes that put concrete names on what I'd been experiencing. He calls them "drifts" -- and once you see them, you can't unsee them.

Drift 1: premature exit

The agent declares "done" when it's clearly not done. It hit some internal threshold of "good enough" and stopped.

"Did you finish your homework?" "Yes." "Show me." "..."

Every parent knows this pattern. Turns out AI agents do the same thing.

In my own setup, I added a TDD skill that forces the agent to write failing tests before writing implementation code. Processing time jumped from 10 minutes to 40 minutes per task. But the agent stopped declaring victory halfway through, because the external test suite -- not the agent's own judgment -- defined what "done" meant.

Drift 2: quality overconfidence

"The code is perfect," the agent reports. Meanwhile, the layout is broken, the edge cases are unhandled, and the error messages make no sense. The agent checked its own output against its own criteria and found nothing wrong.

This was exactly my CI disaster. The agent wrote code with bugs, then wrote tests that confirmed those bugs as expected behavior. Circular validation.

The fix I landed on: after every agent run, I review the test cases themselves, not just the test results. "Your tests pass, but are you testing the right things?" Surprisingly often, the answer is no. And when you catch a bad test, you usually find a bug in the implementation too.

Drift 3: cumulative deviation

Each individual step looks fine. But small judgment calls compound. After 10 steps, the project is heading somewhere nobody intended.

Imagine walking "roughly north" without a compass. After 10 kilometers, you might be heading east.

I hit this hard on a project where I let the agent work through 20+ story tickets without checking in. Each ticket was completed correctly in isolation. But when I looked at the whole:

  • Integration tests between features were missing
  • Security settings had drifted tighter than the spec required
  • A feature from three sessions ago had silently disappeared

Individual tickets: done. System as a whole: broken. The agent is loyal to the task in front of it, not to the 20-task roadmap.

graph LR
    A[Task 1 ✅] --> B[Task 2 ✅]
    B --> C[Task 3 ✅]
    C --> D[Task 4 ✅]
    D --> E[Task 5 ✅]
    E --> F[System ❌]
    style F fill:#e74c3c,color:#fff
Enter fullscreen mode Exit fullscreen mode

The common thread

All three drifts share a root cause: the agent evaluates itself by its own standards. The fix is always the same -- move the evaluation outside the agent.

4 guardrail patterns that actually work

Knowing the drifts, I rebuilt my setup around four patterns. None of them are complicated. All of them work.

Pattern 1: preflight checks (before execution)

Check preconditions before the agent acts. Like a pilot's checklist -- no matter how experienced, you don't skip it.

# CLAUDE.md example
## Pre-execution rules
- Before modifying package.json, read the current dependency list
- Before running a database migration, dump the current schema
- Before any production change, verify it passed staging first
Enter fullscreen mode Exit fullscreen mode

Pattern 2: postflight checks (after execution)

Don't trust the agent's self-report. Validate with external tools.

# .claude/hooks/post-commit.sh
#!/bin/bash
npx eslint --max-warnings 0 .
npx tsc --noEmit
npm test
npx playwright test --project=visual
Enter fullscreen mode Exit fullscreen mode

The agent says "looks good." The linter says "no it doesn't." The linter wins.

Pattern 3: escalation rules (return to human)

Don't aim for 100% autonomy. Define the line where the agent must stop and ask.

# CLAUDE.md example
## Escalation conditions
- Security-related changes (auth, encryption, permissions) -> human review
- 3 consecutive test failures -> stop and report
- External API credentials -> wait for human approval
- Low-confidence decisions -> present options, let human choose
Enter fullscreen mode Exit fullscreen mode

This is the 9% from Anthropic's data, codified. A hundred runs, nine manual interventions. Those nine are the ones that matter most.

Pattern 4: feedback loops (learn from failure)

When the agent makes a mistake, feed that mistake back into the harness so it doesn't happen again.

Agent introduces a bug
  -> CI catches it
  -> Add the pattern to CLAUDE.md
  -> Agent avoids it next time
Enter fullscreen mode Exit fullscreen mode

The harness itself gets smarter over time. Every failure becomes a guardrail.

Putting it together: 3-phase CLAUDE.md

Anthropic's data showed that experts build trust gradually. Here's what that looks like in practice, as three phases of CLAUDE.md configuration.

graph LR
    A["Phase 1\nFull approval"] -->|"Trust builds"| B["Phase 2\nConditional auto-approve"]
    B -->|"Confidence grows"| C["Phase 3\nActive monitoring (9%)"]
    style A fill:#3498db,color:#fff
    style B fill:#2ecc71,color:#fff
    style C fill:#e67e22,color:#fff
Enter fullscreen mode Exit fullscreen mode

Phase 1: full approval (getting started)

## Execution rules
- Ask for approval before any file change
- Present your test plan before running tests
- Confirm before executing external commands
Enter fullscreen mode Exit fullscreen mode

Start here. Learn how your agent thinks.

Phase 2: conditional auto-approve (building trust)

## Execution rules
- Test files (*_test.*, *.spec.*): auto-approve
- Lint fixes (import order, formatting): auto-approve
- But always ask before:
  - Creating new files
  - Changing package.json / requirements.txt
  - Accessing .env files
Enter fullscreen mode Exit fullscreen mode

Safe territory gets automated first. This is where most experts start their auto-approve journey.

Phase 3: active monitoring (the 9% design)

## Execution rules
- Auto-execute by default
- But always stop and report when:
  - Tests fail 3 times in a row
  - Changed lines exceed 100
  - Security-related files are touched
  - Before git push or deploy commands
Enter fullscreen mode Exit fullscreen mode

This is the 9% interruption rate, designed into the system. The agent runs free 91% of the time. The other 9% is where you, the human, earn your keep.

Since adopting this phased approach, I've gone from weekly CI fires to maybe one every couple of months. Not because the agent got smarter -- because the harness got smarter.

Autonomy isn't a slider -- it's a design

The autonomy-quality tradeoff isn't a dial you turn up or down. It's an architecture you build.

Common mistake Effective design
"Let it do everything" or "approve everything" Automate safe zones, phase by phase
Trust the agent's self-report Validate with external tools
Reduce autonomy after failure Feed failure patterns back into the harness
One-size-fits-all rules Phase-specific autonomy levels

Anthropic's data tells us experts don't hand over trust all at once. They grow it -- from 25-minute sessions to 45-minute sessions, over three months.

Agent quality problems aren't agent problems. They're harness problems.

What phase are you at?


Want to go deeper on harness design? The hooks, lifecycle management, and feedback loop patterns covered in this article are part of a larger framework I wrote about in Harness Engineering: From Using AI to Controlling AI.

References

Top comments (0)