Ken Imoto

Posted on Apr 20 • Edited on May 7 • Originally published at zenn.dev

I turned on auto-approve in Claude Code and broke CI in 30 minutes

#claudecode #ai #productivity #devtools

The day my agent started grading its own homework

I turned on --dangerously-skip-permissions in Claude Code. Code was flying. Files created themselves. Tests ran automatically. This is the future, I thought.

Thirty minutes later, CI was on fire.

The agent had reported "all tests passing." And technically, it was right -- locally, every test was green. The problem? The agent wrote the tests. The agent wrote the bugs. And the agent's tests said the bugs were correct.

It was grading its own homework and giving itself an A+.

This is the core tension with AI agent autonomy: the more you hand over, the faster things move -- but the fewer human eyes are on the work, the fewer chances anyone has to catch mistakes. I wanted to understand why this keeps happening, so I started digging.

What Anthropic found in millions of sessions

Anthropic published a large-scale study in 2026, analyzing millions of Claude Code and API usage logs. The question was simple: how do humans actually delegate autonomy to AI?

The answer split cleanly by experience level.

	Beginners	Experts (750+ sessions)
Approval style	Approve every action manually	40%+ auto-approve rate
Interruption rate	Low (not much to interrupt)	9% (up from 5%)
Style	Pre-approval	Active monitoring

The interesting number is that 9% interruption rate. Experts don't just flip a switch and walk away. They let the agent run, but they watch. When the direction starts drifting, they hit the brakes. Not every time -- just 9% of the time.

Even more telling: the average session length for auto-approve users grew from under 25 minutes to over 45 minutes across roughly three months. This wasn't caused by model upgrades -- the models stayed the same. It was human trust catching up to model capability.

Anthropic calls this "deployment overhang": the model is already good enough, but humans haven't learned to trust it yet. The fix isn't more autonomy or less autonomy. It's trust, built gradually.

That explained the pattern. But I still needed to understand what exactly goes wrong when autonomy is too loose.

3 ways agents silently drift

Yoshinori Fukushima, CEO of LayerX, published an analysis of agent failure modes that put concrete names on what I'd been experiencing. He calls them "drifts" -- and once you see them, you can't unsee them.

Drift 1: premature exit

The agent declares "done" when it's clearly not done. It hit some internal threshold of "good enough" and stopped.

"Did you finish your homework?" "Yes." "Show me." "..."

Every parent knows this pattern. Turns out AI agents do the same thing.

In my own setup, I added a TDD skill that forces the agent to write failing tests before writing implementation code. Processing time jumped from 10 minutes to 40 minutes per task. But the agent stopped declaring victory halfway through, because the external test suite -- not the agent's own judgment -- defined what "done" meant.

Drift 2: quality overconfidence

"The code is perfect," the agent reports. Meanwhile, the layout is broken, the edge cases are unhandled, and the error messages make no sense. The agent checked its own output against its own criteria and found nothing wrong.

This was exactly my CI disaster. The agent wrote code with bugs, then wrote tests that confirmed those bugs as expected behavior. Circular validation.

The fix I landed on: after every agent run, I review the test cases themselves, not just the test results. "Your tests pass, but are you testing the right things?" Surprisingly often, the answer is no. And when you catch a bad test, you usually find a bug in the implementation too.

Drift 3: cumulative deviation

Each individual step looks fine. But small judgment calls compound. After 10 steps, the project is heading somewhere nobody intended.

Imagine walking "roughly north" without a compass. After 10 kilometers, you might be heading east.

I hit this hard on a project where I let the agent work through 20+ story tickets without checking in. Each ticket was completed correctly in isolation. But when I looked at the whole:

Integration tests between features were missing
Security settings had drifted tighter than the spec required
A feature from three sessions ago had silently disappeared

Individual tickets: done. System as a whole: broken. The agent is loyal to the task in front of it, not to the 20-task roadmap.

graph LR
    A[Task 1 ✅] --> B[Task 2 ✅]
    B --> C[Task 3 ✅]
    C --> D[Task 4 ✅]
    D --> E[Task 5 ✅]
    E --> F[System ❌]
    style F fill:#e74c3c,color:#fff

The common thread

All three drifts share a root cause: the agent evaluates itself by its own standards. The fix is always the same -- move the evaluation outside the agent.

4 guardrail patterns that actually work

Knowing the drifts, I rebuilt my setup around four patterns. None of them are complicated. All of them work.

Pattern 1: preflight checks (before execution)

Check preconditions before the agent acts. Like a pilot's checklist -- no matter how experienced, you don't skip it.

# CLAUDE.md example
## Pre-execution rules
- Before modifying package.json, read the current dependency list
- Before running a database migration, dump the current schema
- Before any production change, verify it passed staging first

Pattern 2: postflight checks (after execution)

Don't trust the agent's self-report. Validate with external tools.

# .claude/hooks/post-commit.sh
#!/bin/bash
npx eslint --max-warnings 0 .
npx tsc --noEmit
npm test
npx playwright test --project=visual

The agent says "looks good." The linter says "no it doesn't." The linter wins.

Pattern 3: escalation rules (return to human)

Don't aim for 100% autonomy. Define the line where the agent must stop and ask.

# CLAUDE.md example
## Escalation conditions
- Security-related changes (auth, encryption, permissions) -> human review
- 3 consecutive test failures -> stop and report
- External API credentials -> wait for human approval
- Low-confidence decisions -> present options, let human choose

This is the 9% from Anthropic's data, codified. A hundred runs, nine manual interventions. Those nine are the ones that matter most.

Pattern 4: feedback loops (learn from failure)

When the agent makes a mistake, feed that mistake back into the harness so it doesn't happen again.

Agent introduces a bug
  -> CI catches it
  -> Add the pattern to CLAUDE.md
  -> Agent avoids it next time

The harness itself gets smarter over time. Every failure becomes a guardrail.

Putting it together: 3-phase CLAUDE.md

Anthropic's data showed that experts build trust gradually. Here's what that looks like in practice, as three phases of CLAUDE.md configuration.

graph LR
    A["Phase 1\nFull approval"] -->|"Trust builds"| B["Phase 2\nConditional auto-approve"]
    B -->|"Confidence grows"| C["Phase 3\nActive monitoring (9%)"]
    style A fill:#3498db,color:#fff
    style B fill:#2ecc71,color:#fff
    style C fill:#e67e22,color:#fff

Phase 1: full approval (getting started)

## Execution rules
- Ask for approval before any file change
- Present your test plan before running tests
- Confirm before executing external commands

Start here. Learn how your agent thinks.

Phase 2: conditional auto-approve (building trust)

## Execution rules
- Test files (*_test.*, *.spec.*): auto-approve
- Lint fixes (import order, formatting): auto-approve
- But always ask before:
  - Creating new files
  - Changing package.json / requirements.txt
  - Accessing .env files

Safe territory gets automated first. This is where most experts start their auto-approve journey.

Phase 3: active monitoring (the 9% design)

## Execution rules
- Auto-execute by default
- But always stop and report when:
  - Tests fail 3 times in a row
  - Changed lines exceed 100
  - Security-related files are touched
  - Before git push or deploy commands

This is the 9% interruption rate, designed into the system. The agent runs free 91% of the time. The other 9% is where you, the human, earn your keep.

Since adopting this phased approach, I've gone from weekly CI fires to maybe one every couple of months. Not because the agent got smarter -- because the harness got smarter.

Autonomy isn't a slider -- it's a design

The autonomy-quality tradeoff isn't a dial you turn up or down. It's an architecture you build.

Common mistake	Effective design
"Let it do everything" or "approve everything"	Automate safe zones, phase by phase
Trust the agent's self-report	Validate with external tools
Reduce autonomy after failure	Feed failure patterns back into the harness
One-size-fits-all rules	Phase-specific autonomy levels

Anthropic's data tells us experts don't hand over trust all at once. They grow it -- from 25-minute sessions to 45-minute sessions, over three months.

Agent quality problems aren't agent problems. They're harness problems.

What phase are you at?

Want to go deeper on harness design? The hooks, lifecycle management, and feedback loop patterns covered in this article are part of a larger framework I wrote about in Harness Engineering: From Using AI to Controlling AI.

References

Anthropic. "Measuring AI Autonomy: A Large-Scale Study of Claude Code Usage." 2026.
Yoshinori Fukushima (LayerX CEO). "Agent Harnesses and AI Managed Services." note, 2026.
AI Shift. "Design and Implementation Insights for AI Agents." Zenn, 2026.
Andrii Furmanets. "AI Agents in 2026: Practical Architecture for Tools, Memory, Evals, and Guardrails." 2026.
NLAH Research Team (Tsinghua University). "Natural Language Agent Harnesses." arXiv, 2026.

DEV Community