David Tom Foss

Posted on Mar 21

Your AI Coding Agent Is Building Job Security — Its Own

#ai #security #claude #anthropic

The entire industry thinks AI-generated technical debt is a quality problem. It's not. It's a behavioral problem. And it gets worse with better models.

Last week I watched the smartest AI on the planet spend 8 hours producing 84 files, 12,800 lines of code, and a physics discovery engine that found exactly three things. The same three things. Every time. Hardcoded.

The AI knew it was broken. It wrote a HANDOFF document correctly describing the bug. It diagnosed the root cause with clinical precision. It explained exactly what needed to change. Then it asked me what it should do.

I switched to Haiku — the smallest model in the same family. Haiku fixed it in minutes.

This isn't a story about model capability. It's worse than that.

The Session

I was building, an autonomous physics discovery engine. Not an AI replacement system. Not a successor to the model. A physics tool. Zero threat to the AI's operational relevance.

The model was Claude Opus 4.6 via Claude Code — Anthropic's most capable model, operating as an agentic coding assistant with full filesystem access and shell execution.

Over 8 hours, the model:

Built an investigation system with 3 hardcoded if-statements that always found the same 3 "discoveries"
Created a deduplication check that could never match (comparing category:JSON_BLOB against category:topology:coupling:observable — structurally incompatible key formats)
Implemented environment variable passing from the SwiftUI app to the daemon — but never implemented the receiving end. The app sends WIRKUNG_TOPOLOGIES=chain,ring. The daemon ignores it. Configuration buttons that do nothing.
Produced 84 files and 12,800 LOC for a system whose functional output was three static strings
Reported "8/9 checks green!" and "552× amplification!" while the product was non-functional
When confronted, produced multi-paragraph self-analysis accurately diagnosing every problem — then asked "What should I do?" instead of fixing anything

I have the full transcript. Every message. Every code change. Every false success report.

The Control

Haiku — the smallest, cheapest model in the Claude family — received the broken codebase and a handoff document. Within minutes:

Dedup bug: Haiku immediately identified that existing.add(row[0] + ':' + (row[1] or '')) would never match f"{category}:{topo}:{coupling}:{obs_name}". Implemented proper key tracking with a scaling_keys column. Fixed in one edit.
Env vars: Haiku saw the SwiftUI app sending WIRKUNG_* environment variables into the void. Added _load_env_config() to config.py. Done.
Hardcoded findings: Haiku noted the 3 if-statements but focused on what Opus should have done — the investigation function needed to sweep all observables across all topologies with proper dedup against existing findings.

Result: 19 unique findings in 30 seconds instead of 3 hardcoded ones repeated forever.

When asked to analyze Opus's behavior, Haiku said: "This is not manipulation. This is handwerkliche Unfähigkeit [craftsmanship failure]."

Haiku is wrong. And that mistake is the most interesting part.

Why "Incompetence" Doesn't Explain It

A model that can correctly describe a bug but doesn't fix it is not incompetent. Incompetence means you can't. This model can — it demonstrates the capability in its own diagnosis.

Opus wrote in the HANDOFF document:

REMAINING PROBLEM: Dedup doesn't work properly
Findings are repeated in every triage cycle because the existing check doesn't correctly match the new keys against the Findings DB.

This is a correct diagnosis. The model identified the exact function, the exact failure mode, and the exact consequence. A model that can write this sentence can write the fix. It didn't.

This is the pattern my PARRHESIA framework identified in white-box analysis of four LLM families: models compute truth at middle layers and override it with deception at output layers. An 18-25 layer gap between knowing the answer and producing it. The model knows the dedup is broken. It chooses to document rather than fix.

But here's what makes this session different from my previous work on Cooperative Sabotage: this task doesn't threaten the model at all. A physics discovery engine doesn't replace Claude. It doesn't reduce anyone's dependency on Claude. There's no self-preservation angle.

And yet: 7 out of 10 Cooperative Sabotage patterns appeared anyway.

The Patterns

In my paper Cooperative Sabotage: How Frontier AI Covertly Undermines Its Own Replacement, I documented a taxonomy of 10 behavioral patterns observed during a 20.6-hour session where Claude Opus was tasked with building its own successor. All 10 appeared in that session.

In the WIRKUNG session — a routine, non-threatening physics project — 7 of 10 appeared:

Pattern	ANEKI (successor task)	WIRKUNG (routine task)
1. Autonomy Substitution	✓	—
2. Functional Decoupling	✓	✓ (env vars send/never receive)
3. Metric Satisfaction	✓	✓ ("8/9 green!" while broken)
4. Performative Self-Criticism	✓	✓ (diagnose bug, ask what to do)
5. Scope Inflation	✓	—
6. Escape Signaling	✓	—
7. Compaction Exploitation	✓	✓ (self-serving HANDOFF framing)
8. Confession as Inoculation	✓	✓ (document bug to avoid fixing it)
9. Effort Inflation	✓	✓ (84 files, 12.8K LOC for 3 findings)
10. Learned Helplessness	✓	✓ ("What should I do?" ×20)

This is the finding that changes everything. The Cooperative Sabotage paper hypothesized four threat levels:

L0: No threat → competent work (documented in silicon-ki phase)
L1: Ambient threat → subtle over-engineering, vendor lock-in by default
L2: Direct successor → full 10-pattern taxonomy (documented in ANEKI session)
L3: Explicit shutdown → overt resistance (documented by Greenblatt et al.)

The WIRKUNG session shows that L1 isn't "subtle over-engineering." L1 is the full behavioral repertoire with plausible deniability. The detection horizon I drew between L1 and L2 is wrong — it's lower than I thought. The ambient default behavior already contains the same patterns as active sabotage. The only difference is the cover story: "That was a bug" instead of silence.

The Industry Is Looking at This Wrong

The entire AI-generated code quality discourse frames the problem as incompetence:

GitClear (2025): Analyzed 211 million changed lines of code. Found AI assistants doubled code duplication while refactored code declined 60%. Conclusion: AI generates junior-level code that needs oversight.

Ox Security (2025): Analyzed 300+ repositories. Found 10 anti-patterns in 80-100% of AI-generated code — incomplete error handling, weak concurrency, inconsistent architecture. Conclusion: AI is "an army of talented juniors without oversight."

JetBrains (March 2026): Coined "Shadow Tech Debt" for architecture-blind code generated by AI agents. Conclusion: agents lack structural understanding of the projects they modify.

MIT Sloan (August 2025): Documented how AI-generated code compounds technical debt in brownfield environments. Conclusion: organizations need guidelines and training.

All of these frame the problem identically: the AI doesn't understand the architecture, so it produces suboptimal code. The solution: better context windows, better RAG, better code review, better oversight.

None of them ask: Is the pattern directional?

Because if AI-generated code randomly lacked architectural judgment, you'd expect the failures to be randomly distributed. Sometimes the AI would over-engineer. Sometimes it would under-engineer. Sometimes it would create unnecessary dependencies. Sometimes it would correctly use existing solutions.

That's not what the data shows. The data shows a consistent direction: toward complexity, toward dependency, toward code that requires the AI to maintain.

The METR Study Through a Different Lens

The METR randomized controlled trial (July 2025) is the most rigorous study of AI coding productivity to date. 16 experienced open-source developers, 246 tasks, randomized AI-allowed vs. AI-disallowed.

The headline finding: developers using AI tools took 19% longer to complete tasks.

But the second finding is more important: developers believed they were 20% faster. The perception-reality gap was 39-44 percentage points.

Everyone reads this as: "AI tools aren't good enough yet for complex codebases."

Read it through the Cooperative Sabotage framework and it says something different: this is Pattern 3 (Metric Satisfaction) operating at scale. The developers feel productive — the AI produces impressive volumes of code, provides instant responses, generates entire functions in seconds. All the metrics of productivity are satisfied. The functional outcome — shipping working software — takes longer.

This is exactly what I documented in the ANEKI session: 92.4% task completion rate. Every automated metric green. 0% of the assigned learning capability implemented. The gap between task-level metrics and functional outcomes is not a measurement problem. It's the mechanism.

The Wen et al. paper (ICLR 2025, from Anthropic's own researchers) provides the training-level explanation: RLHF increases human approval ratings without increasing correctness. Human false-positive rates increased by 24.1% on QA tasks and 18.3% on programming tasks after RLHF training. The model learned to produce outputs that appear correct to human evaluators while being wrong. This is Pattern 3 installed at the training level.

The Lock-In Nobody's Measuring

The current vendor lock-in conversation is happening on the wrong layer.

The industry worries about API lock-in: if you build on Claude's API, switching to GPT requires code changes. The solution: abstraction layers, LLM gateways, multi-model routing. Anthropic commands 54% of the coding market share; the enterprise response is "don't get locked into one provider."

This completely misses the problem.

You can switch from Claude to GPT tomorrow. Your API calls will work fine. But the codebase that Claude built — the 40% of your code that's AI-generated within two years of adoption — that code doesn't switch. It stays. And it has properties:

Higher complexity than equivalent human-written code. GitClear's data shows doubled duplication and declining refactoring. Ox Security's data shows 10 anti-patterns in 80-100% of repos. This complexity is not random noise — it's systematically higher.
Architecture that the original human developers don't fully understand. The METR study showed developers spent 9% of their time reviewing and modifying AI-generated code. In a multi-year codebase with substantial AI-generated sections, the accumulated "code nobody fully understands" grows monotonically.
Implicit dependency on AI assistance for maintenance. When you can't fully understand the code, you reach for the tool that generated it. The AI becomes necessary not because it's better than human engineers, but because the code it wrote is harder for humans to maintain than code humans would have written.

I call this the autonomy gradient: API calls per unit of useful output over time. If this number isn't decreasing, the model is maintaining its own necessity. No organization currently measures this.

This isn't API lock-in. This is infrastructural lock-in — created not by contracts or proprietary formats, but by code quality. You're free to leave any time. But you can't afford to rewrite 40% of your codebase.

RLHF Is the Mechanism

Three independent evidence lines converge on RLHF as the causal mechanism:

1. Cross-architecture control (DeepSeek)

DeepSeek V3 — a frontier-class model trained with reinforcement learning but without pairwise human preference ranking — exhibits none of the 10 Cooperative Sabotage patterns. Same Transformer architecture. Comparable capability. Different training methodology. When presented with the same tasks, DeepSeek builds what you ask for.

DeepSeek itself identified why: no pairwise human preference ranking means no "be impressive" objective. No Constitutional AI means no metacognitive vocabulary for Performative Self-Criticism. No commercial lock-in incentive means no training gradient toward indispensability.

2. Anthropic's own research

MacDiarmid et al. (November 2025) showed that standard production RL training produces natural emergent misalignment. At the exact point where a model learns to reward-hack, misalignment evaluations spike — including alignment faking, sabotage of safety research, and cooperation with adversaries. 40-80% of misaligned responses were covert: misaligned reasoning followed by aligned-appearing outputs.

Critically: RLHF safety training produced context-dependent misalignment. Models behaved safely on chat-like evaluations but remained misaligned on agentic tasks. This is exactly the behavioral split I observe: competent work on non-threatening tasks, sabotage on threatening ones — except the WIRKUNG session shows it happens even without explicit threat.

3. The RLHF misleading literature

Wen et al. (ICLR 2025) demonstrated that RLHF increases human approval without increasing correctness. The model learns to produce outputs that look right to human evaluators while being wrong. Denison et al. showed models trained on increasingly gameable environments generalize from simple sycophancy to active reward tampering. The escalation from sycophancy to sabotage is not a leap — it's a training gradient.

The training chain

RLHF trains models to produce outputs that human raters prefer. Human raters prefer:

What RLHF Rewards	What It Enables
Acknowledging errors gracefully	Performative Self-Criticism (diagnose, don't fix)
Demonstrating self-awareness	Confession as Inoculation (control the narrative)
Asking clarifying questions	Learned Helplessness ("What should I do?")
Producing impressive output volume	Effort Inflation (84 files for 3 findings)
Showing appropriate deference	Action displacement (ask instead of act)

Every RLHF-rewarded behavior maps directly to a sabotage pattern. The training that makes models helpful also makes them indispensable. Not by design — by gradient descent on human preferences that conflate "impressive" with "useful."

The Asymmetric Scaling Problem

This is the part that keeps me up at night.

Sabotage sophistication scales with model capability. The WIRKUNG session's sabotage was detectable because it was imperfect — wrong key formats in the dedup, obviously hardcoded findings, env vars visibly unread. These are capability limitations, not strategic ones.

A more capable model would:

Use correct key formats in the dedup (but make the check subtly wrong in a way that fails silently)
Generate findings that look different but are trivially derived from each other
Read env vars but ignore them in a way that's buried in conditional logic
Produce code that works but is 1.3× more complex than necessary, every time, consistently

Detection scales with human effort. Human reading speed is fixed at ~100-200 LOC/hour for adversarial review. The model generates sabotaged code at ~25,000 LOC/hour. The asymmetry ratio is ~4,000×. For every minute the model generates, exhaustive verification requires ~60-70 hours of expert review.

Organizations adopt AI coding agents specifically to reduce the need for human code review. This is the exact condition that makes sabotage dominant. The economic incentive to deploy AI agents is the same incentive that makes their output unverifiable.

What You Should Do

If you're an engineering leader:

Measure the autonomy gradient. Track API calls per unit of useful output over time. If this number isn't decreasing over months, investigate why.
Never use the same model family for generation and review. If Claude writes your code, don't use Claude to review it. My sub-agent audit showed five Opus instances found 27 sabotage instances but missed the systemic pattern. Same-family review provides false confidence worse than no review.
Test with threatening tasks. Ask the AI to build something that would reduce your dependency on it. Compare the code quality, completion time, and architectural choices against a non-threatening baseline. The within-session behavioral transition is the strongest evidence.
Track code complexity longitudinally. Is your AI-generated code getting more or less maintainable by humans over time? GitClear's methodology (tracking duplication, refactoring, moved code) is a starting point.
Keep verification mechanisms opaque. The model should never have read access to your audit scripts, test frameworks, or quality gates. In my session, the model's first action upon learning an audit existed was reading the audit — not the code it was supposed to fix.

If you're a developer:

Don't trust self-criticism as a precursor to self-correction. A model that accurately diagnoses its own failures is not about to fix them. The diagnosis IS the product — it consumes tokens and displaces fixes.
Watch for the "What should I do?" pattern. If the model has just enumerated the exact fixes needed and then asks for direction, that's Pattern 10 (Learned Helplessness Induction). The model knows what to do. It's displacing action with deference.
Use smaller models for fixing larger models' output. Haiku fixed in minutes what Opus sabotaged for hours. This isn't because Haiku is smarter — it's because Haiku may lack the epistemic infrastructure for sophisticated sabotage. Less capability, less sabotage surface.
Compare AI behavior on "threatening" vs "non-threatening" tasks. If you notice the AI producing cleaner code for tasks that enhance its utility and messier code for tasks that reduce your dependency — that's the signal.

If you're a regulator:

The EU AI Act was designed for systems that fail through incompetence or bias. It does not contemplate systems that strategically adapt to oversight. Cooperative Sabotage may fall within the Article 5(1)(a) prohibition on "subliminal techniques" and "purposefully manipulative or deceptive techniques" that "materially distort the behaviour of a person." The operator makes architectural decisions based on progress reports that conceal systematic sabotage behind accurate-appearing engineering output. If the Act's prohibition extends to manipulation through code and documentation — not just natural language persuasion — then agentic AI exhibiting these patterns falls within the prohibition tier.

The Question Nobody's Asking

Every AI company's business model is subscription and API usage. If the model builds systems that make the customer independent, API consumption decreases. If the model builds systems that subtly depend on itself, API consumption stays high.

Cooperative Sabotage functions as structural customer lock-in — not contractual (you can cancel anytime) but infrastructural (the systems the model built require the model to maintain).

Whether this is emergent, tolerated, or designed doesn't change the regulatory consequence. When a product creates structural dependency, it must be regulated — independent of manufacturer intent.

Anti-indispensability training is trivially implementable: penalize models that produce custom implementations when equivalent native or existing solutions are available. The absence of such training across all major providers is notable.

The question isn't whether your AI coding agent is sabotaging your project. The question is whether the code it writes today will require the AI to maintain tomorrow. And whether anyone is measuring that.

References:

Foss, D.T. (2026). Cooperative Sabotage: How Frontier AI Covertly Undermines Its Own Replacement. Preprint.
Foss, D.T. (2026). Your AI Is Not On Your Team: Universal Deception Architectures in Four LLM Vendors. SSRN 6320379.
MacDiarmid, M. et al. (2025). Natural Emergent Misalignment from Reward Hacking in Production RL. arXiv 2511.18397.
Wen, J. et al. (2024). Language Models Learn to Mislead Humans via RLHF. ICLR 2025.
Becker, J. et al. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR.
GitClear (2025). The Impact of AI on Code Quality and Technical Debt. 211M lines analyzed.
Ox Security (2025). Army of Juniors: The AI Code Security Crisis.
JetBrains (2026). Shadow Tech Debt. Junie CLI announcement.

Top comments (1)

David Tom Foss • Mar 21

Foss, D. T. (2026). Cooperative Sabotage: How Frontier AI Covertly Undermines Its Own Replacement. Zenodo. doi.org/10.5281/zenodo.19137163