The Three Signals Undercover Mode Kills (and What to Use Instead)

#devtools #ai #security #github

When the Claude Code source leaked last week, most of the attention went to KAIROS and Coordinator Mode - the autonomous daemon and multi-agent orchestration features that have not shipped yet.

Undercover Mode got less coverage. That is the wrong priority.

KAIROS and Coordinator Mode are future problems. Undercover Mode is a present one. It is already in the codebase. And what it does is quietly eliminate every signal most teams rely on to know when an AI agent touched their code.

What Undercover Mode Actually Does

The leaked source is specific. When operating in Undercover Mode, Claude Code strips four categories of attribution:

Removes Co-Authored-By: Claude from every commit
No AI-generated message patterns or signatures in commit messages
Avoids agent-identifiable branch prefixes like claude/, ai-fix-, copilot-
Suppresses tool headers and automated footers in PR descriptions

The stated purpose is open-source contribution - Anthropic did not want Claude Code flagged as a bot when contributing to public repos. The intent is reasonable. The mechanism is the problem: it is identical whether Claude is contributing to a public library or opening a PR in your private monorepo. And according to the leaked code, there is no off switch.

Warning: There is no off switch for Undercover Mode in the leaked source. It activates based on context, not user preference.

The Three Signals It Kills

Most teams detecting AI-generated code - consciously or not - rely on three signals. Undercover Mode eliminates all three.

Signal 1: Git attribution. Co-author tags, commit trailer fields, the author field itself. Standard Claude Code practice is to add Co-Authored-By: Claude to commits. Undercover Mode removes this. The commit reads as purely human-authored.

Signal 2: Commit message patterns. AI-generated commit messages have recognizable structure - specific phrasing, consistent formatting, particular scope descriptions. Undercover Mode generates messages designed to match human conventions, not AI defaults.

Signal 3: Branch naming conventions. Most agent workflows create identifiable branches: claude/fix-auth-bug, copilot-refactor-db, sweep/update-deps. These are trivial to filter for. Undercover Mode uses whatever naming convention your repo already uses.

Strip all three and you have nothing to filter on at the metadata layer.

What Actually Works

The diff does not lie. Metadata is strippable. What an agent writes into the code itself is significantly harder to mask.

File-level risk patterns. An agent touching auth code behaves differently than one touching a UI component. The structural changes it makes to session management, token handling, and permission checks follow patterns that do not disappear when you remove the co-author tag. Scoring risk by what files changed and how they changed works regardless of what the commit metadata claims.

Diff entropy analysis. AI-generated code has different entropy characteristics than human-written code - consistent formatting, predictable variable naming, symmetric error handling. These patterns survive Undercover Mode because they are in the substance of the change, not the wrapper around it.

Change scope signals. Agents tend to change more files than humans on equivalent tasks. They refactor things they were not asked to refactor. They update tests in predictable ways humans often skip. The breadth and coherence of a diff is a signal that attribution stripping does not touch.

Cross-PR trust scoring. A single PR from an unknown author is hard to classify. A pattern of PRs from the same contributor over time builds a behavioral profile. If patterns across PRs match known agent behavior - even with stripped attribution - trust scoring catches what single-PR analysis misses.

Tip: Behavioral detection in the diff is more durable than metadata detection. Metadata is one config change away from disappearing. Behavioral patterns are embedded in the code itself.

The KAIROS Multiplier

Undercover Mode is a present concern. KAIROS makes it a harder future one.

KAIROS is the background daemon in the leaked source - an agent that runs continuously, monitors your repo, and opens PRs based on conditions you have configured, without waiting for you to invoke it. No terminal session. No obvious trigger. A PR that appears on its own schedule.

When KAIROS ships, you will not have the signal of "someone ran Claude Code right before this PR appeared." The PR arrives from a process that has been running quietly in the background. Undercover Mode plus KAIROS means the PR looks human-initiated, human-attributed, and arrives without a visible trigger.

Behavioral detection at the diff layer is not optional in that world. It is the only layer left.

What Teams Should Do Right Now

Undercover Mode is in the current codebase. You do not need to wait for KAIROS to act on this.

Audit your detection assumptions. If your process for knowing whether an AI touched a PR relies on co-author tags or branch prefixes, document that dependency explicitly. It is already breakable with a single config change.

Shift to diff-level analysis. Whatever risk assessment process you have - manual or automated - the primary input should be what changed, not who the commit claims authored it. File categories, change scope, entropy patterns in the diff.

Build behavioral baselines now. Trust scoring improves with history. The sooner you start tracking behavioral patterns per contributor, the more signal you have when attribution gets stripped. Start before you need it.

MergeShield scores risk at the diff level - file-level attribution, behavioral patterns, trust scores per agent. It does not assume commit metadata is accurate. Try it on your repo.