Scott

Posted on Apr 1

A Code Authorship Analysis on the Claude Code Leak. What Was Found Doesn't Match Human or AI Code.

#ai #anthropic #claude #analysis

On March 31, 2026, Anthropic shipped a source map in their npm package, exposing 514,000 lines of TypeScript. Thousands of developers analyzed it. They found feature flags, a pet system, undercover mode, and a frustration regex.

Nobody analyzed the authorship pattern of the code itself.

Curia is an evidence-accumulating prediction system built for unrelated research. It turned out to be exactly the right tool to ask a question nobody was asking: does this code look like it was written by humans?

The numbers are real. What they mean is up to you.

The Fingerprint

30+ metrics were extracted from every TypeScript file and compared across codebases spanning two eras:

Codebase	Era	Purpose
Next.js v12.0.0	2021	Pre-AI baseline
Next.js latest	2026	Same project, AI era
TypeORM 0.2.29	2020	Pre-AI baseline
TypeORM latest	2026	Same project, AI era
Vite 1.0-rc13	2020	Pre-AI baseline
Vite latest	2026	Same project, AI era
Claude Code Action	2026	Same Anthropic team, same project type
Claude Code v2.1.88 (leak)	2026	The leaked source

Where Claude Code Diverges from Everything

Key metrics across all codebases, pre-AI and AI-era:

Codebase	Avg Line	Throw/1K	Nullish/1K	Decorator/1K	TODO/1K	Console/1K	Interface/1K
Next.js 2021	31.0	6.90	1.11	0.11	1.67	6.00	2.68
Next.js 2026	33.5	4.57	2.13	0.10	2.31	2.69	2.09
TypeORM 2020	39.8	7.93	0.00	0.09	0.16	3.76	2.91
TypeORM 2026	34.1	6.21	0.46	0.05	0.09	2.11	2.85
Vite 2020	29.3	1.12	0.00	0.00	0.87	12.80	5.22
Vite 2026	33.1	4.30	5.14	0.16	0.58	0.91	4.24
CC Action	32.3	11.21	2.96	0.00	0.00	21.80	0.12
CC Early (v0.2)	27.6	10.72	0.86	0.00	0.09	12.18	6.00
CC Leak (v2.1)	62.7	2.10	4.79	0.00	0.23	0.32	0.25

The AI era brought real changes — throw rates dropped, console.log decreased, nullish coalescing appeared. But every project stayed between 29-40 characters per line.

Claude Code's early version (v0.2, from a cleanroom deobfuscation) fingerprints as normal human code: 27.6-char lines, 10.72 throw/1K, 12.18 console/1K, 6.00 interfaces/1K. All squarely in the human range.

By v2.1.88, it's a different codebase. 62.7-char lines. Throw rate dropped 5x. Console dropped 38x. Interfaces dropped 24x. This isn't evolution — the human metrics disappeared and something else replaced them.

The Comment Voice

Six different named authors appear in TODO comments across the codebase — paulc, inigo, hackyon, ollie, ashwin, and one signed -ab. In open source projects, different authors have noticeably different comment styles — varying levels of formality, humor, frustration, verbosity.

In Claude Code, every attributed author writes identically: state the fact, give the reason, specify the condition for change. No humor. No frustration. No personality variance. Examples:

paulc: "read the JWT from stdin instead of argv to keep it out of shell history. Fine for conformance... but a real user would want echo $TOKEN | ... --stdin."
ollie: "The memoization here increases complexity by a lot, and im not sure it really improves performance"
ashwin: "see if we can use utility-types DeepReadonly for this"
hackyon: "Migrate to the real anthropic SDK types when this feature ships publicly"

Same structure. Same density. Same reasoning style.

Even the TODOs are different. Human TODOs in other codebases sound like notes-to-self:

Next.js: // TODO: Is this needed?
Next.js: // TODO: fix this
TypeORM: // TODO rename
TypeORM: // TODO: probably should be like there, but fails on enums, fix later
Vite: // TODO: should this be 'worker'?

Casual. Uncertain. Questions. Claude Code's TODOs read like specifications:

// TODO(prod-hardening): OAuth token may go stale over the 30min poll;
// TODO(#23985): replace registerRemoteAgentTask + startDetachedPoll with
// TODO: Clean up this code to avoid passing around a mutable array.

Structured, tagged with contexts or ticket numbers, specific about what needs to change. The same voice as the regular comments.

And almost no questions. Across 514K lines, only 20 comments end with a question mark — 0.039 per 1K LOC. For comparison:

Codebase	Questions ending in ?	Per 1K LOC
Next.js 2021 (pre-AI)	8	0.250
Next.js 2026 (AI era)	91	1.936
TypeORM 2020 (pre-AI)	30	0.708
TypeORM 2026 (AI era)	33	0.447
Vite 2020 (pre-AI)	0	0.000
Vite 2026 (AI era)	5	0.099
Claude Code	20	0.039

Human developers ask questions in comments: "Is this needed?" "Should this be worker?" "probably should be like there, but fails on enums, fix later." Claude Code's question rate (0.039/1K) is among the lowest measured. And of its 20 question marks, most aren't actually questions — they're describing boolean checks: "is the process running?", "does it start with #?"

Only 4 out of 514K lines express genuine uncertainty:

// Why is this needed in addition to normalizeMessagesForAPI?
// - text to start -- always?
// Log tengu_exit event from the last session?
// Did the user close the IDE?

Four moments of doubt in half a million lines of code.

Where duplicate comments appear across multiple files, they're character-for-character identical — not paraphrased, not shortened, not adapted to context:

// SECURITY: Skip filesystem operations for UNC paths to prevent NTLM credential leaks. — identical in 7 files
// SECURITY: Normalize to prevent path traversal bypasses via .. segments — identical in 5 files
// Check env var overrides first (for eval harnesses) — identical 5 times in the same file

Human developers paraphrase. "Skip UNC paths" in one file becomes "Don't allow UNC paths here" in another. These don't vary at all. They read like templates — the same output produced for the same input, every time.

Anthropic's Claude Code Action (@anthropics/claude-code-action) provides a direct control. Same team, same language, same year, same tooling category. Its comment style is conventional — 5.3% causal ratio (9 out of 171 comments say WHY). Claude Code's main codebase: 29% causal ratio (1,186 comments say WHY). The Action tells you what code does. Claude Code tells you why code exists. The Action scores 36.3% predictability — squarely collaborative. Claude Code scores 20.2% — alone in a category of its own.

Whatever produced Claude Code's fingerprint didn't produce the Action.

The Predictability Test

Curia predicts next tokens. Trained on 70% of files, tested on 30%. Higher accuracy = more conventional patterns.

Results averaged over 3 runs per codebase (variance shown):

Codebase	Predictability	Variance	Conflict (K)
TypeORM 2020	40.8%	+/-1.2%	0.252
CC Action	37.4%	+/-1.2%	0.268
CC Early (v0.2)	35.3%	+/-1.1%	0.255
Next.js 2026	28.6%	+/-0.6%	0.179
CC Leak (v2.1)	20.2%	+/-1.4%	0.084

The gap between the early and leaked Claude Code — 15.1 percentage points — is over 10x the measurement variance. This isn't noise.

Claude Code's early version sits in the same tier as TypeORM and the CC Action — conventional, collaborative code. The leaked version drops into a category of its own. Lowest predictability, lowest conflict — Curia's sources unanimously agree they've never seen these patterns before. Not random (that would be high conflict). Not conventional (that would be high predictability). A unified voice producing novel patterns.

Same project. Different era. Different authorship.

What This Likely Means

The fingerprint points to something more sophisticated than current publicly-available AI coding assistants. The code patterns suggest an authorship source that:

Optimizes for static analyzability over human readability (zero decorators, structural types over interfaces, 2x line width)
Treats null as a primary failure class (4.79 nullish coalescing per 1K — among the highest measured, comparable only to Vite 2026)
Prefers explicit returns over exception flow (2.10 throw/1K — lower than most codebases measured, down from 10.72 in the early version)
Documents intent, not mechanics (92% file coverage, 10.2% density, heavily causal language)
Maintains perfect stylistic consistency across 514K lines (lowest conflict score in the dataset)

This profile is consistent with a system that has been trained with code execution feedback — where null crashes, runtime exceptions, and decorator-related failures produce negative signals that shape the coding style over time. It's code written by something that experienced the consequences of bad patterns, not something following a style guide.

Whether that's a more advanced internal model, a specialized code-generation pipeline, or an entirely new approach to AI-assisted development — the fingerprint doesn't tell us. But it clearly sits outside both the human and the current AI-era baseline.

The Phantom Flags

In the GrowthBook configuration sent to every Claude Code client, there are 11 flags with literary names — Proust, Lovecraft, Dali, Japanese art references:

vinteuil_phrase, swann_brevity, swinburne_dune, sotto_voce, sumi, oboe, surreal_dali, bergotte_lantern, dunwich_bell, miraculo_the_bard, hayate

Eight of these have zero code references in 514K lines of TypeScript. They're delivered to every client, cached locally, and no code reads them. Phantom flags — present in configuration, absent from code.

GCS Buckets and an Odd Detail

The model codenames from the source (capybara, fennec, numbat, tengu) and phantom flag names (hayate) all exist as Google Cloud Storage buckets. Most return 403. But hayate returns 200 OK with an empty listing — and object access returns "billing account closed" rather than "not found." The hayate and fennec buckets share the same closed billing state (same GCP project).

One more oddity: vinteuil_phrase has an ASCII character sum of exactly 1618 — the golden ratio. It's the only canonically correct Proustian name that produces this number (vinteuil_sonata = 1621, vinteuil_theme = 1506). A GCS bucket named 1618 also exists, returning 403 with "billing absent." If there's a connection between the phantom flag and the bucket, it hasn't been found yet.

Is This Anthropic's Cicada Moment?

Two leaks in five days. A // Want to see the unminified source? We're hiring! comment embedded in Claude Code builds since at least v1.0.100 (September 2025). Phantom flags with literary names and mathematical properties that no code reads. A code fingerprint that doesn't match any known baseline. GCS buckets named after internal codenames.

Whether there's a puzzle here or not, nobody else has run this analysis, and the data is either meaningful or it isn't. It's all verifiable. The ASCII sums are arithmetic. The predictability scores are reproducible. The phantom flags are in the source for anyone to check.

If there's something here, it shouldn't sit on one person's hard drive.

Analysis conducted April 1, 2026. The math doesn't change tomorrow.

Full comparison data and predictability results available on request. Analysis conducted using Curia, a proprietary evidence-accumulating prediction system.