DEV Community: Alexey Pelykh

Claude Code retries rate-limit errors for API keys, not for your Max plan

Alexey Pelykh — Mon, 22 Jun 2026 14:19:59 +0000

A long agent run on a Max plan, and a single 429 ends it. Not pauses it. Ends it. The turn dies, the work in flight is gone, and there is no backoff, no retry, no countdown. The same error, the exact same HTTP 429, on an API key would have been retried silently and you would never have seen it.

I wanted to know why, so I decompiled the Claude Code binary - version 2.1.179 - and read the retry path. The answer is one branch, and it is keyed on how you pay.

One branch, keyed on how you pay

Claude Code wraps every model request in a retry classifier. Decompiled, the rate-limit decision reduces to a single line:

if (status === 429) return !kq() || GsH()

kq() and GsH() are minified. The same branch with readable names for what each gate checks:

if (status === 429) return !isConsumerSubscription() || isEnterprise()

isConsumerSubscription() is true when you authenticate with a consumer subscription - your OAuth token carries the user:inference scope that Pro and Max logins get. isEnterprise() is true only for Enterprise accounts. Read it out:

API key - isConsumerSubscription() false - 429 retryable.
Enterprise - isEnterprise() true - 429 retryable.
Pro / Max - isConsumerSubscription() true, isEnterprise() false - !true || false is false - 429 not retryable.

When the classifier returns non-retryable, the request throws. On a subscription, that throw is your session ending.

It gets sharper. The 529 "overloaded" error, Anthropic's at-capacity signal, is retried for everyone with no tier check at all - status >= 500 returns retryable unconditionally. So the split is not "subscriptions get less retry." It is specifically the 429 rate-limit error, and only for the flat-rate tiers. And the official Anthropic SDK retries 429s by default, two attempts with exponential backoff. Claude Code turns that off - it passes maxRetries: 0 to the SDK client and reimplements retry with the subscription branch bolted on. This is not SDK behavior leaking through. It is a decision in the client.

The defense, and where it breaks

The branch has a real defense, and it deserves to be made well. A 429 on a subscription is often a usage-window limit - you have burned your 5-hour quota, and it will not reset for minutes to hours. Retrying that in-process is pointless. Worse, a client that hammers its own quota ceiling is misbehaving, and a vendor is right not to ship that. If every Max client retried usage-window 429s on a tight loop, the limit would stop meaning anything. All of it is true. For the window-quota case, not retrying is the correct call.

Then look at what the branch actually does. It does not distinguish the two kinds of 429. Anthropic's limits come in two shapes: the long usage window, and the short per-minute request and token ceilings. The per-minute one clears in seconds. It is the textbook case for backoff - wait two seconds, retry, succeed. The classifier treats both the same, because status === 429 is status === 429. The sound reasoning about the window-quota case ends up justifying an exception that also kills the trivially-retryable per-minute case - the exact case where backoff is the right answer, and the one every other tier gets for free.

That is what does not survive the steelman. The metered tiers retry the per-minute 429 and recover. The flat-rate tier takes it as a dead session.

Mechanism, not motive

I am showing you a mechanism, not a motive. I read a branch in a binary. I cannot tell you from that whether this is deliberate tiering, a conservative default that overreaches, or an oversight nobody revisited, and anyone claiming to know which from the same evidence is guessing. I will be blunt about why I am careful here: I published a teardown of this same vendor on June 15 and corrected it the next day, because I asserted intent the evidence did not carry. Once is a lesson. Twice is a pattern I would rather not start.

What the evidence does carry is the effect. Today, in this client, the flat-rate tier runs on a less forgiving error path than the metered one. By design or by neglect, the people paying a fixed monthly price absorb failures the people paying per token do not.

Reliability isn't on the pricing page

Reliability is part of what you are buying, and it is not on the pricing page. You choose a tooling tier on price and quota, and you assume the failure handling underneath is the same - that only the limits move. Here is a concrete case where that assumption breaks. The retry behavior itself changes with your billing relationship, silently, in a branch you will never see unless you go decompile it.

If you are building agentic workflows on a subscription tier and trusting them to ride out transient limits, that trust is load-bearing and untested. A staff engineer who picks Max for the team because the quota math works has also, without knowing it, picked the tier that turns a recoverable two-second blip into a failed run.

This is one binary version, 2.1.179, June 2026, and a future release can rewrite that branch and retire this finding - pin the version when you cite it. The subscription-versus-metered split also has at least one legitimate driver, the window-quota case above, so do not read a conspiracy into a branch. Read it as what it is: a place where your tooling's reliability quietly depends on how you pay, and nobody told you.

Check it yourself

Do not take my word for it - watch it on your own tier. Every API response carries an anthropic-ratelimit-unified-status header, allowed or rejected, and the client emits tengu_api_retry telemetry every time a retry fires. Run a workload until you hit a 429 and look: does the client back off and recover, or does the turn die? On a metered key it retries. On Max, watch what happens. The whole finding is reproducible in your own logs.

I will pin the bet. This is v2.1.179. If a later release retries 429s for subscriptions too, good - that is the right fix, and you will have watched it land in your own telemetry instead of trusting a teardown. Until then, the tier that costs a flat rate is the tier that gives up first.

The 'Own Hardware for AI' Myth

Alexey Pelykh — Mon, 15 Jun 2026 12:54:28 +0000

Update (June 16, 2026): Anthropic announced the June 15 repricing described below, then postponed it on the day - subscription limits unchanged, no credit to claim. The teardown's argument is unaffected - it rests on the measured workload, not on that change. I've reframed the affected lines below to past/conditional; when the repricing actually lands, I'll publish what it does to the bill.

In the last two weeks of April, my AI tooling consumed 17.7 billion tokens. At Anthropic's API list prices, that fortnight of compute is worth a hair over $15,000 - call it $30,000 a month at the same pace. I paid $800.

Every "stop renting intelligence, buy your own GPUs" thread I read falls apart on contact with this workload. So I ran the numbers properly: my measured usage profile against GPU rental prices, on-premise TCO, and the best open-weight models available in June 2026. The hardware loses, and not by a little. The reasons it loses are more interesting than the price tag.

The workload

All figures below are measured from complete local session logs, April 16-30, 2026. Personal projects only, employer work excluded. Monthly numbers are that pace doubled - and the weeks after this window ran hotter, not cooler, so the extrapolation is the conservative read. Pricing basis: Anthropic list as of June 2026 ($5/$25 per million input/output tokens, $6.25 cache writes, $0.50 cache reads). External prices and benchmarks throughout are as of June 12, 2026.

Metric	Measured (15 days)
Sessions	1,886 (126/day)
User-side messages (incl. tool results)	121,343
API requests	76,546
Total tokens (input + output + cache)	17.7B
Cache hit rate	95.8%
Peak context in a single session	964K tokens
Peak concurrent sessions	15
Active time	4,251 of 21,600 minutes (19.7%)
Model mix	96% Opus 4.7 (the current model that fortnight)
API-equivalent cost	$15,027.63

About a third of that compute went into RemoteClaw, the AI agent middleware I maintain. The rest spread across 50+ repos: open-source libraries like PCRE4J and lhremote, infrastructure, automation, writing, even household paperwork.

This is interactive agentic work. Claude Code sessions doing architecture, implementation, review, in parallel, on demand. Not batch jobs. That shape matters more than the volume, and the whole argument below comes down to it.

What $30K actually means

The headline number is the part everyone gets wrong in both directions.

$30,055 a month is not what Anthropic charges me. It's what the same tokens would bill at API list prices: the value of compute consumed. The actual bill is four Claude Max 20x subscriptions at $200 each, because the usage limits on one account can't hold this volume. $800 a month, a 97.3% discount against list. That was April's terms; Anthropic announced a change to part of this for June 15, then postponed it - I'll get to it.

The token mix is the first clue about what this workload actually is. It's prefill-dominated: for every token the model writes, it reads 216. And 95.8% of those reads hit Anthropic's prefix cache, billed at $0.50 per million tokens instead of the $5.00 input rate. Without the cache discount, the same traffic would bill roughly $90K for the fortnight - six times the headline, around $180K a month.

Hold that thought. Prefix caching itself is a config flag in vLLM or SGLang, and a single tenant re-reading his own session prefixes is the easy case - the hit rate is in the workload, not in some Anthropic magic. What the $30K figure nets out is the discount for it. What self-hosting has to replicate is keeping that cache alive: across restarts, across eviction, across whatever capacity holds it. That thread gets pulled at wall two.

Wall one: context

My sessions peak at 964K tokens of context. Not as a stunt - as a Tuesday. Long agentic sessions accumulate repository state, tool outputs, conversation history, and the model has to keep reasoning over all of it.

Until this spring, the answer was blunt: no open-weight model served that depth for real coding work. DeepSeek V3 topped out at 160K native context. Qwen3's "1M" was YaRN extrapolation, with no independent validation past 512K. Llama 4's "10M" was a theoretical RoPE limit on a model that scores 21% on SWE-bench.

Then the wall moved. DeepSeek V4 shipped in preview on April 24: open weights, MIT license, native 1M context on a new attention stack. On paper, the strongest argument against self-hosting is gone.

In practice, "on paper" is doing heavy lifting. Independent evaluation exists now, and it cuts both ways. NIST's CAISI tested V4 Pro in May and put it roughly eight months behind the frontier - GPT-5-class, behind the Opus 4.6 in their test set. And the model's own showcase numbers tell the depth story: on 8-needle MRCR retrieval it holds above 0.82 through 256K of context, then drops to 0.59 at 1M. That's retrieval, the easy part. What happens to coding quality at the depths my sessions live at, nobody has published at all. The wall went from "impossible" to "unproven, with measured degradation at depth." That is real progress. It is not something you sign a $300-400K node order against.

Wall two: concurrency

Peak load on this workload is 15 concurrent sessions. Not all fifteen sit at a million tokens at once - but capacity gets provisioned for the envelope you promise yourself, and interactive latency leaves no queue to hide in. For GQA-attention models, the only long-context architecture on offer until this spring, KV-cache at ~1M context runs around 150GB at FP8 per session: an envelope measured in terabytes of hot cache, hardware multiplied an estimated 8-15x, $130-240K a month. (DeepSeek's MLA models ran 4-5x leaner per session - and capped out at 160K context, so they couldn't serve the depth at all.) That was the concurrency wall as it stood in April.

V4's compressed-attention stack moved this wall too. DeepSeek's published figures put V4 Pro at 10% of V3.2's already-lean KV-cache at 1M tokens, and V4-Flash at 7% - which works out to single-digit gigabytes per million-token session. Fifteen sessions now fit in under 100GB of cache. On paper, the memory side of the concurrency wall just fell.

Here's what's still standing. V4 Pro is 1.6T parameters: at FP8 the weights overflow an 8x H200 node, so you're buying two before the first request - or quantizing to 4-bit to squeeze into one, and whether 4-bit holds for agentic coding at depth is one more unproven bet. V4-Flash fits a node outright, but it's the smaller variant of a model line that independent evaluation already puts months behind the frontier. The cache still has to stay alive: prefix caching is free software, but holding weeks of session state through restarts, evictions, and failures is an operations discipline of its own - the tiered KV-cache architectures built for it (SGLang's HiCache, Moonshot's Mooncake) are real and still young. And the honest 3-year single-node TCO, with colocation, power, networking, and 0.25 FTE of ops:

Setup	Monthly TCO	vs $30,055 API-equivalent
8x MI300X (1 node)	~$11,000	63% cheaper
8x H200 (1 node)	~$16,000	47% cheaper
15 streams on GQA-class models (the April math)	$130-240K	4-8x more expensive

Those rows have knobs. Stretch amortization from three years to five and they drop by a quarter; staff the node realistically at 0.5-1.0 FTE instead of 0.25 and they climb right back. No setting changes the shape. A single node undercuts API list prices by half, and that's the row the advice is built on. Against the API. I don't pay API prices. The cheapest node under the friendliest knobs is still north of 10x my actual bill, and it buys the model that's unproven at depth, the cache discipline you own, and the ops you staff.

Wall three: utilization

Active minutes over the measured window: 19.7%. The other 80% of the time the metal would sit idle - nights, weekends, the hours where I think instead of dispatch. Owned GPUs bill 100% of the calendar regardless.

The standard rebuttal is burst rental: spin nodes up when working, down when not. Cold starts still kill it. Loading a frontier-size model takes minutes, and every cold start drops the prefix cache that makes the economics work in the first place. The floor is moving - CUDA checkpoint/restore demos cut cold starts roughly 10x on serverless platforms, and vLLM's sleep mode swaps models in seconds on a node you keep paying for - but nobody ships instant-on for a frontier-size MoE with its cache intact. Pay-per-active-minute at May's H200 spot rates would have been about $1,451 a month: a fantasy number that assumes exactly that.

Spot pricing itself is the quieter lesson. Between early May and mid-June, H200 rentals went from $1.28/GPU/hr to $2.15-4.00 - a 1.7-3x jump in five weeks, while H100 held around $1.50. Owning hardware locks in depreciation; renting it inherits that volatility. Either way your capacity plan carries risk a subscription doesn't.

The desktop boxes deserve a line here too, because they break this wall - from below. Two linked DGX Sparks amortize to about $220 a month: the only configuration in this whole analysis that undercuts my subscription, and at that price, idling is free. What it buys is one stream at 16-41 tokens per second on V4-Flash, the cheap tier - against my fifteen streams at a 226 tok/s decode median on Opus. Even the ~$97K GB300 Station, 784GB of coherent memory on a desk, can't hold the one near-rival open model: V4 Pro's 4-bit weights run ~800GB. Cheap enough to idle means too small to rival. Big enough to rival means too expensive to idle.

The part nobody computes

Line them up:

Path	Monthly cost for this workload
Self-host at 15 streams, GQA-class (the April math)	$130-240K
Anthropic API list prices	~$30K
Single-node self-host (V4-Flash-class, unproven at depth)	$11-16K
DeepSeek V4-Flash API (their cheap tier), same buckets naively repriced	~$0.4-5K
Four Max subscriptions	$800

The subscription wins for one reason: pooling. Anthropic's fleet runs hot because my idle 80% is someone else's active 20%; the concurrency spikes average out across subscribers; the capacity is provisioned once for everyone. Pooled compute is what solved utilization looks like - a subscription is just one packaging of it.

Which is why the row that deserves respect isn't the GPU one. It's DeepSeek's API. Take my exact token buckets and reprice them at V4-Flash rates and the whole workload lands somewhere between $400 and $5K a month, depending on how their cache discount treats it. That's another pool, running the same utilization play, potentially cheaper than my subscription. What it isn't is proven: it's the same quality-at-depth bet as the hardware - a model line measured months behind the frontier, retrieval already sagging at 1M - just with zero capex attached. And that's the point. If open-model quality at depth ever gets proven, the rational move still isn't a GPU order. It's an API key. The hardware loses this comparison in every branch.

And the terms nearly changed. Anthropic announced that on June 15, 2026, it would move programmatic usage - the Agent SDK, headless claude -p runs, CI integrations, third-party apps riding subscription auth - off the flat subscription onto a monthly credit ($200 of it on a Max 20x account) that overflows into standard API rates, with interactive Claude Code staying on subscription limits. Then, on the day, it pulled the change: "we're not making this change today... your subscription limits are unchanged." Postponed, not canceled.

It would have landed on me, not hypothetically. A real share of my workload runs through headless claude -p: I use it as a heavier-duty subagent dispatcher, because a failed headless session can be resumed where an in-session subagent can't. Under the announced terms that share would have drawn from a $200 monthly credit on each account - $800 across the four, not poolable - and billed at API rates past that. For now it still draws from the subscription, exactly as before. When the repricing actually lands, I'll publish what it does to the bill.

Read what that announcement said, postponed or not. Anthropic moved to price by the same variable this whole article measures: utilization shape. Human-paced interactive work would keep the amortized flat rate; always-on automated fleets would get metered. The subscription discount exists precisely because human-shaped workloads idle 80% of the time. The 97.3% is today's measured arbitrage, not a law of nature. Anthropic's own attempt to reprice it, even postponed, made that explicit. And the GPU side repriced up to 3x in five weeks too. There is no path without volatility. There's only the path where volatility is someone else's problem.

One more objection I owe the skeptics: this workload's shape is partly a child of its pricing. At a flat rate, marginal tokens are free, so nothing forced me to compact contexts, prune sessions, or route easy work to a cheaper tier. Meter me and I'll do all three - the 17.7B would shrink, maybe several-fold. But run the shrink. A workload compacted to a third still values at $10K a month against list, still bursts, still idles 80% of the calendar. And a workload re-architected hard enough to live at 128K, steady and saturated, has quietly become the other workload - the one the rubric below already sends to the hardware store. The argument doesn't depend on my excess. It depends on the shape.

When buying GPUs actually wins

The myth isn't that owning hardware is always wrong. It's that the advice generalizes. Self-hosting wins when all of these hold:

Data can't leave. Sovereignty or compliance constraints make the economics secondary. (Concentration is a real cost on my path too: 50+ repos of my work flow through one US vendor that observes the usage shape and prices accordingly.)
Utilization is sustained. Batch pipelines and 24/7 serving that keep GPUs above 60% busy. A team pooling one node gets there honestly - ten engineers' bursts interleave. A solo bursty workload doesn't.
Context stays at or under 128K. Where open models are validated and KV-cache stays cheap.
Concurrency is known and stable. So you can size the cluster once instead of provisioning for peaks.
Open-model quality clears your bar. On June's SWE-bench Verified leaderboard the best open model, DeepSeek-V4-Pro-Max at 80.6%, sits about 8 points behind Opus 4.8 at 88.6% - and controlled same-harness comparisons have run closer. NIST's independent evaluation puts V4 roughly eight months behind the frontier.
Volume justifies it. Self-hosting breaks even against DeepSeek's own API at roughly 30-60B full-rate tokens a month. Count yours before claiming the bar: my 35B/month is 96% cache reads, which an API discounts to near-nothing and a self-hosted box re-serves from RAM either way. My full-rate volume is under 2B. If your workload is agentic, yours probably collapses the same way.
You're training, not just serving. Fine-tuning, distillation, RL on proprietary data produce model assets you own. No subscription sells that, and none of this article's math applies to it - this is an inference-economics piece.

Fit the first six and a single node beats API list prices by half. That's the workload the advice was written for. It's not what agentic development looks like.

Short of betting everything on one pool, routing is the middle path: V4-Flash costs $0.14 per million input tokens against Opus's $5.00, a 36x gap, and the same logic works inside one vendor's ladder - Sonnet-tier models and batch discounts for the work that doesn't need the frontier. Every version of that arbitrage is API-to-API. None of it is API-to-hardware.

Measure before you buy

Three numbers decide this, and they're already in your logs:

Peak context depth. Mine: 964K tokens.
Peak concurrent streams. Mine: 15.
Active-minute utilization. Mine: 19.7%.

Deep, bursty, and idle-heavy: the subscription wins and it isn't close. Shallow, steady, and saturated: buy the node and bank the 50%.

The "own hardware" advice was written for the second workload and gets repeated at people running the first. Pull your numbers before you buy the metal. Mine get a sequel either way: Anthropic is still reworking the repricing that would meter my own fleet, and I'll publish what it does to the bill when it actually lands. And if your measured math comes out different from mine, send me a DM - I want to see the workload where the hardware wins.

What AI Catches That Humans Miss in Code Review - And Vice Versa

Alexey Pelykh — Thu, 16 Apr 2026 06:43:53 +0000

Most debates about AI code review stay theoretical. "AI can't understand context." "AI catches things humans miss." "AI generates slop."

I have 449 data points that move past the speculation.

Between February 24 and March 4, 2026, I ran an AI-assisted code review campaign across 6 OCA (Odoo Community Association) repositories. Every review was independently validated against the actual code diffs by 40 separate AI validators. The result: a detailed picture of where AI excels, where it fails, and where the two approaches complement each other.

Where AI excels

Security surface scanning

The AI caught 6 genuine security vulnerabilities that human reviewers missed. These weren't theoretical concerns. They were in code heading toward production in widely-used open source modules.

Portal sudo bypass (timesheet #857): A controller endpoint called sudo() without restricting access, allowing any portal user to access arbitrary project records. No human reviewer flagged it.

Cross-record token exposure (project #1599): An endpoint with auth=public accepted tokens that could be used to access records belonging to other users. The security surface was non-obvious because the auth decorator looked standard.

getattr traversal (sale-workflow #3664): A review with 19 findings identified an getattr call that could be exploited for attribute traversal. This was part of the AI's strongest single review across the entire campaign.

.sudo(user) migration gap (timesheet #881): During a version migration, a .sudo(user) call wasn't properly converted, leaving an elevation path in the portal layer.

The pattern: AI excels at scanning every code path for security-relevant patterns. Human reviewers tend to focus on the functional logic and skip the security surface, especially on familiar modules.

Catching what multiple human approvers missed

This is the data point that surprised me most. Multiple PRs had been reviewed and approved by experienced human maintainers, and the AI still found bugs they all missed.

PR #3679: Three prior human approvals. The AI found that api.Environment.manage() had been removed in Odoo 16.0, making the migration code reference a non-existent API. Three reviewers signed off on it.

PR #3449: Two prior human approvals missed an or-to-and logic regression. A boolean condition that should have used and was using or, changing the filtering behavior entirely. The AI caught it.

PR #3584: Two prior human approvals missed a return True inside a for loop. Only the first line item was being processed. The rest were silently skipped.

PR #3760: A critical procurement skip bug where _action_launch_stock_rule returns True for the entire batch when any single line is a byproduct. No prior reviewer caught it.

These aren't obscure edge cases. They're logic bugs that change program behavior. The AI found them because it reads every line systematically. Humans skim, especially on large diffs from trusted contributors.

Consistent coverage

The AI reviewed all 449 PRs. Every one got the same level of structural analysis: architecture, test coverage, security, migration patterns, dependency checks.

Human review coverage in these same repositories was uneven. 28% of PRs received zero reviews. 984 were merged without any formal review trail. The AI didn't solve the depth problem, but it eliminated the coverage gap for the PRs it touched.

Contributors noticed. Multiple PR authors pushed fixes directly in response to AI review feedback, confirming the reviews were actionable enough to act on without waiting for a human to weigh in.

Where AI fails

Reading the room

PR #3819: The AI flagged features as "missing" that had been intentionally removed. Months of maintainer discussion in the PR thread had established consensus to remove those features. The AI didn't read the PR discussion. It reviewed the diff in isolation, saw code removed, and flagged it as a regression.

This is AI's single biggest limitation as a reviewer. Code review isn't just about the code. It's about the conversation around the code. Why was this change made? What did the community agree on? What prior attempts were tried and rejected? The AI has no access to that context unless someone feeds it in.

Recommending buggy patterns

Project #1583: The AI recommended "aligning with the purchase module pattern" for a computation method. The purchase module had the actual bug. It was putting price_subtotal in a groupby position, treating it as a group key instead of summing it. Following the AI's advice would have introduced incorrect totals.

This is a different failure mode from hallucination. The AI correctly identified a pattern from another module. The pattern was real. The pattern was also wrong. The AI couldn't evaluate whether the reference implementation was correct because it treated existing code as authoritative.

Fabricating observations on large diffs

Timesheet #748: The AI described a pre_init_hook performance pattern with confidence. No pre_init_hook exists anywhere in the 7,362-line diff. The AI generated a plausible technical description from training knowledge instead of reading what was actually there.

Timesheet #830: The AI claimed "tests pass." Zero tests exist. Codecov was failing. The AI pattern-matched: most modules have tests, so tests probably pass.

HR #1462: All three bug descriptions cite variable names (qty_initial, qty_done) from version 16.0 source code. The PR is an 18.0 migration. The AI reviewed code it had memorized, not code in the diff.

Large diffs are the trigger. When diffs exceed several thousand lines, the AI's attention degrades. It substitutes what it expects to see for what's actually there. The fabrication rate was low (4 claims out of ~2,000, under 0.2%), but each fabrication was confidently stated and would have passed self-assessment.

Rubber-stamping at scale

33 reviews (7.5%) approved PRs with no evidence the diff was read. The worst: a 7,538-line diff that got "LGTM" (Looks Good To Me) and nothing else. A typo was found in that same code 4 months later, proving the code hadn't been read at approval time.

Diff size inversely correlated with review depth. PR #4163, a 3,500-line new module, was approved with no inline comments while ignoring 13 existing substantive review comments from a community reviewer posted a week prior.

The AI treated large migrations as low-risk by default. "Clean migration, CI green, LGTM." Migrations are where the hardest bugs hide.

Inconsistency across identical code

PRs #4135 and #4136 contained identical code (a forward-port pair). The AI flagged float_compare precision concerns on one and approved the other without mentioning it. Same code, different treatment, no explanation.

This inconsistency undermines trust. If the AI's assessment depends on which batch a PR lands in rather than the code itself, the signal is unreliable.

Soft language on blocking issues

At least 5 reviews identified genuine blocking issues but used COMMENTED instead of CHANGES_REQUESTED. A return True inside a loop that breaks all processing? COMMENTED. A missing @api.depends that prevents field updates? COMMENTED.

The AI was calibrated to be polite rather than firm. In code review, soft language on a real blocker means the issue gets ignored. CHANGES_REQUESTED exists for a reason.

The complementary model

The data points to a clear division of labor.

AI first pass: Security surface scanning, style consistency, migration pattern verification, test coverage checks, dependency analysis. These are systematic, pattern-based tasks where coverage matters more than depth. The AI will review every PR, every file, every path. Humans won't.

Human second pass: PR discussion context, domain-specific conventions, evaluating whether referenced patterns are actually correct, judgment calls on architectural trade-offs, deciding if "tests pass" is a fact or an assumption.

The model isn't "AI or human." It's "AI catches the surface that humans skip, then humans add the judgment that AI lacks."

What this looks like in practice

AI reviews every PR for security patterns, style violations, test coverage, migration completeness, and obvious logic bugs.
AI flags findings with evidence but does NOT issue final verdicts on large or context-dependent PRs.
Human reviewers start from the AI's findings instead of a blank diff. They add context, validate or dismiss flags, and make the judgment calls.
AI findings that reference patterns from other modules get verified against those modules before being acted on.
Any AI claim about test status gets verified against actual CI output.

The economics

At 15 minutes per review, the 449-PR campaign represents 112 hours of review work. OCA's top human reviewer does 290 PRs in his best year. The AI campaign did 449 in 9 days.

The question isn't whether AI reviews are as good as human reviews. They're not. The question is whether imperfect AI coverage is better than no coverage. For the 138 PRs where the AI review was the only review the PR ever received, the answer is obvious.

The uncomfortable takeaways

AI finds real bugs that experienced humans miss. Three prior approvals on PR #3679. Two on #3449. Two on #3584. These aren't junior reviewers. These are maintainers who've been reviewing code for years. The AI caught what they missed because it reads every line instead of pattern-matching on familiarity.

AI also generates confident nonsense. A non-existent hook described in detail. "Tests pass" with no tests. Variable names from the wrong version of the codebase. Confidence and correctness are uncorrelated.

The biggest AI failure is social, not technical. Not reading PR discussions. Not engaging with prior reviewer feedback. Not understanding that removed code was removed on purpose. The technical analysis can be excellent while the review is still invalid because it ignored the human context around the code.

Diff size is the reliability boundary. Below a few thousand lines, AI reviews are strong. Above that threshold, rubber-stamps and fabrications spike. Know where the AI's attention breaks down.

Neither AI nor human review is sufficient alone. Humans miss surface-level bugs because they skim. AI misses context-dependent issues because it can't read the room. The combination covers more ground than either approach solo.

The tooling for this complementary model doesn't fully exist yet. But the data from 449 reviews makes the case clearly: the future of code review isn't choosing between AI and human reviewers. It's figuring out the handoff between them.

I Spend $800/Month on AI Coding Tools and I Can't Stop

Alexey Pelykh — Thu, 02 Apr 2026 18:05:16 +0000

I have four Claude Max x20 accounts. That's $800 a month on a single AI coding tool.

Each account gives me a 5-hour rolling window of tokens. I burn through each one in 30 to 45 minutes. Then I'm stranded. Two hours of nothing. So I check the next account. Maybe that one has tokens. Maybe the window rolled over. I tab between dashboards, refreshing, calculating which account resets soonest.

And the whole time, one thought on repeat: someone is shipping right now. Someone else's context window is still open. They're refactoring, generating, merging - and I'm sitting here watching a countdown timer.

This morning, while writing this article, the AI agents I dispatched to research "productivity addiction" all hit their rate limits simultaneously. Ironic? Sure. But the feeling underneath was real. Not frustration at the tool. Anxiety that the clock was ticking and I wasn't producing.

The slot machine you're proud of

AI coding tools run on the same psychological mechanism as a slot machine. Every prompt is a gamble. Will the output nail it in one shot, or hallucinate an API that doesn't exist? You don't know until you see the result.

Neuroscience research calls this a variable reward schedule. Unpredictable rewards generate more sustained dopamine activity than predictable ones. Same mechanism as slots.

But nobody brags about their slot machine sessions. AI coding tools get you congratulated. You post "53K lines in 28 days" and people applaud. The output is real. The productivity is real. I'm not questioning that.

What I'm questioning is what happens in the gaps.

The anxiety layer

The productivity itself isn't the problem. I built qontoctl - 53K lines of TypeScript, full API coverage, 28 days, one person. AI made that possible. That's not a delusion. That's a commit history.

The problem is the feeling that arises when the tool is unavailable. Not "I can't work." I can always work. I can plan, review, think, sketch architecture. The feeling is more specific than that: someone else is producing right now and I'm not.

Psychology Today has a name for this. They distinguish productivity addiction from workaholism. Workaholics are compelled to work. Productivity addicts are compelled by the feeling of completing things. The dopamine hit of output. The checkbox. The commit. The merged PR.

AI tools collapse the effort-to-output ratio so dramatically that the reward cycle accelerates. A refactoring that takes a day becomes an hour. So you do three more. Then the tokens run out. And the anxiety isn't "I can't code." It's "I'm falling behind someone who still has tokens."

The psychologist who coined "flow state" warned about something like this: flow "can become addictive, at which point the self becomes captive of a certain kind of order, and is then unwilling to cope with the ambiguities of life." That was written in 1990. It applies to AI-assisted developers now.

That's a new kind of professional anxiety. It didn't exist two years ago.

What the gap actually looks like

When my tokens run out:

Minutes 0-5: Refresh dashboards. Check other accounts. Calculate resets. Consider whether a fifth account would be excessive. (It would.)

Minutes 5-15: The anxiety peaks. Open Twitter. See someone posting about what they just built with AI. Feel behind.

Minutes 15-30: The anxiety fades. I start thinking about what I was actually building. Not what the next prompt should be. What the architecture should be. Whether the direction was right. Whether I was generating code or generating value.

After 30 minutes: Clarity. The kind of clarity that doesn't happen inside the loop because inside the loop there's always one more thing to prompt.

That shift from "generate" to "think" is the interesting part. It doesn't happen voluntarily. I have never once thought "I should take a break from AI coding and reflect on my architecture." Not once. The session limit forces it.

And here's the silver lining: productivity was never the bottleneck. That's solvable with scale and cash - four accounts prove it. The actual bottleneck is creativity. Decision-making. Choosing what to build, not how fast to build it. And that part only happens when the tokens stop.

The productivity alibi

AI tools don't just make you faster. They make "not fast enough" feel inexcusable.

When a refactoring that used to take a week now takes a day, taking two days feels like failure. When you can generate a full test suite in an hour, spending an afternoon thinking about test strategy feels like procrastination. The bar moves. And it only moves up.

That's the burnout path nobody's mapping. Not "AI will take your job" - that's old news. The new one is "AI will raise the output bar until the humans behind it break." Because the tool doesn't get tired. You do. The tool is available 24/7. You're rate-limited to 5-hour windows. Every idle moment feels like falling behind someone who figured out the fifth account before you did.

The honest question

The real test isn't whether you enjoy AI coding tools. Of course you do. They're incredible.

The test is what happens when you can't use them. When the tokens run out, when the rate limit hits, when the API goes down. What do you feel?

An engaged professional shrugs and switches to planning, reviewing, thinking. Pulls out a notebook. Goes for a walk. Comes back sharper.

I know what I do. That tells me something. I'm not sure it tells me something I want to hear, but pretending otherwise would be dishonest.

Anyway

I'm not going to wrap this up with a tidy lesson about "finding balance" or "being intentional with AI tools." That's not really my style.

What I'm going to do is close this laptop. It's April in the south of France. The Mediterranean is right there. My tokens are spent, my article is written, and nobody is shipping anything that can't wait two hours.

If you need me, my session resets at 6pm.

'AI Slop' vs No Review at All - Which Actually Kills Open Source?

Alexey Pelykh — Wed, 01 Apr 2026 13:47:33 +0000

Two things are true at once:

AI-assisted code reviews have real quality problems.
The alternative most PRs actually face is no review at all.

I ran 449 AI-assisted code reviews on OCA (Odoo Community Association) open source PRs in 9 days. Then I ran rigorous independent validation against the actual code. I also pulled the review statistics across all 10,808 PRs in the same 6 repositories.

Both datasets tell a story. The community's reaction to each tells another.

The AI reviews got called "slop," triggered near-ban discussions, and were shut down within days. The 28% zero-review rate has been running for years. The community debated the former. The latter is just how things are.

Here's the full data on both sides.

The case against AI reviews: every flaw, quantified

The AI reviews had real problems. Independent validation - 40 AI instances reading actual PR diffs, not the review text - covered 440 of the 449 reviews and produced these numbers:

Category	Count	%
Fully valid	303	68.9%
Partially valid	97	22.0%
Rubber-stamp	33	7.5%
Invalid	4	0.9%

68.9% fully valid. 31.1% had issues ranging from "missed something important" to "factually wrong."

The specific failures:

4 fabricated claims out of roughly 2,000 total (<0.2%). One described a code pattern that doesn't exist in a 7,362-line diff. One claimed "tests pass" on a module with zero tests. These are hallucinations.

33 rubber-stamp reviews (7.5%). PRs approved with "LGTM, CI green" and no evidence the diff was read. One approved a 3,500-line new module with no inline comments. Another approved a security-sensitive portal module with three words.

2 harmful suggestions. One was high-severity: the AI recommended following a pattern from another module that itself contained a bug. Following the advice would have introduced incorrect totals.

34 false positives. Things flagged as bugs that weren't. Wrong version conventions applied, code already doing what was suggested, misread diffs.

340+ significant issues missed across 440 reviews. Things the AI should have caught and didn't.

That's the full record. It's real.

The case against no review: the invisible damage

Same 6 repositories, 10,808 PRs total.

Metric	Number	What it means
PRs with zero reviews	3,070 (28%)	No human ever looked at these
Merged without review trail	984 (15% of all merges)	In production with no audit record
Stale, died unreviewed	471	Contributor effort, wasted
Modern-branch PRs closed unreviewed	58%	Majority of closed PRs never got a single review

3,070 PRs received zero reviews. Not a shallow review. Not a rubber-stamp. Nothing.

984 were merged anyway. A maintainer with merge rights presumably read the code, but left nothing on record. No feedback for the contributor. No searchable review history. No audit trail. If something breaks, there's no reviewer to trace, no review to learn from.

471 went stale and were closed by bots. Contributors submitted work, waited weeks or months, got silence, and watched a stale bot sweep their effort into the archive. On modern branches (16.0+), 58% of closed PRs died this way.

At the community-estimated 15 minutes per review, reviewing those 471 stale PRs would have taken roughly 118 hours. About 3 work weeks spread over years. Nobody found the time.

Side by side

Dimension	AI reviews (449 PRs)	No review (3,070 PRs)
Valid feedback delivered	303 fully valid + 97 partial	0
Security issues caught	6+ genuine findings	0
Bugs found that humans missed	Multiple across 440 reviews	0
Fabricated claims	4 (<0.2%)	N/A
Harmful suggestions	2 (1 high-severity)	N/A
Rubber-stamps	33 (7.5%)	N/A
Contributor feedback	Present, with issues	Absent
Audit trail	Present, with issues	Absent
Community response	"Stop"	Silence

The AI reviews had a fabrication rate under 0.2%. The no-review path has a coverage rate of 0%. One of those numbers got a community thread. The other didn't.

The risk calculus

What each failure mode actually costs:

A fabricated claim in an AI review (happened 4 times in ~2,000 claims): a human reviewer or the PR author sees the claim, recognizes it's wrong, and ignores it. The PR continues. The cost is noise and wasted attention.

A harmful suggestion (happened twice): if followed, introduces a bug. But the suggestion goes through normal review. It's a recommendation, not a merge. A maintainer can reject it. The cost is real but gated by human review downstream.

A rubber-stamp approval (happened 33 times): a false signal that the code was reviewed. This is genuinely dangerous. If a maintainer treats the AI approval as sufficient and merges without reading the code, real bugs ship. The mitigation: AI reviews shouldn't be counted as formal approvals. They're input, not decisions.

A PR merged without any review (happened 984 times): no signal at all. No feedback. No record. If bugs exist, they ship without anyone having a documented chance to catch them. No mitigation exists because no review happened.

A PR that dies unreviewed (happened 471 times): contributor time wasted. Future contributions from that person become less likely. The community shrinks by one potential contributor. Multiply by 471.

The AI failure modes are visible, quantifiable, and bounded. The no-review failure mode is invisible, unquantified, and compounding.

What the community chose

The AI reviews triggered discussion within days. Multiple community members flagged the campaign. The term "AI slop" was applied. I was asked to stop. A near-ban discussion followed.

The feedback about notification volume was legitimate. The quality concerns were legitimate. I published the data showing every flaw.

The 28% zero-review rate has been running for years. The 471 stale PRs accumulated gradually. The 984 no-trail merges happened one at a time. There's no comparable urgency.

This isn't specific to OCA. It's a pattern in how communities process risk. Visible, novel disruptions trigger immune responses. Invisible, chronic problems don't. The immune system targets the new threat, not the ongoing one.

The real security findings

While we're comparing risk: the AI reviews found 6+ genuine security vulnerabilities in code heading toward production.

Portal sudo bypass in a timesheet module
Cross-record token exposure on a public auth endpoint
getattr traversal in sale-workflow
Unfiltered portal properties in a project module
.sudo(user) migration gaps in security-sensitive code

The AI also caught bugs that multiple human reviewers missed. On one PR, two prior human approvals missed a return True inside a loop. On another, two prior approvals missed an or-to-and logic regression. On a third, three prior human approvals missed issues the AI flagged.

These findings came from the same 69%-valid review set that was labeled "slop." The 6 security catches and the 4 fabricated claims exist in the same dataset.

AI reviews aren't good enough

They're not. 69% validity is not where you want to be. 33 rubber-stamps are unacceptable. 2 harmful suggestions are too many.

But "good enough" depends on the comparison. Against a thorough human review by a domain expert, AI reviews lose badly. Against nothing, the calculus changes.

For 138 PRs where my AI review was the only review the PR ever received, the alternative wasn't a better review. It was no review at all. For those PRs, even a partially valid review with missed issues provides more value than the silence they were otherwise going to get.

The question

Open source has a reviewer scarcity crisis. OCA's most prolific reviewer has done 2,197 unique PR reviews across 9.5 years. That's exceptional, sustained effort over a decade. The review backlog still grows faster than volunteers can clear it.

The question isn't whether AI reviews are perfect. They're not. The question is whether a community can afford to reject imperfect coverage when the alternative is no coverage at all.

28% of PRs get nothing. What's the plan for them?

I Ran 449 AI Code Reviews in 9 Days. Then I Almost Got Banned.

Alexey Pelykh — Fri, 27 Mar 2026 20:32:47 +0000

The OCA (Odoo Community Association) has a quiet crisis. Across 6 repositories I tracked, 28% of all pull requests - 3,070 out of 10,808 - received zero reviews. Ever.

984 PRs were merged without any formal review trail. 471 went stale and were closed by a bot, unreviewed. 58% of closed PRs on modern branches died without a single human looking at them.

Nobody panicked about this. It was just how things were.

I decided to do something about it. Between February 24 and March 4, 2026, I ran an AI-assisted review campaign: 449 unique PRs reviewed across 6 OCA repositories in 9 days.

For scale: OCA's most prolific reviewer, pedrobaeza, has reviewed 2,197 unique PRs over 9.5 years. His best year was 290 PRs in 2025, which works out to 0.79 PRs per day. The AI campaign ran at 49.9 PRs per day. That's 63x his best-ever daily pace. To match the campaign's 9-day output at his best-year rate would take roughly 19 months.

At the community-estimated 15 minutes per review, the campaign represented 112 hours of review work. 2.8 full work weeks compressed into 9 calendar days.

Why I didn't ask permission

There was no formal AI policy to comply with. An LLM guidelines thread had been open on the OCA contributors mailing list since September 2025. Six months later, still no policy. Meanwhile, 471 PRs sat rotting in the queue, contributors' work ignored until a stale bot swept it away.

I had the tooling. I had Odoo domain knowledge from years of contributing. The gap was quantified and obvious. I filled it.

How the pipeline worked

The setup was straightforward. Claude Code read each PR's diff via the GitHub API, analyzed the code changes against Odoo framework conventions, and posted structured reviews. I reviewed the pipeline output and iterated on the prompts as quality patterns emerged.

The campaign covered sale-workflow (261 PRs), project (53), hr (46), bank-statement-import (39), timesheet (38), and web (3). Sale-workflow dominated because it had the largest unreviewed backlog.

The results, honestly

I didn't trust self-assessment. When I first had the AI evaluate its own review quality, it came back at 98.6% valid. That number was garbage.

So I ran a second round: 40 independent validator instances, each reading actual PR diffs via gh pr diff and verifying every technical claim against the code. The corrected number: 68.9% fully valid. Including partially valid reviews where some claims were correct but significant issues were missed: 90.9%.

The 30-point gap between self-assessment and independent validation is itself a finding worth its own post. But here's the quality breakdown:

Verdict	Count	%
Fully valid	303	68.9%
Partially valid	97	22.0%
Rubber-stamp	33	7.5%
Invalid	4	0.9%

34 false positives across 440 validated reviews. 4 fabricated claims out of roughly 2,000 total claims, a rate under 0.2%. And 2 harmful suggestions where following the advice would have made the code worse. One was high-severity: the AI recommended following a pattern from another module that itself had a bug.

What the AI actually caught

6 genuine security findings. Portal sudo bypass. Cross-record token exposure on a public auth endpoint. getattr traversal. These weren't theoretical. They were in code heading toward production.

The AI consistently found bugs that prior human reviewers missed. On PR #3584, two prior approvals missed a return True inside a loop. On #3449, two prior approvals missed an or-to-and logic regression. On #3679, three prior approvals missed issues the AI flagged.

What the AI got wrong

4 reviews were outright invalid. One reviewed the wrong version of the code entirely, describing 16.0 bugs on an 18.0 migration PR with variable names that didn't exist in the diff. One approved a fix that a maintainer corrected 15 minutes later. One flagged features as "missing" that had been intentionally removed per months of community consensus it hadn't read.

The rubber-stamp rate was 7.5%. These were PRs approved with no evidence the diff was actually read. Some were large migrations that got "clean migration, CI green, LGTM" and nothing else.

Quality improved over time, from 34% fully valid on early repos to 87% on mid-campaign work, then degraded back to 70% in later batches. Volume fatigue is real, even for AI pipelines.

138 PRs nobody else ever reviewed

This is the number that matters most to me: on 138 of the PRs I reviewed, the AI review was the only review the PR ever received. 30% of my reviews were the first and only external eyes on that code.

Some of those PRs had been sitting for months. Some for over a year. Contributors submitted work, waited, heard nothing, and eventually watched a stale bot close their effort.

The AI review wasn't perfect. But it was something. For those 138 PRs, the alternative wasn't a better review. The alternative was no review at all.

The community response

"Stop."

Stefan Rijnhart flagged the campaign as "flooding PRs with non-contextual reviews." Tom Blauwendraat asked me to stop until policy was established. Akim Juillerat called the reviews "AI slop" after receiving 10+ notifications. Denis Roussel reported continued "flooding."

They weren't entirely wrong. The notification volume was real. Some reviews were shallow. The rubber-stamps were a legitimate quality problem.

But nobody had data on any of this before I ran the experiment. The "AI slop" label was applied before anyone measured whether the reviews were valid. When I measured, 69% were fully valid. 91% had at least some legitimate value. And 138 PRs got their first-ever review.

What I learned

Volume and quality are in tension. The campaign started at 34% quality and climbed to 87% as the pipeline improved. Then it degraded back to 70% as I pushed volume. The optimal pace is slower than what's technically possible.

Self-assessment is worthless. 98.6% vs 68.9%. A 30-point gap. If you're using AI for anything consequential and not running independent validation, you're flying blind.

Communities punish action more than inaction. 28% of PRs getting zero reviews? Acceptable. 471 contributions dying unreviewed? Normal. AI reviews with a 69% validity rate filling the gap? Stop immediately. The asymmetry is real.

The coverage crisis is the actual problem. The debate about AI review quality is important but secondary. The primary crisis is that open source communities don't have enough reviewers. Period. The 28% zero-review rate didn't start when AI showed up. It was there the whole time. Nobody was panicking about it.

Permission isn't always available. There was no policy to comply with. There was no process to request permission through. There was only a gap and a community thread six months into discussion with no resolution. Sometimes you act and deal with the consequences.

The full quality audit, reviewer landscape analysis, and community discussion context are documented in detail. I'll be publishing the validation methodology, security findings, and lessons for AI-augmented teams in follow-up posts.

I ran an unauthorized experiment. The results weren't perfect. They were real. And for 138 PRs that had never gotten a single review, they were the only thing that happened.

53K Lines, 28 Days, $1,600: The Real Numbers Behind QontoCtl 1.0

Alexey Pelykh — Thu, 26 Mar 2026 14:17:52 +0000

The Headline Numbers

28 days ago, QontoCtl didn't exist. Today it's at 1.0.0 with full Qonto banking API coverage. Here's the codebase:

Metric	Value
TypeScript lines of code	52,713
Source files	501
Total lines (incl. blanks, comments)	64,497
Publishable packages	5

I ran scc on the repo. It estimates development cost using the COCOMO model, which factors in lines of code, complexity, and industry benchmarks for team-based development:

COCOMO Estimate	Value
Cost to develop	$1,923,212
Schedule effort	17.63 months
People required	9.69

What it actually took:

Actual	Value
Cost	$1,597 in API costs
Time	28 days
People	1

These aren't projections. The cost is from Claude Code's usage tracking. The timeline is from git history. First commit: February 26, 2026.

What Was Built

The 0.1 release covered the basics: organizations, accounts, transactions, statements, labels, memberships. 10 MCP tools. Read-only access via API keys.

1.0.0 covers everything Qonto exposes:

Account management - create, update, close, IBAN certificates
SEPA beneficiaries - add, update, trust/untrust
Transfers - SEPA (with cancel, proof, verify-payee), internal, bulk, recurring, international
Client invoicing - full lifecycle from create through finalize, send, mark-paid, cancel
Supplier invoices - list, view, bulk-create
Quotes and credit notes
Cards and insurance
Payment links and webhooks
Attachments - upload, link to transactions
Teams and membership management
E-invoicing settings
International currencies and eligibility

All of this through OAuth 2.0 with PKCE. Strong Customer Authentication handling for every write operation. Idempotency keys for safe retries. 69 MCP tools total. Same operations available as CLI commands and MCP tools, backed by one shared core library.

How the Sausage Was Made

306 Claude Code sessions over 28 days. Here's the raw usage data:

Metric	Value
Sessions	306
User turns	37,004
API requests	23,399
Tool calls	9,113
Total tokens	2.47 billion
Estimated cost	$1,597

That's roughly 11 sessions per day, about 1,300 turns per day.

Here's the part that changes the economics: I didn't manually design the architecture for this project. I have a library of Claude Code skills - reusable configuration files that encode how I approach software architecture. Monorepo structure, package boundaries, test strategy, API design patterns, security handling - these are captured as skills that Claude applies automatically when the project context matches.

The skills are the IP. Not this project's architecture specifically, but the architectural judgment encoded in a format that compounds across every project. When I start a new integration project, those skills kick in. The monorepo structure with Turborepo, the shared core library with thin CLI and MCP layers, the OAuth flow design, the Zod schemas for runtime validation - none of that was designed from scratch for QontoCtl. It was applied from patterns I've refined across dozens of projects.

Each new Qonto API domain followed a pattern: read the API docs, design the service layer, implement CLI commands, implement MCP tools, write tests. Once the pattern was established for one domain, Claude applied it across dozens more. Consistent internal patterns multiply AI productivity the same way they multiply human productivity, just faster.

The gap between "decide what to build" and "it exists and is tested" is where this setup changes the math. And the skills library is what makes it repeatable.

The Multiplier Nobody Measures

Here's what COCOMO doesn't account for: third-party API quality.

Qonto's API team built an API that's clean, consistent, and well-documented across every domain. Same patterns everywhere. Predictable naming. Coherent pagination. Consistent error handling.

This matters more than people realize for AI-augmented development. When patterns are consistent, the AI learns them once and applies them across dozens of domains correctly. When documentation is accurate, the implementation matches the spec on the first pass. When error handling follows conventions, you don't spend cycles debugging inconsistencies.

I've built integrations against APIs with inconsistent naming, undocumented edge cases, pagination that works differently per endpoint. That friction multiplies with AI tooling - every inconsistency becomes a correction cycle instead of a generation cycle.

Qonto's API had none of that. Full coverage in 28 days was possible because the API was built right.

What This Data Shows (And Doesn't)

One data point isn't a trend. Here's what I think is defensible:

What the data supports:

A single experienced architect with a reusable skills library can ship production-grade software at a pace that would have required a team 18 months ago
The economics of "is this worth building?" change when implementation cost drops by two orders of magnitude
API/SDK-shaped projects - well-defined interfaces, systematic patterns, comprehensive test coverage - are particularly well-suited to AI-augmented development
The real IP isn't in the code. It's in the skills that generate the code. Those compound across projects
Third-party API quality is a force multiplier that compounds with AI tooling

What it doesn't:

COCOMO models team dynamics, communication overhead, organizational friction. A single expert would never hit $1.9M even without AI. The honest baseline for a solo expert is probably $150K-200K worth of effort - still a 100:1 ratio against $1,600, just not 1,200:1
This doesn't generalize to all software. Integration projects with well-defined external contracts are the best case. Novel algorithm design, ambiguous requirements, or coordination-heavy systems would show different ratios
The human is not optional. The skills library encodes architectural judgment built over years. AI multiplies that judgment. It doesn't replace it

The question this data raises isn't "will AI replace developers?" That framing misses the point. The question is: what becomes worth building when the implementation cost drops by two orders of magnitude? What tools, integrations, and products were previously "not worth the engineering effort" that now make sense?

QontoCtl exists because the answer to "should someone build a full CLI and MCP server for Qonto's API?" changed from "it would take a team months" to "I can do this in February."

Try It

QontoCtl is open source under AGPL-3.0. CLI tool or MCP server usage carries no license obligations.

alexey-pelykh / qontoctl

CLI and MCP server for the Qonto banking API

CLI and MCP server for the Qonto banking API.

This project is brought to you by Alexey Pelykh.

What It Does

QontoCtl lets AI assistants (Claude, etc.) interact with Qonto through the Model Context Protocol. It can:

Organizations — retrieve organization details and settings
Accounts — list, create, update, close bank accounts; download IBAN certificates
Transactions — list, search, filter bank transactions; manage transaction attachments
Bank Statements — list, view, and download bank statements
Labels — manage transaction labels and categories
Memberships — view team members, show current membership, invite new members
SEPA Beneficiaries — list, add, update, trust/untrust SEPA beneficiaries
SEPA Transfers — list, create, cancel transfers; download proofs; verify payees
Internal Transfers — create transfers between accounts in the same organization
Bulk Transfers — list and view bulk transfer batches
Recurring Transfers — list and view recurring transfers
Clients — list, create, update, delete clients
Client…

View on GitHub

P.S. The $1,597 is the pay-as-you-go API estimate. I'm on Claude Max x20 ($200/month), and this project consumed roughly 30% of it. Actual out-of-pocket: ~$60.

The RSS Illusion: 63 GB Process on a 32 GB Machine

Alexey Pelykh — Wed, 18 Mar 2026 14:40:00 +0000

macOS displayed "Apps out of memory - iTerm2: 63.89 GB" on my 32 GB machine.

iTerm2 is my terminal. It doesn't do anything that should consume 64 GB. So I went looking.

The real culprit

Every Claude Code session runs as a child process of iTerm2. macOS attributes all descendant memory to the parent application. That's why the dialog blamed iTerm2.

I had 37 iTerm tabs open, each with a Claude Code session. Most were idle. Finished conversations I never closed.

ps aux reported 4.1 GB total RSS across all 95 Claude processes. The macOS footprint tool reported 62.7 GB.

A 15x gap.

How RSS misleads on macOS

RSS (Resident Set Size) counts pages physically resident in RAM. When macOS compresses or swaps dirty pages, RSS drops. The process appears to shrink.

But footprint tracks dirty pages regardless of compression state. Those pages are still attributed to the process. They still count against system memory pressure. Activity Monitor and the "out of memory" dialog use footprint, not RSS.

The result: a process can show 7 MB RSS while holding 1.3 GB of dirty, non-reclaimable memory. RSS doesn't just undercount. It creates a dangerous illusion. The process looks like it's using less memory over time while actually consuming more.

If you're monitoring macOS workloads with ps or anything RSS-based, you're flying blind.

Decomposing the footprint

Using footprint -p <pid>, each Claude Code process breaks down into memory categories. The pattern across sessions of different ages tells the story:

Session Age	WebKit malloc	IOAccelerator	Total Footprint
Fresh (3 hrs)	343 MB (77%)	48 MB (11%)	443 MB
8 days idle	231 MB (23%)	711 MB (71%)	996 MB
15 days idle	265 MB (20%)	968 MB (73%)	1,324 MB

IOAccelerator starts small and grows to dominate. Every allocation is marked dirty and non-reclaimable. macOS cannot free this memory without killing the process.

128 MB slabs that never get freed

vmmap reveals the IOAccelerator memory is structured as 128 MB slabs:

Address Range                    VSIZE    RSDNT  DIRTY   SWAP
54db4000000-54dbc000000  128.0M  1440K  1440K  95.7M   ← oldest slab
54dbc000000-54dc4000000  128.0M     0K     0K  95.3M
54dc4000000-54dcc000000  128.0M     0K     0K 126.3M
...
54dfc000000-54e04000000  128.0M     0K     0K    80K   ← newest slab
(reserved)                768.0M     0K     0K     0K   ← pre-allocated

The oldest slabs fill to 95-128 MB. When one fills, a new one is allocated. They are never freed or reused. A reserved block pre-allocates VM address space for future growth.

After 15 days idle: 10 slabs, 966 MB swapped, 1.4 MB resident. Peak footprint for this single session hit 2.8 GB.

Where does the GPU stack come from?

Claude Code is built on Bun, which uses JavaScriptCore from WebKit. It renders its TUI using Ink, a React-based terminal rendering framework.

Despite being a terminal REPL that outputs ANSI escape codes, the process loads a full GPU stack:

Metal.framework
MetalPerformanceShaders.framework (MPSNeuralNetwork, MPSNDArray, MPSImage)
IOAccelerator.framework
IOSurface.framework
GPUWrangler.framework
GPUCompiler.framework
WebCore.framework

No explicit Metal API calls exist in the binary. So where does this come from?

My first hypothesis was JSC/WebKit's rendering infrastructure. Testing disproved it. Loading JSC and WebKit directly via dlopen() in a C test program produced zero IOAccelerator allocations and zero GPU frameworks. Standalone Bun also loads zero Metal or GPU frameworks.

The GPU framework stack is loaded specifically by Claude Code. Something in its dependency tree triggers it. What's proven: it's not JSC and it's not Bun's baseline runtime. The exact dependency remains unidentified.

Isolation testing

To narrow the cause, I ran control tests at multiple layers:

Test	IOAccel Slabs	IOAccel Dirty	Footprint
C + dlopen(JSC) + eval	0	0	3.7 MB
C + dlopen(WebKit)	0	0	2.0 MB
Bun idle (sleep)	1	1.3 MB	6.5 MB
Bun + heavy JSON parsing	1	1.4 MB	6.7 MB
Bun + HTTP streaming (20 req)	1 (2 regions)	2.3 MB	13 MB
Claude Code (3 hrs active)	4	46 MB	443 MB
Claude Code (15 days idle)	10	966 MB	1,324 MB

Two layers emerged.

Layer 1 - Bun baseline: Bun allocates a 128 MB IOAccelerator slab on startup. JSC alone (via dlopen) doesn't. This is Bun-specific, small, and fixed. No Metal or GPU frameworks are loaded. I tested 12 JSC/Bun environment variables and flags. None affected the allocation.

Layer 2 - Claude Code growth: Claude Code loads the full Metal/GPU framework stack and grows from 1 slab to 10+ over its lifetime. HTTP streaming in standalone Bun caused growth from 1 to 2 IOAccelerator regions in 20 seconds, suggesting sustained network I/O is a contributor. Claude Code streams API responses for hours, which would amplify this.

Running leaks on the 15-day idle session reported 175,613 leaked objects totaling 13.5 MB in the standard malloc zone alone. The WebKit malloc zone was unreadable due to security restrictions. The actual leak count is likely much higher.

The session's file descriptors were all revoked. No GPU device handles remained open. The IOAccelerator memory was orphaned. Buffers allocated with no active GPU connections.

What this means

For monitoring: If you monitor macOS workloads using RSS, you can get a 15x underestimate for long-running processes with IOAccelerator allocations. Use footprint or kern.memorystatus_level instead.

# Real memory cost per process
footprint -p <pid>

# System memory pressure as percentage
sysctl -n kern.memorystatus_level

For runtime selection: Bun allocates IOAccelerator-tagged memory on startup that JSC alone doesn't. It's small at baseline, but Claude Code shows what happens when a large application runs on top for hours: the allocation grows to nearly 1 GB and is never reclaimed. If your Bun application does sustained network I/O, monitor with footprint, not RSS.

For Claude Code users: Close idle sessions. Each one accumulates ~1 GB of non-reclaimable footprint after a few hours of active use. If macOS reports "out of memory" for your terminal, check for accumulated Claude processes.

What's still open: Which Claude Code dependency loads the Metal/GPU framework stack that standalone Bun doesn't? Is the slab growth driven by sustained network streaming, terminal rendering, or both? These questions are tracked at oven-sh/bun#28234 and anthropics/claude-code#35804. Corrections to the original reports have been issued.

We Thought Our AI Reviews Were 98.6% Valid. Independent Validation Said 69%.

Alexey Pelykh — Tue, 17 Mar 2026 12:28:25 +0000

The most dangerous thing about AI-augmented work isn't the errors. It's thinking you're not making them.

I ran 449 AI-assisted code reviews on OCA (Odoo Community Association) open source PRs in 9 days. When I had the AI assess its own review quality, it said 98.6% valid. When I ran independent validation, the number dropped to 68.9%. The validation used 40 separate AI instances, each reading the actual code diffs and verifying every technical claim.

That 30-point gap should concern anyone using AI for serious work.

The experiment

Between February 24 and March 4, 2026, I reviewed 449 unique pull requests across 6 OCA repositories using AI-assisted workflows. Each PR got a full technical review: architecture assessment, bug identification, security analysis, test coverage evaluation. The output was structured code review comments posted directly to GitHub.

For scale: OCA's most prolific human reviewer has done 2,197 unique PR reviews over 9.5 years. My campaign produced 449 in 9 days.

The reviews weren't rubber-stamps either. They found real security vulnerabilities (portal sudo bypass, cross-record token exposure, getattr traversal), caught bugs that multiple human reviewers missed, and provided actionable technical feedback that PR authors implemented.

But how good were they really?

Round 1: The self-assessment trap

My first attempt at validation was obvious: have AI evaluate the reviews. I fed each review to an evaluator and asked "Is this review technically valid?"

Result: 98.6% valid.

This number is worthless.

The evaluator was reading the review text - not the actual code. It was checking whether the review sounded plausible, not whether the claims matched reality. A review that confidently describes a pre_init_hook performance pattern scores well on plausibility. The fact that no pre_init_hook exists anywhere in the 7,362-line diff? The evaluator had no way to know.

This is the fundamental problem with self-assessment. AI evaluating AI-generated text is pattern-matching for coherence, not verifying truth. It's the equivalent of grading your own exam by checking whether your handwriting is neat.

I discarded the entire Round 1 dataset.

Round 2: Independent validation against actual code

Round 2 used a different approach. I dispatched 40 independent AI instances (I call them "subclauds"), each assigned to a single PR. Each one:

Retrieved the actual PR diff using gh pr diff
Read every technical claim in the review
Independently verified each claim against the real code
Classified the review as VALID, PARTIALLY VALID, RUBBER-STAMP, or INVALID - with evidence

The key difference: validators had the ground truth. They weren't evaluating whether the review sounded right. They were checking whether each claim matched the code.

The results

Category	Count	%
Fully valid	303	68.9%
Partially valid	97	22.0%
Rubber-stamp	33	7.5%
Invalid	4	0.9%

68.9% fully valid. Combined with partially valid: 90.9%.

Not terrible. But 30 points below what self-assessment reported.

What "partially valid" means

Most partially valid reviews had genuinely correct observations but missed important issues in the diff. A review might correctly identify three concerns but miss a critical fourth one. The feedback it gave was real - it just wasn't complete.

What "rubber-stamp" means

33 reviews (7.5%) approved PRs without evidence of reading the diff. These are the reviews that said "LGTM, CI green" on a 3,500-line new module with no inline comments. One approved a security-sensitive portal module with zero tests and gave it three words.

What "invalid" means

Four reviews were factually wrong at their core:

One described bugs from version 16.0 in an 18.0 migration review. The variable names it cited don't exist in the diff.
One approved a fix that a maintainer corrected 15 minutes later.
One flagged features as "missing" that were intentionally removed per months of community discussion.
One requested changes for a state value that already exists correctly in the code.

The fabrication problem

Out of roughly 2,000 total claims across 440 validated PRs, 4 were fabricated. Less than 0.2%.

But each fabrication is instructive:

Phantom pattern: Described a pre_init_hook performance pattern with confidence. No pre_init_hook exists anywhere in the 7,362-line diff. The AI generated a plausible Odoo code pattern from training knowledge rather than reading the actual code.
Phantom tests: Claimed "tests pass" on a module with zero tests. Codecov was failing. The AI assumed tests exist because most modules have them.
Wrong version: All three bug descriptions cite variable names and code patterns from version 16.0 source code, not the 18.0 migration diff under review. The AI was analyzing code it had memorized from training, not code in the PR.
Invisible tests: Claimed "module doesn't include any tests" when a 187-line test file with 6 test methods exists in the PR. The AI missed a file it should have read.

The common thread: every fabrication stems from the AI substituting pattern-matched expectations for actual observation. It "knows" what Odoo modules typically look like and fills in the blanks rather than reading what's actually there.

The quality curve

Quality wasn't uniform. It improved over time, then degraded with volume.

Period	Valid rate
Timesheet (early)	34%
HR	50%
Bank-statement-import	59%
Project	70%
Sale-workflow (early batches)	87%
Sale-workflow (late batches)	70%

The jump from 34% to 87% shows genuine learning - prompts improved, edge cases were handled, failure modes were addressed. The regression from 87% to 70% shows volume fatigue. The same degradation pattern that affects human reviewers doing batch work.

Why this matters beyond code review

The 30-point validation gap isn't specific to code review. It's a structural problem with any AI-assisted workflow where:

The output looks plausible. Well-written text passes surface-level scrutiny.
Self-assessment is circular. AI checking AI text measures coherence, not correctness.
Ground truth verification requires extra work. Actually checking claims against reality takes effort most people skip.

If you're using AI for research, writing, analysis, or decision support, the same gap likely exists. You just haven't measured it yet.

How to validate your own AI output

The methodology is reusable:

Separate the evaluator from the generator. Don't ask the same model to grade its own output.
Give the evaluator ground truth. The evaluator must have access to the source material, not just the AI's output.
Require evidence for every claim. Each verification should quote specific evidence from the source.
Use categorical classification with clear definitions. Valid / Partially Valid / Rubber-stamp / Invalid gives you actionable data.
Run at scale. A few spot checks won't reveal systemic patterns. I validated 440 reviews to see the quality curve.

The cost of this validation was a fraction of the cost of generating the reviews. The cost of NOT validating? Thinking you're at 98.6% when you're at 68.9%.

The bottom line

Self-assessed AI quality is a vanity metric. If you're measuring your AI workflow by asking "does this look right?" you're overestimating quality by 20-30 points.

Validate against ground truth, not against the AI's own output. The gap you find will be uncomfortable. That discomfort is the point.

The Missing Category in the AI Agent Landscape

Alexey Pelykh — Tue, 10 Mar 2026 15:28:55 +0000

There are over 100 projects that will build you an AI agent. You can get one in TypeScript, Rust, Python, Go, Zig, or Shell. You can run it on a Raspberry Pi or a Kubernetes cluster. You can talk to it through Telegram, WhatsApp, Slack, Discord, WeChat, or 15 other channels.

But if you already HAVE an agent -- a Claude Code setup with custom skills and CLAUDE.md, a tuned Gemini CLI workflow, a Codex integration your team depends on -- and you want to message it from your phone? The options are a handful of single-channel scripts and a lot of empty space.

I spent months mapping this landscape: 115+ projects across 10 categories, from full rewrites (5 language ports of OpenClaw alone) to managed hosting services to single-file Telegram bridges. What emerged is a gap I'm calling "agent middleware" -- and I built a project to fill it.

Two Kinds of Users

There are two fundamentally different people evaluating AI agent tools right now:

         Builds Agent Logic              Bridges Existing Agents
              <---------------------------------------------->
              |                                              |
  Many        |  OpenClaw, NanoClaw        RemoteClaw        |
  Channels    |  AstrBot, CoPaw           cc-connect         |
              |  LangBot, PocketPaw                          |
              |                                              |
  Few/No      |  Nanobot, ZeroClaw        TinyClaw           |
  Channels    |  IronClaw, MicroClaw      claude-pipe        |
              |  Moltis, OpenFang         Claude-Code-Remote |

The left side is a crowded, well-served market. OpenClaw alone has 250K+ stars and 36,900 forks. NanoClaw offers the same idea in 15 source files. At least five Rust rewrites compete for the "same thing, but faster" niche.

The right side is almost empty.

This is the developer who already has Claude Code configured with a custom ~/.claude directory, or a Gemini CLI setup they've spent weeks tuning, or a Codex workflow integrated into their team's process. They do not want a new agent. They want to send a message to the agent they already have -- from their phone, from a Slack channel, from WhatsApp.

Where does your setup fall on this spectrum? Bookmark the full landscape reference for the complete data.

The Fork Explosion

In February 2026, OpenClaw was forking at 100 per hour. By March, the ecosystem had produced five complete language rewrites (Rust, Go, Python, Zig, Shell), a dozen managed hosting services, and over 60 forks with meaningful modifications.

The fork explosion was not about OpenClaw being bad. It was about OpenClaw being almost-right for too many different use cases. Every fork adjusts the same core product for a different audience: lighter, more secure, Chinese-market-native, edge-deployable, enterprise-ready.

But almost every fork keeps the same fundamental architecture: a platform that owns the agent loop, runs its own LLM orchestration, and bundles everything from memory to skills to model management.

The Missing Category

We call this gap agent middleware: software that connects existing AI agents to messaging channels without owning the agent loop.

The boundary test for agent middleware is simple: does it route through infrastructure, or does it try to be the agent?

Agent Middleware	Agent Platform
Bridges to your CLI agent	Runs its own LLM calls
Preserves your agent's config	Requires its own configuration
Adds channels, sessions, scheduling	Adds memory, skills, model management
Your `~/.claude` is the agent	Its built-in orchestrator is the agent

This is not a quality judgment. Platforms like OpenClaw, NanoClaw, and Nanobot are excellent at what they do. The distinction is architectural: they own the agent loop, agent middleware does not.

CLI agents ship new capabilities monthly. A platform that bundles its own versions of those capabilities is building on quicksand. OpenClaw's 294,000 lines of code and 5,300+ open issues are the natural result. NanoClaw and Nanobot exist because the full platform became too heavy.

Middleware only provides what a CLI agent cannot provide for itself: sessions, channel routing, scheduling, and gateway services. Everything else is the agent's job.

The Convergence Evidence

Multiple independent developers arrived at the same conclusion:

Project	Runtime	Channels	Notes
claude-code-telegram	Claude Code	Telegram	SDK + CLI fallback, cron
ccbot	Claude Code	Telegram	tmux-based
claude-pipe	Claude Code	Telegram + Discord	~1,000 lines
Claude-Code-Remote	Claude, Gemini, Cursor	Email, Discord, Telegram	Multi-runtime
cc-connect	Claude, Gemini, Codex, Cursor	8 channels	Cron, voice

cc-connect bridges four CLI runtimes to 8 messaging channels with cron scheduling and voice support. Same multi-runtime, multi-channel concept, implemented as a lightweight bridge.

LangBot is the closest thing to production middleware from the Chinese ecosystem: 11+ messaging platforms, integrations with Dify, Coze, n8n, and other agent runtimes. Pure bridge, no agent logic.

Claude-to-IM-skill bridges Claude Code and Codex to Telegram, Discord, and Feishu simultaneously, with persistent sessions and a permission system.

When 10 developers independently build the same Telegram bridge without knowing about each other, that is not a trend. It is a product category announcing itself.

What Agent Middleware Actually Does

If middleware does not own the agent loop, what does it provide?

Capability	What It Does	Why a CLI Agent Cannot Do This
Sessions	Maps Telegram conversations to persistent agent sessions	CLI agent does not know about Telegram sessions
Channel routing	Routes WhatsApp and Slack messages to the same agent	CLI agent assumes a terminal
Scheduling	"Analyze revenue at 8am, post to Slack"	CLI agent cannot trigger itself
Gateway services	Auth, rate limiting, tool access policies	CLI agent has no network layer

These capabilities are infrastructure-bound. They only make sense when there is a system between the user and the agent. The moment you want to access your agent from your phone, you need all of them.

# What the setup looks like
npm install -g remoteclaw
remoteclaw init --channel telegram --runtime claude
remoteclaw start

The Landscape

Here is a simplified map of how the ecosystem divides. This is not exhaustive -- the full reference (linked below) covers 115+ projects across 10 categories.

Mobile remote control apps (Happy Coder, CloudCLI) solve the "remote access" need through native apps rather than messaging. They compete for the same user but through a different channel.

Bot frameworks (Botpress, Rasa, Chatwoot) connect to messaging channels but own the conversation logic. They are platforms, not middleware.

Agent orchestration frameworks (LangGraph, CrewAI, AutoGen) build multi-agent systems but do not provide messaging channel integration. They are infrastructure for agent logic, not for message delivery.

If you are building a single-channel bridge for Claude Code, check if RemoteClaw already supports your channel.

Why I Built RemoteClaw

I built RemoteClaw because I spent months inside the OpenClaw codebase -- analyzing 5,605 files across 334 analysis batches -- and realized that the channel infrastructure was exactly what developers with existing agents needed, but the platform layer was exactly what they did not.

RemoteClaw is a fork of OpenClaw that strips the platform layer and replaces it with an AgentRuntime interface. Your CLI agent runs as a subprocess, preserving your configuration untouched. The gateway handles sessions, channels, and 50 MCP tools. The agent handles everything else.

It is middleware, not a platform. It connects the agent you already have.

Full Reference

The complete landscape data -- 115+ projects, 10 categories, channel coverage comparison, and architecture classification -- is available on our documentation site:

Agent Middleware Landscape Reference

This map covers 115+ projects across 10 categories. We are certain we missed some. If you find a project we missed or a description that needs correction, please open an issue.

Will the "right side" of this map fill up in 2026, or will platforms absorb the middleware function? I have a strong opinion. What's yours?

RemoteClaw is open-source middleware that bridges CLI AI agents to 22+ messaging channels. GitHub | Quick Start | Documentation

From AI-Augmented Human to Human-Augmented AI

Alexey Pelykh — Thu, 05 Mar 2026 14:45:29 +0000

Sometime in late 2025, the relationship between software engineers and AI inverted.

I can't pinpoint the exact moment. But looking at my workflow across ten active projects - Java libraries, LinkedIn automation tools, Odoo modules, browser extensions - the pattern is clear. I stopped using AI to help me write code. AI writes the code. I specify what to build, review what comes back, and steer when it drifts.

The terminology is catching up. Andrej Karpathy declared "vibe coding" passé in February 2026 and promoted "agentic engineering" - where "you are not writing the code directly 99% of the time." Nicholas Zakas mapped a three-stage progression: Coder to Conductor to Orchestrator. Researchers at ArXiv formalized it as "Software Engineering 3.0," analyzing 456,000 AI-authored pull requests across 61,000 repositories.

Different labels. Same observation: the human moved from doing the work with AI assistance to overseeing AI doing the work.

But here's what nobody is saying clearly enough: most of the industry hasn't made this transition. Many haven't entered any AI era at all.

Three Eras, One Industry

Cross-referencing data from Jellyfish, Bain, Stack Overflow, and McKinsey, the software industry is operating in three distinct modes simultaneously.

Era 0: Pre-GenAI (~30-35% of organizations). These companies have AI tool licenses. Their developers have Copilot seats. Nothing has changed. Bain calls it "rollout without adoption" - tools deployed, workflows unchanged. Three of four companies say the hardest part isn't the technology. It's getting people to change how they work.

The engineers at these companies write code the same way they did in 2022. The AI subscription shows up on the expense report. The AI doesn't show up in the workflow.

Era 1: AI-Augmented Human (~50-55%). This is where most AI-adopting organizations sit. Individual developers use Copilot, Cursor, or ChatGPT as smarter autocomplete. They get 10-15% productivity gains at the individual level. They still write the code. AI helps.

The problem: the coding bottleneck moves, but nothing else changes. Review processes, testing infrastructure, security scanning, deployment workflows - all pre-AI. Faster code generation creates bottlenecks everywhere downstream. One Fortune 50 analysis showed a 10x increase in security findings per month after widespread AI adoption - more code hitting the pipeline meant more surface area.

The typical symptom: "Our developers are faster but we're not shipping faster."

Era 2: Human-Augmented AI (~10-15%). This is the inversion. AI is the primary producer across the delivery chain. Humans focus on specification, architecture, steering, review, and judgment.

The Sanity engineering team documented this in detail: AI writes 80% of initial implementations. The first attempt is "95% garbage." By the third iteration, the output is workable. Features ship 2-3x faster overall. Rakuten tested it on a 12.5 million line codebase - Claude Code completed a feature implementation in 7 hours of autonomous work with 99.9% accuracy. Zero human code contribution during execution.

These organizations redesigned their entire delivery chain around AI. Not just the coding step. Everything downstream too. The maturity timeline: 18-24 months of compounding investment to get here.

The Evidence Is Messy (On Purpose)

The data supporting this shift exists. So does data complicating it. Both deserve honest treatment.

The case for the inversion. GitHub Copilot generates 46% of code for its users (61% for Java). Google reports 25%+ of new code is AI-generated. Microsoft says 20-30%. Nearly half of all code written in 2025 was AI-generated. By raw volume, AI is the primary producer in adopting organizations.

The case for skepticism. In early 2025, METR ran a randomized controlled trial and found experienced open-source developers were 19% slower with AI tools. Not faster. Slower. Those developers believed they were 20% faster - a perception-reality gap of 39 percentage points.

But that study has a sequel. When METR tried to replicate it in late 2025, 30-50% of developers refused to submit tasks they didn't want to do without AI. Returning participants from the original study showed an 18% speedup. METR's own February 2026 assessment: developers are "likely more sped up from AI tools now" than in early 2025. The original finding was a snapshot of early-2025 tools on familiar codebases. The reversal itself is evidence of how fast the shift happened.

Code quality concerns remain real regardless. CodeRabbit's analysis of 470 PRs found AI-generated code had 1.7x more issues, with performance problems at roughly 8x the rate. GitClear analyzed 211 million changed lines: refactoring collapsed from 24% to 9.5%, code duplication rose eightfold.

Trust is declining while adoption surges. Stack Overflow's 2025 survey: 84% of developers use AI tools, but trust in accuracy dropped from 40% to 29%. Only 3% report high trust. Forty-six percent actively distrust.

The reconciliation. The productivity trajectory is upward, but the quality and trust problems are structural - they don't disappear with better models. The METR reversal shows developers getting faster. The CodeRabbit and GitClear data show the code getting worse. Both are true simultaneously.

The real picture: AI is a genuine capability amplifier for bounded tasks. It is simultaneously a quality degrader, a security risk (Fortune 50 data showed a 10x vulnerability spike), and a perception distorter. These things are all true at the same time.

The organizations in Era 2 aren't ignoring these problems. They're building systems to manage them. The organizations in Era 0 and Era 1 aren't managing them because they don't know they have them.

What the Human Actually Does Now

In the Era 2 workflow, the human's job changes fundamentally.

Specification. Writing detailed prompts, specs, and context documents. This is where most of the value gets created. A vague specification produces garbage output regardless of the model.

Architecture. System design, technology selection, integration patterns. AI can implement a pattern. It can't choose the right one for your business context.

Steering. Redirecting when AI drifts, constraining the solution space. The Sanity team's experience makes sense: the first attempt is 95% garbage not because the AI is bad, but because iterative refinement with human judgment is the workflow.

Review. Evaluating AI output for correctness, security, and maintainability. This is the new bottleneck. Organizations that treat review as a cost center are accumulating technical debt they can't see yet.

Context provisioning. Building CLAUDE.md files, providing codebase context, configuring tools. MIT Technology Review called this "context engineering" - the discipline that replaced "prompt engineering" in 2025.

Judgment. Edge cases, trade-offs, business logic. The things that require understanding the business, not just the code.

What humans are not doing: writing boilerplate, implementing known patterns, generating test scaffolding, routine refactoring. These tasks made up a significant portion of a developer's day. They're delegated now.

This is an identity crisis for many developers. GitHub frames it as moving from "code producer to creative director of code." Sixty-five percent expect their role to be redefined in 2026. If your career identity is tied to writing code, being told your value is in what you specify rather than what you type requires a fundamental rethink.

What This Means for Engineering Leaders

Three things matter if you're a CTO or VP Engineering.

Know which era you're actually in. Not which era you think you're in. The METR perception-reality gap applies to organizations, not just individuals. If your developers have AI tools but your delivery metrics haven't changed, you're in Era 0 regardless of how many Copilot licenses you're paying for.

The coding step is only 25-35% of development time. Bain's analysis: concept to launch includes requirements, design, implementation, testing, deployment, and maintenance. Even a 50% improvement in the coding step translates to 12-17% faster delivery. The organizations seeing 25-30% overall gains redesigned the full chain, not just the coding step.

The junior pipeline is breaking. Employment for software developers aged 22-25 fell nearly 20% from the 2022 peak. Fifty-four percent of engineering leaders plan to hire fewer juniors. This creates a time bomb: the senior engineers of 2030 need to be hired as juniors in 2026. The organizations figuring out AI-accelerated junior development - 18 months to mid-level instead of three years - will have a structural advantage. Those that simply stop hiring juniors are borrowing from a future they haven't thought through.

The Honest Position

The shift from AI-augmented human to human-augmented AI is real. It is also incomplete, unevenly distributed, and complicated by quality and security trade-offs that most organizations aren't measuring.

Calling it a paradigm shift is accurate for the 10-15% in Era 2. For the majority, it's an unrealized possibility sitting unused behind a subscription login.

The most productive framing isn't "AI is replacing developers" or "AI is just a tool." It's recognizing that the relationship changed - and that the organizations and individuals who understand the new terms are pulling ahead of those who don't.

The gap is widening. Not because the technology demands it. Because the people who adapted first are setting the standard everyone else will be measured against.