<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexey Pelykh</title>
    <description>The latest articles on DEV Community by Alexey Pelykh (@alexey-pelykh).</description>
    <link>https://dev.to/alexey-pelykh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802620%2F75d663e0-2705-4bc2-b61c-beba0ccca265.jpg</url>
      <title>DEV Community: Alexey Pelykh</title>
      <link>https://dev.to/alexey-pelykh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexey-pelykh"/>
    <language>en</language>
    <item>
      <title>What AI Catches That Humans Miss in Code Review - And Vice Versa</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Thu, 16 Apr 2026 06:43:53 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/what-ai-catches-that-humans-miss-in-code-review-and-vice-versa-45ne</link>
      <guid>https://dev.to/alexey-pelykh/what-ai-catches-that-humans-miss-in-code-review-and-vice-versa-45ne</guid>
      <description>&lt;p&gt;Most debates about AI code review stay theoretical. "AI can't understand context." "AI catches things humans miss." "AI generates slop."&lt;/p&gt;

&lt;p&gt;I have 449 data points that move past the speculation.&lt;/p&gt;

&lt;p&gt;Between February 24 and March 4, 2026, I ran an AI-assisted code review campaign across 6 OCA (Odoo Community Association) repositories. Every review was independently validated against the actual code diffs by 40 separate AI validators. The result: a detailed picture of where AI excels, where it fails, and where the two approaches complement each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI excels
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Security surface scanning
&lt;/h3&gt;

&lt;p&gt;The AI caught 6 genuine security vulnerabilities that human reviewers missed. These weren't theoretical concerns. They were in code heading toward production in widely-used open source modules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portal sudo bypass&lt;/strong&gt; (timesheet #857): A controller endpoint called &lt;code&gt;sudo()&lt;/code&gt; without restricting access, allowing any portal user to access arbitrary project records. No human reviewer flagged it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-record token exposure&lt;/strong&gt; (project #1599): An endpoint with &lt;code&gt;auth=public&lt;/code&gt; accepted tokens that could be used to access records belonging to other users. The security surface was non-obvious because the auth decorator looked standard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;getattr traversal&lt;/strong&gt; (sale-workflow #3664): A review with 19 findings identified an &lt;code&gt;getattr&lt;/code&gt; call that could be exploited for attribute traversal. This was part of the AI's strongest single review across the entire campaign.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;.sudo(user)&lt;/code&gt; migration gap&lt;/strong&gt; (timesheet #881): During a version migration, a &lt;code&gt;.sudo(user)&lt;/code&gt; call wasn't properly converted, leaving an elevation path in the portal layer.&lt;/p&gt;

&lt;p&gt;The pattern: AI excels at scanning every code path for security-relevant patterns. Human reviewers tend to focus on the functional logic and skip the security surface, especially on familiar modules.&lt;/p&gt;

&lt;h3&gt;
  
  
  Catching what multiple human approvers missed
&lt;/h3&gt;

&lt;p&gt;This is the data point that surprised me most. Multiple PRs had been reviewed and approved by experienced human maintainers, and the AI still found bugs they all missed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR #3679&lt;/strong&gt;: Three prior human approvals. The AI found that &lt;code&gt;api.Environment.manage()&lt;/code&gt; had been removed in Odoo 16.0, making the migration code reference a non-existent API. Three reviewers signed off on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR #3449&lt;/strong&gt;: Two prior human approvals missed an &lt;code&gt;or&lt;/code&gt;-to-&lt;code&gt;and&lt;/code&gt; logic regression. A boolean condition that should have used &lt;code&gt;and&lt;/code&gt; was using &lt;code&gt;or&lt;/code&gt;, changing the filtering behavior entirely. The AI caught it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR #3584&lt;/strong&gt;: Two prior human approvals missed a &lt;code&gt;return True&lt;/code&gt; inside a &lt;code&gt;for&lt;/code&gt; loop. Only the first line item was being processed. The rest were silently skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PR #3760&lt;/strong&gt;: A critical procurement skip bug where &lt;code&gt;_action_launch_stock_rule&lt;/code&gt; returns &lt;code&gt;True&lt;/code&gt; for the entire batch when any single line is a byproduct. No prior reviewer caught it.&lt;/p&gt;

&lt;p&gt;These aren't obscure edge cases. They're logic bugs that change program behavior. The AI found them because it reads every line systematically. Humans skim, especially on large diffs from trusted contributors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consistent coverage
&lt;/h3&gt;

&lt;p&gt;The AI reviewed all 449 PRs. Every one got the same level of structural analysis: architecture, test coverage, security, migration patterns, dependency checks.&lt;/p&gt;

&lt;p&gt;Human review coverage in these same repositories was uneven. 28% of PRs received zero reviews. 984 were merged without any formal review trail. The AI didn't solve the depth problem, but it eliminated the coverage gap for the PRs it touched.&lt;/p&gt;

&lt;p&gt;Contributors noticed. Multiple PR authors pushed fixes directly in response to AI review feedback, confirming the reviews were actionable enough to act on without waiting for a human to weigh in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI fails
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reading the room
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PR #3819&lt;/strong&gt;: The AI flagged features as "missing" that had been intentionally removed. Months of maintainer discussion in the PR thread had established consensus to remove those features. The AI didn't read the PR discussion. It reviewed the diff in isolation, saw code removed, and flagged it as a regression.&lt;/p&gt;

&lt;p&gt;This is AI's single biggest limitation as a reviewer. Code review isn't just about the code. It's about the conversation around the code. Why was this change made? What did the community agree on? What prior attempts were tried and rejected? The AI has no access to that context unless someone feeds it in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommending buggy patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Project #1583&lt;/strong&gt;: The AI recommended "aligning with the purchase module pattern" for a computation method. The purchase module had the actual bug. It was putting &lt;code&gt;price_subtotal&lt;/code&gt; in a &lt;code&gt;groupby&lt;/code&gt; position, treating it as a group key instead of summing it. Following the AI's advice would have introduced incorrect totals.&lt;/p&gt;

&lt;p&gt;This is a different failure mode from hallucination. The AI correctly identified a pattern from another module. The pattern was real. The pattern was also wrong. The AI couldn't evaluate whether the reference implementation was correct because it treated existing code as authoritative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fabricating observations on large diffs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Timesheet #748&lt;/strong&gt;: The AI described a &lt;code&gt;pre_init_hook&lt;/code&gt; performance pattern with confidence. No &lt;code&gt;pre_init_hook&lt;/code&gt; exists anywhere in the 7,362-line diff. The AI generated a plausible technical description from training knowledge instead of reading what was actually there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timesheet #830&lt;/strong&gt;: The AI claimed "tests pass." Zero tests exist. Codecov was failing. The AI pattern-matched: most modules have tests, so tests probably pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HR #1462&lt;/strong&gt;: All three bug descriptions cite variable names (&lt;code&gt;qty_initial&lt;/code&gt;, &lt;code&gt;qty_done&lt;/code&gt;) from version 16.0 source code. The PR is an 18.0 migration. The AI reviewed code it had memorized, not code in the diff.&lt;/p&gt;

&lt;p&gt;Large diffs are the trigger. When diffs exceed several thousand lines, the AI's attention degrades. It substitutes what it expects to see for what's actually there. The fabrication rate was low (4 claims out of ~2,000, under 0.2%), but each fabrication was confidently stated and would have passed self-assessment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rubber-stamping at scale
&lt;/h3&gt;

&lt;p&gt;33 reviews (7.5%) approved PRs with no evidence the diff was read. The worst: a 7,538-line diff that got "LGTM" (Looks Good To Me) and nothing else. A typo was found in that same code 4 months later, proving the code hadn't been read at approval time.&lt;/p&gt;

&lt;p&gt;Diff size inversely correlated with review depth. PR #4163, a 3,500-line new module, was approved with no inline comments while ignoring 13 existing substantive review comments from a community reviewer posted a week prior.&lt;/p&gt;

&lt;p&gt;The AI treated large migrations as low-risk by default. "Clean migration, CI green, LGTM." Migrations are where the hardest bugs hide.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inconsistency across identical code
&lt;/h3&gt;

&lt;p&gt;PRs #4135 and #4136 contained identical code (a forward-port pair). The AI flagged &lt;code&gt;float_compare&lt;/code&gt; precision concerns on one and approved the other without mentioning it. Same code, different treatment, no explanation.&lt;/p&gt;

&lt;p&gt;This inconsistency undermines trust. If the AI's assessment depends on which batch a PR lands in rather than the code itself, the signal is unreliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Soft language on blocking issues
&lt;/h3&gt;

&lt;p&gt;At least 5 reviews identified genuine blocking issues but used COMMENTED instead of CHANGES_REQUESTED. A &lt;code&gt;return True&lt;/code&gt; inside a loop that breaks all processing? COMMENTED. A missing &lt;code&gt;@api.depends&lt;/code&gt; that prevents field updates? COMMENTED.&lt;/p&gt;

&lt;p&gt;The AI was calibrated to be polite rather than firm. In code review, soft language on a real blocker means the issue gets ignored. CHANGES_REQUESTED exists for a reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The complementary model
&lt;/h2&gt;

&lt;p&gt;The data points to a clear division of labor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI first pass&lt;/strong&gt;: Security surface scanning, style consistency, migration pattern verification, test coverage checks, dependency analysis. These are systematic, pattern-based tasks where coverage matters more than depth. The AI will review every PR, every file, every path. Humans won't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human second pass&lt;/strong&gt;: PR discussion context, domain-specific conventions, evaluating whether referenced patterns are actually correct, judgment calls on architectural trade-offs, deciding if "tests pass" is a fact or an assumption.&lt;/p&gt;

&lt;p&gt;The model isn't "AI or human." It's "AI catches the surface that humans skip, then humans add the judgment that AI lacks."&lt;/p&gt;

&lt;h3&gt;
  
  
  What this looks like in practice
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;AI reviews every PR for security patterns, style violations, test coverage, migration completeness, and obvious logic bugs.&lt;/li&gt;
&lt;li&gt;AI flags findings with evidence but does NOT issue final verdicts on large or context-dependent PRs.&lt;/li&gt;
&lt;li&gt;Human reviewers start from the AI's findings instead of a blank diff. They add context, validate or dismiss flags, and make the judgment calls.&lt;/li&gt;
&lt;li&gt;AI findings that reference patterns from other modules get verified against those modules before being acted on.&lt;/li&gt;
&lt;li&gt;Any AI claim about test status gets verified against actual CI output.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The economics
&lt;/h3&gt;

&lt;p&gt;At 15 minutes per review, the 449-PR campaign represents 112 hours of review work. OCA's top human reviewer does 290 PRs in his best year. The AI campaign did 449 in 9 days.&lt;/p&gt;

&lt;p&gt;The question isn't whether AI reviews are as good as human reviews. They're not. The question is whether imperfect AI coverage is better than no coverage. For the 138 PRs where the AI review was the only review the PR ever received, the answer is obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  The uncomfortable takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI finds real bugs that experienced humans miss.&lt;/strong&gt; Three prior approvals on PR #3679. Two on #3449. Two on #3584. These aren't junior reviewers. These are maintainers who've been reviewing code for years. The AI caught what they missed because it reads every line instead of pattern-matching on familiarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI also generates confident nonsense.&lt;/strong&gt; A non-existent hook described in detail. "Tests pass" with no tests. Variable names from the wrong version of the codebase. Confidence and correctness are uncorrelated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The biggest AI failure is social, not technical.&lt;/strong&gt; Not reading PR discussions. Not engaging with prior reviewer feedback. Not understanding that removed code was removed on purpose. The technical analysis can be excellent while the review is still invalid because it ignored the human context around the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diff size is the reliability boundary.&lt;/strong&gt; Below a few thousand lines, AI reviews are strong. Above that threshold, rubber-stamps and fabrications spike. Know where the AI's attention breaks down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neither AI nor human review is sufficient alone.&lt;/strong&gt; Humans miss surface-level bugs because they skim. AI misses context-dependent issues because it can't read the room. The combination covers more ground than either approach solo.&lt;/p&gt;

&lt;p&gt;The tooling for this complementary model doesn't fully exist yet. But the data from 449 reviews makes the case clearly: the future of code review isn't choosing between AI and human reviewers. It's figuring out the handoff between them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>opensource</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>I Spend $800/Month on AI Coding Tools and I Can't Stop</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Thu, 02 Apr 2026 18:05:16 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/i-spend-800month-on-ai-coding-tools-and-i-cant-stop-5dj</link>
      <guid>https://dev.to/alexey-pelykh/i-spend-800month-on-ai-coding-tools-and-i-cant-stop-5dj</guid>
      <description>&lt;p&gt;I have four Claude Max x20 accounts. That's $800 a month on a single AI coding tool.&lt;/p&gt;

&lt;p&gt;Each account gives me a 5-hour rolling window of tokens. I burn through each one in 30 to 45 minutes. Then I'm stranded. Two hours of nothing. So I check the next account. Maybe that one has tokens. Maybe the window rolled over. I tab between dashboards, refreshing, calculating which account resets soonest.&lt;/p&gt;

&lt;p&gt;And the whole time, one thought on repeat: someone is shipping right now. Someone else's context window is still open. They're refactoring, generating, merging - and I'm sitting here watching a countdown timer.&lt;/p&gt;

&lt;p&gt;This morning, while writing this article, the AI agents I dispatched to research "productivity addiction" all hit their rate limits simultaneously. Ironic? Sure. But the feeling underneath was real. Not frustration at the tool. Anxiety that the clock was ticking and I wasn't producing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The slot machine you're proud of
&lt;/h2&gt;

&lt;p&gt;AI coding tools run on the same psychological mechanism as a slot machine. Every prompt is a gamble. Will the output nail it in one shot, or hallucinate an API that doesn't exist? You don't know until you see the result.&lt;/p&gt;

&lt;p&gt;Neuroscience research calls this a &lt;a href="https://www.sciencedirect.com/science/article/pii/S0306460323000217" rel="noopener noreferrer"&gt;variable reward schedule&lt;/a&gt;. Unpredictable rewards generate more sustained dopamine activity than predictable ones. Same mechanism as slots.&lt;/p&gt;

&lt;p&gt;But nobody brags about their slot machine sessions. AI coding tools get you congratulated. You post "53K lines in 28 days" and people applaud. The output is real. The productivity is real. I'm not questioning that.&lt;/p&gt;

&lt;p&gt;What I'm questioning is what happens in the gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  The anxiety layer
&lt;/h2&gt;

&lt;p&gt;The productivity itself isn't the problem. I built &lt;a href="https://alexey-pelykh.com/blog/qontoctl-1.0-the-numbers/" rel="noopener noreferrer"&gt;qontoctl&lt;/a&gt; - 53K lines of TypeScript, full API coverage, 28 days, one person. AI made that possible. That's not a delusion. That's a commit history.&lt;/p&gt;

&lt;p&gt;The problem is the feeling that arises when the tool is unavailable. Not "I can't work." I can always work. I can plan, review, think, sketch architecture. The feeling is more specific than that: &lt;strong&gt;someone else is producing right now and I'm not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.psychologytoday.com/us/blog/in-excess/202008/productivity-addiction" rel="noopener noreferrer"&gt;Psychology Today&lt;/a&gt; has a name for this. They distinguish productivity addiction from workaholism. Workaholics are compelled to &lt;em&gt;work&lt;/em&gt;. Productivity addicts are compelled by the &lt;em&gt;feeling of completing things&lt;/em&gt;. The dopamine hit of output. The checkbox. The commit. The merged PR.&lt;/p&gt;

&lt;p&gt;AI tools collapse the effort-to-output ratio so dramatically that the reward cycle accelerates. A refactoring that takes a day becomes an hour. So you do three more. Then the tokens run out. And the anxiety isn't "I can't code." It's "I'm falling behind someone who still has tokens."&lt;/p&gt;

&lt;p&gt;The psychologist who coined "flow state" &lt;a href="https://www.flowresearchcollective.com/blog/dark-side-of-flow" rel="noopener noreferrer"&gt;warned about something like this&lt;/a&gt;: flow "can become addictive, at which point the self becomes captive of a certain kind of order, and is then unwilling to cope with the ambiguities of life." That was written in 1990. It applies to AI-assisted developers now.&lt;/p&gt;

&lt;p&gt;That's a new kind of professional anxiety. It didn't exist two years ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the gap actually looks like
&lt;/h2&gt;

&lt;p&gt;When my tokens run out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minutes 0-5&lt;/strong&gt;: Refresh dashboards. Check other accounts. Calculate resets. Consider whether a fifth account would be excessive. (It would.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minutes 5-15&lt;/strong&gt;: The anxiety peaks. Open Twitter. See someone posting about what they just built with AI. Feel behind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minutes 15-30&lt;/strong&gt;: The anxiety fades. I start thinking about what I was actually building. Not what the next prompt should be. What the &lt;em&gt;architecture&lt;/em&gt; should be. Whether the direction was right. Whether I was generating code or generating value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After 30 minutes&lt;/strong&gt;: Clarity. The kind of clarity that doesn't happen inside the loop because inside the loop there's always one more thing to prompt.&lt;/p&gt;

&lt;p&gt;That shift from "generate" to "think" is the interesting part. It doesn't happen voluntarily. I have never once thought "I should take a break from AI coding and reflect on my architecture." Not once. The session limit forces it.&lt;/p&gt;

&lt;p&gt;And here's the silver lining: productivity was never the bottleneck. That's solvable with scale and cash - four accounts prove it. The actual bottleneck is creativity. Decision-making. Choosing &lt;em&gt;what&lt;/em&gt; to build, not &lt;em&gt;how fast&lt;/em&gt; to build it. And that part only happens when the tokens stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The productivity alibi
&lt;/h2&gt;

&lt;p&gt;AI tools don't just make you faster. They make "not fast enough" feel inexcusable.&lt;/p&gt;

&lt;p&gt;When a refactoring that used to take a week now takes a day, taking two days feels like failure. When you can generate a full test suite in an hour, spending an afternoon thinking about test &lt;em&gt;strategy&lt;/em&gt; feels like procrastination. The bar moves. And it only moves up.&lt;/p&gt;

&lt;p&gt;That's the burnout path nobody's mapping. Not "AI will take your job" - that's old news. The new one is "AI will raise the output bar until the humans behind it break." Because the tool doesn't get tired. You do. The tool is available 24/7. You're rate-limited to 5-hour windows. Every idle moment feels like falling behind someone who figured out the fifth account before you did.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest question
&lt;/h2&gt;

&lt;p&gt;The real test isn't whether you enjoy AI coding tools. Of course you do. They're incredible.&lt;/p&gt;

&lt;p&gt;The test is what happens when you can't use them. When the tokens run out, when the rate limit hits, when the API goes down. What do you feel?&lt;/p&gt;

&lt;p&gt;An engaged professional shrugs and switches to planning, reviewing, thinking. Pulls out a notebook. Goes for a walk. Comes back sharper.&lt;/p&gt;

&lt;p&gt;I know what I do. That tells me something. I'm not sure it tells me something I want to hear, but pretending otherwise would be dishonest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anyway
&lt;/h2&gt;

&lt;p&gt;I'm not going to wrap this up with a tidy lesson about "finding balance" or "being intentional with AI tools." That's not really my style.&lt;/p&gt;

&lt;p&gt;What I'm going to do is close this laptop. It's April in the south of France. The Mediterranean is right there. My tokens are spent, my article is written, and nobody is shipping anything that can't wait two hours.&lt;/p&gt;

&lt;p&gt;If you need me, my session resets at 6pm.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>productivity</category>
      <category>psychology</category>
    </item>
    <item>
      <title>'AI Slop' vs No Review at All - Which Actually Kills Open Source?</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Wed, 01 Apr 2026 13:47:33 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/ai-slop-vs-no-review-at-all-which-actually-kills-open-source-o08</link>
      <guid>https://dev.to/alexey-pelykh/ai-slop-vs-no-review-at-all-which-actually-kills-open-source-o08</guid>
      <description>&lt;p&gt;Two things are true at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AI-assisted code reviews have real quality problems.&lt;/li&gt;
&lt;li&gt;The alternative most PRs actually face is no review at all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I ran 449 AI-assisted code reviews on OCA (Odoo Community Association) open source PRs in 9 days. Then I ran rigorous independent validation against the actual code. I also pulled the review statistics across all 10,808 PRs in the same 6 repositories.&lt;/p&gt;

&lt;p&gt;Both datasets tell a story. The community's reaction to each tells another.&lt;/p&gt;

&lt;p&gt;The AI reviews got called "slop," triggered near-ban discussions, and were shut down within days. The 28% zero-review rate has been running for years. The community debated the former. The latter is just how things are.&lt;/p&gt;

&lt;p&gt;Here's the full data on both sides.&lt;/p&gt;

&lt;h2&gt;
  
  
  The case against AI reviews: every flaw, quantified
&lt;/h2&gt;

&lt;p&gt;The AI reviews had real problems. Independent validation - 40 AI instances reading actual PR diffs, not the review text - covered 440 of the 449 reviews and produced these numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fully valid&lt;/td&gt;
&lt;td&gt;303&lt;/td&gt;
&lt;td&gt;68.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partially valid&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;22.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rubber-stamp&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;7.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invalid&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;68.9% fully valid. 31.1% had issues ranging from "missed something important" to "factually wrong."&lt;/p&gt;

&lt;p&gt;The specific failures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4 fabricated claims&lt;/strong&gt; out of roughly 2,000 total (&amp;lt;0.2%). One described a code pattern that doesn't exist in a 7,362-line diff. One claimed "tests pass" on a module with zero tests. These are hallucinations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;33 rubber-stamp reviews&lt;/strong&gt; (7.5%). PRs approved with "LGTM, CI green" and no evidence the diff was read. One approved a 3,500-line new module with no inline comments. Another approved a security-sensitive portal module with three words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2 harmful suggestions&lt;/strong&gt;. One was high-severity: the AI recommended following a pattern from another module that itself contained a bug. Following the advice would have introduced incorrect totals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;34 false positives&lt;/strong&gt;. Things flagged as bugs that weren't. Wrong version conventions applied, code already doing what was suggested, misread diffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;340+ significant issues missed&lt;/strong&gt; across 440 reviews. Things the AI should have caught and didn't.&lt;/p&gt;

&lt;p&gt;That's the full record. It's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The case against no review: the invisible damage
&lt;/h2&gt;

&lt;p&gt;Same 6 repositories, 10,808 PRs total.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Number&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PRs with zero reviews&lt;/td&gt;
&lt;td&gt;3,070 (28%)&lt;/td&gt;
&lt;td&gt;No human ever looked at these&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merged without review trail&lt;/td&gt;
&lt;td&gt;984 (15% of all merges)&lt;/td&gt;
&lt;td&gt;In production with no audit record&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stale, died unreviewed&lt;/td&gt;
&lt;td&gt;471&lt;/td&gt;
&lt;td&gt;Contributor effort, wasted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modern-branch PRs closed unreviewed&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;Majority of closed PRs never got a single review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;3,070 PRs received zero reviews. Not a shallow review. Not a rubber-stamp. Nothing.&lt;/p&gt;

&lt;p&gt;984 were merged anyway. A maintainer with merge rights presumably read the code, but left nothing on record. No feedback for the contributor. No searchable review history. No audit trail. If something breaks, there's no reviewer to trace, no review to learn from.&lt;/p&gt;

&lt;p&gt;471 went stale and were closed by bots. Contributors submitted work, waited weeks or months, got silence, and watched a stale bot sweep their effort into the archive. On modern branches (16.0+), 58% of closed PRs died this way.&lt;/p&gt;

&lt;p&gt;At the community-estimated 15 minutes per review, reviewing those 471 stale PRs would have taken roughly 118 hours. About 3 work weeks spread over years. Nobody found the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side by side
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;AI reviews (449 PRs)&lt;/th&gt;
&lt;th&gt;No review (3,070 PRs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Valid feedback delivered&lt;/td&gt;
&lt;td&gt;303 fully valid + 97 partial&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security issues caught&lt;/td&gt;
&lt;td&gt;6+ genuine findings&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bugs found that humans missed&lt;/td&gt;
&lt;td&gt;Multiple across 440 reviews&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fabricated claims&lt;/td&gt;
&lt;td&gt;4 (&amp;lt;0.2%)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Harmful suggestions&lt;/td&gt;
&lt;td&gt;2 (1 high-severity)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rubber-stamps&lt;/td&gt;
&lt;td&gt;33 (7.5%)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contributor feedback&lt;/td&gt;
&lt;td&gt;Present, with issues&lt;/td&gt;
&lt;td&gt;Absent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Present, with issues&lt;/td&gt;
&lt;td&gt;Absent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community response&lt;/td&gt;
&lt;td&gt;"Stop"&lt;/td&gt;
&lt;td&gt;Silence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The AI reviews had a fabrication rate under 0.2%. The no-review path has a coverage rate of 0%. One of those numbers got a community thread. The other didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The risk calculus
&lt;/h2&gt;

&lt;p&gt;What each failure mode actually costs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fabricated claim in an AI review&lt;/strong&gt; (happened 4 times in ~2,000 claims): a human reviewer or the PR author sees the claim, recognizes it's wrong, and ignores it. The PR continues. The cost is noise and wasted attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A harmful suggestion&lt;/strong&gt; (happened twice): if followed, introduces a bug. But the suggestion goes through normal review. It's a recommendation, not a merge. A maintainer can reject it. The cost is real but gated by human review downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A rubber-stamp approval&lt;/strong&gt; (happened 33 times): a false signal that the code was reviewed. This is genuinely dangerous. If a maintainer treats the AI approval as sufficient and merges without reading the code, real bugs ship. The mitigation: AI reviews shouldn't be counted as formal approvals. They're input, not decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A PR merged without any review&lt;/strong&gt; (happened 984 times): no signal at all. No feedback. No record. If bugs exist, they ship without anyone having a documented chance to catch them. No mitigation exists because no review happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A PR that dies unreviewed&lt;/strong&gt; (happened 471 times): contributor time wasted. Future contributions from that person become less likely. The community shrinks by one potential contributor. Multiply by 471.&lt;/p&gt;

&lt;p&gt;The AI failure modes are visible, quantifiable, and bounded. The no-review failure mode is invisible, unquantified, and compounding.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the community chose
&lt;/h2&gt;

&lt;p&gt;The AI reviews triggered discussion within days. Multiple community members flagged the campaign. The term "AI slop" was applied. I was asked to stop. A near-ban discussion followed.&lt;/p&gt;

&lt;p&gt;The feedback about notification volume was legitimate. The quality concerns were legitimate. I published the data showing every flaw.&lt;/p&gt;

&lt;p&gt;The 28% zero-review rate has been running for years. The 471 stale PRs accumulated gradually. The 984 no-trail merges happened one at a time. There's no comparable urgency.&lt;/p&gt;

&lt;p&gt;This isn't specific to OCA. It's a pattern in how communities process risk. Visible, novel disruptions trigger immune responses. Invisible, chronic problems don't. The immune system targets the new threat, not the ongoing one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real security findings
&lt;/h2&gt;

&lt;p&gt;While we're comparing risk: the AI reviews found 6+ genuine security vulnerabilities in code heading toward production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Portal sudo bypass in a timesheet module&lt;/li&gt;
&lt;li&gt;Cross-record token exposure on a public auth endpoint&lt;/li&gt;
&lt;li&gt;getattr traversal in sale-workflow&lt;/li&gt;
&lt;li&gt;Unfiltered portal properties in a project module&lt;/li&gt;
&lt;li&gt;.sudo(user) migration gaps in security-sensitive code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI also caught bugs that multiple human reviewers missed. On one PR, two prior human approvals missed a &lt;code&gt;return True&lt;/code&gt; inside a loop. On another, two prior approvals missed an &lt;code&gt;or&lt;/code&gt;-to-&lt;code&gt;and&lt;/code&gt; logic regression. On a third, three prior human approvals missed issues the AI flagged.&lt;/p&gt;

&lt;p&gt;These findings came from the same 69%-valid review set that was labeled "slop." The 6 security catches and the 4 fabricated claims exist in the same dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI reviews aren't good enough
&lt;/h2&gt;

&lt;p&gt;They're not. 69% validity is not where you want to be. 33 rubber-stamps are unacceptable. 2 harmful suggestions are too many.&lt;/p&gt;

&lt;p&gt;But "good enough" depends on the comparison. Against a thorough human review by a domain expert, AI reviews lose badly. Against nothing, the calculus changes.&lt;/p&gt;

&lt;p&gt;For 138 PRs where my AI review was the only review the PR ever received, the alternative wasn't a better review. It was no review at all. For those PRs, even a partially valid review with missed issues provides more value than the silence they were otherwise going to get.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question
&lt;/h2&gt;

&lt;p&gt;Open source has a reviewer scarcity crisis. OCA's most prolific reviewer has done 2,197 unique PR reviews across 9.5 years. That's exceptional, sustained effort over a decade. The review backlog still grows faster than volunteers can clear it.&lt;/p&gt;

&lt;p&gt;The question isn't whether AI reviews are perfect. They're not. The question is whether a community can afford to reject imperfect coverage when the alternative is no coverage at all.&lt;/p&gt;

&lt;p&gt;28% of PRs get nothing. What's the plan for them?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>codereview</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>I Ran 449 AI Code Reviews in 9 Days. Then I Almost Got Banned.</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Fri, 27 Mar 2026 20:32:47 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/i-ran-449-ai-code-reviews-in-9-days-then-i-almost-got-banned-17h5</link>
      <guid>https://dev.to/alexey-pelykh/i-ran-449-ai-code-reviews-in-9-days-then-i-almost-got-banned-17h5</guid>
      <description>&lt;p&gt;The OCA (Odoo Community Association) has a quiet crisis. Across 6 repositories I tracked, 28% of all pull requests - 3,070 out of 10,808 - received zero reviews. Ever.&lt;/p&gt;

&lt;p&gt;984 PRs were merged without any formal review trail. 471 went stale and were closed by a bot, unreviewed. 58% of closed PRs on modern branches died without a single human looking at them.&lt;/p&gt;

&lt;p&gt;Nobody panicked about this. It was just how things were.&lt;/p&gt;

&lt;p&gt;I decided to do something about it. Between February 24 and March 4, 2026, I ran an AI-assisted review campaign: 449 unique PRs reviewed across 6 OCA repositories in 9 days.&lt;/p&gt;

&lt;p&gt;For scale: OCA's most prolific reviewer, pedrobaeza, has reviewed 2,197 unique PRs over 9.5 years. His best year was 290 PRs in 2025, which works out to 0.79 PRs per day. The AI campaign ran at 49.9 PRs per day. That's 63x his best-ever daily pace. To match the campaign's 9-day output at his best-year rate would take roughly 19 months.&lt;/p&gt;

&lt;p&gt;At the community-estimated 15 minutes per review, the campaign represented 112 hours of review work. 2.8 full work weeks compressed into 9 calendar days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I didn't ask permission
&lt;/h2&gt;

&lt;p&gt;There was no formal AI policy to comply with. An LLM guidelines thread had been open on the OCA contributors mailing list since September 2025. Six months later, still no policy. Meanwhile, 471 PRs sat rotting in the queue, contributors' work ignored until a stale bot swept it away.&lt;/p&gt;

&lt;p&gt;I had the tooling. I had Odoo domain knowledge from years of contributing. The gap was quantified and obvious. I filled it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the pipeline worked
&lt;/h2&gt;

&lt;p&gt;The setup was straightforward. Claude Code read each PR's diff via the GitHub API, analyzed the code changes against Odoo framework conventions, and posted structured reviews. I reviewed the pipeline output and iterated on the prompts as quality patterns emerged.&lt;/p&gt;

&lt;p&gt;The campaign covered sale-workflow (261 PRs), project (53), hr (46), bank-statement-import (39), timesheet (38), and web (3). Sale-workflow dominated because it had the largest unreviewed backlog.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results, honestly
&lt;/h2&gt;

&lt;p&gt;I didn't trust self-assessment. When I first had the AI evaluate its own review quality, it came back at 98.6% valid. That number was garbage.&lt;/p&gt;

&lt;p&gt;So I ran a second round: 40 independent validator instances, each reading actual PR diffs via &lt;code&gt;gh pr diff&lt;/code&gt; and verifying every technical claim against the code. The corrected number: &lt;strong&gt;68.9% fully valid&lt;/strong&gt;. Including partially valid reviews where some claims were correct but significant issues were missed: 90.9%.&lt;/p&gt;

&lt;p&gt;The 30-point gap between self-assessment and independent validation is itself a finding worth its own post. But here's the quality breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fully valid&lt;/td&gt;
&lt;td&gt;303&lt;/td&gt;
&lt;td&gt;68.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partially valid&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;22.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rubber-stamp&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;7.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invalid&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;34 false positives across 440 validated reviews. 4 fabricated claims out of roughly 2,000 total claims, a rate under 0.2%. And 2 harmful suggestions where following the advice would have made the code worse. One was high-severity: the AI recommended following a pattern from another module that itself had a bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AI actually caught
&lt;/h2&gt;

&lt;p&gt;6 genuine security findings. Portal sudo bypass. Cross-record token exposure on a public auth endpoint. getattr traversal. These weren't theoretical. They were in code heading toward production.&lt;/p&gt;

&lt;p&gt;The AI consistently found bugs that prior human reviewers missed. On PR #3584, two prior approvals missed a &lt;code&gt;return True&lt;/code&gt; inside a loop. On #3449, two prior approvals missed an &lt;code&gt;or&lt;/code&gt;-to-&lt;code&gt;and&lt;/code&gt; logic regression. On #3679, three prior approvals missed issues the AI flagged.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AI got wrong
&lt;/h2&gt;

&lt;p&gt;4 reviews were outright invalid. One reviewed the wrong version of the code entirely, describing 16.0 bugs on an 18.0 migration PR with variable names that didn't exist in the diff. One approved a fix that a maintainer corrected 15 minutes later. One flagged features as "missing" that had been intentionally removed per months of community consensus it hadn't read.&lt;/p&gt;

&lt;p&gt;The rubber-stamp rate was 7.5%. These were PRs approved with no evidence the diff was actually read. Some were large migrations that got "clean migration, CI green, LGTM" and nothing else.&lt;/p&gt;

&lt;p&gt;Quality improved over time, from 34% fully valid on early repos to 87% on mid-campaign work, then degraded back to 70% in later batches. Volume fatigue is real, even for AI pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  138 PRs nobody else ever reviewed
&lt;/h2&gt;

&lt;p&gt;This is the number that matters most to me: on 138 of the PRs I reviewed, the AI review was the only review the PR ever received. 30% of my reviews were the first and only external eyes on that code.&lt;/p&gt;

&lt;p&gt;Some of those PRs had been sitting for months. Some for over a year. Contributors submitted work, waited, heard nothing, and eventually watched a stale bot close their effort.&lt;/p&gt;

&lt;p&gt;The AI review wasn't perfect. But it was something. For those 138 PRs, the alternative wasn't a better review. The alternative was no review at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The community response
&lt;/h2&gt;

&lt;p&gt;"Stop."&lt;/p&gt;

&lt;p&gt;Stefan Rijnhart flagged the campaign as "flooding PRs with non-contextual reviews." Tom Blauwendraat asked me to stop until policy was established. Akim Juillerat called the reviews "AI slop" after receiving 10+ notifications. Denis Roussel reported continued "flooding."&lt;/p&gt;

&lt;p&gt;They weren't entirely wrong. The notification volume was real. Some reviews were shallow. The rubber-stamps were a legitimate quality problem.&lt;/p&gt;

&lt;p&gt;But nobody had data on any of this before I ran the experiment. The "AI slop" label was applied before anyone measured whether the reviews were valid. When I measured, 69% were fully valid. 91% had at least some legitimate value. And 138 PRs got their first-ever review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Volume and quality are in tension.&lt;/strong&gt; The campaign started at 34% quality and climbed to 87% as the pipeline improved. Then it degraded back to 70% as I pushed volume. The optimal pace is slower than what's technically possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-assessment is worthless.&lt;/strong&gt; 98.6% vs 68.9%. A 30-point gap. If you're using AI for anything consequential and not running independent validation, you're flying blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communities punish action more than inaction.&lt;/strong&gt; 28% of PRs getting zero reviews? Acceptable. 471 contributions dying unreviewed? Normal. AI reviews with a 69% validity rate filling the gap? Stop immediately. The asymmetry is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The coverage crisis is the actual problem.&lt;/strong&gt; The debate about AI review quality is important but secondary. The primary crisis is that open source communities don't have enough reviewers. Period. The 28% zero-review rate didn't start when AI showed up. It was there the whole time. Nobody was panicking about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Permission isn't always available.&lt;/strong&gt; There was no policy to comply with. There was no process to request permission through. There was only a gap and a community thread six months into discussion with no resolution. Sometimes you act and deal with the consequences.&lt;/p&gt;




&lt;p&gt;The full quality audit, reviewer landscape analysis, and community discussion context are documented in detail. I'll be publishing the validation methodology, security findings, and lessons for AI-augmented teams in follow-up posts.&lt;/p&gt;

&lt;p&gt;I ran an unauthorized experiment. The results weren't perfect. They were real. And for 138 PRs that had never gotten a single review, they were the only thing that happened.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>codereview</category>
      <category>programming</category>
    </item>
    <item>
      <title>53K Lines, 28 Days, $1,600: The Real Numbers Behind QontoCtl 1.0</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:17:52 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/53k-lines-28-days-1600-the-real-numbers-behind-qontoctl-10-546b</link>
      <guid>https://dev.to/alexey-pelykh/53k-lines-28-days-1600-the-real-numbers-behind-qontoctl-10-546b</guid>
      <description>&lt;h2&gt;
  
  
  The Headline Numbers
&lt;/h2&gt;

&lt;p&gt;28 days ago, QontoCtl didn't exist. Today it's at 1.0.0 with full Qonto banking API coverage. Here's the codebase:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript lines of code&lt;/td&gt;
&lt;td&gt;52,713&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source files&lt;/td&gt;
&lt;td&gt;501&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total lines (incl. blanks, comments)&lt;/td&gt;
&lt;td&gt;64,497&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Publishable packages&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I ran &lt;a href="https://github.com/boyter/scc" rel="noopener noreferrer"&gt;scc&lt;/a&gt; on the repo. It estimates development cost using the COCOMO model, which factors in lines of code, complexity, and industry benchmarks for team-based development:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;COCOMO Estimate&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost to develop&lt;/td&gt;
&lt;td&gt;$1,923,212&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schedule effort&lt;/td&gt;
&lt;td&gt;17.63 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;People required&lt;/td&gt;
&lt;td&gt;9.69&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What it actually took:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$1,597 in API costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;28 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;People&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't projections. The cost is from Claude Code's usage tracking. The timeline is from git history. First commit: February 26, 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Was Built
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://alexey-pelykh.com/blog/announcing-qontoctl/" rel="noopener noreferrer"&gt;0.1 release&lt;/a&gt; covered the basics: organizations, accounts, transactions, statements, labels, memberships. 10 MCP tools. Read-only access via API keys.&lt;/p&gt;

&lt;p&gt;1.0.0 covers everything Qonto exposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Account management&lt;/strong&gt; - create, update, close, IBAN certificates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEPA beneficiaries&lt;/strong&gt; - add, update, trust/untrust&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transfers&lt;/strong&gt; - SEPA (with cancel, proof, verify-payee), internal, bulk, recurring, international&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client invoicing&lt;/strong&gt; - full lifecycle from create through finalize, send, mark-paid, cancel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supplier invoices&lt;/strong&gt; - list, view, bulk-create&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quotes and credit notes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cards and insurance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Payment links and webhooks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attachments&lt;/strong&gt; - upload, link to transactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Teams and membership management&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;E-invoicing settings&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;International currencies and eligibility&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this through OAuth 2.0 with PKCE. Strong Customer Authentication handling for every write operation. Idempotency keys for safe retries. 69 MCP tools total. Same operations available as CLI commands and MCP tools, backed by one shared core library.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Sausage Was Made
&lt;/h2&gt;

&lt;p&gt;306 Claude Code sessions over 28 days. Here's the raw usage data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sessions&lt;/td&gt;
&lt;td&gt;306&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User turns&lt;/td&gt;
&lt;td&gt;37,004&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API requests&lt;/td&gt;
&lt;td&gt;23,399&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calls&lt;/td&gt;
&lt;td&gt;9,113&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;2.47 billion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimated cost&lt;/td&gt;
&lt;td&gt;$1,597&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's roughly 11 sessions per day, about 1,300 turns per day.&lt;/p&gt;

&lt;p&gt;Here's the part that changes the economics: I didn't manually design the architecture for this project. I have a library of &lt;a href="https://docs.anthropic.com/en/docs/claude-code/skills" rel="noopener noreferrer"&gt;Claude Code skills&lt;/a&gt; - reusable configuration files that encode how I approach software architecture. Monorepo structure, package boundaries, test strategy, API design patterns, security handling - these are captured as skills that Claude applies automatically when the project context matches.&lt;/p&gt;

&lt;p&gt;The skills are the IP. Not this project's architecture specifically, but the architectural judgment encoded in a format that compounds across every project. When I start a new integration project, those skills kick in. The monorepo structure with Turborepo, the shared core library with thin CLI and MCP layers, the OAuth flow design, the Zod schemas for runtime validation - none of that was designed from scratch for QontoCtl. It was applied from patterns I've refined across dozens of projects.&lt;/p&gt;

&lt;p&gt;Each new Qonto API domain followed a pattern: read the API docs, design the service layer, implement CLI commands, implement MCP tools, write tests. Once the pattern was established for one domain, Claude applied it across dozens more. Consistent internal patterns multiply AI productivity the same way they multiply human productivity, just faster.&lt;/p&gt;

&lt;p&gt;The gap between "decide what to build" and "it exists and is tested" is where this setup changes the math. And the skills library is what makes it repeatable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multiplier Nobody Measures
&lt;/h2&gt;

&lt;p&gt;Here's what COCOMO doesn't account for: third-party API quality.&lt;/p&gt;

&lt;p&gt;Qonto's API team built an API that's clean, consistent, and well-documented across every domain. Same patterns everywhere. Predictable naming. Coherent pagination. Consistent error handling.&lt;/p&gt;

&lt;p&gt;This matters more than people realize for AI-augmented development. When patterns are consistent, the AI learns them once and applies them across dozens of domains correctly. When documentation is accurate, the implementation matches the spec on the first pass. When error handling follows conventions, you don't spend cycles debugging inconsistencies.&lt;/p&gt;

&lt;p&gt;I've built integrations against APIs with inconsistent naming, undocumented edge cases, pagination that works differently per endpoint. That friction multiplies with AI tooling - every inconsistency becomes a correction cycle instead of a generation cycle.&lt;/p&gt;

&lt;p&gt;Qonto's API had none of that. Full coverage in 28 days was possible because the API was built right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Data Shows (And Doesn't)
&lt;/h2&gt;

&lt;p&gt;One data point isn't a trend. Here's what I think is defensible:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the data supports:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single experienced architect with a reusable skills library can ship production-grade software at a pace that would have required a team 18 months ago&lt;/li&gt;
&lt;li&gt;The economics of "is this worth building?" change when implementation cost drops by two orders of magnitude&lt;/li&gt;
&lt;li&gt;API/SDK-shaped projects - well-defined interfaces, systematic patterns, comprehensive test coverage - are particularly well-suited to AI-augmented development&lt;/li&gt;
&lt;li&gt;The real IP isn't in the code. It's in the skills that generate the code. Those compound across projects&lt;/li&gt;
&lt;li&gt;Third-party API quality is a force multiplier that compounds with AI tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;COCOMO models team dynamics, communication overhead, organizational friction. A single expert would never hit $1.9M even without AI. The honest baseline for a solo expert is probably $150K-200K worth of effort - still a 100:1 ratio against $1,600, just not 1,200:1&lt;/li&gt;
&lt;li&gt;This doesn't generalize to all software. Integration projects with well-defined external contracts are the best case. Novel algorithm design, ambiguous requirements, or coordination-heavy systems would show different ratios&lt;/li&gt;
&lt;li&gt;The human is not optional. The skills library encodes architectural judgment built over years. AI multiplies that judgment. It doesn't replace it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question this data raises isn't "will AI replace developers?" That framing misses the point. The question is: what becomes worth building when the implementation cost drops by two orders of magnitude? What tools, integrations, and products were previously "not worth the engineering effort" that now make sense?&lt;/p&gt;

&lt;p&gt;QontoCtl exists because the answer to "should someone build a full CLI and MCP server for Qonto's API?" changed from "it would take a team months" to "I can do this in February."&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;QontoCtl is open source under AGPL-3.0. CLI tool or MCP server usage carries no license obligations.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/alexey-pelykh" rel="noopener noreferrer"&gt;
        alexey-pelykh
      &lt;/a&gt; / &lt;a href="https://github.com/alexey-pelykh/qontoctl" rel="noopener noreferrer"&gt;
        qontoctl
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      CLI and MCP server for the Qonto banking API
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://raw.githubusercontent.com/qontoctl/.github/main/profile/assets/social-preview.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fqontoctl%2F.github%2Fmain%2Fprofile%2Fassets%2Fsocial-preview.png" alt="QontoCtl: The Complete CLI &amp;amp; MCP for Qonto"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/alexey-pelykh/qontoctl/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/alexey-pelykh/qontoctl/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/alexey-pelykh/qontoctl" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0bd3840c0fc0f7238f3ecb03bb1f81b9599ea90f45ab51fceb5a5ab86266fe12/68747470733a2f2f696d672e736869656c64732e696f2f636f6465636f762f632f6769746875622f616c657865792d70656c796b682f716f6e746f63746c3f6c6f676f3d636f6465636f76" alt="Codecov"&gt;&lt;/a&gt;
&lt;a href="https://www.npmjs.com/package/qontoctl" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/32dae5c3ccd2cb87cc2a65fc5eb13d4aa07b9d976b066c99b03e8eb4097027c4/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f716f6e746f63746c3f6c6f676f3d6e706d" alt="npm version"&gt;&lt;/a&gt;
&lt;a href="https://www.npmjs.com/package/qontoctl" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/cad8151a10424b9731fe858bf63cb32b8f27312d5b21727e657b03f9625cde0c/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f646d2f716f6e746f63746c3f6c6f676f3d6e706d" alt="npm downloads"&gt;&lt;/a&gt;
&lt;a href="https://github.com/alexey-pelykh/qontoctl" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ce90ffb5dc16ec9c39685aaf112d65cd4d900721b3c08e98b8592698b4503ed4/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f616c657865792d70656c796b682f716f6e746f63746c3f7374796c653d666c6174266c6f676f3d676974687562" alt="GitHub Repo stars"&gt;&lt;/a&gt;
&lt;a href="https://github.com/alexey-pelykh/qontoctl/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/80481f5922d5071bdefe69ec305c22839137e1cf3b9cbb6cfb2a327c0378c279/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f616c657865792d70656c796b682f716f6e746f63746c" alt="License"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;CLI and MCP server for the &lt;a href="https://qonto.com" rel="nofollow noopener noreferrer"&gt;Qonto&lt;/a&gt; banking API.&lt;/p&gt;
&lt;p&gt;This project is brought to you by &lt;a href="https://github.com/alexey-pelykh" rel="noopener noreferrer"&gt;Alexey Pelykh&lt;/a&gt;.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What It Does&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;QontoCtl lets AI assistants (Claude, etc.) interact with Qonto through the &lt;a href="https://modelcontextprotocol.io" rel="nofollow noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt;. It can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Organizations&lt;/strong&gt; — retrieve organization details and settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accounts&lt;/strong&gt; — list, create, update, close bank accounts; download IBAN certificates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transactions&lt;/strong&gt; — list, search, filter bank transactions; manage transaction attachments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bank Statements&lt;/strong&gt; — list, view, and download bank statements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Labels&lt;/strong&gt; — manage transaction labels and categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memberships&lt;/strong&gt; — view team members, show current membership, invite new members&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEPA Beneficiaries&lt;/strong&gt; — list, add, update, trust/untrust SEPA beneficiaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEPA Transfers&lt;/strong&gt; — list, create, cancel transfers; download proofs; verify payees&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal Transfers&lt;/strong&gt; — create transfers between accounts in the same organization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bulk Transfers&lt;/strong&gt; — list and view bulk transfer batches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recurring Transfers&lt;/strong&gt; — list and view recurring transfers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clients&lt;/strong&gt; — list, create, update, delete clients&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client&lt;/strong&gt;…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/alexey-pelykh/qontoctl" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;&lt;em&gt;P.S. The $1,597 is the pay-as-you-go API estimate. I'm on Claude Max x20 ($200/month), and this project consumed roughly 30% of it. Actual out-of-pocket: ~$60.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>ai</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
    <item>
      <title>The RSS Illusion: 63 GB Process on a 32 GB Machine</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:40:00 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/the-rss-illusion-63-gb-process-on-a-32-gb-machine-298n</link>
      <guid>https://dev.to/alexey-pelykh/the-rss-illusion-63-gb-process-on-a-32-gb-machine-298n</guid>
      <description>&lt;p&gt;macOS displayed "Apps out of memory - iTerm2: 63.89 GB" on my 32 GB machine.&lt;/p&gt;

&lt;p&gt;iTerm2 is my terminal. It doesn't do anything that should consume 64 GB. So I went looking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real culprit
&lt;/h2&gt;

&lt;p&gt;Every Claude Code session runs as a child process of iTerm2. macOS attributes all descendant memory to the parent application. That's why the dialog blamed iTerm2.&lt;/p&gt;

&lt;p&gt;I had 37 iTerm tabs open, each with a Claude Code session. Most were idle. Finished conversations I never closed.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ps aux&lt;/code&gt; reported 4.1 GB total RSS across all 95 Claude processes. The macOS &lt;code&gt;footprint&lt;/code&gt; tool reported 62.7 GB.&lt;/p&gt;

&lt;p&gt;A 15x gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  How RSS misleads on macOS
&lt;/h2&gt;

&lt;p&gt;RSS (Resident Set Size) counts pages physically resident in RAM. When macOS compresses or swaps dirty pages, RSS drops. The process &lt;em&gt;appears&lt;/em&gt; to shrink.&lt;/p&gt;

&lt;p&gt;But &lt;code&gt;footprint&lt;/code&gt; tracks dirty pages regardless of compression state. Those pages are still attributed to the process. They still count against system memory pressure. Activity Monitor and the "out of memory" dialog use &lt;code&gt;footprint&lt;/code&gt;, not RSS.&lt;/p&gt;

&lt;p&gt;The result: a process can show 7 MB RSS while holding 1.3 GB of dirty, non-reclaimable memory. RSS doesn't just undercount. It creates a dangerous illusion. The process looks like it's using &lt;em&gt;less&lt;/em&gt; memory over time while actually consuming &lt;em&gt;more&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you're monitoring macOS workloads with &lt;code&gt;ps&lt;/code&gt; or anything RSS-based, you're flying blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decomposing the footprint
&lt;/h2&gt;

&lt;p&gt;Using &lt;code&gt;footprint -p &amp;lt;pid&amp;gt;&lt;/code&gt;, each Claude Code process breaks down into memory categories. The pattern across sessions of different ages tells the story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Session Age&lt;/th&gt;
&lt;th&gt;WebKit malloc&lt;/th&gt;
&lt;th&gt;IOAccelerator&lt;/th&gt;
&lt;th&gt;Total Footprint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fresh (3 hrs)&lt;/td&gt;
&lt;td&gt;343 MB (77%)&lt;/td&gt;
&lt;td&gt;48 MB (11%)&lt;/td&gt;
&lt;td&gt;443 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 days idle&lt;/td&gt;
&lt;td&gt;231 MB (23%)&lt;/td&gt;
&lt;td&gt;711 MB (71%)&lt;/td&gt;
&lt;td&gt;996 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15 days idle&lt;/td&gt;
&lt;td&gt;265 MB (20%)&lt;/td&gt;
&lt;td&gt;968 MB (73%)&lt;/td&gt;
&lt;td&gt;1,324 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;IOAccelerator starts small and grows to dominate. Every allocation is marked dirty and non-reclaimable. macOS cannot free this memory without killing the process.&lt;/p&gt;

&lt;h2&gt;
  
  
  128 MB slabs that never get freed
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;vmmap&lt;/code&gt; reveals the IOAccelerator memory is structured as 128 MB slabs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Address Range                    VSIZE    RSDNT  DIRTY   SWAP
54db4000000-54dbc000000  128.0M  1440K  1440K  95.7M   ← oldest slab
54dbc000000-54dc4000000  128.0M     0K     0K  95.3M
54dc4000000-54dcc000000  128.0M     0K     0K 126.3M
...
54dfc000000-54e04000000  128.0M     0K     0K    80K   ← newest slab
(reserved)                768.0M     0K     0K     0K   ← pre-allocated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The oldest slabs fill to 95-128 MB. When one fills, a new one is allocated. They are never freed or reused. A reserved block pre-allocates VM address space for future growth.&lt;/p&gt;

&lt;p&gt;After 15 days idle: 10 slabs, 966 MB swapped, 1.4 MB resident. Peak footprint for this single session hit 2.8 GB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where does the GPU stack come from?
&lt;/h2&gt;

&lt;p&gt;Claude Code is built on Bun, which uses JavaScriptCore from WebKit. It renders its TUI using Ink, a React-based terminal rendering framework.&lt;/p&gt;

&lt;p&gt;Despite being a terminal REPL that outputs ANSI escape codes, the process loads a full GPU stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metal.framework&lt;/li&gt;
&lt;li&gt;MetalPerformanceShaders.framework (MPSNeuralNetwork, MPSNDArray, MPSImage)&lt;/li&gt;
&lt;li&gt;IOAccelerator.framework&lt;/li&gt;
&lt;li&gt;IOSurface.framework&lt;/li&gt;
&lt;li&gt;GPUWrangler.framework&lt;/li&gt;
&lt;li&gt;GPUCompiler.framework&lt;/li&gt;
&lt;li&gt;WebCore.framework&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No explicit Metal API calls exist in the binary. So where does this come from?&lt;/p&gt;

&lt;p&gt;My first hypothesis was JSC/WebKit's rendering infrastructure. Testing disproved it. Loading JSC and WebKit directly via &lt;code&gt;dlopen()&lt;/code&gt; in a C test program produced zero IOAccelerator allocations and zero GPU frameworks. Standalone Bun also loads zero Metal or GPU frameworks.&lt;/p&gt;

&lt;p&gt;The GPU framework stack is loaded specifically by Claude Code. Something in its dependency tree triggers it. What's proven: it's not JSC and it's not Bun's baseline runtime. The exact dependency remains unidentified.&lt;/p&gt;

&lt;h2&gt;
  
  
  Isolation testing
&lt;/h2&gt;

&lt;p&gt;To narrow the cause, I ran control tests at multiple layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;IOAccel Slabs&lt;/th&gt;
&lt;th&gt;IOAccel Dirty&lt;/th&gt;
&lt;th&gt;Footprint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;C + dlopen(JSC) + eval&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;3.7 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C + dlopen(WebKit)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;2.0 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bun idle (sleep)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1.3 MB&lt;/td&gt;
&lt;td&gt;6.5 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bun + heavy JSON parsing&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1.4 MB&lt;/td&gt;
&lt;td&gt;6.7 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bun + HTTP streaming (20 req)&lt;/td&gt;
&lt;td&gt;1 (2 regions)&lt;/td&gt;
&lt;td&gt;2.3 MB&lt;/td&gt;
&lt;td&gt;13 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code (3 hrs active)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;46 MB&lt;/td&gt;
&lt;td&gt;443 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code (15 days idle)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;966 MB&lt;/td&gt;
&lt;td&gt;1,324 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two layers emerged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 - Bun baseline&lt;/strong&gt;: Bun allocates a 128 MB IOAccelerator slab on startup. JSC alone (via &lt;code&gt;dlopen&lt;/code&gt;) doesn't. This is Bun-specific, small, and fixed. No Metal or GPU frameworks are loaded. I tested 12 JSC/Bun environment variables and flags. None affected the allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 - Claude Code growth&lt;/strong&gt;: Claude Code loads the full Metal/GPU framework stack and grows from 1 slab to 10+ over its lifetime. HTTP streaming in standalone Bun caused growth from 1 to 2 IOAccelerator regions in 20 seconds, suggesting sustained network I/O is a contributor. Claude Code streams API responses for hours, which would amplify this.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;leaks&lt;/code&gt; on the 15-day idle session reported 175,613 leaked objects totaling 13.5 MB in the standard malloc zone alone. The WebKit malloc zone was unreadable due to security restrictions. The actual leak count is likely much higher.&lt;/p&gt;

&lt;p&gt;The session's file descriptors were all revoked. No GPU device handles remained open. The IOAccelerator memory was orphaned. Buffers allocated with no active GPU connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For monitoring&lt;/strong&gt;: If you monitor macOS workloads using RSS, you can get a 15x underestimate for long-running processes with IOAccelerator allocations. Use &lt;code&gt;footprint&lt;/code&gt; or &lt;code&gt;kern.memorystatus_level&lt;/code&gt; instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Real memory cost per process&lt;/span&gt;
footprint &lt;span class="nt"&gt;-p&lt;/span&gt; &amp;lt;pid&amp;gt;

&lt;span class="c"&gt;# System memory pressure as percentage&lt;/span&gt;
sysctl &lt;span class="nt"&gt;-n&lt;/span&gt; kern.memorystatus_level
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;For runtime selection&lt;/strong&gt;: Bun allocates IOAccelerator-tagged memory on startup that JSC alone doesn't. It's small at baseline, but Claude Code shows what happens when a large application runs on top for hours: the allocation grows to nearly 1 GB and is never reclaimed. If your Bun application does sustained network I/O, monitor with &lt;code&gt;footprint&lt;/code&gt;, not RSS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Claude Code users&lt;/strong&gt;: Close idle sessions. Each one accumulates ~1 GB of non-reclaimable footprint after a few hours of active use. If macOS reports "out of memory" for your terminal, check for accumulated Claude processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's still open&lt;/strong&gt;: Which Claude Code dependency loads the Metal/GPU framework stack that standalone Bun doesn't? Is the slab growth driven by sustained network streaming, terminal rendering, or both? These questions are tracked at &lt;a href="https://github.com/oven-sh/bun/issues/28234" rel="noopener noreferrer"&gt;oven-sh/bun#28234&lt;/a&gt; and &lt;a href="https://github.com/anthropics/claude-code/issues/35804" rel="noopener noreferrer"&gt;anthropics/claude-code#35804&lt;/a&gt;. Corrections to the original reports have been issued.&lt;/p&gt;

</description>
      <category>macos</category>
      <category>debugging</category>
      <category>claudecode</category>
      <category>bunjs</category>
    </item>
    <item>
      <title>We Thought Our AI Reviews Were 98.6% Valid. Independent Validation Said 69%.</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Tue, 17 Mar 2026 12:28:25 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/we-thought-our-ai-reviews-were-986-valid-independent-validation-said-69-2mdl</link>
      <guid>https://dev.to/alexey-pelykh/we-thought-our-ai-reviews-were-986-valid-independent-validation-said-69-2mdl</guid>
      <description>&lt;p&gt;The most dangerous thing about AI-augmented work isn't the errors. It's thinking you're not making them.&lt;/p&gt;

&lt;p&gt;I ran 449 AI-assisted code reviews on OCA (Odoo Community Association) open source PRs in 9 days. When I had the AI assess its own review quality, it said 98.6% valid. When I ran independent validation, the number dropped to 68.9%. The validation used 40 separate AI instances, each reading the actual code diffs and verifying every technical claim.&lt;/p&gt;

&lt;p&gt;That 30-point gap should concern anyone using AI for serious work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;Between February 24 and March 4, 2026, I reviewed 449 unique pull requests across 6 OCA repositories using AI-assisted workflows. Each PR got a full technical review: architecture assessment, bug identification, security analysis, test coverage evaluation. The output was structured code review comments posted directly to GitHub.&lt;/p&gt;

&lt;p&gt;For scale: OCA's most prolific human reviewer has done 2,197 unique PR reviews over 9.5 years. My campaign produced 449 in 9 days.&lt;/p&gt;

&lt;p&gt;The reviews weren't rubber-stamps either. They found real security vulnerabilities (portal sudo bypass, cross-record token exposure, getattr traversal), caught bugs that multiple human reviewers missed, and provided actionable technical feedback that PR authors implemented.&lt;/p&gt;

&lt;p&gt;But how good were they really?&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 1: The self-assessment trap
&lt;/h2&gt;

&lt;p&gt;My first attempt at validation was obvious: have AI evaluate the reviews. I fed each review to an evaluator and asked "Is this review technically valid?"&lt;/p&gt;

&lt;p&gt;Result: 98.6% valid.&lt;/p&gt;

&lt;p&gt;This number is worthless.&lt;/p&gt;

&lt;p&gt;The evaluator was reading the review text - not the actual code. It was checking whether the review &lt;em&gt;sounded&lt;/em&gt; plausible, not whether the claims matched reality. A review that confidently describes a &lt;code&gt;pre_init_hook&lt;/code&gt; performance pattern scores well on plausibility. The fact that no &lt;code&gt;pre_init_hook&lt;/code&gt; exists anywhere in the 7,362-line diff? The evaluator had no way to know.&lt;/p&gt;

&lt;p&gt;This is the fundamental problem with self-assessment. AI evaluating AI-generated text is pattern-matching for coherence, not verifying truth. It's the equivalent of grading your own exam by checking whether your handwriting is neat.&lt;/p&gt;

&lt;p&gt;I discarded the entire Round 1 dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 2: Independent validation against actual code
&lt;/h2&gt;

&lt;p&gt;Round 2 used a different approach. I dispatched 40 independent AI instances (I call them "subclauds"), each assigned to a single PR. Each one:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieved the actual PR diff using &lt;code&gt;gh pr diff&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Read every technical claim in the review&lt;/li&gt;
&lt;li&gt;Independently verified each claim against the real code&lt;/li&gt;
&lt;li&gt;Classified the review as VALID, PARTIALLY VALID, RUBBER-STAMP, or INVALID - with evidence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key difference: validators had the ground truth. They weren't evaluating whether the review sounded right. They were checking whether each claim matched the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fully valid&lt;/td&gt;
&lt;td&gt;303&lt;/td&gt;
&lt;td&gt;68.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partially valid&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;22.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rubber-stamp&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;7.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invalid&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;68.9% fully valid. Combined with partially valid: 90.9%.&lt;/p&gt;

&lt;p&gt;Not terrible. But 30 points below what self-assessment reported.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "partially valid" means
&lt;/h3&gt;

&lt;p&gt;Most partially valid reviews had genuinely correct observations but missed important issues in the diff. A review might correctly identify three concerns but miss a critical fourth one. The feedback it gave was real - it just wasn't complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "rubber-stamp" means
&lt;/h3&gt;

&lt;p&gt;33 reviews (7.5%) approved PRs without evidence of reading the diff. These are the reviews that said "LGTM, CI green" on a 3,500-line new module with no inline comments. One approved a security-sensitive portal module with zero tests and gave it three words.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "invalid" means
&lt;/h3&gt;

&lt;p&gt;Four reviews were factually wrong at their core:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One described bugs from version 16.0 in an 18.0 migration review. The variable names it cited don't exist in the diff.&lt;/li&gt;
&lt;li&gt;One approved a fix that a maintainer corrected 15 minutes later.&lt;/li&gt;
&lt;li&gt;One flagged features as "missing" that were intentionally removed per months of community discussion.&lt;/li&gt;
&lt;li&gt;One requested changes for a state value that already exists correctly in the code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The fabrication problem
&lt;/h2&gt;

&lt;p&gt;Out of roughly 2,000 total claims across 440 validated PRs, 4 were fabricated. Less than 0.2%.&lt;/p&gt;

&lt;p&gt;But each fabrication is instructive:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Phantom pattern&lt;/strong&gt;: Described a &lt;code&gt;pre_init_hook&lt;/code&gt; performance pattern with confidence. No &lt;code&gt;pre_init_hook&lt;/code&gt; exists anywhere in the 7,362-line diff. The AI generated a plausible Odoo code pattern from training knowledge rather than reading the actual code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Phantom tests&lt;/strong&gt;: Claimed "tests pass" on a module with zero tests. Codecov was failing. The AI assumed tests exist because most modules have them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wrong version&lt;/strong&gt;: All three bug descriptions cite variable names and code patterns from version 16.0 source code, not the 18.0 migration diff under review. The AI was analyzing code it had memorized from training, not code in the PR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invisible tests&lt;/strong&gt;: Claimed "module doesn't include any tests" when a 187-line test file with 6 test methods exists in the PR. The AI missed a file it should have read.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The common thread: every fabrication stems from the AI substituting pattern-matched expectations for actual observation. It "knows" what Odoo modules typically look like and fills in the blanks rather than reading what's actually there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quality curve
&lt;/h2&gt;

&lt;p&gt;Quality wasn't uniform. It improved over time, then degraded with volume.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Period&lt;/th&gt;
&lt;th&gt;Valid rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Timesheet (early)&lt;/td&gt;
&lt;td&gt;34%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HR&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bank-statement-import&lt;/td&gt;
&lt;td&gt;59%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sale-workflow (early batches)&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sale-workflow (late batches)&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The jump from 34% to 87% shows genuine learning - prompts improved, edge cases were handled, failure modes were addressed. The regression from 87% to 70% shows volume fatigue. The same degradation pattern that affects human reviewers doing batch work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond code review
&lt;/h2&gt;

&lt;p&gt;The 30-point validation gap isn't specific to code review. It's a structural problem with any AI-assisted workflow where:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The output looks plausible.&lt;/strong&gt; Well-written text passes surface-level scrutiny.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-assessment is circular.&lt;/strong&gt; AI checking AI text measures coherence, not correctness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground truth verification requires extra work.&lt;/strong&gt; Actually checking claims against reality takes effort most people skip.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're using AI for research, writing, analysis, or decision support, the same gap likely exists. You just haven't measured it yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to validate your own AI output
&lt;/h2&gt;

&lt;p&gt;The methodology is reusable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Separate the evaluator from the generator.&lt;/strong&gt; Don't ask the same model to grade its own output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give the evaluator ground truth.&lt;/strong&gt; The evaluator must have access to the source material, not just the AI's output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require evidence for every claim.&lt;/strong&gt; Each verification should quote specific evidence from the source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use categorical classification with clear definitions.&lt;/strong&gt; Valid / Partially Valid / Rubber-stamp / Invalid gives you actionable data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run at scale.&lt;/strong&gt; A few spot checks won't reveal systemic patterns. I validated 440 reviews to see the quality curve.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cost of this validation was a fraction of the cost of generating the reviews. The cost of NOT validating? Thinking you're at 98.6% when you're at 68.9%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Self-assessed AI quality is a vanity metric. If you're measuring your AI workflow by asking "does this look right?" you're overestimating quality by 20-30 points.&lt;/p&gt;

&lt;p&gt;Validate against ground truth, not against the AI's own output. The gap you find will be uncomfortable. That discomfort is the point.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>opensource</category>
      <category>devtools</category>
    </item>
    <item>
      <title>The Missing Category in the AI Agent Landscape</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Tue, 10 Mar 2026 15:28:55 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/the-missing-category-in-the-ai-agent-landscape-3bfd</link>
      <guid>https://dev.to/alexey-pelykh/the-missing-category-in-the-ai-agent-landscape-3bfd</guid>
      <description>&lt;p&gt;There are over 100 projects that will build you an AI agent. You can get one in TypeScript, Rust, Python, Go, Zig, or Shell. You can run it on a Raspberry Pi or a Kubernetes cluster. You can talk to it through Telegram, WhatsApp, Slack, Discord, WeChat, or 15 other channels.&lt;/p&gt;

&lt;p&gt;But if you already HAVE an agent -- a Claude Code setup with custom skills and CLAUDE.md, a tuned Gemini CLI workflow, a Codex integration your team depends on -- and you want to message it from your phone? The options are a handful of single-channel scripts and a lot of empty space.&lt;/p&gt;

&lt;p&gt;I spent months mapping this landscape: 115+ projects across 10 categories, from full rewrites (5 language ports of OpenClaw alone) to managed hosting services to single-file Telegram bridges. What emerged is a gap I'm calling "agent middleware" -- and I built a project to fill it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Kinds of Users
&lt;/h2&gt;

&lt;p&gt;There are two fundamentally different people evaluating AI agent tools right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         Builds Agent Logic              Bridges Existing Agents
              &amp;lt;----------------------------------------------&amp;gt;
              |                                              |
  Many        |  OpenClaw, NanoClaw        RemoteClaw        |
  Channels    |  AstrBot, CoPaw           cc-connect         |
              |  LangBot, PocketPaw                          |
              |                                              |
  Few/No      |  Nanobot, ZeroClaw        TinyClaw           |
  Channels    |  IronClaw, MicroClaw      claude-pipe        |
              |  Moltis, OpenFang         Claude-Code-Remote |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The left side is a crowded, well-served market. OpenClaw alone has 250K+ stars and 36,900 forks. NanoClaw offers the same idea in 15 source files. At least five Rust rewrites compete for the "same thing, but faster" niche.&lt;/p&gt;

&lt;p&gt;The right side is almost empty.&lt;/p&gt;

&lt;p&gt;This is the developer who already has Claude Code configured with a custom &lt;code&gt;~/.claude&lt;/code&gt; directory, or a Gemini CLI setup they've spent weeks tuning, or a Codex workflow integrated into their team's process. They do not want a new agent. They want to send a message to the agent they already have -- from their phone, from a Slack channel, from WhatsApp.&lt;/p&gt;

&lt;p&gt;Where does your setup fall on this spectrum? &lt;a href="https://docs.remoteclaw.org/landscape" rel="noopener noreferrer"&gt;Bookmark the full landscape reference for the complete data.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fork Explosion
&lt;/h2&gt;

&lt;p&gt;In February 2026, OpenClaw was forking at 100 per hour. By March, the ecosystem had produced five complete language rewrites (Rust, Go, Python, Zig, Shell), a dozen managed hosting services, and over 60 forks with meaningful modifications.&lt;/p&gt;

&lt;p&gt;The fork explosion was not about OpenClaw being bad. It was about OpenClaw being almost-right for too many different use cases. Every fork adjusts the same core product for a different audience: lighter, more secure, Chinese-market-native, edge-deployable, enterprise-ready.&lt;/p&gt;

&lt;p&gt;But almost every fork keeps the same fundamental architecture: a platform that owns the agent loop, runs its own LLM orchestration, and bundles everything from memory to skills to model management.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Missing Category
&lt;/h2&gt;

&lt;p&gt;We call this gap &lt;strong&gt;agent middleware&lt;/strong&gt;: software that connects existing AI agents to messaging channels without owning the agent loop.&lt;/p&gt;

&lt;p&gt;The boundary test for agent middleware is simple: does it route through infrastructure, or does it try to be the agent?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent Middleware&lt;/th&gt;
&lt;th&gt;Agent Platform&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bridges to your CLI agent&lt;/td&gt;
&lt;td&gt;Runs its own LLM calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preserves your agent's config&lt;/td&gt;
&lt;td&gt;Requires its own configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adds channels, sessions, scheduling&lt;/td&gt;
&lt;td&gt;Adds memory, skills, model management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Your &lt;code&gt;~/.claude&lt;/code&gt; is the agent&lt;/td&gt;
&lt;td&gt;Its built-in orchestrator is the agent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not a quality judgment. Platforms like OpenClaw, NanoClaw, and Nanobot are excellent at what they do. The distinction is architectural: they own the agent loop, agent middleware does not.&lt;/p&gt;

&lt;p&gt;CLI agents ship new capabilities monthly. A platform that bundles its own versions of those capabilities is building on quicksand. OpenClaw's 294,000 lines of code and 5,300+ open issues are the natural result. NanoClaw and Nanobot exist because the full platform became too heavy.&lt;/p&gt;

&lt;p&gt;Middleware only provides what a CLI agent cannot provide for itself: sessions, channel routing, scheduling, and gateway services. Everything else is the agent's job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Convergence Evidence
&lt;/h2&gt;

&lt;p&gt;Multiple independent developers arrived at the same conclusion:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Channels&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;claude-code-telegram&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Telegram&lt;/td&gt;
&lt;td&gt;SDK + CLI fallback, cron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ccbot&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Telegram&lt;/td&gt;
&lt;td&gt;tmux-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;claude-pipe&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Telegram + Discord&lt;/td&gt;
&lt;td&gt;~1,000 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude-Code-Remote&lt;/td&gt;
&lt;td&gt;Claude, Gemini, Cursor&lt;/td&gt;
&lt;td&gt;Email, Discord, Telegram&lt;/td&gt;
&lt;td&gt;Multi-runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cc-connect&lt;/td&gt;
&lt;td&gt;Claude, Gemini, Codex, Cursor&lt;/td&gt;
&lt;td&gt;8 channels&lt;/td&gt;
&lt;td&gt;Cron, voice&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;cc-connect&lt;/strong&gt; bridges four CLI runtimes to 8 messaging channels with cron scheduling and voice support. Same multi-runtime, multi-channel concept, implemented as a lightweight bridge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangBot&lt;/strong&gt; is the closest thing to production middleware from the Chinese ecosystem: 11+ messaging platforms, integrations with Dify, Coze, n8n, and other agent runtimes. Pure bridge, no agent logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude-to-IM-skill&lt;/strong&gt; bridges Claude Code and Codex to Telegram, Discord, and Feishu simultaneously, with persistent sessions and a permission system.&lt;/p&gt;

&lt;p&gt;When 10 developers independently build the same Telegram bridge without knowing about each other, that is not a trend. It is a product category announcing itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agent Middleware Actually Does
&lt;/h2&gt;

&lt;p&gt;If middleware does not own the agent loop, what does it provide?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Why a CLI Agent Cannot Do This&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sessions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maps Telegram conversations to persistent agent sessions&lt;/td&gt;
&lt;td&gt;CLI agent does not know about Telegram sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Channel routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Routes WhatsApp and Slack messages to the same agent&lt;/td&gt;
&lt;td&gt;CLI agent assumes a terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Analyze revenue at 8am, post to Slack"&lt;/td&gt;
&lt;td&gt;CLI agent cannot trigger itself&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gateway services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auth, rate limiting, tool access policies&lt;/td&gt;
&lt;td&gt;CLI agent has no network layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These capabilities are infrastructure-bound. They only make sense when there is a system between the user and the agent. The moment you want to access your agent from your phone, you need all of them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What the setup looks like&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; remoteclaw
remoteclaw init &lt;span class="nt"&gt;--channel&lt;/span&gt; telegram &lt;span class="nt"&gt;--runtime&lt;/span&gt; claude
remoteclaw start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Landscape
&lt;/h2&gt;

&lt;p&gt;Here is a simplified map of how the ecosystem divides. This is not exhaustive -- the full reference (linked below) covers 115+ projects across 10 categories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mobile remote control apps&lt;/strong&gt; (Happy Coder, CloudCLI) solve the "remote access" need through native apps rather than messaging. They compete for the same user but through a different channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bot frameworks&lt;/strong&gt; (Botpress, Rasa, Chatwoot) connect to messaging channels but own the conversation logic. They are platforms, not middleware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent orchestration frameworks&lt;/strong&gt; (LangGraph, CrewAI, AutoGen) build multi-agent systems but do not provide messaging channel integration. They are infrastructure for agent logic, not for message delivery.&lt;/p&gt;

&lt;p&gt;If you are building a single-channel bridge for Claude Code, &lt;a href="https://docs.remoteclaw.org/channels" rel="noopener noreferrer"&gt;check if RemoteClaw already supports your channel&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built RemoteClaw
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://remoteclaw.org" rel="noopener noreferrer"&gt;RemoteClaw&lt;/a&gt; because I spent months inside the OpenClaw codebase -- analyzing 5,605 files across 334 analysis batches -- and realized that the channel infrastructure was exactly what developers with existing agents needed, but the platform layer was exactly what they did not.&lt;/p&gt;

&lt;p&gt;RemoteClaw is a fork of OpenClaw that strips the platform layer and replaces it with an AgentRuntime interface. Your CLI agent runs as a subprocess, preserving your configuration untouched. The gateway handles sessions, channels, and 50 MCP tools. The agent handles everything else.&lt;/p&gt;

&lt;p&gt;It is middleware, not a platform. It connects the agent you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Reference
&lt;/h2&gt;

&lt;p&gt;The complete landscape data -- 115+ projects, 10 categories, channel coverage comparison, and architecture classification -- is available on our documentation site:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.remoteclaw.org/landscape" rel="noopener noreferrer"&gt;Agent Middleware Landscape Reference&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This map covers 115+ projects across 10 categories. We are certain we missed some. If you find a project we missed or a description that needs correction, please &lt;a href="https://github.com/remoteclaw/remoteclaw/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Will the "right side" of this map fill up in 2026, or will platforms absorb the middleware function? I have a strong opinion. What's yours?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;RemoteClaw is open-source middleware that bridges CLI AI agents to 22+ messaging channels. &lt;a href="https://github.com/remoteclaw/remoteclaw" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://docs.remoteclaw.org/quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; | &lt;a href="https://docs.remoteclaw.org" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>devtools</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>From AI-Augmented Human to Human-Augmented AI</title>
      <dc:creator>Alexey Pelykh</dc:creator>
      <pubDate>Thu, 05 Mar 2026 14:45:29 +0000</pubDate>
      <link>https://dev.to/alexey-pelykh/from-ai-augmented-human-to-human-augmented-ai-2ni</link>
      <guid>https://dev.to/alexey-pelykh/from-ai-augmented-human-to-human-augmented-ai-2ni</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ig3s6a399196ajmm5jt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ig3s6a399196ajmm5jt.png" alt="From AI-Augmented Human to Human-Augmented AI" width="800" height="993"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sometime in late 2025, the relationship between software engineers and AI inverted.&lt;/p&gt;

&lt;p&gt;I can't pinpoint the exact moment. But looking at my workflow across ten active projects - Java libraries, LinkedIn automation tools, Odoo modules, browser extensions - the pattern is clear. I stopped using AI to help me write code. AI writes the code. I specify what to build, review what comes back, and steer when it drifts.&lt;/p&gt;

&lt;p&gt;The terminology is catching up. Andrej Karpathy declared "vibe coding" passé in February 2026 and promoted "agentic engineering" - where "you are not writing the code directly 99% of the time." Nicholas Zakas mapped a three-stage progression: Coder to Conductor to Orchestrator. Researchers at ArXiv formalized it as "Software Engineering 3.0," analyzing 456,000 AI-authored pull requests across 61,000 repositories.&lt;/p&gt;

&lt;p&gt;Different labels. Same observation: the human moved from doing the work with AI assistance to overseeing AI doing the work.&lt;/p&gt;

&lt;p&gt;But here's what nobody is saying clearly enough: most of the industry hasn't made this transition. Many haven't entered any AI era at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Eras, One Industry
&lt;/h2&gt;

&lt;p&gt;Cross-referencing data from Jellyfish, Bain, Stack Overflow, and McKinsey, the software industry is operating in three distinct modes simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 0: Pre-GenAI (~30-35% of organizations).&lt;/strong&gt; These companies have AI tool licenses. Their developers have Copilot seats. Nothing has changed. Bain calls it "rollout without adoption" - tools deployed, workflows unchanged. Three of four companies say the hardest part isn't the technology. It's getting people to change how they work.&lt;/p&gt;

&lt;p&gt;The engineers at these companies write code the same way they did in 2022. The AI subscription shows up on the expense report. The AI doesn't show up in the workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 1: AI-Augmented Human (~50-55%).&lt;/strong&gt; This is where most AI-adopting organizations sit. Individual developers use Copilot, Cursor, or ChatGPT as smarter autocomplete. They get 10-15% productivity gains at the individual level. They still write the code. AI helps.&lt;/p&gt;

&lt;p&gt;The problem: the coding bottleneck moves, but nothing else changes. Review processes, testing infrastructure, security scanning, deployment workflows - all pre-AI. Faster code generation creates bottlenecks everywhere downstream. One Fortune 50 analysis showed a 10x increase in security findings per month after widespread AI adoption - more code hitting the pipeline meant more surface area.&lt;/p&gt;

&lt;p&gt;The typical symptom: "Our developers are faster but we're not shipping faster."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Era 2: Human-Augmented AI (~10-15%).&lt;/strong&gt; This is the inversion. AI is the primary producer across the delivery chain. Humans focus on specification, architecture, steering, review, and judgment.&lt;/p&gt;

&lt;p&gt;The Sanity engineering team documented this in detail: AI writes 80% of initial implementations. The first attempt is "95% garbage." By the third iteration, the output is workable. Features ship 2-3x faster overall. Rakuten tested it on a 12.5 million line codebase - Claude Code completed a feature implementation in 7 hours of autonomous work with 99.9% accuracy. Zero human code contribution during execution.&lt;/p&gt;

&lt;p&gt;These organizations redesigned their entire delivery chain around AI. Not just the coding step. Everything downstream too. The maturity timeline: 18-24 months of compounding investment to get here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evidence Is Messy (On Purpose)
&lt;/h2&gt;

&lt;p&gt;The data supporting this shift exists. So does data complicating it. Both deserve honest treatment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The case for the inversion.&lt;/strong&gt; GitHub Copilot generates 46% of code for its users (61% for Java). Google reports 25%+ of new code is AI-generated. Microsoft says 20-30%. Nearly half of all code written in 2025 was AI-generated. By raw volume, AI is the primary producer in adopting organizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The case for skepticism.&lt;/strong&gt; In early 2025, METR ran a randomized controlled trial and found experienced open-source developers were 19% slower with AI tools. Not faster. Slower. Those developers believed they were 20% faster - a perception-reality gap of 39 percentage points.&lt;/p&gt;

&lt;p&gt;But that study has a sequel. When METR tried to replicate it in late 2025, 30-50% of developers refused to submit tasks they didn't want to do without AI. Returning participants from the original study showed an 18% speedup. METR's own February 2026 assessment: developers are "likely more sped up from AI tools now" than in early 2025. The original finding was a snapshot of early-2025 tools on familiar codebases. The reversal itself is evidence of how fast the shift happened.&lt;/p&gt;

&lt;p&gt;Code quality concerns remain real regardless. CodeRabbit's analysis of 470 PRs found AI-generated code had 1.7x more issues, with performance problems at roughly 8x the rate. GitClear analyzed 211 million changed lines: refactoring collapsed from 24% to 9.5%, code duplication rose eightfold.&lt;/p&gt;

&lt;p&gt;Trust is declining while adoption surges. Stack Overflow's 2025 survey: 84% of developers use AI tools, but trust in accuracy dropped from 40% to 29%. Only 3% report high trust. Forty-six percent actively distrust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reconciliation.&lt;/strong&gt; The productivity trajectory is upward, but the quality and trust problems are structural - they don't disappear with better models. The METR reversal shows developers getting faster. The CodeRabbit and GitClear data show the code getting worse. Both are true simultaneously.&lt;/p&gt;

&lt;p&gt;The real picture: AI is a genuine capability amplifier for bounded tasks. It is simultaneously a quality degrader, a security risk (Fortune 50 data showed a 10x vulnerability spike), and a perception distorter. These things are all true at the same time.&lt;/p&gt;

&lt;p&gt;The organizations in Era 2 aren't ignoring these problems. They're building systems to manage them. The organizations in Era 0 and Era 1 aren't managing them because they don't know they have them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Human Actually Does Now
&lt;/h2&gt;

&lt;p&gt;In the Era 2 workflow, the human's job changes fundamentally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specification.&lt;/strong&gt; Writing detailed prompts, specs, and context documents. This is where most of the value gets created. A vague specification produces garbage output regardless of the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture.&lt;/strong&gt; System design, technology selection, integration patterns. AI can implement a pattern. It can't choose the right one for your business context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Steering.&lt;/strong&gt; Redirecting when AI drifts, constraining the solution space. The Sanity team's experience makes sense: the first attempt is 95% garbage not because the AI is bad, but because iterative refinement with human judgment is the workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review.&lt;/strong&gt; Evaluating AI output for correctness, security, and maintainability. This is the new bottleneck. Organizations that treat review as a cost center are accumulating technical debt they can't see yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context provisioning.&lt;/strong&gt; Building CLAUDE.md files, providing codebase context, configuring tools. MIT Technology Review called this "context engineering" - the discipline that replaced "prompt engineering" in 2025.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judgment.&lt;/strong&gt; Edge cases, trade-offs, business logic. The things that require understanding the business, not just the code.&lt;/p&gt;

&lt;p&gt;What humans are not doing: writing boilerplate, implementing known patterns, generating test scaffolding, routine refactoring. These tasks made up a significant portion of a developer's day. They're delegated now.&lt;/p&gt;

&lt;p&gt;This is an identity crisis for many developers. GitHub frames it as moving from "code producer to creative director of code." Sixty-five percent expect their role to be redefined in 2026. If your career identity is tied to writing code, being told your value is in what you specify rather than what you type requires a fundamental rethink.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Engineering Leaders
&lt;/h2&gt;

&lt;p&gt;Three things matter if you're a CTO or VP Engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Know which era you're actually in.&lt;/strong&gt; Not which era you think you're in. The METR perception-reality gap applies to organizations, not just individuals. If your developers have AI tools but your delivery metrics haven't changed, you're in Era 0 regardless of how many Copilot licenses you're paying for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The coding step is only 25-35% of development time.&lt;/strong&gt; Bain's analysis: concept to launch includes requirements, design, implementation, testing, deployment, and maintenance. Even a 50% improvement in the coding step translates to 12-17% faster delivery. The organizations seeing 25-30% overall gains redesigned the full chain, not just the coding step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The junior pipeline is breaking.&lt;/strong&gt; Employment for software developers aged 22-25 fell nearly 20% from the 2022 peak. Fifty-four percent of engineering leaders plan to hire fewer juniors. This creates a time bomb: the senior engineers of 2030 need to be hired as juniors in 2026. The organizations figuring out AI-accelerated junior development - 18 months to mid-level instead of three years - will have a structural advantage. Those that simply stop hiring juniors are borrowing from a future they haven't thought through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Position
&lt;/h2&gt;

&lt;p&gt;The shift from AI-augmented human to human-augmented AI is real. It is also incomplete, unevenly distributed, and complicated by quality and security trade-offs that most organizations aren't measuring.&lt;/p&gt;

&lt;p&gt;Calling it a paradigm shift is accurate for the 10-15% in Era 2. For the majority, it's an unrealized possibility sitting unused behind a subscription login.&lt;/p&gt;

&lt;p&gt;The most productive framing isn't "AI is replacing developers" or "AI is just a tool." It's recognizing that the relationship changed - and that the organizations and individuals who understand the new terms are pulling ahead of those who don't.&lt;/p&gt;

&lt;p&gt;The gap is widening. Not because the technology demands it. Because the people who adapted first are setting the standard everyone else will be measured against.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>devtools</category>
      <category>career</category>
    </item>
  </channel>
</rss>
