<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dylan Brown</title>
    <description>The latest articles on DEV Community by Dylan Brown (@dylan_brown_4c803aefcfe51).</description>
    <link>https://dev.to/dylan_brown_4c803aefcfe51</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3845674%2F558fcc6f-3390-42e3-9ed6-37646f67d30d.jpg</url>
      <title>DEV Community: Dylan Brown</title>
      <link>https://dev.to/dylan_brown_4c803aefcfe51</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dylan_brown_4c803aefcfe51"/>
    <language>en</language>
    <item>
      <title>I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like</title>
      <dc:creator>Dylan Brown</dc:creator>
      <pubDate>Sat, 30 May 2026 05:04:45 +0000</pubDate>
      <link>https://dev.to/dylan_brown_4c803aefcfe51/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-actually-looks-like-21le</link>
      <guid>https://dev.to/dylan_brown_4c803aefcfe51/i-tracked-claude-code-and-codex-pass-rates-for-95-days-what-getting-dumber-actually-looks-like-21le</guid>
      <description>&lt;p&gt;Every few weeks a thread blows up: &lt;em&gt;"Is Claude Code getting worse?"&lt;/em&gt; Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.&lt;/p&gt;

&lt;p&gt;So I built a tracker. For ~95 days it's logged the daily &lt;strong&gt;SWE-Bench-Pro pass rate&lt;/strong&gt; for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is &lt;em&gt;how often the agent actually solves the task&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's what the data says — and it's more interesting than "it got dumber."&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code: a real step up, then a recent slide
&lt;/h2&gt;

&lt;p&gt;Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Opus 4.6 era&lt;/strong&gt; — baseline &lt;strong&gt;~54%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus 4.7 era&lt;/strong&gt; — baseline &lt;strong&gt;~65%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That 4.6 → 4.7 jump is a genuine &lt;strong&gt;+11 percentage point&lt;/strong&gt; step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.&lt;/p&gt;

&lt;p&gt;Then the last ~7 days: &lt;strong&gt;today's pass rate is ~52%&lt;/strong&gt;, well below the 65% baseline and past the significance threshold (p &amp;lt; 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift &lt;em&gt;below the current model's own established baseline.&lt;/em&gt; Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.&lt;/p&gt;

&lt;p&gt;The nuance most threads miss: Claude Code is &lt;strong&gt;both&lt;/strong&gt; "much better than 6 months ago" &lt;strong&gt;and&lt;/strong&gt; "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex: three versions, basically flat
&lt;/h2&gt;

&lt;p&gt;Now the part nobody expects. Across three Codex releases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gpt-5.3-codex&lt;/strong&gt; — ~58%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpt-5.4-xhigh&lt;/strong&gt; — ~54%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpt-5.5-xhigh&lt;/strong&gt; — ~56%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three "major" version bumps, and the pass rate just oscillates in a &lt;strong&gt;54–58% band&lt;/strong&gt;. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why candlesticks (and a fixed 0–100 axis)
&lt;/h2&gt;

&lt;p&gt;Two design choices that matter if you want to read drift honestly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fixed 0–100% y-axis.&lt;/strong&gt; Auto-scaling per time window makes a 5pp dip look catastrophic because the view zooms in. A 5pp drop should &lt;em&gt;look like&lt;/em&gt; a 5pp drop whether you're comparing 30 days or 90, Claude or Codex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-era baselines, not one flat line.&lt;/strong&gt; A single baseline across model versions lies about the older model. Each release gets its own dashed reference, so you can see the &lt;em&gt;step&lt;/em&gt;, not just the absolute level.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: &lt;strong&gt;&lt;a href="https://keaiapi.com/coding-agent-tracker" rel="noopener noreferrer"&gt;Drift K-Line tracker →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you ship with these agents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust a single bad day.&lt;/strong&gt; One red candle is inside the noise band. A &lt;em&gt;week&lt;/em&gt; below baseline is signal. Watch the baseline line, not the last datapoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Newer version" ≠ "smarter."&lt;/strong&gt; Codex's flat line is the proof. Benchmark before you migrate a workflow to a new release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability drifts. Your costs shouldn't have to.&lt;/strong&gt; If an agent quietly drops 13pp, the last thing you want is to &lt;em&gt;also&lt;/em&gt; be locked into one vendor's pricing while you wait it out.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Author note: I build &lt;a href="https://keaiapi.com" rel="noopener noreferrer"&gt;keaiapi&lt;/a&gt;, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Building an Autonomous AI Agent That Writes Novels — Architecture of a 10-Agent Pipeline</title>
      <dc:creator>Dylan Brown</dc:creator>
      <pubDate>Fri, 27 Mar 2026 16:00:03 +0000</pubDate>
      <link>https://dev.to/dylan_brown_4c803aefcfe51/building-an-autonomous-ai-agent-that-writes-novels-architecture-of-a-10-agent-pipeline-59pf</link>
      <guid>https://dev.to/dylan_brown_4c803aefcfe51/building-an-autonomous-ai-agent-that-writes-novels-architecture-of-a-10-agent-pipeline-59pf</guid>
      <description>&lt;p&gt;AI-generated fiction has a consistency problem. Ask any LLM to write chapter 1 of a novel and it'll do a decent job. Ask it to write chapter 30 and it has no idea what happened in the first 29.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/Narcooo/inkos" rel="noopener noreferrer"&gt;InkOS&lt;/a&gt; to solve this. It's an open-source CLI AI agent that writes, audits, and revises novels autonomously — using a pipeline of 10 specialized AI agents with persistent state tracking across the entire book.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture and the specific engineering problems it solves.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Most AI writing tools work like this: you give the model a prompt, it generates text, you copy it, repeat. There's no memory between chapters. After 20+ chapters, you run into:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Continuity breaks&lt;/strong&gt; — characters remember things they never witnessed, weapons reappear after being lost, relationships reset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context bloat&lt;/strong&gt; — injecting all previous state into each prompt hits token limits, causes 400 errors, costs $200/chapter in API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hook accumulation&lt;/strong&gt; — the model plants plot hooks but never resolves them. After 30 chapters you have 40+ dangling threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI voice&lt;/strong&gt; — every paragraph uses the same words ("delve", "tapestry", "testament", "intricate"), sentence structure is monotonous, and there's excessive summarization&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Architecture: 10 Agents in Sequence
&lt;/h2&gt;

&lt;p&gt;Instead of one model doing everything, InkOS splits the work across 10 specialized agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Radar → Planner → Composer → Architect → Writer → Observer → Reflector → Normalizer → Auditor → Reviser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent has exactly one job:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Radar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scans platform trends and reader preferences (pluggable, skippable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads author intent + current focus + memory retrieval, produces chapter intent with must-keep/must-avoid lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Composer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Selects relevant context from truth files by relevance, compiles rule stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architect&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Plans chapter structure: outline, scene beats, pacing targets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Writer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Produces prose from composed context (length-governed, dialogue-driven)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Over-extracts 9 categories of facts from the chapter text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Outputs Zod-validated JSON deltas for state updates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Normalizer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single-pass compress/expand to hit the target word count band&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auditor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Validates draft against 7 truth files across 33 dimensions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reviser&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-fixes critical issues, flags others for human review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If the audit fails, the pipeline loops back: revise → re-audit until all critical issues are resolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  State Management: 7 Truth Files
&lt;/h2&gt;

&lt;p&gt;Every book maintains 7 canonical truth files as the single source of truth:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;current_state.md&lt;/code&gt; — character locations, relationships, knowledge, emotional arcs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;particle_ledger.md&lt;/code&gt; — resource accounting: items, money, stats with quantities&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pending_hooks.md&lt;/code&gt; — open plot threads, foreshadowing, unresolved conflicts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chapter_summaries.md&lt;/code&gt; — per-chapter summaries with state changes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;subplot_board.md&lt;/code&gt; — A/B/C subplot line status tracking&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;emotional_arcs.md&lt;/code&gt; — per-character emotion tracking and growth&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;character_matrix.md&lt;/code&gt; — interaction matrix, encounter records, information boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Auditor checks every draft against these files. If a character "remembers" something they never witnessed, or pulls a weapon they lost two chapters ago — the auditor catches it.&lt;/p&gt;

&lt;p&gt;Since v0.6, truth files are stored as Zod-validated JSON (&lt;code&gt;story/state/*.json&lt;/code&gt;). The Reflector outputs JSON deltas — not full markdown rewrites. Corrupted data is rejected, not propagated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solving Context Bloat: SQLite Temporal Memory
&lt;/h2&gt;

&lt;p&gt;On Node 22+, InkOS uses a SQLite temporal memory database (&lt;code&gt;story/memory.db&lt;/code&gt;). Instead of injecting all 7 truth files into every prompt (which blows up after 20 chapters), the Composer agent does relevance-based retrieval — pulling only the facts, hooks, and summaries that matter for the current chapter.&lt;/p&gt;

&lt;p&gt;This was the single biggest improvement in v0.6. Before: context bloat caused 400 errors and made each chapter cost $200+ in API calls. After: selective retrieval keeps context lean regardless of book length.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hook Governance
&lt;/h2&gt;

&lt;p&gt;One of the hardest problems in long-form AI fiction: the model loves planting hooks but never pays them off. After 30 chapters you'd have 40+ open threads, none resolving.&lt;/p&gt;

&lt;p&gt;The Planner agent now generates a &lt;code&gt;hookAgenda&lt;/code&gt; — scheduling which hooks to advance and which to resolve in each chapter. &lt;code&gt;analyzeHookHealth&lt;/code&gt; audits hook debt, &lt;code&gt;evaluateHookAdmission&lt;/code&gt; blocks duplicate hooks, and new &lt;code&gt;mention&lt;/code&gt; semantics prevents fake advancement (where the model references a hook without actually progressing it).&lt;/p&gt;

&lt;h2&gt;
  
  
  De-AI-ification
&lt;/h2&gt;

&lt;p&gt;Every genre profile includes a fatigue word list. For LitRPG: "delve", "tapestry", "testament", "intricate", "pivotal". The Auditor flags these automatically.&lt;/p&gt;

&lt;p&gt;But detection alone isn't enough — InkOS bakes de-AI-ification into the Writer agent's prompts at the source: banned sentence patterns, style fingerprint injection, dialogue-driven scene guidance. &lt;code&gt;revise --mode anti-detect&lt;/code&gt; runs dedicated anti-detection rewriting on existing chapters.&lt;/p&gt;

&lt;p&gt;You can also clone any author's style: &lt;code&gt;inkos style analyze reference.txt&lt;/code&gt; extracts a statistical fingerprint (sentence length distribution, word frequency, rhythm profiles), and &lt;code&gt;inkos style import&lt;/code&gt; injects it into all future chapters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Genre Support
&lt;/h2&gt;

&lt;p&gt;10 English-native genre profiles, each with dedicated pacing rules, audit dimensions, and fatigue word lists:&lt;/p&gt;

&lt;p&gt;LitRPG, Progression Fantasy, Isekai, Cultivation, System Apocalypse, Dungeon Core, Romantasy, Sci-Fi, Tower Climber, Cozy Fantasy — plus 5 Chinese web novel genres.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; @actalk/inkos
inkos book create &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s2"&gt;"The Last Delver"&lt;/span&gt; &lt;span class="nt"&gt;--genre&lt;/span&gt; litrpg
inkos write next
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command writes a full chapter: draft → audit → auto-revise. Run &lt;code&gt;inkos up&lt;/code&gt; for daemon mode that writes chapters on a schedule.&lt;/p&gt;

&lt;p&gt;Works with Claude, GPT-4, or any OpenAI-compatible API including local models. Multi-model routing lets you put Claude on the Writer and GPT-4o on the Auditor.&lt;/p&gt;

&lt;p&gt;InkOS is also published as an &lt;a href="https://clawhub.ai" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; skill — install with &lt;code&gt;clawhub install inkos&lt;/code&gt; and any compatible agent can invoke it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Narcooo/inkos" rel="noopener noreferrer"&gt;github.com/Narcooo/inkos&lt;/a&gt; (2.4k stars, MIT license)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;npm&lt;/strong&gt;: &lt;code&gt;npm i -g @actalk/inkos&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;Would love feedback from anyone working on multi-agent systems, long-context state management, or creative AI. What continuity problems have you run into with long-form AI generation?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>typescript</category>
      <category>writing</category>
    </item>
  </channel>
</rss>
