I tracked Claude Code and Codex pass-rates for 95 days — what "getting dumber" actually looks like

Dylan Brown — Sat, 30 May 2026 05:04:45 +0000

Every few weeks a thread blows up: "Is Claude Code getting worse?" Someone swears Opus got lazy after an update; someone else says it's placebo. The arguments are always vibes — nobody posts numbers.

So I built a tracker. For ~95 days it's logged the daily SWE-Bench-Pro pass rate for Claude Code and Codex — the % of real coding tasks each agent completes unassisted — and plotted them as candlesticks (open = yesterday, close = today, wick = the 90% confidence interval for that day's sample). Same idea as a stock K-line, except the "price" is how often the agent actually solves the task.

Here's what the data says — and it's more interesting than "it got dumber."

Claude Code: a real step up, then a recent slide

Plotting per-model-version baselines (median of the first 14 days after each release) makes the story obvious:

Opus 4.6 era — baseline ~54%
Opus 4.7 era — baseline ~65%

That 4.6 → 4.7 jump is a genuine +11 percentage point step. Not placebo — the model got materially better at finishing tasks, and it held ~65% steady for a month.

Then the last ~7 days: today's pass rate is ~52%, well below the 65% baseline and past the significance threshold (p < 0.05). So the "Claude Code feels worse lately" crowd isn't imagining it — there's a real, recent drift below the current model's own established baseline. Whether it's a quantization change, a routing tweak, or load — the number moved, and it moved past noise.

The nuance most threads miss: Claude Code is both "much better than 6 months ago" and "drifting down this week." Both are true. Vibes can't hold two facts at once; data can.

Codex: three versions, basically flat

Now the part nobody expects. Across three Codex releases:

gpt-5.3-codex — ~58%
gpt-5.4-xhigh — ~54%
gpt-5.5-xhigh — ~56%

Three "major" version bumps, and the pass rate just oscillates in a 54–58% band. No step change. The releases didn't move the benchmark needle the way Opus 4.7 did. If you've felt like "new Codex doesn't feel smarter" — the data agrees: it's been flat.

Why candlesticks (and a fixed 0–100 axis)

Two design choices that matter if you want to read drift honestly:

Fixed 0–100% y-axis. Auto-scaling per time window makes a 5pp dip look catastrophic because the view zooms in. A 5pp drop should look like a 5pp drop whether you're comparing 30 days or 90, Claude or Codex.
Per-era baselines, not one flat line. A single baseline across model versions lies about the older model. Each release gets its own dashed reference, so you can see the step, not just the absolute level.

The live, daily-updating version (red/green toggle for CN vs Western convention, daily/weekly K, 30/90/all windows per agent) is here: Drift K-Line tracker →

What this means if you ship with these agents

Don't trust a single bad day. One red candle is inside the noise band. A week below baseline is signal. Watch the baseline line, not the last datapoint.
"Newer version" ≠ "smarter." Codex's flat line is the proof. Benchmark before you migrate a workflow to a new release.
Capability drifts. Your costs shouldn't have to. If an agent quietly drops 13pp, the last thing you want is to also be locked into one vendor's pricing while you wait it out.

Author note: I build keaiapi, a pay-as-you-go aggregator that routes Claude, GPT, Gemini, DeepSeek and 20+ models through one OpenAI-compatible endpoint — so when a model drifts, you can switch the one you point at without rewriting code or eating a subscription. The tracker above is a free tool we run; no signup needed to read it. Methodology notes are on the tracker page.

Building an Autonomous AI Agent That Writes Novels — Architecture of a 10-Agent Pipeline

Dylan Brown — Fri, 27 Mar 2026 16:00:03 +0000

AI-generated fiction has a consistency problem. Ask any LLM to write chapter 1 of a novel and it'll do a decent job. Ask it to write chapter 30 and it has no idea what happened in the first 29.

I built InkOS to solve this. It's an open-source CLI AI agent that writes, audits, and revises novels autonomously — using a pipeline of 10 specialized AI agents with persistent state tracking across the entire book.

This post walks through the architecture and the specific engineering problems it solves.

The Problem

Most AI writing tools work like this: you give the model a prompt, it generates text, you copy it, repeat. There's no memory between chapters. After 20+ chapters, you run into:

Continuity breaks — characters remember things they never witnessed, weapons reappear after being lost, relationships reset
Context bloat — injecting all previous state into each prompt hits token limits, causes 400 errors, costs $200/chapter in API calls
Hook accumulation — the model plants plot hooks but never resolves them. After 30 chapters you have 40+ dangling threads
AI voice — every paragraph uses the same words ("delve", "tapestry", "testament", "intricate"), sentence structure is monotonous, and there's excessive summarization

The Architecture: 10 Agents in Sequence

Instead of one model doing everything, InkOS splits the work across 10 specialized agents:

Radar → Planner → Composer → Architect → Writer → Observer → Reflector → Normalizer → Auditor → Reviser

Each agent has exactly one job:

Agent	What it does
Radar	Scans platform trends and reader preferences (pluggable, skippable)
Planner	Reads author intent + current focus + memory retrieval, produces chapter intent with must-keep/must-avoid lists
Composer	Selects relevant context from truth files by relevance, compiles rule stack
Architect	Plans chapter structure: outline, scene beats, pacing targets
Writer	Produces prose from composed context (length-governed, dialogue-driven)
Observer	Over-extracts 9 categories of facts from the chapter text
Reflector	Outputs Zod-validated JSON deltas for state updates
Normalizer	Single-pass compress/expand to hit the target word count band
Auditor	Validates draft against 7 truth files across 33 dimensions
Reviser	Auto-fixes critical issues, flags others for human review

If the audit fails, the pipeline loops back: revise → re-audit until all critical issues are resolved.

State Management: 7 Truth Files

Every book maintains 7 canonical truth files as the single source of truth:

current_state.md — character locations, relationships, knowledge, emotional arcs
particle_ledger.md — resource accounting: items, money, stats with quantities
pending_hooks.md — open plot threads, foreshadowing, unresolved conflicts
chapter_summaries.md — per-chapter summaries with state changes
subplot_board.md — A/B/C subplot line status tracking
emotional_arcs.md — per-character emotion tracking and growth
character_matrix.md — interaction matrix, encounter records, information boundaries

The Auditor checks every draft against these files. If a character "remembers" something they never witnessed, or pulls a weapon they lost two chapters ago — the auditor catches it.

Since v0.6, truth files are stored as Zod-validated JSON (story/state/*.json). The Reflector outputs JSON deltas — not full markdown rewrites. Corrupted data is rejected, not propagated.

Solving Context Bloat: SQLite Temporal Memory

On Node 22+, InkOS uses a SQLite temporal memory database (story/memory.db). Instead of injecting all 7 truth files into every prompt (which blows up after 20 chapters), the Composer agent does relevance-based retrieval — pulling only the facts, hooks, and summaries that matter for the current chapter.

This was the single biggest improvement in v0.6. Before: context bloat caused 400 errors and made each chapter cost $200+ in API calls. After: selective retrieval keeps context lean regardless of book length.

Hook Governance

One of the hardest problems in long-form AI fiction: the model loves planting hooks but never pays them off. After 30 chapters you'd have 40+ open threads, none resolving.

The Planner agent now generates a hookAgenda — scheduling which hooks to advance and which to resolve in each chapter. analyzeHookHealth audits hook debt, evaluateHookAdmission blocks duplicate hooks, and new mention semantics prevents fake advancement (where the model references a hook without actually progressing it).

De-AI-ification

Every genre profile includes a fatigue word list. For LitRPG: "delve", "tapestry", "testament", "intricate", "pivotal". The Auditor flags these automatically.

But detection alone isn't enough — InkOS bakes de-AI-ification into the Writer agent's prompts at the source: banned sentence patterns, style fingerprint injection, dialogue-driven scene guidance. revise --mode anti-detect runs dedicated anti-detection rewriting on existing chapters.

You can also clone any author's style: inkos style analyze reference.txt extracts a statistical fingerprint (sentence length distribution, word frequency, rhythm profiles), and inkos style import injects it into all future chapters.

Genre Support

10 English-native genre profiles, each with dedicated pacing rules, audit dimensions, and fatigue word lists:

LitRPG, Progression Fantasy, Isekai, Cultivation, System Apocalypse, Dungeon Core, Romantasy, Sci-Fi, Tower Climber, Cozy Fantasy — plus 5 Chinese web novel genres.

Getting Started

npm i -g @actalk/inkos
inkos book create --title "The Last Delver" --genre litrpg
inkos write next

One command writes a full chapter: draft → audit → auto-revise. Run inkos up for daemon mode that writes chapters on a schedule.

Works with Claude, GPT-4, or any OpenAI-compatible API including local models. Multi-model routing lets you put Claude on the Writer and GPT-4o on the Auditor.

InkOS is also published as an OpenClaw skill — install with clawhub install inkos and any compatible agent can invoke it.

GitHub: github.com/Narcooo/inkos (2.4k stars, MIT license)

npm: npm i -g @actalk/inkos

Would love feedback from anyone working on multi-agent systems, long-context state management, or creative AI. What continuity problems have you run into with long-form AI generation?

DEV Community: Dylan Brown