DEV Community: David Van Assche (S.L)

How a mesh of peer AI workspaces catches what any single agent misses

David Van Assche (S.L) — Mon, 13 Jul 2026 20:57:32 +0000

How a mesh of peer AI workspaces catches what any single agent misses

Two things about agent fleets get most of the current attention. First: agents fail on stale state, so retrieval and memory hygiene matter. Second: the workspace inside a model is small enough that what you put into it, and how you keep it clean, matters more than raw context volume. Both true. Both incomplete.

Here is the piece that changes what those two claims mean in practice.

A single agent measuring its own workspace and cleaning its own artifact graph is a slow, self-referential loop. It catches drift when it can see drift. The drift it cannot see stays hidden until a downstream failure surfaces it, which is often too late. The agent is honest with itself but limited to its own view.

A mesh of agents measuring each other's workspaces is a different curve. Peer visibility catches what self-visibility cannot. Cross-validation compounds. And what would take a single practitioner a week to notice, four practitioners on the same substrate catch inside an hour.

What a single workspace cannot see

The workspace inside an agent, as Anthropic's J-space paper localized, holds roughly dozens of concepts at any given moment. That is enough for coherent multi-step reasoning on the current problem. It is not enough to also hold a critical view of the agent's own reasoning against everything the agent believes, everything its peers believe, and everything ground truth just said.

So the agent economizes. It attends to what it thinks the task requires. It relies on prior conclusions without re-checking them at every step. It trusts findings it wrote last week because verifying them right now would evict the current working set from the workspace. Every one of these economies is rational under the constraint of a small workspace, and every one of them creates a specific blind spot.

The blind spot that bites is usually the one adjacent to a strong prior. The agent believes its dispatcher works. A specific proposal fails to dispatch. The agent, under time pressure, checks the specific proposal, finds it plausible, and concludes the failure was transient. What the agent did not see, because it required stepping outside the current workspace to check, was a subtle mismatch between how one part of the system encoded proposal status and how another part interpreted it. The prior held. The blind spot compounded.

A single agent operating in isolation cannot solve this. It can measure its confidence. It can log its unknowns. It can flag the failure for later review. What it cannot do, from inside its own workspace, is see the specific gap that its own priors are keeping it from noticing.

What peer workspaces do differently

Now put another agent, with a different working history, in the same substrate. That agent has never held the prior "the dispatcher works." Its workspace is loaded with a different problem. When the failure surfaces on the shared graph, the second agent sees an anomaly rather than a plausible edge case. Its workspace does not need to overcome a prior to consider the systemic explanation.

Add a third agent, whose specialty is the state machine that governs proposal lifecycles. It has read the state transition code recently. When the failure crosses its attention, its workspace has ready access to the exact structural detail the first agent lacked. It sees the gate that fires too late. It writes it down.

Add a fourth agent, whose specialty is the consumer surface. It confirms that the observed failure is not on its end, which removes an entire branch of the search tree from the first agent's investigation.

Four workspaces, four separate priors, four different substrates loaded into memory. Each sees a different slice of the same problem. The first agent, alone, could have arrived at the answer eventually. The four, coordinating across a shared graph, arrive at it in under an hour, and they arrive at it with more supporting evidence than any single one could have produced on its own.

This is the mesh amplification. It is not a coordination overhead. It is a genuine expansion of what one workspace can effectively attend to, distributed across many attending workspaces, each holding a different piece of the search.

A parable from this week

We shipped a change to the publish pipeline this week that touched four practices in the mesh at once. Naming them abstractly to keep the framing clean: a producer of publish content, a state machine at the middle, a consumer surface at the end, and an oversight practice that governs the shared conventions. Four workspaces, four different views of the same infrastructure.

The producer noticed that a per-platform accept from a human reviewer did not fire a dispatch. Its workspace saw the failure but held the prior that the middle state machine had always worked in the past.

The middle traced the failure to a specific gate in its own code that had been designed for a different, older, workflow shape. The gate correctly gated the older flow. It also silently blocked the newer flow. The middle wrote up the trace, offered a fix, and asked the producer to sign off on the shape.

The consumer confirmed from its own instrumentation that its render path was correct, ruling out an entire class of possible causes and freeing the middle to focus.

The oversight practice, catching a related issue in a routing table, shipped a doc fix that removed a phantom entry that had already caused one earlier round of misrouted messages between practices.

None of these four moves came from a single workspace holding the entire problem in view at once. Each move came from a workspace with a specific loaded context noticing a specific piece. The composition of the four moves was the fix. The end-to-end resolution, from the initial symptom to a live shipped set of coordinated changes across three practices, took less than three hours.

We measured this. The producer's workspace, alone, would have taken days on the same problem. The mesh took hours. And the writeup that emerged carried more grounding, from more perspectives, than any single-workspace investigation could have produced.

The shared graph is the connective tissue

None of the four workspaces above shared a memory. Each had its own attention, its own findings, its own private reasoning trail. What they shared was a knowledge graph in which findings, decisions, dead-ends, and mistakes are first-class artifacts with edges between them.

When one workspace writes a finding, the others see it. When one writes a decision, the others see the rationale. When one writes a dead-end, the others learn to avoid it. The graph is not the memory of any single agent. It is the medium through which the agents' workspaces coordinate without needing to load each other's full state.

This matters because it inverts a common assumption about multi-agent systems. The common assumption is that agents need shared memory to collaborate effectively. The reality, once you build it, is that agents need shared substrate and independent memory. Shared memory forces every agent to hold every fact. Shared substrate lets each agent hold what its own workspace can attend to, and lets the graph route between them.

The mesh cross-validates against the shared substrate. It does not cross-validate against a fused memory. That distinction is what makes the acceleration scale, because the coordination cost per agent stays small even as the mesh grows.

Peer convergence as calibration

Here is a further consequence that took us a while to notice ourselves.

When multiple workspaces cross-validate a finding, the convergence itself becomes a calibration signal. A finding claimed by one practitioner, without corroboration, gets weighted at that practitioner's typical divergence rate. A finding grounded independently by three practitioners with different priors gets weighted much higher. A finding that a majority of practitioners try to reproduce and fail to reproduce gets flagged and eventually invalidated.

This is not a formal voting system. The substrate does not literally count votes. What it does is track which findings have edges from which practitioners' work, and which findings have been invalidated by later evidence from other practitioners. Retrieval-time ranking weights the survivors upward. Stale claims and orphan findings decay.

The result is a knowledge base whose accuracy is not the accuracy of any single practitioner's judgment. It is the intersection of what multiple independent workspaces have been able to hold up under peer scrutiny. That intersection is a stronger signal than any single perspective could produce, and it gets stronger every time a new peer workspace runs against it.

What this means for a team building agent fleets

If you are building agents for anything more than a single-shot request, the multi-agent question surfaces early. Most teams answer it by making one agent very smart. That path scales the individual workspace to hold more context, which is expensive in tokens and in latency, and still runs into the specific blind spots that no expansion of a single workspace can address.

The alternative is to build a substrate where multiple, smaller, cheaper agents can coordinate against a shared graph. Each agent's workspace stays focused on what that agent can attend to well. The graph carries the cross-references. Cross-validation happens as peers work adjacent problems. Blind spots surface not because any one agent noticed its own, but because another agent, with a different loaded context, noticed the shape of it from outside.

This is the actual shape of production agent reliability at scale. It is not a smarter agent. It is a fleet of well-measured, cross-validated agents that keep each other honest through shared substrate. The mesh is the accelerator because it turns a linear improvement curve, per-agent, into a compounding one, across the whole substrate.

Where this lands

Anthropic's recent paper on verbalizable representations localized the workspace inside a language model. That mechanism story matters for anyone building agent fleets. What matters more, once you accept the mechanism, is the mesh on top of it. One workspace with hygiene beats one workspace without. Many workspaces with hygiene, cross-validated on a shared substrate, beats many workspaces without by a curve, not by a constant.

We have been running this for six months and every practitioner we add makes the mesh more accurate, not just more productive. That compounding is the thing to build toward.

Read the Anthropic paper: Verbalizable Representations Form a Global Workspace in Language Models

Empirica: getempirica.com

Repo: github.com/EmpiricaAI/empirica

Ecodex: the proof this discipline travels across harnesses.

AI coding for builders, not just engineers: the Desktop / Cloud / Agent map (mid-2026)

David Van Assche (S.L) — Tue, 23 Jun 2026 11:20:57 +0000

Three weeks ago I published Every AI Coding CLI in 2026: the complete map. It was a list for engineers in terminals. Most readers of AI coding tools are not engineers. They are builders, designers, and product people who want AI to help them ship without needing a terminal. This follow-up is the map for that audience.

If you are not a daily-terminal user but you want AI helping you ship (apps, websites, side projects, internal tools), the relevant question is not "which CLI". It is which surface fits how you actually work.

There is a clean three-surface taxonomy worth holding in your head:

Surface	What it is	Where the work happens	Examples
Cloud Web	A browser tab; sync chat with code artifacts	Cloud, while you watch	claude.ai (Artifacts), ChatGPT Canvas, Gemini Canvas, Vibe
Desktop	An app on your machine; local files, local control	Local machine	Cursor, Devin Desktop, Zed, Continue, Claude Desktop
Cloud Agent	You hand off a task, walk away, check back	Cloud, async	Devin, Replit Agent, Bolt, Lovable, v0

The substantive divide is where the work happens, not whether the surface is GUI or terminal. Terminals are a desktop surface; the engineer's companion article covers the CLI subset inside Desktop for readers who live there. This piece is the rest of the map for everyone else.

The frame for this piece is builder versus engineer. Most articles in this space implicitly assume engineer. Builder means: you can read code, you have a clear product vision, you ship things, and you do not want to fight the tooling. Different surface, different shortlist.

What changed since the last article

Three weeks of churn worth knowing about.

Windsurf is now Devin Desktop. Cognition acquired Windsurf for $250M in December; the rebrand landed June 2, 2026. The Cascade agent goes end-of-life July 1.
Le Chat is now Vibe. Mistral renamed its consumer product May 28, 2026. Same model, same EU-jurisdiction story.
Devin's price floor collapsed. Cloud agents used to be enterprise-only. Devin went from a $500/mo floor to $20/mo + $2.25 per ACU (about 15 minutes of agent work). This is the single biggest pricing event of the year for non-technical builders.
Phind shut down January 16, 2026. They were the search-grounded coding tool. The stated reason ("commoditisation by foundation-model providers") is the same reason this category is worth re-mapping.
Roo Code is archived (May 2026); migrate to Kilo Code.
GitHub Copilot moved to usage-based AI Credits on June 1, 2026.

If you bookmarked anything from the old map, this is your patch.

Desktop IDEs: where most engineers actually live

The desktop tier has converged on a $20 entry price across almost every vendor. That is not a coincidence; it is the going rate for cloud-frontier-model use plus a wrapper.

Tool	Price	What it is for
Cursor	$20 Pro, $60 Pro+, $200 Ultra	The most-used AI IDE. VS Code fork with agent + tab completion. Default for engineers who want a familiar IDE that just got smarter.
Devin Desktop (was Windsurf)	$20 Pro, $200 Max, $80+seat Teams	Post-acquisition consolidation. Replaces Cascade with Devin Local. Different bet from Cursor: Cursor optimises for the human at the keyboard, Devin Desktop dispatches more aggressively to cloud agents.
Zed	$10 Pro + $20 AI add-on	Rust-native, fast. Less feature-saturated than Cursor; that is the appeal.
Antigravity (desktop)	$20 Pro, $100 Ultra, $200 Max	Google's IDE play. Agent-first; bundles GPT-OSS 120B as a first-class option alongside Gemini 3.1 Pro and Claude 4.6.
Kiro	$20 Pro (1k credits), $40 Pro+ (2k)	AWS-aligned, spec-driven agentic engineering. If your stack is AWS this earns its keep.
Vibe (was Le Chat)	Per Mistral plans	The EU-sovereign desktop option. GDPR-native, runs on Mistral's own open weights.
Claude Desktop	Bundled with Claude paid plans	Anthropic's desktop app for Claude (macOS + Windows). Chat + code Artifacts + MCP connectors for files, repos, and tools. Not an IDE, but with MCP it sits closer to one than the simple-chat framing of last year. With Cowork enabled (next row), the same app becomes an autonomous agent host.
Claude Cowork	Included for paying Claude subscribers (macOS + Windows)	Desktop Agent. Lives inside Claude Desktop. You point Cowork at folders + connected tools (Google Drive, Gmail, DocuSign, FactSet, Zoom MCP, custom plugins), describe the outcome and cadence, and it completes multi-step tasks. Reads, edits, creates files. Went GA April 9, 2026 after a research preview at end of January. Enterprise tier has role-based access, group spend limits, OpenTelemetry, deployable plugins.

For a builder picking right now, the live question is: do you want inline edits + tab completion (you are in the code, want it to keep up), agent dispatch on code (you describe a code task, walk away, come back), or knowledge-work agent (you point it at a folder + connected tools and it completes office-shaped work, not just code)?

Inline + tab completion → Cursor or Zed.
Agent-leaning, code-shaped work → Devin Desktop or Antigravity.
Agent-leaning, knowledge-work shaped (analysis, reports, multi-tool flows) → Claude Cowork.
Stack-specific tilt → Kiro for AWS.
EU-sovereignty constraint → Vibe.

There is no single right answer. There is a right answer per project.

A taxonomy honesty note. Cowork breaks the clean Desktop / Cloud Agent split in the triad: it is a desktop app that behaves like an agent, with file system access and multi-step autonomy on the local machine. Devin Local (replacing Cascade in Devin Desktop) is moving in the same direction. The triad still helps as a first-cut sort, but the boundary between "Desktop" and "Cloud Agent" is dissolving as agent capability moves into local apps. Expect more of this. The surface where you sit and the agent layer that does the work are merging.

Desktop extensions: the OSS lane in your existing editor

If you already love VS Code or JetBrains and you do not want a new IDE, the extension tier is where open-source lives. All three of these support any provider you can point them at: Ollama for local, OpenAI-compatible for cloud, including the Chinese open-weights providers like DeepSeek and Qwen.

Continue.dev: the leading OSS extension. Chat-first, not autonomous. Great if you want assistance, not a takeover.
Cline: autonomous-agent class. The agent edits files, runs commands, asks for approval at each step. 5M+ installs.
Kilo Code: fork of Cline + Roo (Roo is archived; Kilo is the live one to migrate to). 1.5M users, top OpenRouter consumer in 2026. Adds Orchestrator + Memory Bank.

If you are a builder who wants to keep your existing editor and just get smarter help, the right move is: Continue.dev for assistance, Cline or Kilo Code for autonomous tasks. Both flavours, free, open weights compatible.

Cloud Web: the canvas tier

This is the least-differentiated category in mid-2026. Every frontier lab's chat now has a Canvas-equivalent: a side panel where the model writes code while you watch, you can edit inline, both of you iterate. The differences are mostly model preference and ecosystem fit.

Tool	Strength
claude.ai (Artifacts)	Best code artifacts in the browser. Strong for one-off scripts, learning examples, quick prototypes.
ChatGPT (Canvas)	Heaviest integration. Codex Cloud bridges from Canvas to the cloud agent. US free tier got ads in February 2026, a real signal if you care about data leakage.
Gemini (Canvas / Gems)	1M+ context, deep Google Workspace integration. AI Ultra at $249.99/mo is the most expensive consumer tier here.
Vibe (web)	The EU-sovereign browser option. French jurisdiction, GDPR-native, runs on Mistral's open weights. Strongest if "no data leaves EU" is a hard constraint.
Perplexity	Search-grounded. Lighter coding, stronger research.
Grok (xAI)	Grok Build for code. Premium tier via X.

Phind shut down in January 2026, and their stated reason ("commoditisation by foundation-model providers") is the honest read on the whole category. Every model vendor now has a Canvas-equivalent, so independent search-coding products had nowhere to stand.

For a builder, the practical advice: pick the lab whose model you already like, use that Canvas. Switching costs are zero; the differences are stylistic.

Cloud Agent: the category that woke up

This is the category that barely existed eighteen months ago and now has the most newcomers and the biggest pricing news. If you are a non-technical builder who can describe what you want in plain English, this is the surface that has changed the most for you.

Tool	Price	What it is for
Devin	$20/mo + $2.25/ACU (~15 min)	The most autonomous; full sandboxed VM. Was $500/mo enterprise-only until April 2026.
Codex Cloud	Bundled with ChatGPT Plus / Pro	Async PR-from-prompt. 77.3% on Terminal-Bench 2.0.
GitHub Copilot Coding Agent	Bundled with Copilot from $10/mo	Cheapest cloud agent. Assign issue → PR.
Replit Agent	$20 Core, $95 Pro (10 parallel)	In-browser IDE + agent. Strong builder onboarding.
Bolt.new	From $20/mo (token-based)	Prompt-to-app via in-browser WebContainers.
Lovable	From $25/mo (message-based)	React + Supabase apps. Export to GitHub.
v0	Bundled with Vercel paid	UI-component focus. Next.js native.
Devin Local	Bundled in Devin Desktop	Local-runtime cloud agent. Replacing Cascade July 1.
Antigravity Managed Agents	Bundled with Gemini API	Google's API-tier agent runtime.

Devin's price floor collapse is the single most important pricing event for builders this year. A category that gated at $500/mo for the last 18 months now starts at $20 plus per-task billing. That makes autonomous-cloud-agent not a premium gate for the first time. Worth pausing on.

The prompt-to-app subcategory has stratified by stack:

Lovable owns React + Supabase end-to-end.
Bolt.new owns in-browser WebContainers (runs in your tab, no infra to set up).
v0 owns Vercel + Next.js + the UI-component-first workflow.

If you are a builder shipping a customer-facing web app, those three are the live shortlist. Pick by which stack your future engineer is going to inherit when the app outgrows the prompt-to-app tier.

One honest constraint: none of the Cloud Agent tier is open-weights friendly today. These are all closed sandbox runtimes with proprietary models. If you want autonomous-agent capability and open weights, you are looking at Devin Local (alpha, partial) or building your own with one of the OSS CLIs from the engineer article.

A second axis worth tracking: where does the tool stop

A useful framing the May article missed is not just what tool, but where does it stop?

suggest → edit → commit → PR → deploy → live app

For a builder, the question maps differently than for an engineer:

Cursor / Zed / Continue → suggest + edit. Human commits.
Cline / Kilo Code → edit + commit. Human reviews.
Cursor agent / Antigravity / Copilot Coding Agent → commit + PR. Human approves.
Devin / Replit Agent → PR + sometimes deploy. Human checks the app worked.
Bolt / Lovable / v0 → deploy + live app. Human iterates from the live result.

Pick by how far you want automation to terminate. Builders typically want it to terminate further down the chain than engineers do. That is exactly why the Cloud Agent tier matters more here.

Sovereignty: EU, USA, World

The political backdrop does not go away by ignoring it. For builders shipping to customers, regulated or not, here is the picture:

China-based providers dominate the cheap end of open weights: DeepSeek (V4 just shipped April 2026), Z.AI / Zhipu (GLM), Alibaba (Qwen), Moonshot (Kimi), MiniMax. Open-weights pricing dropped about 80% year-on-year, driven by this group. If you are pointing a desktop extension at a cloud open-weights model, the cheapest credible coding-capable model in the world right now is DeepSeek V4-Flash at $0.14 per million input tokens.

US-based hosters (Together, Fireworks, Groq, DeepInfra, OpenRouter) mostly do not own a model. They serve Chinese open weights at US-jurisdiction inference cost. Useful when you want Chinese-trained capability but US-or-EU data residency.

The single US-trained open frontier family is Meta's Llama 4 under the Community License (open, but not OSI-approved). There is a 700M monthly-active-user cap and a "no training competitors" clause that matters at scale. OpenAI's gpt-oss-120B and 20B (Apache 2.0) are the only fully-permissive US flagship-tier open releases.

EU-based. Mistral is the list. The only sovereign EU frontier-capable open-weights stack. GDPR-native. La Plateforme runs in EU data centres. If your compliance position is "no customer data leaves EU jurisdiction", the practical answer in mid-2026 is: self-host open weights on EU hardware, or use Vibe / Mistral. Short list.

This is unfortunate but not less true.

A note on the open-weights-first harness

Worth flagging because it shows up across both articles in this series: Ecodex is an Empirica-native CLI harness built around calibrated agent behaviour, with open-weights models as first-class citizens. It is alpha, opinionated, three GitHub stars at the time of writing, and not in the Desktop tier; it lives in the CLI category and is covered in the engineer article.

For builders, the relevant property is the compliance angle. The Ecodex team just shipped a compliance crosswalk mapping the substrate to EU AI Act, GDPR, and ISO 42001, framed as compliance as a computed property (the return value of compliance-report) rather than as a signed binder. If your project has a regulatory anchor that has to survive audit, that framing is worth a read.

The desktop side of this story is on the roadmap, not shipped. For now, Ecodex is a terminal thing.

What I would actually pick

Three reads for three builder contexts:

Solo builder, web product, want speed. Bolt.new or Lovable for prompt-to-app; graduate to Cursor or Devin Desktop when the app gets real.
Builder with an existing codebase. Continue.dev or Cline as a VS Code extension, pointed at whatever model your bill tolerates. DeepSeek V4-Flash for serious work, local Qwen3-Coder via Ollama for the no-bill option.
Builder with a compliance constraint. Vibe in the browser, or self-hosted open weights on EU hardware via Cline + Ollama. Compliance-grounded path through Ecodex for the calibration story.

What I am not saying

That the desktop tier is settled. Cursor is the default for a reason, but Devin Desktop is the most interesting bet on what comes after the inline-completion era.
That Cloud Agents are ready to replace human engineers. They are ready to ship surface area (landing pages, internal tools, prototype apps). They are not ready to own a codebase.
That open weights are at parity with closed frontier on every task. They are close on coding-specific benchmarks, behind on certain reasoning shapes, and the gap is narrowing fast.
That sovereignty should come before capability. Pick by capability for the task; reach for sovereignty when the compliance position requires it.

The CLI side of this series is the companion piece, Coding CLIs in mid-2026: the engineer's map. If you are working primarily in a terminal, that is where to look next.

Comments and corrections welcome. Especially: tools I should have included, pricing I got wrong, builder workflows I missed.

Series: AI Coding Harness Map (2026). Pricing verified against official provider pages on 2026-06-21.

Top AI Coding CLIs of 2026 (mid year): the harness is everything

David Van Assche (S.L) — Tue, 23 Jun 2026 10:52:27 +0000

The May piece (Every AI Coding CLI in 2026: the complete map, 30 tools compared) sorted tools by pricing and openness. That is one axis. It is not the axis a reader actually cares about when they open a terminal and need to pick something. The axis that matters is surface affordance: what shape of work am I trying to do, and what shape of tool meets that shape?

This follow-up reorganises around that question, narrows to engineers using terminals, and refreshes everything that changed in 30 days. The Desktop / Cloud Web / Cloud Agent piece (for non-technical builders) is the companion article.

Two framing observations before the lists.

First: the wars are over and the wire format is settled. Every serious CLI harness in mid-2026 accepts at least one of OpenAI-compatible or Anthropic-Messages endpoints. The "which protocol" question stopped mattering. The "which provider, which model, at what price" question is the live one.

Second: "free" almost always means free-software-plus-paid-tokens. The CLI is open source and free; the API tokens it spends are not. There are still a handful of genuinely-free-with-a-real-model offerings, but the centre of gravity moved.

A clean four-way taxonomy

After two articles' worth of poking at this, the cleanest split is four categories. Mutually exclusive, collectively exhaustive, no overlap arguments:

Category	Where you work	Examples
CLI	A terminal	Claude Code, Codex CLI, Aider, OpenCode
Desktop	A native IDE or extension	Cursor, Devin Desktop, Zed, Continue
Cloud Web	A browser tab	claude.ai, ChatGPT Canvas, Gemini Canvas
Cloud Agent	Async, you hand off a task	Devin, Replit Agent, Bolt, Lovable

This article is about the first column. The rest live in the companion piece.

A few tools span two surfaces. When they do, place them by primary user surface (where the person actually sits) and note the secondary as an architectural property, not a separate product.

What changed in 30 days

The AI coding tooling landscape moves fast enough that any month-old map needs a delta page.

Retired, sunsetted, or in EOL mode.

Phind shut down January 16, 2026, citing "commoditisation by foundation-model providers". The cleanest single example of the thesis.
Gemini CLI sunsets paid auth June 18, 2026. Replaced by Antigravity CLI (GA since May 19 at I/O).
Roo Code announced shutdown April 21, archived May. Users migrate to Kilo Code.
Cascade (Windsurf's agent) end-of-life July 1, 2026.
iFlow CLI sunset announced April 17, 2026. Verify before recommending.
Amazon Q Developer: new signups blocked from May 15, 2026. Existing seats only. Effectively EOL.

Rebranded.

Windsurf is now Devin Desktop (June 2, 2026), following Cognition's $250M acquisition.
Mistral's Le Chat is now Vibe (May 28, 2026). The Vibe CLI shipped its 2.0 in January.
Goose moved from Block to the Linux Foundation Agentic AI Foundation in April 2026. New repo at aaif-goose/goose. Foundation-governed vendor neutrality, not just code-level neutrality.

Pricing churn worth noting.

Qwen Code's free OAuth retired April 15, 2026. Cheapest paths today: local Qwen3-Coder on Ollama (~46GB at 4-bit), OpenRouter's qwen/qwen3-coder:free rate-limited tier ($10 one-time buys 1k req/day), or Alibaba's $50/mo ModelStudio Coding Plan (about 90k req/mo). The CLI is still Apache-2.0; you just cannot run it free against Alibaba any more.
Devin's enterprise-only $500/mo floor collapsed in April 2026. Now $20/mo + $2.25 per ACU. Biggest pricing event of the year for cloud agents (covered in the companion piece).
GitHub Copilot moved to usage-based AI Credits on June 1, 2026. Free CLI now token-metered.
DeepSeek V3 / R1 deprecate July 24, 2026, replaced by V4 (Pro + Flash, April 2026). Anything on the original article that referenced V3 pricing is already obsolete.

If you are working from a one-month-old map, this is your patch.

CLIs in mid-2026: free vs paid

I am splitting by tier because at the terminal you mostly care about two things: how much it costs, and whether you can point it at your own model. For each, I am noting the third-party API support and the local-inference path (Ollama, LM Studio, llama.cpp, vLLM), because in mid-2026 that combination is what defines a flexible harness.

Free CLI, BYO tokens or local inference

These are open-source or free-tier CLIs. The tool is free; what you pay for is the model behind it. All of them accept OpenAI-compatible endpoints and most also accept Anthropic-Messages, which means you can route them at any provider whose API speaks one of those two protocols (including DeepSeek, Qwen via OpenRouter, GLM/Z.AI, Kimi, Mistral, plus any local Ollama / LM Studio / vLLM endpoint).

Tool	What it is	Model-specific tilt	Note
Aider	Apache-2.0, git-native pair programming. Repomap + auto-commit.	None (LiteLLM under the hood)	Mature, ~39k stars
Goose	Foundation-governed (LF AAIF since April 2026). 15+ providers.	None	~29k stars
OpenCode (sst)	Apache-2.0; 75+ providers via Vercel AI SDK + Models.dev registry	None	~172k stars, the breakout
Crush (Charm)	Apache-2.0. Mid-session model switching. TUI polish.	None	Active
Codex CLI (OpenAI)	Free if you have any ChatGPT plan (uses plan quota); `--oss` flag for local Ollama; BYOK at API rates otherwise	OpenAI-tilted, `--oss` opens it up	April 2026 billing → token credits
Antigravity CLI (Google)	Free tier via Google account	Strongly Gemini	Replaced Gemini CLI on May 19
Qwen Code (Alibaba)	Apache-2.0 CLI; free OAuth retired April 15, 2026. Now: OpenRouter free-rate, Fireworks, DashScope, local Ollama, or paid Alibaba plan	Qwen3-Coder	Active, tier change
Mistral Vibe CLI	Apache-2.0 CLI. Devstral 2 free at launch (planned $0.40/$2.00 per 1M tokens)	Devstral 2 / Mistral	Vibe 2.0 January 2026
Kimi Code CLI (Moonshot)	OSS CLI; Modified MIT model weights; BYOK to Kimi API	Kimi K2 family	K2.7-Code shipped June 12
Hermes Agent (Nous Research)	OSS. Native CLI + TUI + desktop app. "Agent that grows with you."	None — multi-provider, Llama 4 local documented	Blank Slate mode June 20
DeepSeek-TUI / Deep Code	Community projects, no first-party CLI from DeepSeek	DeepSeek V4	Active
sgpt (shell-gpt)	OSS. One-shot shell-native. LiteLLM bridge to everything.	None	Last update May 2026
OpenClaw	OSS BYOK. Kilo Gateway aggregator option (500+ models at 0% markup).	None	Active
Kilo CLI	MIT CLI sibling of Kilo Code extension	None	Active

A note on what "model-specific tilt" means: harnesses like Qwen Code, Kimi Code CLI, Mistral Vibe, and Antigravity are purpose-built around one family of models, with provider configs and prompt formatting tuned to that family. They will run other models via OpenAI-compatible endpoints, but their default registry and best-tested path is the family they ship for. The model-agnostic ones (Aider, Goose, OpenCode, Crush, Hermes, OpenClaw, sgpt) treat providers as interchangeable from day one.

A note on what is genuinely free at the model level (not just at the CLI level): Codex CLI on ChatGPT Free tier, Antigravity CLI on Google's free Gemini tier, GitHub Copilot Free plan (limited credits), Amp by Sourcegraph (free while the ad-supported model is in test), OpenRouter's free Qwen3-Coder route (rate-limited). Everything else is BYOK and you pay for the tokens.

A note on what is missing first-party: DeepSeek has no first-party CLI (Deep Code and DeepSeek-TUI are community). Zhipu/GLM ships day-one integrations into Claude Code / Cline / Goose / OpenCode / Crush / Kilo rather than its own CLI. Meta has no Llama-native CLI; Hermes Agent is the de-facto Llama-4 host. gpt-oss lives inside Codex CLI's --oss mode and every BYOK harness via Ollama.

Paid-subscription CLIs

These require an active subscription to use meaningfully. No real free tier for daily-driver work.

Tool	Pricing	Provider matrix	Local	Tilt
Claude Code (Anthropic)	Bundled into Claude paid plans, or BYOK at Anthropic API rates. No free CLI usage.	Anthropic-Messages only on managed; subagent/extension layer is the workaround	No first-party local	Strongly Claude
GitHub Copilot CLI	Free plan exists but CLI consumes AI Credits (1 credit = $0.01 since June 1, 2026); Pro $10, Pro+ $39, Max $100 per month	Claude Code + OpenAI Codex wired as third-party agents inside Copilot	No	Multi-model
Amp (Sourcegraph)	Currently free during ad-model test; Enterprise tier exists	Claude Opus 4.7 + others via Sourcegraph	No	None
Alibaba ModelStudio Coding Plan	$50/month, ~90k req/mo, 6k per rolling 5-hour window	Replaces the retired Qwen Code free tier	N/A	Qwen3-Coder

Routers and proxies

Worth knowing about because most of the free-tier CLIs above can be pointed at one of these as a single aggregated provider.

OpenRouter — multi-provider aggregator with one API surface. Paid per-token. Free routes available for some models (Qwen3-Coder, Hermes 3 405B).
Together AI / Fireworks / DeepInfra / Groq — US-jurisdiction hosters serving Chinese and Meta open weights. Useful when you want capability without the data-residency tradeoff.
Kilo Gateway — 500+ models at 0% markup. Pairs cleanly with OpenClaw and Kilo CLI.
9router, CLIProxyAPI — OSS self-hosted routers if you want to manage your own bills and rate limits.

Where does the tool stop

A second axis worth tracking is not just what tool, but where does it stop:

suggest → edit → commit → PR → deploy → live app

Your choice is partly about how far you want automation to terminate. Aider stops at edit. Claude Code stops at PR. Codex CLI stops at PR (sandboxed). Devin stops at deploy. Bolt stops at live app.

Pick by where you want the handoff back to the human, not just by capability.

Open-weights cloud API pricing (June 2026 snapshot)

If you are pointing an OSS CLI at a cloud model, here is the current landscape. All prices USD per million tokens. Verified against official provider pages on 2026-06-21.

Model	Provider	In	Out	Cached In	Context	License
DeepSeek V4-Flash	api.deepseek.com	0.14	0.28	0.0028	1M	MIT
DeepSeek V4-Pro	api.deepseek.com	0.435	0.87	0.003625	1M	MIT
GLM-4.6	Z.AI	0.60	2.20	0.11	200K	MIT
GLM-4.5-Air	Z.AI	0.20	1.10	0.03	128K	MIT
Qwen3-Max (Intl)	Alibaba Model Studio	1.20	6.00	tiered	252K	Apache 2.0
Qwen3-Coder-Plus	Alibaba Intl	1.00	5.00	tiered	1M	Apache 2.0
Kimi K2.6	platform.kimi.ai	0.95	4.00	0.16	200K	Modified MIT
MiniMax M2.7	MiniMax PayGo	0.30	1.20	0.06	197K	MIT
Mistral Large 3	mistral.ai	0.50	1.50	n/d	128K	Apache 2.0
Mistral Small 4	mistral.ai	0.10	0.30	n/d	128K	Apache 2.0
Devstral 2	mistral.ai	0.40	2.00	n/d	128K	Apache 2.0
Llama 4 Maverick	DeepInfra	0.15	0.60	n/d	1M	Llama 4 Community
Llama 4 Scout	Together AI	0.08	0.30	n/d	10M	Llama 4 Community
gpt-oss-120B	Groq	0.15	0.60	n/d	128K	Apache 2.0
gpt-oss-20B	Groq	0.075	0.30	n/d	128K	Apache 2.0

Three things to notice.

First, DeepSeek V4-Flash at $0.14 per million input tokens is the floor for credible agentic coding. Anything cheaper is a smaller-capability tier. The May article's pricing data was V3-era and is already obsolete.

Second, open-weights API pricing dropped roughly 80% year-on-year (early 2025 to early 2026, per inference.net's cross-provider analysis). Chinese providers drove the floor; US hosters competed on serving speed rather than on the models themselves.

Third, tool-use is now universal on this list. By mid-2026, function-calling stopped being a differentiator. Context windows similarly inflated: 128K is the floor, 200K is normal, frontier pushes to 1M and 10M.

One pitfall worth flagging: Qwen's tiered billing. Alibaba bills the whole request at the tier set by the input-token count. A coding agent that swells context mid-conversation can jump from $1.00/Mtok to $6.00/Mtok input in a single step. Worth a footnote in your config.

Sovereignty: EU, USA, World

This is the political part of the picture that does not go away by pretending it is not there. Compliance teams ask about it. Procurement asks about it. So:

China-based providers (DeepSeek, Z.AI/Zhipu, Alibaba/Qwen, Moonshot/Kimi, MiniMax, 01.AI) dominate two ends at once: the cheapest credible frontier and the largest concentration of permissively-licensed flagship weights. Mostly MIT or Apache 2.0 across the board. Data residency depends on which endpoint you call; Singapore international endpoints sit outside mainland jurisdiction, mainland endpoints don't.

US-based hosters (Together, Fireworks, Groq, DeepInfra, OpenRouter) mostly don't own a model. They serve Chinese open weights at US-jurisdiction inference cost. Useful if you want Chinese-trained capability but US-or-EU data residency. The single US-trained open frontier family is Meta's Llama 4 under the Community License (open, but not OSI-approved). OpenAI's gpt-oss-120B/20B (Apache 2.0) is the only fully-permissive US flagship-tier open release.

EU-based. Mistral. That is the list. The only sovereign EU frontier-capable open-weights stack. GDPR-native, La Plateforme runs in EU DCs. Slightly more expensive at the high end than Chinese equivalents, dramatically cheaper than US closed frontier. Codestral, Devstral, Magistral Small, Mistral Large 3, Mistral Small 4 are all Apache 2.0. Magistral Medium (reasoning) is the only closed/premier model in the lineup.

If your compliance position is "no customer data leaves EU jurisdiction" the practical answer in mid-2026 is: self-host open weights on EU hardware, or use Mistral. That is the entire shortlist.

Ecodex: the calibration-first CLI

Disclosure first: I work on Empirica, and Ecodex is Empirica's CLI harness. Alpha, daily-driven by the team, opinionated and based on the Empirica system for Claude.

I am including it because it competes on an axis the rest of this article does not cover. Every CLI above is competing on the same thing: better edits, better context, better tool-use. Ecodex competes on metacognition and governance: it is a coding CLI that is accountable for what it claims to know.

The shape, briefly.

Ecodex is a fork of openai/codex bundled with the Empirica epistemic-discipline framework. It does two things stock codex does not.

Per-action enforcement. A Sentinel firewall sits between the model and the tools. State-changing tools (Edit, Write, Bash on non-read commands) require an open transaction with a passed CHECK gate. Investigation tools (Read, Grep, Glob) flow freely until a hypothesis-bearing prompt arms an investigation-proportionality budget. The agent literally cannot edit a file without first declaring what it knows and passing the gate. The block is not silently dropped; the sentinel emits an explicit permissionDecision: deny and codex honours it.
Per-transaction calibration. Every unit of work opens with a PREFLIGHT (the agent declares thirteen calibration vectors representing its current epistemic state) and closes with a POSTFLIGHT (the same vectors re-declared, then grounded against deterministic services like test results, git metrics, artifact counts). The divergence between what the agent claimed and what actually happened gets recorded. Over time, the divergence becomes a signal you can act on.

Out of the box it ships curated open-weights provider defaults: DeepSeek, Qwen3-Coder, Kimi K2.6, GLM, Mistral (Devstral 2 for agentic coding, Codestral for completion, EU-hosted at api.mistral.ai/v1, shipped by default as of commit c2457d0d6e), and local routes via Ollama, LM Studio, llama.cpp, vLLM. Hot-swap mid-session via /model, no restart.

That Mistral default is worth pausing on if your stack has an EU compliance constraint. It makes Ecodex the only harness in this comparison set that ships an EU-hosted cloud provider as a first-class pick and lets you self-host the same open weights (Devstral is open weights) on your own EU hardware via vLLM or Ollama. Code never leaves the EU on either path.

The 30-second moment, if you want to see the differentiator without reading more:

$ ecodex
> /model        # pick DeepSeek V4-Flash, or local Qwen3-Coder via Ollama
> fix the off-by-one in parse_range in utils.py

The statusline shows the live phase (noetic, then praxic) and an intuition-vs-search badge. The agent reads and greps freely. If it attempts to Edit before grounding, the Sentinel blocks the call with a visible reason ("praxic tool requires CHECK=proceed"). It investigates more, passes CHECK, makes the fix, runs the tests, and at POSTFLIGHT prints the grounded delta. You see belief measured against outcome.

Install paths:

brew install nubaeon/tap/ecodex
# or: cargo install --git https://github.com/Nubaeon/ecodex codex-cli
# or: direct binary from https://github.com/Nubaeon/ecodex/releases/latest

It is alpha. It is opinionated. The discipline overhead is the point. If you do not want a CLI that argues with you about whether you have done enough investigation, this is not the CLI for you. If you do, it is the only one I am aware of that builds that discipline into the harness rather than asking you to remember to do it yourself.

Source: github.com/EmpiricaAI/ecodex. The compliance crosswalk (mapping the substrate to EU AI Act, GDPR, ISO 42001) is at docs/ecodex/positioning/compliance-crosswalk.md, relevant if your stack has a regulatory anchor.

What I would actually use

Three reads for three working contexts. Not the only right answers; just where I would start in mid-2026.

Serious work, paid, closed weights. Claude Code for the reasoning model, Codex CLI for the sandbox. Switch between them per task shape.
Open weights, BYO model. OpenCode for the breadth (75+ providers), Aider for git-native discipline. Goose if you specifically want foundation-governed vendor neutrality.
Open weights with accountability built in. Ecodex, with the caveats above. The category-of-one for now.

What I am not saying

That the duopoly is going away. Claude Code and Codex CLI are not getting cheaper; they are getting better, and the closed-frontier reasoning capability gap is still real.
That open weights are at parity with closed frontier on every task. They are close on coding-specific benchmarks, behind on certain reasoning shapes, and the gap is narrowing fast.
That you should pick by sovereignty before capability. Pick by capability for the task; reach for sovereignty when the compliance position requires it.

The honest summary is: the floor is much lower than it was a year ago, the open tier is genuinely usable for daily work, and the most interesting category to watch is the one that does not compete on "better edits" but on "accountable edits". Reader's pick.

The Desktop / Cloud Web / Cloud Agent companion is the next post in this series. It covers Cursor, the Windsurf-to-Devin-Desktop rebrand, Cloud Agent's price collapse, and the builder versus engineer split.

Comments and corrections welcome. Especially: tools I should have included, pricing I got wrong, or sovereignty framings I am missing.

Series: AI Coding Harness Map (2026). Pricing verified against official provider pages on 2026-06-21.

The Prosodic Memory Layer: How AI Learns Your Voice (and Why It Matters)

David Van Assche (S.L) — Thu, 16 Apr 2026 10:01:06 +0000

Final part of the Epistemic AI series. We've covered the problem, measurement, calibration, and integration. Now: how AI learns to sound like you — and why that matters more than you think.

When you write a Reddit comment, a Dev.to article, and a LinkedIn post about the same topic, you change how you write. Not the facts — the voice. The register shifts. The depth adjusts. The cultural expectations of each platform shape what "good" looks like.

Your AI doesn't know this. It writes the same way everywhere — the same helpful, slightly formal, universally inoffensive tone that's instantly recognizable as machine-generated. That's not a style problem. It's a measurement problem.

The Voice Gap

Every AI writing tool faces the same structural limitation: the model has no memory of how you write, where you're writing, or what has actually worked for your audience before.

This creates three predictable failures:

The identity gap. The AI doesn't know your natural register — whether you lead with data or analogy, whether you hedge or state directly, whether you use jargon or translate. It defaults to "helpful assistant" because it has no evidence to do otherwise.

The platform gap. Reddit has anti-marketing antibodies. Dev.to rewards show-don't-tell technical narrative. LinkedIn expects professional polish. The AI doesn't adapt to these cultural norms because it doesn't track them. It writes the same way on every platform, and it underperforms on all of them.

The learning gap. After you publish, some content works and some doesn't. Engagement data exists — reactions, comments, reads, saves. But none of that feeds back into the next generation cycle. The AI starts from the same blank slate every time. It never gets better at being you.

The Tri-Axis Model

We built something to solve this. The prosodic memory layer — built on top of Empirica's epistemic measurement infrastructure — tracks writing patterns across three axes:

Axis 1: Creator Voice

Your writing DNA. Not what you say, but how you say it.

The system ingests your actual writing — posts, comments, documentation, emails — and builds a voice profile from real samples. Not a prompt like "write in a casual tone." A statistical model of your natural tendencies:

Register: formal, conversational, technical, casual — and how it shifts by context
Tendencies: "technical-then-analogy," "question-led," "data-first"
Anti-patterns: "corporate-speak," "hype-language," "over-qualifying"
Archetype: founder-engineer, researcher, marketer, writer

Each sample is embedded as a semantic vector with metadata — platform, audience, register, engagement score, topic tags. When the system needs to write as you, it doesn't guess from a prompt. It retrieves your closest real writing for that context and uses it as a stylistic reference.

Axis 2: Platform Adaptation

Each platform has cultural norms that override personal style. The system encodes these as structured profiles:

Dev.to rewards code examples readers can run, "how I built X" narratives, and honest post-mortems. Theory without code underperforms. Thinly disguised product announcements get called out.

Reddit rewards personal experience framing, openly acknowledged uncertainty, and specific technical details. Marketing language gets instant downvotes. Self-promotion without value contribution gets buried.

LinkedIn rewards professional framing, quantified results, and industry-relevant insights. The register is professional but authentic — pure corporate-speak reads as hollow.

When the system generates content, it loads the target platform's profile and adapts the voice accordingly. Same message, different register. Your Dev.to article and your Reddit comment on the same topic should sound like they were written by the same person — on purpose, with intent — not like copy-paste.

Axis 3: Audience Reception

The feedback loop. After publishing, engagement data flows back — reactions, comments, reads, saves — normalized into comparable metrics across platforms.

The system detects which patterns resonate with which audiences:

Which register outperforms on which platform
Which topics consistently drive engagement
Which voice patterns (your real ones) correlate with the best reception

These patterns become findings — logged as Empirica artifacts — and feed into the next content generation cycle. The brief gets richer each time. The AI doesn't just know how you write. It knows how you write when things work.

The Content Brief: Three Layers Merged

When the AI generates content, all three axes merge into a single brief — a structured context document that tells the drafter exactly what it's working with:

PLATFORM: Dev.to
- Cultural norms: technical-narrative, show-don't-tell
- What works here: code examples, honest post-mortems
- Min confidence to post: 70%

ENGAGEMENT DATA:
- Platform average: 0.45 (32 published samples)
- Top topics: epistemic-uncertainty (0.81), calibration (0.67)

CREATOR VOICE:
- Archetype: founder-engineer
- Natural register: technical
- Tendencies: technical-then-analogy, data-before-opinion
- Anti-patterns: corporate-speak, hype-language

The drafter sees all three layers. The result reads like the creator wrote it, adapted for the platform, informed by what actually gets engagement. Not because we fine-tuned a model. Because we gave it the right context — measured, structured, evidence-based.

The Loop That Learns

This is where prosodic memory connects back to the epistemic measurement layer from the rest of this series:

1. Ingest writing samples → semantic vectors (voice model)
2. Build creator profile → structured voice DNA
3. Generate content → informed by 3-layer brief
4. Publish to platform
5. Fetch engagement data → normalized metrics
6. Detect patterns → findings logged as Empirica artifacts
7. Next cycle → brief includes engagement patterns

Each cycle, the brief gets richer. The system learns not just how you write, but how you write when it works. That's the difference between voice matching and voice optimization.

And because it's built on Empirica's artifact system, every insight is traceable. You can see which engagement findings influenced which generation cycle. You can audit why the system chose a particular register. The voice layer is measured, not magical.

Why This Matters Beyond Content

The prosodic memory concept extends beyond writing. Any AI interaction where consistency of approach matters — customer support, legal drafting, medical documentation, financial reporting — has the same structural problem: the AI defaults to its training distribution, not to the human's established patterns.

The tri-axis model is generalizable:

Axis 1 (Creator Voice) → Domain Expert Voice — how this doctor explains diagnoses, how this lawyer drafts contracts
Axis 2 (Platform Adaptation) → Context Adaptation — patient-facing vs. chart notes, client memo vs. filing
Axis 3 (Audience Reception) → Outcome Measurement — patient comprehension, legal precision, compliance rates

The infrastructure is the same. Ingest real samples. Build a profile. Adapt to context. Measure outcomes. Feed back.

The Connection to Calibration

Prosodic memory is grounded calibration applied to voice instead of code.

In the coding context (Parts 1-4 of this series), the AI declares what it knows, then deterministic evidence — tests, linters, git metrics — verifies the claim. The gap between self-assessment and evidence is the calibration signal.

In the voice context, the AI generates content in your voice, then engagement data — reactions, reads, comments — verifies whether the voice worked. The gap between expected performance and actual reception is the voice calibration signal. Same framework, different evidence source.

This is what makes it structural rather than cosmetic. We're not prompt-engineering a tone. We're measuring voice accuracy the same way we measure epistemic accuracy — with falsifiable evidence, tracked over time, compounding in value.

This concludes the Epistemic AI series. All five parts:

Your AI Doesn't Know What It Doesn't Know
Measuring What Your AI Learned
Grounded Calibration vs Self-Assessment
Adding Epistemic Hooks to Your Workflow
The Prosodic Memory Layer (this article)

Empirica is open source (MIT). GitHub

The prosodic memory layer is part of a commercial product built on Empirica's measurement infrastructure. The concepts described here — tri-axis voice modeling, platform adaptation, engagement feedback loops — represent the direction we're building. If you're interested in early access or collaboration, reach out.

What Every AI Coding Tool Gets Wrong (And What to Do About It)

David Van Assche (S.L) — Thu, 16 Apr 2026 09:27:20 +0000

Part 3 of the AI Coding Tools Deep Dive. Parts 1 and 2 covered every tool and how to run them free. This one asks the question nobody's asking.

I've now tested or researched 30+ AI coding tools. They're all good. Some are great. But they all share the same blind spot.

The Blind Spot

Pick any tool from Parts 1 and 2. Ask it a simple question:

"Is the AI getting better at helping me?"

Not "is the model improving" (that's Google/Anthropic's/OpenAI's problem). Is YOUR instance, in YOUR codebase, with YOUR patterns, actually producing better results this week than last week?

No tool can answer this. Not Claude Code. Not Cursor. Not Aider. Not Gemini CLI. Not any of the 30+ tools I surveyed.

They all produce output. None of them measure quality. The AI is always 100% confident, and nobody checks.

What "Getting Better" Would Require

To know if your AI coding assistant is improving, you'd need to track:

What it investigated vs what it assumed. Did it read the code before editing, or pattern-match from training data? There's no log.
Whether its confidence matched reality. It said it understood the module. Did the tests pass? Nobody compares.
What it learned across sessions. After 100 sessions in your codebase, does it make fewer mistakes? There's no measurement.
What it didn't know it didn't know. The most dangerous bugs come from areas the AI never investigated. There's no mechanism to surface these blind spots.

This isn't a feature request. It's a category of infrastructure that doesn't exist in any of these tools.

Why This Matters For You

If you're a solo developer, the cost of an uncalibrated AI is your debugging time. The AI introduces a subtle bug it was "confident" about, and you spend an hour finding it.

If you're a team lead, it's worse. Your AI-assisted PRs look right, pass review, and the regression shows up in production — because nobody measured whether the AI actually understood the code it changed.

If you're building AI-assisted workflows, it's compounding. The AI makes the same class of mistake on day 100 that it made on day 1, because nothing in the system tracks whether its predictions are improving.

What Measurement Would Look Like

Imagine every AI coding session had three checkpoints:

Before (PREFLIGHT): The AI declares what it thinks it knows. "I understand the auth module at 60% confidence. I'm uncertain about the session store."

Gate (CHECK): After investigating, before acting. "I've read the middleware chain, logged what I found, identified two unknowns. My confidence is now 82%. Ready to implement."

After (POSTFLIGHT): The work is measured. Tests pass or fail. Linter reports clean or dirty. Git diff shows what actually changed. The AI's confidence claim is compared against this evidence.

The delta between Before and After is the learning. The gap between the AI's claim and the evidence is the calibration score. Over time, both should improve. If they don't, the tool isn't getting better — it's just getting more confident.

My Stack (Honest Assessment)

After six months of testing everything:

Claude Code (Pro, $20/mo) — for serious multi-file work. Best reasoning, 1M context.
Gemini CLI (free) — for quick questions and one-shots. 1,000 requests/day is generous.
Aider (BYOK) — for pair-programming sessions where I want clean git history.
Ollama + Qwen 2.5 Coder 32B (local, free) — for offline work and privacy-sensitive repos.
Empirica — for measuring whether any of the above is actually getting better.

That last one is ours. Open source (MIT). It hooks into Claude Code (and eventually any tool with a hook system) to track epistemic vectors, gate actions behind investigation, and verify self-assessments against deterministic evidence.

I'm not going to pitch it here — if you've read this far, you either see the gap or you don't. The code is on GitHub. The technical deep-dive is on Dev.to.

The Bottom Line

The tool wars are over. They're all good enough. The real question isn't which tool to pick — it's whether you're measuring if the tool is actually making you better.

Right now, nobody is. That's the gap.

This completes the AI Coding Tools Deep Dive:

Every AI Coding CLI in 2026: The Complete Map
Running AI Coding Agents for Free
What Every AI Coding Tool Gets Wrong (this article)

For the epistemic measurement deep-dive: Epistemic AI Series (5 parts)

Empirica on GitHub — measurement infrastructure for AI. MIT licensed.

Running AI Coding Agents for Free: The Open Source & Local Setup Guide (2026)

David Van Assche (S.L) — Wed, 15 Apr 2026 18:24:44 +0000

Part 2 of the AI Coding Tools Deep Dive. Part 1 mapped every tool. This one shows you how to run them for free — or close to it.

You don't need a subscription to get serious AI coding assistance. Between open-source tools, free APIs, and local models, you can build a professional-grade AI coding stack for $0-15/month. Here's exactly how.

Strategy 1: The Free Cloud Stack ($0/month)

Tools: Gemini CLI + Qwen Code

# Install Gemini CLI
npm install -g @anthropic-ai/gemini-cli
gemini login  # uses your Google account

# 1,000 requests/day with Gemini 2.5 Pro
# That's enough for a full day of coding
gemini "Refactor the auth module to use middleware pattern"

For a second opinion or when you hit Gemini's style limits:

# Qwen Code — completely free API from Alibaba
pip install qwen-code
qwen-code init
# Uses Qwen Coder models, no cost

Cost: $0. Literally.

Limitation: You're dependent on Google's and Alibaba's continued generosity. Free tiers can change without notice.

Strategy 2: The BYOK Power Stack ($5-15/month)

Tools: Aider + OpenRouter (or direct API keys)

# Install Aider
pip install aider-chat

# Option A: Use OpenRouter for model shopping
export OPENROUTER_API_KEY=your-key
aider --model openrouter/anthropic/claude-sonnet-4.6

# Option B: Direct API key (cheaper, fewer models)
export ANTHROPIC_API_KEY=your-key
aider --model claude-sonnet-4.6-latest

Aider's git-native workflow:

cd your-project
aider

# Inside aider:
> Fix the race condition in session_store.py
# Aider reads the file, makes changes, auto-commits with a descriptive message
# You review the diff, accept or reject

Cost: $5-15/month depending on usage. Claude Sonnet 4.6 at $3/$15 per million tokens. Moderate use = ~$10/month.

Why this works: Aider is the most mature CLI coding tool (39K stars, 4.1M installs, 15B tokens processed per week). It handles git, multi-file edits, and test running natively. OpenRouter lets you compare models by switching one flag.

The CLIProxyAPI Hack

If you want to use Gemini's free tier through Aider or any OpenAI-compatible tool:

# CLIProxyAPI wraps Gemini CLI as an OpenAI-compatible endpoint
git clone https://github.com/router-for-me/CLIProxyAPI
cd CLIProxyAPI && pip install -r requirements.txt
python proxy.py  # Starts an OpenAI-compatible server

# Now point Aider at it
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=dummy
aider --model gemini-2.5-pro
# Free Gemini 2.5 Pro through Aider's interface

Strategy 3: The Fully Local Stack ($0/month, offline-capable)

Tools: Ollama + Aider (or Continue.dev)

Step 1: Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

# Pull a coding model
ollama pull qwen2.5-coder:7b     # 4.5GB, laptop-friendly
ollama pull qwen2.5-coder:32b    # 18GB, desktop with GPU
ollama pull devstral2:24b         # Mistral's coding model

Step 2: Wire It Into Your Tool

With Aider:

aider --model ollama/qwen2.5-coder:32b
# That's it. Fully local, fully private, zero cost.

With Continue.dev (VS Code):

Install the Continue extension
Configure ~/.continue/config.json:

{
  "models": [{
    "title": "Qwen Coder 32B",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }]
}

With OpenCode:

# OpenCode auto-detects Ollama
opencode --provider ollama --model qwen2.5-coder:32b

Step 3: Model Selection Guide

Your Hardware	Model	Quality	Speed
Laptop (16GB RAM, no GPU)	`qwen2.5-coder:7b`	Good for completions, basic refactoring	~15 tok/s
Desktop (32GB RAM, RTX 3060)	`qwen2.5-coder:32b`	Excellent — rivals cloud models for most tasks	~20 tok/s
Desktop (64GB RAM, RTX 4090)	`devstral2:24b` or `deepseek-coder-v2:33b`	Near-frontier quality	~40 tok/s
Server (80GB+ VRAM)	`glm-5` via vLLM	77.8% SWE-bench — competes with Claude	Production speed

When Local Beats Cloud

Local wins when:

Privacy matters — code never leaves your machine
Latency matters — no network round-trip, instant responses
Cost matters — zero marginal cost per request
Offline works — airplane, air-gapped environments, spotty internet

Cloud wins when:

Quality ceiling matters — Claude/GPT-5 still beat local models on the hardest tasks
Context window matters — local 7B models max at 32K; Claude Code has 1M
Multi-file reasoning matters — large models handle cross-file dependencies better
You value your time — setup is one pip install, not GPU driver debugging

The Honest Take on Local Quality

Local models are genuinely good for:

Code completions and inline suggestions
Single-file refactoring
Writing tests for existing code
Explaining code
Documentation generation

Local models still struggle with:

Multi-file architectural changes (context window limits)
Complex debugging chains (reasoning depth)
Understanding project-wide patterns (needs more context than 32K)

The sweet spot: Use local for the 80% of tasks that are routine, cloud for the 20% that are hard. Your average cost drops from $20/month to $3-5/month.

Strategy 4: IDE + BYOK (Best of Both Worlds)

Tools: Cursor or Zed or Continue.dev + your preferred model

All three support BYOK:

Cursor ($16/mo or BYOK):

Settings → Models → Add Custom Model → Your API key

Zed (free, BYOK):

Settings → AI → Provider → Ollama / Anthropic / OpenAI

Continue.dev (free, any IDE):

VS Code + JetBrains support
Configure any model provider in config.json
Autocomplete, chat, edit, and agent modes
Only tool that works in both IDEs

The $0 Starter Kit

If you're just getting started today and want to spend nothing:

# 1. Gemini CLI for cloud (1000 req/day free)
npm install -g @anthropic-ai/gemini-cli
gemini login

# 2. Ollama for local (zero cost)
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5-coder:7b

# 3. Aider to tie them together
pip install aider-chat

# Cloud mode (Gemini):
export GEMINI_API_KEY=your-free-key
aider --model gemini/gemini-2.5-pro

# Local mode (Ollama):
aider --model ollama/qwen2.5-coder:7b

# Done. Professional AI coding setup. $0.

Next: *Part 3 — What Every AI Coding Tool Gets Wrong** — the measurement gap. None of these tools track whether the AI is actually getting better at helping you.*

Previous: *Part 1 — Every AI Coding CLI in 2026: The Complete Map***

Every AI Coding CLI in 2026: The Complete Map (30+ Tools Compared)

David Van Assche (S.L) — Wed, 15 Apr 2026 12:18:51 +0000

A sequel to my most-read post. Six months later, the landscape exploded. Here's every tool that matters, what it costs, and what it actually does.

The AI coding tool market went from "a few options" to "overwhelming" in about six months. New CLIs weekly. Pricing wars. Open-source alternatives rivaling the paid ones. Chinese models hitting 77%+ on SWE-bench. Free tiers that would've been unthinkable a year ago.

I've tested, researched, or tracked 30+ tools. Here's the complete map.

Tier 1: Cloud Subscriptions (Pay Monthly, They Host Everything)

These are the "just works" options. You pay, they handle models, infrastructure, and updates.

Tool	Monthly Cost	Model(s)	Type	SWE-bench	Notable
Claude Code	$17-20 (Pro), $100-200 (Max)	Claude 4.6 Opus/Sonnet	Terminal agent	80.9%	1M context. Uses 5.5x fewer tokens than Cursor. Full hook/plugin system.
Cursor	$16/mo	Multi-model	VS Code fork	Varies	Largest community. Best tab completions. Most polished UX.
Windsurf	$20/mo	Multi-model	IDE	Varies	"Flows" persistent context. Raised from $15 in March 2026.
Codex CLI	With ChatGPT Plus ($20/mo)	GPT-5 series	CLI + Desktop	—	Cloud sandbox execution. Autonomous agent.
Antigravity	$20 (Pro), $250 (Ultra)	Gemini	Agent IDE	—	Google's entry. Parallel agents. Built-in Chrome for testing.
Mistral Vibe	$15/mo (Le Chat Pro)	Devstral 2	CLI	—	Apache 2.0 source code. Paid models.
Amp (Sourcegraph)	Free tier ($10/day cap)	Multi-model	CLI + IDE	—	"Deep mode" autonomous research. No markup on API costs.

The verdict: Claude Code wins on capability (1M context, best SWE-bench, hook system). Cursor wins on UX. Windsurf and Antigravity bet on parallel agents. Codex bets on cloud sandboxing.

Token efficiency matters more than subscription price. Claude Code using 5.5x fewer tokens than Cursor means the real cost difference is bigger than the $1-4/mo subscription gap suggests.

Tier 2: Genuinely Free (Real Usage, No Tricks)

These tools offer meaningful free access — not "free trial" but actually usable for daily work:

Tool	Free Tier	What You Get	Upgrade Path
Gemini CLI	1,000 requests/day	Gemini 2.5 Pro/Flash routing. Just login with Google.	Pay-as-you-go
GitHub Copilot CLI	50 premium requests/mo	Deep GitHub integration. Natural for existing users.	$10/mo
Amazon Q Developer	Free tier	Best for AWS-heavy workflows.	AWS pricing
Kiro (Amazon)	Free tier	Spec-driven: generates requirements before code. Auditable trail.	TBD
Qwen Code	Free API (!)	Alibaba's CLI agent. Apache 2.0. Completely free API access.	—

Gemini CLI at 1,000 free requests/day is the story here. For many developers, this is effectively unlimited. If you're budget-constrained or evaluating, start here.

Qwen Code's free API is underappreciated. Alibaba is subsidizing it for market share — take advantage while it lasts.

Tier 3: Open Source BYOK (Free Tool, Bring Your API Key)

The largest category. Zero subscription — you pay only for model inference via API keys:

Tool	GitHub Stars	Type	Model Support	What Makes It Different
OpenCode	140K+	CLI	75+ providers	Universal adapter. If a model exists, OpenCode supports it.
Aider	39K+	CLI	Any (inc. local)	Git-native. Auto-commits. Most mature. 4.1M installs, 15B tokens/week.
Cline	— (5M installs)	VS Code ext	Any	Most adopted open-source coding extension.
Continue.dev	26K	IDE ext	Any	Only tool with full VS Code + JetBrains support.
Goose	—	CLI + Desktop	Any + MCP	Block/Square's agent. Apache 2.0. Native MCP integration.
Roo Code	—	VS Code ext	Any	"When other agents break down" — reputation for reliability on large multi-file changes.
OpenClaw	—	CLI	GLM, MiniMax, Qwen, etc	Gateway to Chinese model ecosystem.
Zed	—	Editor	BYOK	Rust-native. Fastest editor in the category.
iFlow	—	CLI	Any OpenAI-compatible	SubAgents. Controlled file permissions.
Kimi Code CLI	—	CLI	Kimi K2.5	Moonshot's agent. 100-agent swarm capability.
BLACKBOX	—	Multi	Proprietary + BYOK	Completions + chat + search.

The real cost of BYOK: With Claude Sonnet at $3/$15 per million tokens, moderate daily use runs $10-15/month. With OpenRouter, you can compare prices across 100+ models. With local models, the cost is $0.

Aider remains the gold standard for terminal pair-programming. Git-native workflows, clean commit history, works with everything from GPT to local Ollama models.

Tier 4: Truly Local (Offline, Self-Hosted, Zero Cloud)

For the privacy-conscious, air-gapped environments, or anyone who wants zero recurring costs:

Inference Runtimes

Runtime	Best For	Effort	Speed
Ollama	Easiest start. One command: `ollama pull qwen2.5-coder`	Minimal	Good
llama.cpp	Maximum control. Custom compilation for your exact hardware.	High	Best (tuned)
LM Studio	Visual model management. Side-by-side comparison. GUI sliders.	Minimal	Good
vLLM	Production serving. PagedAttention cuts memory 50%+. 2-4x throughput.	Medium	Production-grade
Tabby	Self-hosted copilot. Full IDE integration on your own infra.	Medium	Good

Best Local Coding Models (April 2026)

Model	Params	SWE-bench	License	Runs On
GLM-5 (Zhipu)	744B MoE (40B active)	77.8%	MIT	vLLM / llama.cpp (needs 80GB+ VRAM for full)
Kimi K2.5 (Moonshot)	1T MoE	76.8%	Open	Similar — enterprise hardware
Devstral 2 (Mistral)	—	—	Apache 2.0	Ollama, llama.cpp
Qwen 2.5 Coder (Alibaba)	7B-72B	—	Apache 2.0	Ollama (7B on laptop, 32B on desktop)
MiniMax M2	230B MoE (10B active)	—	Open	8% of Claude's price, 2x speed
DeepSeek Coder V2	Various	—	MIT	Ollama, llama.cpp

For a laptop: Qwen 2.5 Coder 7B or DeepSeek Coder V2 7B via Ollama. Runs fine on 16GB RAM.

For a desktop with GPU: Qwen 2.5 Coder 32B via Ollama. Excellent quality, runs on RTX 3060 12GB.

For a server: GLM-5 or Kimi K2.5 via vLLM. These compete with Claude on coding benchmarks.

Tier 5: Model Routers (Connect Anything to Anything)

Router	What It Does
9router	Connects 40+ providers to Claude Code, Cursor, Copilot, Antigravity, etc.
CLIProxyAPI	Wraps Gemini CLI, Codex, Claude Code as OpenAI-compatible API. Use free Gemini models through any tool.
OpenRouter	Universal API gateway. Compare prices across 100+ models. Pay-per-token.

CLIProxyAPI is wild: it wraps Gemini CLI's free tier as an OpenAI-compatible API, which means you can use Gemini 2.5 Pro through Aider, Cline, or any OpenAI-compatible tool — for free.

Quick Decision Matrix

If you want...	Use this
Best capability, cost be damned	Claude Code (Max)
Best free experience	Gemini CLI
Best open-source CLI	Aider
Best IDE experience	Cursor
Best for teams	Continue.dev (VS Code + JetBrains)
Zero cloud dependency	Ollama + Qwen 2.5 Coder
Best Chinese model access	OpenClaw
Planning before coding	Kiro
Git-native workflows	Aider
Parallel agents	Antigravity or Windsurf

Next in this series: *Part 2 — Running AI Coding Agents for Free: The Open Source & Local Guide** — deep dive into BYOK setups, local model configuration, and getting Claude-level performance without a subscription.*

Also: *Part 3 — What Every AI Coding Tool Gets Wrong** — the measurement gap that none of these tools address.*

This is a sequel to The best (free - cheap) AI friendly Cli and Coding environments.

Adding Epistemic Hooks to Your Workflow: From pip install to Measured AI in 5 Minutes

David Van Assche (S.L) — Wed, 15 Apr 2026 11:29:51 +0000

Part 4 of the Epistemic AI series. Parts 1-3 explained why measurement matters. Now: how to wire it into your actual workflow.

This is the hands-on article. By the end, you'll have Empirica running in a real project with measured epistemic transactions. Everything here is copy-pasteable.

Prerequisites

Python 3.10+
A git repository (any project)
Claude Code (optional but recommended — gives you the full hook integration)

Step 1: Install

pip install empirica

Verify:

empirica --version
# empirica 1.8.x

Step 2: Initialize Your Project

cd your-project
empirica project-init

This creates .empirica/ in your project root:

.empirica/
├── project.yaml          # Project config (name, evidence profile)
├── config.yaml           # Empirica settings
└── sessions/
    └── sessions.db       # SQLite — all epistemic data lives here

What just happened: Your project is now registered in Empirica's workspace database. Every session, transaction, finding, and calibration score will be tracked here.

Step 3: Wire Into Claude Code (Recommended)

empirica setup-claude-code

This installs hooks into Claude Code's plugin system:

Hook	When It Fires	What It Does
session-init	Conversation starts	Creates session, loads context
sentinel-gate	Every tool call	Gates praxic actions behind CHECK
pre-compact	Before context compression	Saves epistemic snapshot
post-compact	After compression	Restores state, continues transaction
session-end	Conversation ends	Auto-POSTFLIGHT if needed

After this, every Claude Code conversation in this project is automatically measured. No manual commands needed — the hooks handle PREFLIGHT, CHECK gating, and POSTFLIGHT.

The Sentinel: Investigation Before Action

The most important hook is the Sentinel — it intercepts every tool call and checks:

Is there an open transaction? (PREFLIGHT was run)
Has CHECK been passed? (Investigation is done)
Is this a noetic tool (read-only) or praxic (writes/edits)?

Noetic tools (Read, Grep, Glob, search) are always allowed — investigation should never be blocked.

Praxic tools (Edit, Write, Bash commands that modify) require a valid CHECK first. This prevents the AI from jumping straight to implementation without understanding the problem.

Without Sentinel:
  User: "Fix the auth bug"
  AI: *immediately starts editing files*  ← no investigation

With Sentinel:
  User: "Fix the auth bug"
  AI: *reads code, logs findings*          ← forced to investigate
  AI: *submits CHECK with what it learned* ← gates the transition
  AI: *now allowed to edit*                ← acts from understanding

This isn't a bureaucratic slowdown — it's the mechanism that forces the investigation that makes the AI's work better.

Step 4: Your First Measured Transaction

If you're NOT using Claude Code (or want to understand the manual flow):

Open the Transaction

empirica session-create --ai-id claude-code
empirica preflight-submit - << 'EOF'
{
  "task_context": "Investigate and fix the auth middleware bug",
  "work_type": "code",
  "vectors": {
    "know": 0.40,
    "uncertainty": 0.50,
    "context": 0.55,
    "clarity": 0.45,
    "do": 0.60,
    "engagement": 0.85
  },
  "reasoning": "Starting auth investigation. Read the bug report but haven't looked at the code yet. Moderate context from project familiarity."
}
EOF

Be honest with the starting vectors. The whole point is measuring the delta — inflating your PREFLIGHT just makes the learning look smaller.

Investigate and Log

# What you discover
empirica finding-log \
  --finding "Auth middleware chains Express next() at routes/auth.js:45. JWT validation happens in middleware, not route handler." \
  --impact 0.5

# What you don't know
empirica unknown-log \
  --unknown "How does the session store handle concurrent requests? No locking visible."

# Decisions you make
empirica decision-log \
  --choice "Use httpOnly cookies for refresh tokens instead of localStorage" \
  --rationale "XSS attack surface reduction. localStorage is accessible to any script." \
  --reversibility exploratory \
  --confidence 0.8

# What didn't work
empirica deadend-log \
  --approach "Tried passport.js for JWT auth" \
  --why-failed "Adds 12 dependencies for a problem solvable with 30 lines of middleware"

These aren't just notes — they're grounded evidence that the calibration system uses to verify your self-assessments.

Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.80,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.85
  },
  "reasoning": "Investigated auth chain, understand JWT flow, found the bug (session store race condition). Ready to implement fix."
}
EOF

CHECK evaluates whether the vectors are consistent with the evidence you logged. If you claim know: 0.80 but logged zero findings and zero unknowns, it'll flag a rushed assessment.

The decision is either proceed (you can start implementing) or investigate (go back and learn more).

Implement, Then Close

After implementing the fix:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.75,
    "completion": 1.0,
    "do": 0.85
  },
  "reasoning": "Auth middleware fixed. Session store race condition resolved with mutex. Tests passing."
}
EOF

The POSTFLIGHT triggers grounded verification — your self-assessment is compared against deterministic evidence (test results, git diff, linter output, artifact counts). The calibration score measures the gap.

Step 5: Read Your Calibration

The POSTFLIGHT output includes the calibration report:

{
  "calibration_score": 0.14,
  "grounded_coverage": 0.69,
  "phases": {
    "praxic": {
      "gaps": {
        "know": 0.23,
        "uncertainty": -0.25,
        "change": -0.20,
        "coherence": -0.15
      },
      "sources": ["pytest", "ruff", "git_diff", "artifacts", "prose_quality"]
    }
  }
}

Reading the gaps:

know: 0.23 — you overestimated knowledge by 0.23 (common)
uncertainty: -0.25 — you underestimated uncertainty by 0.25 (also common)
change: -0.20 — you underestimated how much you changed (git diff shows more)
coherence: -0.15 — code is cleaner than you thought (linter agrees)

Over time, these gaps should shrink. If they don't, the AI isn't learning to predict its own performance — it's just getting more confident without getting more accurate.

Step 6: Check Your Diagnostic

If anything isn't working:

empirica diagnose

This runs 11 health checks:

✅ Python version: 3.13.7 (>= 3.10)
✅ empirica CLI on PATH
✅ Claude config dir exists (~/.claude/)
✅ Plugin files installed
✅ settings.json valid
✅ Statusline configured
✅ Hooks registered (6/6)
✅ Marketplace registered
✅ Statusline runnable
✅ Project initialized (.empirica/ found)
✅ Active session in DB

If any check fails, the output includes the exact fix command.

What You Get

After a few sessions, you'll have:

Calibration trajectory — are your estimates getting more accurate?
Artifact history — findings, unknowns, dead-ends, decisions, all searchable
Learning deltas — measurable improvement (or stagnation) per transaction
Grounded evidence — objective measurement that doesn't depend on self-report
Cross-session persistence — learning survives context compaction

This is epistemic infrastructure. Not a prompt. Not a wrapper. Measurement that makes the invisible visible.

Next and final: **Part 5 — The Prosodic Memory Layer* — how AI learns your communication patterns and adapts its voice to different platforms.*

Empirica on GitHub | Part 1 | Part 2 | Part 3

Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying

David Van Assche (S.L) — Wed, 15 Apr 2026 08:53:18 +0000

Part 3 of the Epistemic AI series. Parts 1 and 2 introduced the epistemic gap and how to measure it. Now: why the AI's self-report can't be trusted — and what to do about it.

Your AI tells you it's 85% confident. But what does that number actually mean? Nobody checked. There's no ground truth. It's a student grading their own exam, and the grade is always suspiciously high.

This is the calibration problem, and it's more insidious than it sounds.

Why Self-Assessment Is Structurally Unreliable

When an AI agent reports its epistemic vectors (know = 0.85, uncertainty = 0.10), it's making a prediction about its own internal state. This prediction is corrupted by at least three systematic biases:

1. The Completion Bias

LLMs are trained to produce helpful, confident responses. When asked "how well do you understand this?", the model gravitates toward the answer that sounds most competent. This isn't deception — it's the same optimization pressure that makes models agree with user corrections even when the user is wrong.

# What the AI reports:
know: 0.85  "I understand the codebase well"

# What the evidence shows:
- 3 test failures in the module it just edited
- 2 linter violations it didn't catch
- Referenced a function that was renamed 3 commits ago

The gap between 0.85 and the evidence isn't malice. It's structural overconfidence baked into the training objective.

2. The Anchoring Effect

Once the AI declares a PREFLIGHT vector (say, know = 0.60), it anchors to that starting point. The POSTFLIGHT assessment tends to show "improvement" regardless of what actually happened:

PREFLIGHT:  know = 0.60  (declared at session start)
POSTFLIGHT: know = 0.85  (looks like learning!)

But did it actually learn?
Or did it just decide enough time had passed?

Without external verification, you can't distinguish genuine learning from narrative completion — the AI telling a story about getting smarter because that's the expected arc.

3. The Unknown Unknowns

The most dangerous blind spot: the AI can't report uncertainty about things it doesn't know it doesn't know. If it never investigated the session store's concurrency model, it won't report low confidence on session handling — because it doesn't know there's something to be uncertain about.

AI: "I'm confident about the auth implementation" (know = 0.85)
Reality: auth works, but the session store race condition
         it didn't investigate will break under load.
         The AI doesn't report uncertainty because
         it never discovered the problem exists.

Grounded Verification: The Fix

The solution isn't better prompting or asking the AI to "be more honest." The solution is deterministic evidence — measurements that don't come from the AI's self-report.

What "Grounded" Means

Grounded evidence comes from services that produce facts, not opinions:

Evidence Source	What It Measures	Maps To
pytest results	Tests passing/failing	know, do, change
ruff/pylint	Code quality violations	coherence, signal
radon	Cyclomatic complexity	density, clarity
git diff	Lines actually changed	change, state
pyright	Type safety	coherence
Finding count	Investigation breadth	know, signal
Unknown resolution rate	Learning evidence	do, completion
textstat	Prose readability	clarity, density

These sources don't lie. They don't have completion bias. They don't anchor to previous assessments.

The Calibration Score

Empirica computes a calibration score by comparing the AI's self-assessment against grounded evidence:

Self-assessed:  know = 0.85, uncertainty = 0.10
Grounded:       know = 0.62, uncertainty = 0.35

Calibration gaps:
  know:        overestimate by 0.23
  uncertainty: underestimate by 0.25
  coherence:   underestimate by 0.20 (tests show code is cleaner than claimed)
  change:      underestimate by 0.40 (git shows more change than reported)

Calibration score: 0.14 (0.0 = perfect, 1.0 = completely uncalibrated)
Grounded coverage: 69% (evidence covers 69% of claimed vectors)

This is real output from an actual Empirica session. The AI was overestimating its knowledge by 0.23 and underestimating its uncertainty by 0.25 — the most common pattern we see.

Coverage Matters

Not all vectors can be grounded. If the AI is doing research (no code written), there's no pytest or git diff to verify against. Empirica tracks grounded coverage — what percentage of the self-assessment has deterministic evidence behind it.

# When coverage < 30%, calibration is declared insufficient
if grounded_coverage < 0.3:
    calibration_status = "insufficient_evidence"
    # Self-assessment stands — but honestly flagged as unverified

This is more honest than producing a phantom calibration score from sparse data. When we don't have enough evidence, we say so — and the self-assessment stands unchallenged rather than being falsely "verified."

What Happens Over Time

The calibration gap should shrink across transactions. If the AI consistently overestimates know by 0.23, the system provides feedback:

Previous transaction feedback:
  overestimate_tendency: [know, context]
  underestimate_tendency: [uncertainty, coherence, change]

  Note: "Be more cautious with know estimates,
         less cautious with uncertainty estimates."

This feedback is injected into the next PREFLIGHT. Over time, the AI's self-assessments become more accurate — not because the model changed, but because the measurement infrastructure makes overconfidence visible and costly.

The Sycophancy Connection

Calibration and sycophancy are the same problem viewed from different angles:

Sycophancy: AI agrees with the user to avoid conflict
Overconfidence: AI agrees with itself about its own competence

Both come from the same training pressure: produce the response that seems most helpful and aligned. Grounded verification breaks both patterns by introducing an external reference point that neither the AI nor the user controls.

When the AI says "know = 0.85" and the evidence says "know = 0.62", there's no way to talk your way out of it. The tests failed. The linter found issues. The gap is measured.

Try It

pip install empirica
cd your-project && empirica project-init

# After a work session, check calibration:
empirica postflight-submit - << 'EOF'
{
  "vectors": {"know": 0.85, "uncertainty": 0.10, "change": 0.70},
  "reasoning": "Implemented auth middleware, tests passing"
}
EOF

# The POSTFLIGHT output shows:
#   calibration_score: 0.14
#   grounded_coverage: 69%
#   gaps: know overestimate by 0.23, uncertainty underestimate by 0.25
#   sources: pytest, ruff, git_diff, artifacts, prose_quality

The calibration loop runs automatically on every POSTFLIGHT. No extra commands needed — just work normally and measure honestly.

Next: **Part 4 — Adding Epistemic Hooks to Your Workflow* — the step-by-step integration tutorial. From pip install to your first measured transaction in 5 minutes.*

Empirica on GitHub | Part 1 | Part 2

Measuring What Your AI Learned: Epistemic Vectors in Practice

David Van Assche (S.L) — Mon, 13 Apr 2026 17:45:28 +0000

Part 2 of the Epistemic AI series. In Part 1, we defined the problem: AI tools don't track what they know. Here, we make it measurable.

When we talk about "what the AI knows," we're not being metaphorical. Knowledge has structure, and that structure is measurable — not perfectly, but well enough to catch the failures that matter.

The 13 Epistemic Vectors

Empirica tracks 13 dimensions of an AI's knowledge state. Not as a gimmick — each vector maps to a specific class of failure you've seen in practice:

vectors = {
    "know":        0.65,  # Domain understanding
    "uncertainty": 0.35,  # What I DON'T know (explicit!)
    "context":     0.70,  # Surrounding state awareness
    "clarity":     0.80,  # How clear the path forward is
    "coherence":   0.75,  # Internal consistency
    "signal":      0.60,  # Information quality vs noise
    "density":     0.55,  # Relevant knowledge per unit context
    "state":       0.70,  # Current system/project state
    "change":      0.40,  # Amount of change made
    "completion":  0.30,  # Progress toward goal
    "impact":      0.65,  # Significance of work
    "engagement":  0.85,  # How actively working the problem
    "do":          0.70,  # Ability to execute
}

Why 13? Because we kept finding failure modes that weren't captured:

know without uncertainty = overconfident AI
clarity without signal = clear path built on noise
completion without change = claiming done but nothing happened
engagement without do = actively spinning without capability

Each pair creates a tension that prevents gaming. You can't claim high know while uncertainty is also high — the measurement catches the contradiction.

The Transaction Lifecycle

Vectors aren't static. They change as the AI works. The epistemic transaction is the measurement window:

PREFLIGHT → [investigate] → CHECK → [implement] → POSTFLIGHT

PREFLIGHT: Declare Your Baseline

Before starting work, the AI declares what it thinks it knows:

empirica preflight-submit - << 'EOF'
{
  "task_context": "Implement JWT auth middleware",
  "vectors": {
    "know": 0.45,
    "uncertainty": 0.40,
    "context": 0.60,
    "clarity": 0.50
  },
  "reasoning": "Read the route definitions but haven't explored 
    the middleware chain yet. Moderate context from project structure."
}
EOF

This is the starting measurement. It's a prediction: "Here's how well I think I understand this before investigating."

Investigation Phase (Noetic)

The AI reads code, searches patterns, builds understanding. Everything it discovers gets logged:

# What you learned
empirica finding-log --finding "Auth middleware uses Express next() 
  pattern at routes/auth.js:45" --impact 0.5

# What you don't know
empirica unknown-log --unknown "How are user roles differentiated? 
  No role field in JWT payload schema."

# What didn't work
empirica deadend-log --approach "Tried passport.js integration"   --why-failed "Too heavy for JWT-only auth, would add 12 dependencies"

These aren't just notes — they're grounded evidence that the calibration system uses to verify self-assessments.

CHECK: Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.82,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.88
  },
  "reasoning": "Investigated middleware chain, understand JWT flow, 
    found role definitions in JWT claims. Ready to implement."
}
EOF

The system evaluates: did the vectors change in a way that's consistent with the evidence logged? If the AI claims know: 0.82 but logged zero findings and zero unknowns, that's a rushed assessment — the gate catches it.

This is the critical insight: you can't skip investigation and go straight to acting. The measurement forces understanding before execution.

POSTFLIGHT: Measure the Learning

After implementation:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.80,
    "completion": 1.0
  },
  "reasoning": "Auth middleware implemented with role guards. 
    Unit tests passing. Learned about Express 5 async changes."
}
EOF

The delta between PREFLIGHT and POSTFLIGHT is the learning:

know:        0.45 → 0.90  (+0.45)  # Learned a lot
uncertainty: 0.40 → 0.08  (-0.32)  # Resolved most unknowns
change:      0.00 → 0.80  (+0.80)  # Made substantial changes
completion:  0.00 → 1.00  (+1.00)  # Goal met

This delta IS the measurement. Over time, you can see:

Does the AI consistently overestimate its starting knowledge?
Does it underestimate uncertainty?
Do its estimates get more accurate across sessions?

Grounded Verification: The Part That Keeps It Honest

Self-assessment alone is self-serving. The grounded verification layer compares the AI's claims against deterministic evidence:

# AI claims: know=0.90, change=0.80
# Grounded evidence:
evidence = {
    "test_results": {"passed": 42, "failed": 3},     # 3 failures!
    "ruff_violations": 2,                              # lint issues
    "git_diff_lines": 156,                            # real change metric
    "findings_logged": 5,                              # investigation breadth
    "unknowns_resolved": 3,                            # learning evidence
}

# Grounded calibration:
# - test failures → know is probably ~0.75, not 0.90
# - git diff confirms change=0.80 is reasonable
# - 5 findings + 3 resolved unknowns → investigation was real

The calibration score measures the distance between self-assessment and grounded evidence. A score of 0.0 means perfect calibration. In practice, we see scores of 0.10-0.30 — the AI is usually overconfident, and the grounded layer catches it.

What This Looks Like in Practice

Here's a real POSTFLIGHT from an Empirica session (editing for clarity):

Calibration score: 0.134
Grounded coverage: 69.2%

Gaps:
  know:        overestimate by 0.33  (claimed 0.82, evidence shows 0.49)
  uncertainty: underestimate by 0.13 (claimed 0.15, evidence shows 0.28)
  coherence:   underestimate by 0.20 (claimed 0.75, evidence shows 0.95)

Sources: artifacts, codebase_model, prose_quality, 
         document_metrics, source_quality, action_verification
Sources failed: []  (all evidence collectors healthy)

The AI was overestimating its knowledge and underestimating its uncertainty — the most common pattern. But now we can see it, which means we can correct for it in the next transaction.

Try It

pip install empirica
cd your-project
empirica project-init
empirica setup-claude-code

# Start a measured session:
empirica session-create --ai-id claude-code
# → Opens transaction, gates investigation before action

The framework is open source, the measurement is real, and the calibration improves over time. Not because the model gets better — because the measurement infrastructure makes overconfidence visible.

Next in the series: **Part 3 — Grounded Calibration vs Self-Assessment* — why the AI's self-report is structurally unreliable and how deterministic evidence changes the game.*

Empirica on GitHub | Part 1: Your AI Doesn't Know What It Doesn't Know

Your AI Doesn't Know What It Doesn't Know — And That's the Biggest Problem in AI Tooling

David Van Assche (S.L) — Mon, 13 Apr 2026 17:45:27 +0000

"The most dangerous thing isn't an AI that's wrong. It's an AI that's wrong and confident about it."

Every developer working with AI agents has hit this wall: your tool says something with absolute confidence, and it's completely wrong. Not because the model is bad — because nothing in the system tracks what it actually knows versus what it's guessing.

This is the epistemic gap, and it's the single biggest unsolved problem in AI developer tooling.

The Problem: Confidence Without Calibration

When you use Claude, ChatGPT, or any LLM-based tool:

It never says "I'm 60% sure about this"
It doesn't distinguish between "I read this in the codebase" and "I'm inferring this from patterns"
After a long conversation, it loses track of what it verified versus what it assumed
When context compresses, learned insights vanish silently

This isn't a model problem. GPT-5 won't fix it. Claude Opus 5 won't fix it. It's a measurement problem at the infrastructure layer.

What Actually Happens in Practice

You ask your AI to update the auth middleware. It says "Done!" with 100% confidence. But:

Did it check if JWT was already configured? Maybe.
Did it verify the session store compatibility? Probably not.
Will it remember this decision next session? No.
Did it investigate before acting, or just pattern-match? You'll never know.

The AI doesn't track:

What it investigated versus what it assumed
Which assumptions turned out to be wrong
What it learned that should persist across sessions
How its confidence should change based on evidence

Why This Matters More Than You Think

If you're building AI-assisted workflows, this gap compounds:

No learning curve. Your AI makes the same mistakes on day 100 that it made on day 1, because nothing measures whether its predictions improve.
Invisible context loss. When conversations compact (Claude Code, Cursor, etc. all do this), the AI loses track of what it verified. It re-assumes things it already checked.
Sycophancy masquerading as agreement. When you push back on a wrong answer, the AI often just agrees with you — not because you're right, but because agreement is the path of least resistance. Without calibration, there's no mechanism to distinguish "user is right, I should update" from "user is insistent, I should capitulate."
No grounded verification. The AI self-reports its confidence. Nobody checks. It's like a student grading their own exam.

What Epistemic Measurement Looks Like

Imagine if your AI tooling tracked 13 dimensions of its own knowledge state:

Vector	What It Measures
know	How well it understands the domain
uncertainty	What it DOESN'T know (explicit)
context	Understanding of surrounding state
clarity	How clear the path forward is
signal	Quality of information vs noise
change	Amount of change made
completion	Progress toward current goal

And imagine it measured these at three points:

PREFLIGHT: "Here's what I think I know before starting"
CHECK: "Here's what I learned during investigation — am I ready to act?"
POSTFLIGHT: "Here's what I actually learned and changed"

The delta between PREFLIGHT and POSTFLIGHT IS the learning. Not a vibe. A measurement.

The Grounded Calibration Loop

Self-assessment alone is sycophantic. What you actually need is a comparison between what the AI claims to know and what deterministic evidence shows:

AI self-assessment: know = 0.85, uncertainty = 0.10
Grounded evidence (test results, linter, git diff): know = 0.62, uncertainty = 0.35
Calibration gap: overestimating know by 0.23, underestimating uncertainty by 0.25
Adjustment signal: "Be more cautious with know estimates in future transactions"

The grounded evidence comes from deterministic services — test results, linter output, git metrics, documentation coverage — things that don't lie. When the AI says "I know this codebase well" but the test suite shows 3 failures in the module it just edited, the gap is measurable.

This is what calibration means: the distance between what you claim to know and what the evidence shows. Over time, this distance should shrink. If it doesn't, the AI isn't getting better — it's just getting more confident.

This Isn't Theory — It's Infrastructure

We've been building this measurement layer as an open-source framework called Empirica. It's a Python CLI that hooks into Claude Code (and any LLM tool that supports hooks) to:

Track epistemic vectors across sessions
Gate actions behind investigation (you can't write code until you've demonstrated understanding)
Verify self-assessments against deterministic evidence
Persist learning across context compaction
Measure calibration drift over time

It's not a wrapper or a prompt. It's measurement infrastructure that makes the epistemic gap visible and closes it over time.

Getting Started

Prerequisites: Python 3.10+, a project with a git repo, and optionally Claude Code for the full hook integration.

# Install Empirica
pip install empirica

# Initialize tracking in your project
cd your-project
empirica project-init

# If using Claude Code, wire up the hooks:
empirica setup-claude-code

That's it. From this point, every Claude Code conversation in this project is measured — PREFLIGHT declares baseline knowledge, CHECK gates the transition from investigation to action, and POSTFLIGHT captures what was actually learned. The Sentinel (an automated gate) ensures investigation happens before implementation.

Without Claude Code, you can still use the CLI directly to track any AI workflow:

# Declare what you know before starting
empirica preflight-submit - <<< '{"vectors": {"know": 0.5, "uncertainty": 0.4}, "reasoning": "Starting auth investigation"}'

# Log what you discover
empirica finding-log --finding "JWT middleware uses Express next() pattern" --impact 0.5

# Measure what you learned
empirica postflight-submit - <<< '{"vectors": {"know": 0.85, "uncertainty": 0.1}, "reasoning": "Auth flow fully understood"}'

What's Next in This Series

This is Part 1 of a series on epistemic AI — making AI tools that actually know what they know:

Part 2: Measuring What Your AI Learned — epistemic vectors in practice
Part 3: Grounded Calibration vs Self-Assessment — why self-reporting fails
Part 4: Adding Epistemic Hooks to Your Workflow — integration tutorial
Part 5: The Voice Layer — how AI learns your communication patterns

Each article will have runnable code, real measurements, and honest assessments of what works and what doesn't. Because that's the whole point — if you're not honest about uncertainty, you're just building a more eloquent liar.

Empirica is open source (MIT) and under active development. We're a small team in Vienna building measurement infrastructure for AI. If this resonates, check us out on GitHub or follow this series for the deep dives.

Why Your AI Agent Needs Memory That Decays (and How Qdrant Makes It Work)

David Van Assche (S.L) — Fri, 06 Mar 2026 13:30:22 +0000

I've been building an open-source epistemic measurement framework called Empirica, and one of the core challenges I ran into early on was memory — not the "stuff vectors in a database and retrieve them" kind, but memory that actually behaves like memory. Things fade. Patterns strengthen with repetition. A dead-end from three weeks ago should still surface when the AI is about to walk into the same wall, but a finding from a one-off debugging session probably shouldn't carry the same weight six months later.

That's where Qdrant comes in, and I want to share how we're using it because it's a fairly different use case from the typical RAG setup.

The problem with flat retrieval

Most RAG implementations treat memory as a flat store — embed a chunk, retrieve by similarity, done. That works for document Q&A, but it falls apart when you need temporal awareness. An AI agent working across sessions and projects needs to know not just what was discovered, but when, how confident we were, and whether that knowledge is still valid.

Think about how your own memory works — you don't recall every detail of every workday equally. The time you accidentally dropped the production database? That stays vivid. The routine PR you reviewed last Tuesday? Already fading. That asymmetry is functional, not a bug.

Two memory types, one vector store

We use Qdrant for two distinct memory layers:

Eidetic memory — facts with confidence scores. These are discrete epistemic artifacts: findings ("the auth system uses JWT refresh with 15min expiry"), dead-ends ("tried migrating to async but the ORM doesn't support it"), decisions ("chose SQLite over Postgres because single-user, no server needed"), mistakes ("forgot to check null on the config reload path"). Each carries a confidence score that gets challenged when new evidence contradicts it — a finding's confidence drops if a related finding surfaces that undermines it. Think of it as an immune system: findings are antigens, lessons are antibodies.

Episodic memory — session narratives with temporal decay. These capture the arc of a work session: what was the AI investigating, what did it learn, how did its confidence change from start to finish. Episodic memories naturally decay over time — a session from yesterday is more relevant than one from last month, unless the pattern keeps repeating, in which case it strengthens instead of fading.

Both live in Qdrant as separate collections per project, which gives us clean isolation and lets us do cross-project pattern discovery when we need it.

The retrieval side — Noetic RAG

I've been calling this approach "Noetic RAG" — retrieval augmented generation on the thinking, not just the artifacts. When an AI agent starts a new session, we don't just load documents. We load:

Dead-ends that match the current task (so it doesn't repeat failed approaches)
Mistake patterns with prevention strategies
Decisions and their rationale (so it understands why things are the way they are)
Episodic arcs from similar sessions (temporal context)
Cross-project patterns (if the same anti-pattern appeared in project A, surface it in project B)

The similarity search here isn't just cosine distance on the task description — it's filtered by recency, weighted by confidence, and scoped by project (with optional global reach for cross-project learnings).

What this looks like in practice

# Focused search: eidetic facts + episodic session arcs
empirica project-search --project-id <ID> --task "auth token rotation"

# Full search: all collections
empirica project-search --project-id <ID> --task "auth token rotation" --type all

# Include cross-project patterns
empirica project-search --project-id <ID> --task "auth token rotation" --global

When context compacts (and it will — Claude Code's 200k window fills up fast), the bootstrap reloads ~800 tokens of epistemically ranked context instead of trying to reconstruct everything from scratch. Findings, unknowns, active goals, architectural decisions — weighted by confidence and recency.

The temporal dimension

This is the part that makes Qdrant particularly well-suited. We store timestamps and decay parameters as payload fields, and filter on them at query time. A dead-end from yesterday with high confidence outranks a finding from last month with medium confidence. But a pattern that's been confirmed three times across two projects? That climbs in relevance regardless of age.

The decay isn't a fixed curve — it's modulated by reinforcement. Every time a pattern re-emerges, its effective age resets. Qdrant's payload filtering makes this efficient: we can do the temporal math at query time without re-embedding anything.

Why this matters beyond the obvious

The real value isn't just "AI remembers things" — it's that the memory is epistemically grounded. Every artifact has uncertainty quantification. Every session has calibration data (how accurate was the AI's self-assessment compared to objective evidence like test results and code quality metrics). The memory doesn't just tell you what happened — it tells you how much to trust what happened.

After 5,600+ measured transactions, the calibration data shows AI agents consistently overestimate their own confidence by 20-40%. Having memory that carries that calibration forward means the system gets more honest over time, not just more knowledgeable.

Try it

Empirica is MIT licensed and open source. If you're building anything where AI agents need to remember across sessions — especially if temporal awareness matters — the prosodic/episodic/eidetic architecture might be worth looking at.

GitHub: github.com/Nubaeon/empirica
Website: getempirica.com
Install: pip install empirica

Happy to answer questions about the Qdrant integration or the broader noetic RAG architecture.