DEV Community: Thomas Landgraf

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop.

Thomas Landgraf — Mon, 25 May 2026 04:38:09 +0000

There are benchmarks for code an LLM writes. HumanEval, MBPP, SWE-Bench, LiveCodeBench. There are no benchmarks for the specifications an LLM writes. The upstream half of agentic software delivery has been flying blind — and the spec is what your downstream coding agent has to interpret.

I went looking for one and there isn't one. So I propose one, and to demonstrate it I gave thirteen LLMs the same real codebase (excalidraw) and asked each of them to produce a specification tree. Six of those thirteen ran locally on a laptop - via LM Studio and Ollama - and one of them landed within 12% of the frontier-cloud baseline. Then I made Claude Opus walk through every other model's output and judge it.

The numbers surprised me. So did how well the local half held up.

The metric: driftless implementability

A spec compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. A bad function fails its test. A bad spec fails a meeting.

The question that matters to anyone shipping software with AI agents downstream is: hand the spec — and only the spec — to Claude Code, Google Antigravity, or Codex. Does the result match what the spec described? If yes, the spec was good. If the agent had to guess, invent, or ask, the spec was lossy. Drift is the cost. Driftlessness is the goal.

That third claim is the whole reason this experiment can exist. It re-frames "which LLM should I use to write my specs" as a question with an answer, not a vibe.

The setup

Thirteen LLMs. One brief. One codebase.

Cloud, frontier: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; GPT 5.4, GPT 5.4 Mini; Gemini 3.1 Pro and Flash previews
Local, open-weights: Qwen 3.6 35B A3B (LM Studio), Gemma 4 26B A4B (LM Studio, MoE), Gemma 4 31B (Ollama, dense), Gemma 4 8B (Ollama), GPT-OSS 20B (open weights), Nemotron 3 Nano (open weights)

Six of the thirteen never touched a cloud endpoint. The whole local cohort was a deliberate test of the question every privacy-constrained team has been asking since the open-weights wave: are on-laptop weights good enough for real agentic spec work in 2026?

I'm the creator of SPECLAN (full disclosure), a VS Code extension for spec-driven development. The pipeline that produced these trees is SPECLAN's Infer Specs from Code agent — it walks a codebase via MCP tools, decides what's a feature, writes the requirement, and stops. Same agent across all thirteen runs. Only the model changes.

Each output is a hierarchy of Markdown files: Goal → Feature → Requirement, every entity in its own file with YAML frontmatter (id, parent, status). You can walk all thirteen trees side-by-side on the speclan.net/compare gallery.

What the numbers say

The most defensible single metric is requirement count. Not because more is better — a spec with 200 noisy requirements is worse than one with 80 clean ones — but because it tells you whether the model committed. A model that wrote 12 requirements for excalidraw missed almost all of it. A model that wrote 200 saw the codebase.

The reference baseline: Claude Opus 4.7 produced 5 goals, 16 top-level features, 43 features, 197 requirements. That's the frontier-model bar.

The first surprise: Claude Haiku 4.5 produced more: 5 goals, 14 top, 45 features, 203 requirements. From a smaller and cheaper model. Haiku earned it by splitting Opus's requirements into smaller pieces — not by inventing things Opus missed, but by carving the same surface finer. The right read: model-family scaling sometimes trades resolution for terseness, not insight for capacity.

The second surprise: Qwen 3.6 35B A3B, running locally in LM Studio on an Apple M4 Max, produced 4 goals, 23 top, 49 features, 174 requirements - within 12% of Opus, no tokens leaving the machine. It was the strongest of the six local runs and the one that finally answers the on-laptop question affirmatively for me. The local LLM crowd has been right about something: open-weights MoE models in the 30-40B range crossed the threshold for real agentic work in 2026, and Qwen 3.6 35B A3B is the cleanest example I've benchmarked.

The local cohort split into three tiers worth naming. The 35B-class MoE (Qwen 3.6) was indistinguishable from frontier-cloud output at the structural layer. The 30B-class dense (Gemma 4 31B on Ollama) was the honest workhorse - 60 requirements across 5 top-level features, 3.3x sparser than Opus but covering the same core ground, credible as a privacy-safe substitute when cloud is off-limits. The smaller end (Gemma 4 8B, GPT-OSS 20B, Nemotron 3 Nano, and the Gemma 4 26B A4B MoE which terminated before the orientation pass) produced coherent feature trees but lost primitives - goals, vision, mission, or in Gemma 4 26B's case the literal goal body text.

The third surprise was a failure mode I'd never have predicted: Gemma 4 26B A4B (the MoE variant, also running locally) left the literal placeholder string "Goal description goes here." inside the body of G-093 "Intuitive User Experience & Customization." Real model output. Real shipped file. The smaller open-weights models often look like they're working - the tree fills in, the IDs validate, the structure passes - and then a goal body is template text that nobody asked for. This is exactly the failure shape that would silently slip past a junior reviewer and break implementation downstream.

The fourth surprise: GPT-OSS 20B (OpenAI's open weights) produced 0 goals — five top-level features, 16 features, 17 requirements — but no GOALS layer at all. Same hierarchy primitives, missing the highest level entirely. Coherent feature tree, no rationale layer above it. The kind of structural omission that's invisible in a single file but obvious the moment you open the tree.

The fifth surprise: Gemini 3.1 Pro wrote "User Identity and Access" features for a drawing tool with no accounts. Excalidraw is local-first, anonymous, no auth. The Gemini Pro spec invented an account-billing system that doesn't exist in the codebase. Pattern-match hallucination at the architecture layer — the model recognized "web app" and reached for the canonical web-app feature set, regardless of whether the actual code supported any of it.

Opus judges the rest

This is the part I didn't plan and ended up promoting to its own scene in the video.

After all 13 trees were generated, I had Claude Opus walk every other candidate's output and add a JUDGEMENT subsection at the top of each spec tree. Strengths. Weaknesses. Concrete drift risks. Opus on Haiku: positive — "captures the same surface at finer resolution." Opus on Gemma 4 26B: "placeholder text in G-093 body indicates incomplete generation; do not use as implementation input." Opus on GPT-OSS 20B: "feature tree is coherent but absence of goals layer means no traceability anchor for downstream agents." Opus on its own output: Opus declined to judge itself, which was the right call.

I find this beat the strongest single argument the video makes — not because Opus is the right judge of every spec, but because it demonstrates the thesis concretely. The downstream agent has to interpret the spec. If the most capable downstream agent we have access to today can name the drift risks in another model's spec, that's the same signal you'd get if you let it try to implement and watched it fumble. Faster to ask the question directly.

The local angle, unpacked

Six of the thirteen runs were local, mixed across LM Studio and Ollama as the runtime. The Qwen 3.6 35B A3B run was the strongest of the six and the one I expect most readers to care about, but the on-camera generation in the video is also the local one - so the privacy claim is visible, not just claimed.

LM Studio loaded Qwen 3.6 35B A3B at Q4_K_M with a 262K context window - comfortably above the ~50K floor SPECLAN's agents want for spec-tree generation. Tool use marked Supported. Architecture qwen35moe (Mixture of Experts). Same OpenAI-compatible /v1 endpoint surface as the frontier providers. Throughput on the M4 Max sat around 80 tok/sec.

SPECLAN's Local LLM (Experimental) provider in LLM Configuration accepts any OpenAI-compatible base URL. Switch the active provider from Anthropic to Local LLM, pick the model from LM Studio's (or Ollama's) loaded catalogue, click Apply. The same Infer Specs from Code wizard runs unchanged. The fact that the runtime is local is invisible to the agent code - it's just a different endpoint. Ollama serves the same /v1 shape, which is why the Gemma 4 31B and 8B runs slot in alongside the LM Studio ones without any agent-code change.

The macOS GPU monitor pinned at top-right of the video shows the M4 Max's GPU utilization bars churning the whole hour-long generation. The privacy claim - no tokens left the machine - is on screen, not just narrated. For privacy-constrained teams whose codebase can't leave the laptop, the structural finding from this benchmark is that you have a real choice in 2026: Qwen 3.6 35B A3B if you want output that's structurally indistinguishable from frontier-cloud, Gemma 4 31B (dense) if you want a slower-but-thoroughly-reliable workhorse, the smaller models if your tradeoff is hardware constraint over output quality.

The caveat: Qwen 3.6 35B A3B ran on the OpenAI SDK path (because LM Studio ships an OpenAI-compatible API; everyone does). The Anthropic SDK adds default scaffolding - a persistent Todo-List, a planner, a scratchpad - that the OpenAI SDK doesn't. Some of the requirement-count gap between Claude band (196-203 across Opus/Sonnet/Haiku) and the OpenAI-SDK candidates is SDK, not model. Qwen 3.6 35B A3B at 174 reqs is the OpenAI-SDK band outlier upward, which is the genuinely interesting signal.

A word on Qwen 3.6 35B A3B specifically

I want to be explicit about how remarkable this result is. The benchmark says: open-weights model, MoE architecture, running on a single Apple M4 Max laptop, quantized to Q4_K_M, talking to a SPECLAN agent through LM Studio's OpenAI-compatible endpoint, with no SDK-level scaffolding helping it - produced 174 requirements across 4 goals, 23 top-level features, 49 features on a 13K-file TypeScript monorepo. That output is structurally close enough to Claude Opus 4.7 (197 requirements, 16 top-level features) that walking the two spec trees side-by-side on the /compare page, you have to look at the model labels to tell them apart at a glance. Qwen's tree actually decomposes more aggressively at the top level (23 vs Opus's 16) - the model carved the canvas surface into finer top-level buckets than the frontier baseline did.

The tool-call reliability is the part that genuinely changed my mental model. Smaller-than-frontier models historically fail under multi-turn structured-output workloads - they produce coherent prose but the JSON-schema adherence falls apart by turn 8 or 10, and the agent's create_feature / create_requirement MCP calls start coming back malformed. Qwen 3.6 35B A3B held adherence across the full ~50-minute generation run with one self-correction (a delete_feature it issued after a misread on its own prior create_feature). One. On a multi-hundred-tool-call run. That's the kind of behavior I would have expected from a frontier model 18 months ago and not from open-weights weights running on a laptop.

If you're picking a single open-weights model to point at agentic spec work today, this is the one. It's the best on-laptop result I've seen, period - and it ran without the SDK scaffolding that the Claude-band candidates lean on. The privacy story finally has an output-quality story to match it.

What I'd tell you to do with this

If you're choosing a model to write the spec your downstream coding agent has to implement: read the /compare gallery, pick the two or three model families that are realistic for your budget and privacy posture, and walk their trees side-by-side. Don't average across model families — Gemini 3.1 Pro produced an architecturally different spec from Claude Opus, not a worse one or a better one. Different.

If you're privacy-constrained: Qwen 3.6 35B A3B in LM Studio on a 24-32GB unified-memory Mac is the current best on-laptop choice for agentic spec work. Throughput on M4 Max sat around 80 tok/sec; context held up well under multi-turn tool use; tool-call reliability surprised me more than the speed did. The 8B-class open-weights models are not there yet — they generate coherent feature trees but lose primitives (goals, status fields, or in Gemma 4 26B's case, the actual goal body).

If you're an SDD methodology nerd: the driftless-implementability framing generalizes beyond SPECLAN. You can apply it to any spec format — the test is "hand it to your downstream coding agent and watch what it does." Run the test on your own specs before you ship them.

The full 13-model walk-through is in the video at the top of this post. The interactive side-by-side viewer with all thirteen trees is at speclan.net/compare — every tree linkable, every requirement reachable, every JUDGEMENT subsection expanded. SPECLAN's Local LLM provider and the 13-model comparison blog post are the deeper-dive companions.

The spec is the upstream half of agentic delivery. It's been flying blind. Driftless implementability is one way to make it visible.

I Didn't Ship an 'Improve My Spec' Button. I Shipped Two.

Thomas Landgraf — Sun, 17 May 2026 12:49:58 +0000

For most of its life, the tool I maintain was good at one half of the job: keeping specifications structured, versioned, traceable, and honest about their status. The other half — actually getting the words onto the page — was still the user's to grind through. You stared at an empty requirement file and tried to remember every section that belonged there, every edge case you'd regret forgetting, every sibling spec you might be about to duplicate.

This week I shipped the half that helps with that. The interesting part isn't that there's now an AI in the loop. It's that I deliberately didn't ship the obvious version of it.

Full disclosure: I'm the creator of SPECLAN, a VS Code extension that manages product specifications as Markdown files with YAML frontmatter — Git-native, one file per requirement, organized in a hierarchical tree. The pattern works without the tool; SPECLAN is just where I observed and engineered around the design problem below.

The obvious version: one button

The obvious version of "AI helps you author specs" is a single button: Improve this spec. You write a rough draft, you click it, the model rewrites the whole thing better. Every AI authoring feature trends toward that shape because it demos well and it's one code path.

I built that. I didn't ship it. Two days into using it on my own specs, it was clear the single button was quietly doing two unrelated jobs badly, because they're not the same job.

Correctness and completeness are different problems

Watch what actually goes wrong with a first draft, and it fails in two distinct ways.

It's confidently wrong. The model fills every section with plausible-sounding text. The gaps don't announce themselves — they hide as assumptions nobody questioned until an implementer hit one three weeks later and built the wrong thing. "The system shall let users export their report" reads fine until someone asks: which formats? who's allowed? what about a report still generating? That's a correctness failure, and the fix is interrogation — someone has to ask the questions the draft glossed over and get answers.

It's narrow. You wrote the version of the feature that was in your head. The three adjacent things a good reviewer would have raised never made it onto the page, because review hasn't happened yet — including the ways the new spec quietly contradicts the specs already around it. That's a completeness failure, and the fix is the opposite motion: not asking the author what they meant, but proposing what they didn't think of.

Interrogation and proposal are different interaction models. One is a question form you answer. The other is an idea pool you accept or reject. Bolt both onto one "improve" button and you get a model that does each one vaguely. So I split it into two tools, each doing one thing legibly.

Clarify — make it correct

After the assistant drafts a spec from your one-line idea, Clarify reads it back the way a skeptical reviewer would: ambiguous wording, underspecified behavior, missing scope boundaries, decisions the draft assumed instead of stating. It hands you up to a few targeted questions — multiple-choice when several answers can apply, single-choice when it's one decision, free-text when it needs your judgment. You rate how much each question matters; you skip any you can't answer yet. Nothing blocks you.

On submit, it doesn't append a Q&A list to the bottom of the spec. It runs a refinement pass that works your answers back into the draft body and returns you to the draft, tighter. Before:

## Specification
When a user logs in, the system shall greet them with a
fortune cookie message.

The questions surface the holes — where is it shown, which sign-ins trigger it, what about service accounts. You answer two, skip one. After:

## Specification
On a successful interactive sign-in, the system shall display
a fortune cookie message in a non-blocking banner. It auto-dismisses
after 8 seconds. Shown at most once per user per calendar day;
service and automation accounts are excluded.

Same idea. A spec an implementer can build from without guessing.

Brainstorm — make it complete

Clarify makes a draft correct. Brainstorm makes it complete — and this is where I had to be most careful not to turn it into "add more stuff."

From your idea and the surrounding spec context, Brainstorm generates idea cards. What it raises is the thing I didn't expect to be the most valuable: not feature padding, but inconsistencies with the specs around the draft. A timing rule that contradicts an existing business rule. A dependency reference pointing at a bare ID instead of a real link. An out-of-scope exclusion that names a sibling spec ambiguously. You Accept what belongs, Reject what doesn't. Only accepted ideas are worked in.

The detail I'm proudest of is what happens to a rejected idea. If Brainstorm proposes "scheduled recurring export" and you reject it as scope creep, it doesn't just vanish — it can land as an explicit ## Out of Scope line. The thing you decided not to do is now written down on purpose. The difference between a spec that's complete and one that just looks finished is whether the deliberate exclusions are visible.

The user is the filter. Brainstorm widens the option space; it never widens scope without consent.

The posture, not just the features

The reason this is two narrow tools instead of one button isn't ergonomics — it's a stance about what an AI authoring flow is allowed to do. At no point does the tool finalize a spec for you. Clarify asks; you answer or skip. Brainstorm offers; you accept or reject. The draft lives in memory until you explicitly press Create, and it lands in draft status — the start of the review lifecycle, not the end of it. Owners and stakeholders review it exactly as they always have.

That constraint falls directly out of the product's premise. A spec's whole value is that it's a reviewed artifact with status discipline. An AI flow that auto-finalized specs would contradict the thing that makes specs worth keeping. So the flow is deliberately shaped so the AI accelerates the draft without bypassing the review the lifecycle exists to enforce. The single "improve" button quietly erodes that. Two suggest-then-decide tools preserve it.

The unglamorous win that shipped alongside

Not every improvement needs an AI. The same release ships a plain Search Bar in the editor — permanently visible, full-width, find-in-document with live highlighting, a match counter, next/previous with wraparound, and keyboard shortcuts to focus and clear. Case-insensitive, full-string matching, ignores the hidden YAML frontmatter. It's invisible until the first time a requirement grows past one screen and you need one acceptance criterion now. Secondary to the AI headline, but the quality-of-life touch long-time users had been asking for, and worth saying out loud that "ship the boring thing people actually asked for" still belongs in a release.

The generalizable lesson

If you're building AI authoring into any tool, the temptation is always the single universal "make it better" action — it's one code path and it demos beautifully. The lesson I keep relearning: before you build the button, name the failure modes it's supposed to fix. If there's more than one and they need opposite interaction models, one button will do all of them vaguely. Correctness wants interrogation; completeness wants proposal. They earned two tools because they were two problems.

The SPECLAN-specific tour — the full flow, screenshots, the Search Bar, and a step-by-step walkthrough — lives in the release notes and the help guide. The video at the top is the 3-minute version.

What's a feature you shipped as one action that, in hindsight, was secretly two? Curious whether the "name the failure modes first" lens lands the same way elsewhere.

Why I Shipped Two Artifact Mechanisms In My VS Code Extension — Not One

Thomas Landgraf — Sat, 09 May 2026 10:51:18 +0000

A specification is more than text. It comes with a wireframe, the regulatory PDF it answers to, the API contract it has to honour, the stakeholder slide deck someone negotiated against. For a long time, none of that lived in my spec tree. The Markdown files were git-tracked; the evidence behind them rotted in Confluence, in shared drives, in pasted-and-lost screenshots in chat.

This week I shipped a fix in v0.9.7 of the VS Code extension I maintain. The shape of the fix is the part I want to write about, because the obvious version of it would have been wrong.

Full disclosure: I'm the creator of SPECLAN, a VS Code extension that manages product specifications as Markdown files with YAML frontmatter — Git-native, one file per requirement, organized in a hierarchical tree. The pattern (Markdown + YAML + Git) works without the tool; SPECLAN is just where I observed and engineered around the design problem below.

The obvious version: one artifact mechanism, governance everywhere

Specs travel in a lifecycle: draft → review → approved → in-development → under-test → released → deprecated. Once a spec is approved, the team has agreed on its content; the implementation gets built, tested, and shipped against that agreement. The natural next thought: artifact attachments should follow the same discipline. If you can't silently rewrite a released requirement's body, you also can't silently swap out the API contract attached to it. So: one universal artifact mechanism, Change Request governance everywhere.

I sketched that. I didn't ship it. Two days into testing, I realized the obvious version forced ceremony onto material that had no business going through a review cycle.

The thing the obvious version got wrong

Consider the artifacts a project actually accumulates over a year:

The login-flow-mockup.png attached to a specific feature. Owned by that feature. Means nothing without it.
The api-contract.json attached to a specific requirement. Verified against that requirement. Drift from it is a real bug.
architecture/system-overview.png. Referenced by half the specs. Owned by no spec.
brand-guidelines.pdf. Cited by every customer-facing surface. Owned by no spec.
regulatory/pet-handling-compliance.md. Read whenever a new compliance-touching feature is drafted. Owned by no spec.
meeting-notes/2026-04-22-kickoff.md. Reference material for context. Owned by no spec.

The first two are evidence pinned to a specific entity, with a specific lifecycle, where governance is the whole point. The other four are reference material — shared across many specs, owned by none, not bound to any specification's release cycle.

If I force-march the second class through a Change Request workflow, every architecture-diagram update needs a 4-stage review against… what spec? It doesn't belong to one. The reviewer would be approving a change with no parent entity to compare against. The ceremony has no anchor.

So I shipped two mechanisms with deliberately different governance:

Axis	Spec Artifacts	Project Artifacts
Hierarchy	Flat — top-level files only	Full nested filetree
Change Request governance	Mandatory on locked specs	None — direct file ops always
Filename sanitation	Enforced at every Add path	Whatever the filesystem accepts
Scope	Pinned to one spec entity	Project-wide reference, owned by no one
Visual surface	Section at the bottom of a spec page	Third entry in the project tree

The split falls out of one observation: locking applies to entities with status; project folders don't have status. A spec has a status. A project folder doesn't. There is no "released project" to gate changes against, so a unified governance mechanism would have no anchor for one of the two cases.

How Spec Artifacts work

Every feature, requirement, or change request gets its own artifacts/ folder right next to its .md file:

speclan/features/F-2419-login-flow/
├── F-2419-login-flow.md
├── artifacts/
│   ├── login-mockup.png
│   ├── api-response-schema.json
│   └── stakeholder-approval.pdf
└── change-requests/

In the WYSIWYG editor, an Artifacts section appears at the bottom of the spec page. Drag-drop or pick to add. Click to open in the registered default viewer. Image artifacts (PNG, JPEG, GIF, WebP, SVG) get an extra trick: they can be embedded inline in the spec body as illustrations, diagrams, or mockups — not as separate attachment rows. The on-disk artifact stays a plain ![alt](artifacts/file.png) markdown link, so the spec is portable to any markdown viewer.

The Change Request gate kicks in on locked specs. The dispatch table:

Parent spec status	Add / Remove behavior
`draft`	`review`
`in-development`	`under-test`
`deprecated`	Add/remove disabled

On a locked spec, dropping in an artifact doesn't overwrite the canonical file. The system creates a Change Request in draft status and stages your file under a CR-suffixed disk name. The CR flows through the standard draft → review → approved → in-development → under-test → released lifecycle. When you click Merge, the staged file becomes canonical.

The reason for this discipline isn't audit hygiene — it's artifact / implementation drift. A spec in released is one whose word the implementation team has built against. The API contract attached to it is the contract the build was verified against. If that contract silently shifts, the implementation no longer matches the evidence and nobody knows. The CR-staging mechanism prevents the silent shift; an approved CR doubles as a signal into the implementation flow — same trigger releases the new evidence and tells the implementation it needs to update.

How Project Artifacts work

A single project-wide directory at speclan/artifacts/. Folders nest to any depth. Drop files in via the picker, drag from your OS file manager, or organize subfolders directly through the filesystem — the on-disk filetree is the source of truth, the editor's tree view auto-refreshes via a filesystem watcher.

No CR governance. No filename sanitation. No status check. Direct file ops always.

This isn't laziness; it's the second half of the design choice. Reference material doesn't have a release cycle of its own. An architecture diagram updates when the architecture updates. A brand-guidelines PDF updates when marketing pushes a new revision. The project's own lifecycle drives those changes — there is no spec entity with a status that owns them, so there's nothing to gate against.

The two mechanisms compose: a spec body can link to a project artifact via plain relative markdown ([brand guidelines](../../../artifacts/brand-guidelines.pdf)). The linkage is plain markdown — SPECLAN doesn't track it as a referential relationship, doesn't auto-stage it under a CR, and doesn't validate the path. They share one thing: a consistent icon vocabulary. A .pdf artifact gets the same icon in the spec's Artifacts section as it does in the project tree. Different governance, same visual language.

What the extension does NOT do with artifact bytes

One boundary worth being explicit about, because the question gets asked: SPECLAN does not read, parse, summarise, or interpret the bytes of your artifacts. None of the AI features (clarification assistants, code-walking inference, change-request merging) consume artifact contents. The file is stored, referenced, surfaced in the UI, governed through CRs, and kept in sync on rename — that's the whole interaction the extension has with it.

What the extension does is make sure the implementation agent — Claude Code, Codex, Cursor, whatever you hand the spec to — can find the artifacts and decide for itself how to read them. Markdown, JSON, source code, and most images are read natively by modern coding agents. PDFs, DOCX, PPTX usually need a skill attached to the agent or a pre-extraction step into a sibling Markdown artifact before the implementation hand-off.

This is deliberate. Bundling PDF/DOCX parsers into the extension would (a) bloat it, (b) lock users into one extraction pipeline, and (c) silently expose binary content to AI providers users may not have authorised for that scope. Artifacts are evidence the implementation agent can find — not pre-digested input the AI has already consumed.

The one-click bridge to implementation

While the architecture work was the headline of this release, the quality-of-life win that ships with it is a button the extension has needed since v0.9.0: Quick Impl. A pill-shaped button in the editor topbar, visible on Features, Requirements, and Change Requests, that turns an approved spec into a paste-ready implementation prompt with a single click.

The prompt is structured to read the spec from its relative path, ask the user which implementation technique to use, set the spec's status to in-development for the duration of the work, and bookend the lifecycle by flipping to under-test when development is complete. It's a single-spec, fire-and-forget hand-off — the explicit alternative to the planfile-based workflow for "I have one approved feature and just want to ship it now."

What this changed about my own workflow

For a year I'd been treating the spec body as the only thing that lived in git. The wireframes lived in Figma; the contracts lived in OpenAPI files in another repo; the regulatory references lived in shared-drive PDFs that I emailed myself when I needed them. The day I started attaching them all to the spec tree under v0.9.7, I noticed how much friction the previous pattern carried that I'd just become numb to.

The deliberate split — governed Spec Artifacts, ungoverned Project Artifacts — is the kind of design choice that's obvious in hindsight but easy to get wrong upfront. A less careful version would have shipped one universal artifact mechanism with CR governance everywhere, and users (myself included) would have spent months filing bugs about why the architecture diagram needs a 4-stage review cycle to update. Two mechanisms with different governance is what falls out of taking the entity-lifecycle abstraction seriously.

If you're building tooling that touches a spec lifecycle, the practical lesson generalizes: governance is determined by what the parent entity needs, not by what's most uniform. A unified mechanism is satisfying from the architect's seat. From the user's seat, it forces ceremony on material that has none.

For the SPECLAN-specific tour, the release notes and the Artifacts help page walk through the user surfaces in detail. And if you're curious which of today's frontier models writes specs you'd actually trust to carry real artifacts, speclan.net/compare parks 13 models' output on the same brief side-by-side.

What's the cleanest split you've made between governed and ungoverned state in tooling you've shipped? Curious whether the "what does the parent entity need" lens lands the same way elsewhere.

The Window Is Closing: Spend $1200 on Yourself Before AI Pricing Catches Up

Thomas Landgraf — Sat, 09 May 2026 06:14:28 +0000

The developers who use the current cheap-access era to become AI-native will keep accelerating when the pricing resets. The ones who wait will not catch up.

The number that started this for me

In March 2026, on the last day of NVIDIA GTC, Jason Calacanis asked Jensen Huang on the All-In Podcast whether NVIDIA was spending around $2 billion a year on AI tokens for its own engineering team. Huang's answer is the line that has been rattling around in my head for two months:

"Let's say you have a software engineer or AI researcher, and you pay them $500,000 a year. At the end of the year, I'm going to ask him how much did you spend in tokens. If that $500,000 engineer did not consume at least $250,000 worth of tokens, I am going to be deeply alarmed."

— Jensen Huang, NVIDIA GTC 2026

Read it twice. The CEO of the company that makes the GPUs everyone else is buying is saying, out loud and on a podcast, that the price-to-output ratio of his most senior engineers should look like 1:0.5 in tokens-to-salary. He went on to add that he plans to "give them probably half of that on top of it as tokens so that they could be amplified 10×." That isn't a hypothetical. That's NVIDIA's stated policy on how its own engineers are expected to operate.

Now look at the gap between that future and the room you're sitting in. In Jensen's world, a senior engineer consumes $250,000–$500,000 of inference per year and the company expects it. In the market today, a working developer can get a flat-rate Claude Max or ChatGPT Pro plan for $1,200 a year that delivers, by the public receipts I'll walk through below, somewhere between $9,000 and $36,000 of API-equivalent inference.

There are exactly two ways that gap closes:

The inference price comes down to meet the plan price. Possible, but the unit economics — training cost doubling every seven months, providers losing money on aggregate, venture investors eventually demanding margin — say no.
The plan price moves up to meet the inference price. The arbitrage shrinks. Companies start budgeting AI tokens as a real line item — Jensen-style — and the people who are not yet AI-native get capped out of the budget.

I'm betting on the second one. And if you're betting on the second one too, the thing to do right now, before the gap closes, is to buy yourself the cheapest year of frontier AI access you will probably ever live through. A $100/month plan is $1,200 a year. The compounding career return on twelve months of unrestricted AI exposure during this window is, in my honest read, the highest-ROI line item available to a working developer.

The two sections that follow show why the window is real, why it's closing, and what to do with it.

The subsidy story: what the receipts say

The headline that comes out of the public measurements:

Heavy interactive users are getting 5–20× the API-equivalent value of their subscription, sometimes much more, and the gap is widening as people learn how to push the tools.

The receipts that anchor that range:

A public usage tracker logged 755.7M tokens through Claude Max 20x in a single month, with an API-equivalent cost of ~$1,428 against $200 paid — about a 7× discount on a typical heavy user.
One developer documented 10B tokens of Claude Code use over eight months that would have cost ~$15,000 at API rates but cost ~$800 on a Max plan — a 19× discount sustained across the better part of a year. (Blended ~$1.50/Mtok — Sonnet-heavy with strong cache hits.)
The extreme case — Business Insider reported a single Claude Code user consuming ~11B tokens in a month against a $200 subscription, roughly $35,000 of API-equivalent work. That single user is reportedly part of why Anthropic introduced weekly limits in late 2025. (Blended ~$3.20/Mtok — Opus-heavy with weaker caching, which is why the same token volume costs more than twice as much in API-equivalent terms.)

Two structural numbers that tell the same story from a different angle:

Break-even threshold for Max 20x: ~22M Sonnet tokens per month. Any working developer who actually uses Claude Code on a real codebase clears that in the first week.
Break-even for Max 5x: ~11M tokens per month. A single agentic afternoon.

This isn't 20% better. It isn't 2× better. The plans are pricing access to frontier coding intelligence at a fraction of what the same intelligence costs on the meter — and the providers are visibly tightening the screws. The pattern is unmistakable once you line it up:

July 2025: Anthropic introduces weekly rate limits on Pro and Max plans on top of the existing 5-hour windows, framed as affecting "less than 5% of subscribers."
March 2026: Anthropic, as reported by The Register, quietly redistributes the 5-hour session limits so that Pro and Max users hit caps faster during U.S. business-hours peak (5:00–11:00 PT). The framing is "demand management." The effect on a working developer in a U.S. timezone is that the same workflow that fit inside a session in February no longer fits in March.
April 2026: OpenAI launches a $100 ChatGPT Pro tier as a direct counter to Claude Max, adding a new pricing rung rather than dropping the existing ones.
May 2026: Anthropic doubles Claude Code's 5-hour limits and removes peak-hour reductions — but only after locking in $100B+ of AWS capacity and a SpaceX Colossus 1 deal. The relief is real and explicitly capacity-bound; it isn't a price cut.

The arbitrage exists now. The trajectory of every change in the last twelve months has been to manage the gap, not eliminate it — and certainly not in the user's favor. The question this post asks is: how long can it possibly last, and what should you do with it while it's there?

The economics story: why the providers can't hold this price

Three numbers from the public unit-economics record matter for the rest of this argument:

1. The inference bill is climbing faster than the revenue. TechCrunch, citing leaked documents analyzed by Ed Zitron, reported OpenAI inference spend at ~$3.8B in 2024 and ~$8.65B in just the first nine months of 2025 — implying inference costs that may at points have exceeded API revenue. Anthropic's reported 2025 gross margin of ~40% came in after inference costs ran ~23% above their internal plan. These are companies that are gross-margin positive on the call but burning cash on the company.

2. Training costs are doubling every seven months. Epoch AI's modeling has frontier-model training cost growing roughly 3.5× per year. Dario Amodei has publicly said current frontier runs cost between $100M and $1B, with $10–100B plausible in the 2025–2027 window. Every dollar of API revenue you spend ultimately has to amortize the next training run, not the last one — and the next one is going to cost more than the last one did, by a factor that compounds.

3. The infrastructure commitments are real and not coming back. Anthropic raised a $30B Series G in February 2026 at a $380B post-money valuation and committed >$100B over ten years to AWS for up to 5 GW of capacity. OpenAI has Stargate. These are companies pre-paying for compute on a multi-year horizon. That money has to be repaid by someone eventually, and it isn't coming out of the API margin alone.

Now stack these:

Heavy user is getting 7–20× API-equivalent value on a plan
+ Inference costs growing faster than revenue
+ Training cost doubling every 7 months
+ Multi-billion-dollar capacity commitments to repay
+ Investors will eventually want margin, not growth-at-all-costs
= an unsustainable equilibrium

The plans are not cheap because compute got cheap. The plans are cheap because two companies are spending venture money to win a platform war for developer mindshare. Once one of them wins — or once the venture money insists on margin — the plan price has to move toward the API price, not the other way around.

You can argue with the timing. I cannot tell you whether this resets in twelve months, twenty-four, or thirty-six. I can tell you with high confidence that the current $/token price on a $100 plan is not the long-run equilibrium, and that the direction of travel is toward more expensive, not less.

What happens when the music stops

Now layer the unit economics onto a corporate hiring decision. Imagine the eventual repricing — even mild — where one truly AI-amplified senior developer is consuming the API-equivalent of $5,000 to $10,000 per month in inference, and the company is paying that as a real line item rather than a flat $200 plan.

Companies do not casually hand $10,000/month of inference to every junior on the team. They optimize ROI. The cold version of that calculation looks like this:

Senior, AI-native, deeply experienced with the toolchain → gets unrestricted access. Productivity multiplier of 3–10×. Easy ROI on the inference bill.
Mid-level, still ramping on AI workflows → gets capped access. Limited to the cheaper model tiers. ROI is real but smaller.
Junior, not yet AI-fluent → gets a Plus-tier seat at most, or none at all. The economic argument for hiring three juniors instead of one AI-amplified senior gets uncomfortable fast.

This isn't a future conversation. It is already showing up in real boardrooms, and it is the natural endpoint of the unit economics. The "junior developer arbitrage" — three cheap people doing the work of one expensive person — was historically supported by repetitive implementation work that AI now does better than a junior. The repetitive implementation work is exactly what LLMs are best at. That tier of work is the one collapsing.

The risk for a developer entering or sitting in this category is not "AI replaces me." The real risk is much more specific:

AI-enabled developers replacing non-AI developers.

A mediocre AI-native developer can already out-ship a good traditional developer on most implementation tasks. Not because they are smarter — because they are amplified. The pattern is the same as excavators replacing shovels, CAD replacing paper drafting, compilers replacing assembly, cloud replacing server rooms. The tool becomes the new abstraction layer; the people who don't speak it become economically invisible.

Why the next six months matter more than people realize

The current cheap-access phase creates a narrow opportunity window for developers who haven't yet converted. During this window you can still afford to spend entire evenings on the kind of unstructured experimentation that builds intuition:

Building actual side projects, not toy ones.
Wiring up your home automation through agents.
Prototyping the SaaS idea you've been carrying for two years.
Learning what context engineering actually means in practice.
Watching how the model fails on real code and developing the verification habits that catch it.
Building the prompt-and-orchestration patterns that don't exist in any course because the field is too new.

This list looks unserious. It isn't. AI development is becoming a new engineering discipline, and the discipline is built almost entirely from accumulated hours-on-tool. Decomposition strategies, context layering, specification-driven development, verification loops, AI orchestration, model routing, autonomous-coding patterns, human-AI review cycles — none of these are taught yet. They're built by working developers running a lot of agents through a lot of tasks and noticing which patterns hold.

The compounding shape of this is what makes the window so dangerous to miss:

Each month of unrestricted AI use makes you better at AI use, which makes the AI more useful to you, which accelerates your output, which creates more exposure, which makes you better at AI use. Skip the loop for a year and the gap to someone who didn't is not "twelve months of practice." It's twelve months of compounding.

You don't catch up to a year of compound growth by working harder for six months. The math doesn't allow it.

"My company doesn't pay for a plan yet"

Then pay for it yourself.

I am genuinely serious. A $100/month plan costs $1,200 a year. That is roughly the price of:

A flagship smartphone
One conference ticket with travel and a hotel night
A consumer GPU like an RTX 4070 Ti / 4080
A Herman Miller-class ergonomic chair
A long weekend somewhere nice

For $1,200 you get twelve months of effectively unlimited frontier-model coding intelligence — the receipts above put the API-equivalent value somewhere between $9,000 and $36,000 over that same year. There is, in my read, no other line item available to a working developer with a higher career-compounding return per dollar. None. Not a course, not a degree, not a conference, not a piece of hardware.

A few practical guardrails for spending it well:

Use it for things you actually care about. Side projects. The home-automation rebuild. The CLI tool you've sketched five times. Engagement matters more than novelty — you'll learn faster on a problem you care about than on a tutorial.
Don't only chat — build and ship. The skills that matter compound through real workflows: agent loops, verification, context management. Chat-only use will teach you a much smaller fraction of what's available.
Lean into the failure modes. Notice when the model loops, when it hallucinates a tool call, when it edits the wrong file. Those moments are how you build the intuition that separates an AI-native developer from someone who types prompts.
Pick one stack and go deep. Claude Code with Sonnet 4.6 / Opus 4.7 is what I use. Codex is fine. Pick one, build the muscle memory there first, generalize later.
Measure something. Token consumption, completion rate, time-to-shipped on side projects. You don't need a dashboard — even a notebook entry per week tells you whether you're improving.

The two-class industry

I'm increasingly convinced the software industry will divide into two broad categories over the next two to three years.

AI-amplified engineers. These developers orchestrate systems, guide AI agents, verify output, design architectures around what AI can and can't do, and effectively leverage massive inference budgets. Their per-developer output becomes hard to describe with the productivity language we use today. They are the ones companies will pay $5,000–$10,000/month of inference for, because the ROI is obvious.

Non-AI engineers. These developers will increasingly compete on lower cost, maintenance work, legacy systems, and commodity implementation tasks — the kind of work where AI exposure is either restricted or where the codebase resists it. That tier exists. It will keep existing. It is not where you want to spend the second half of your career.

The two-class split is not a moral judgment. It's an economic one. Once companies have a clear view of per-developer inference ROI, the people who don't justify the inference spend will be capped on it, and the people who do will get more.

My advice

Do not wait for permission. Do not wait for your company. Do not wait for the "official AI processes" memo.

Use this phase. The current pricing is, almost certainly, the cheapest year of AI access you will ever live through as a developer. The window is open now because two companies are spending venture money to keep it open. When that ends — whether through repricing, rate-limit tightening, or the slow grind of training-cost amortization showing up in the API price — the developers who already built the loop will keep accelerating, and the ones who didn't will spend years trying to catch up to a moving target.

The whole investment is $1,200. Build things. Break things. Automate your house. Ship side projects no one asked for. Use the model on hard problems and watch where it fails. Read traces. Notice patterns.

This is, sincerely, the highest-leverage career investment available to a working developer in 2026. If your employer covers it: great. If not: cover it yourself. A year from now you will either be inside the AI-native loop or trying to argue your way back into it from outside.

Of course, you can take the other bets. You can bet on cheap Chinese open-weights models catching up. You can bet on local LLMs being good enough by the time the plan price resets. You can bet that the weaker tiers of the big providers will stay capable enough to keep you competitive against developers running unrestricted on the frontier. Any of those might work — none of them are unreasonable. But look at the asymmetry: if my hypothesis is wrong, I've spent the price of a phone on a year of frontier coding intelligence. If the other bets are wrong, you've spent that same year falling behind people who didn't make them. For $1,200, I'd still rather be on my side of the trade.

A personal note before I close. The best investment my father ever made in my career — and my brothers' — was a Commodore VIC he bought in 1982 for 600 Deutsche Mark, about $250 in the money of the time. He had no idea what we would do with it. Neither did we. What it bought us was hours on a machine that was about to become economically central, during a window when very few people our age had access to one. Forty-some years later, both of us still earn our living downstream of those evenings on that little keyboard. A $1,200 plan in 2026 is the same shape of bet — bigger absolute number, smaller in real terms, and aimed at a tool that is changing what software work is even faster than the home computer did.

Think about it. And maybe thank me later.

If the math feels different from what I described — particularly if you've found a way to stay productive without a paid plan, or if you think I'm overstating the timing — I'd genuinely like to hear it. Drop a comment.

Sources

The Jensen Huang quote (NVIDIA GTC 2026)

Tom's Hardware — NVIDIA engineers should use AI tokens worth half their annual salary every year — tomshardware.com
All-In Podcast on X — original podcast clip — x.com/theallinpod
R&D World — NVIDIA CEO says elite engineers should spend at least $250K on tokens annually — rdworldonline.com

Plan vs API: pricing, receipts, break-even

Anthropic — Plans & Pricing — claude.com/pricing
Anthropic — API Pricing — platform.claude.com
OpenAI — API Pricing — openai.com/api/pricing
Verdent — Claude Code Pricing 2026 (the 755.7M-token / $1,428 Max 20x tracker) — verdent.ai
Product Compass — Subscriptions vs API (the 10B-token / 8-month / $15K → $800 case) — productcompass.pm
KSRed — I built a cost tracker for Claude Code — ksred.com
PricePerToken — Subscription vs API calculator — pricepertoken.com
Business Insider — Inference whales threaten AI coding startups' business model (the 11B-token / ~$35K example) — businessinsider.com

Provider rate-limit and pricing moves

TechCrunch — Anthropic unveils new rate limits to curb Claude Code power users (Jul 2025) — techcrunch.com
The Register — Anthropic tweaks Claude usage limits to manage capacity (Mar 2026) — theregister.com
TechCrunch — ChatGPT finally offers $100/month Pro plan (Apr 2026) — techcrunch.com
Anthropic — Higher Limits and a Compute Deal with SpaceX (May 2026) — anthropic.com
Bloomberg — Anthropic, SpaceX Sign Deal to Boost AI Computing Power (May 2026) — bloomberg.com

Unit economics, training cost, provider profitability

TechCrunch — Leaked documents shed light on how much OpenAI pays Microsoft — Zitron, $3.8B / $8.65B inference spend (Nov 2025) — techcrunch.com
Bloomberg — OpenAI sees better margins on business sales (Dec 2025) — bloomberg.com
Investing.com — Anthropic trims profit margin outlook (~40% gross margin, inference ~23% over plan) — investing.com
WSJ — OpenAI / Anthropic IPO finances — wsj.com
WSJ — The Spiraling Cost of Making AI — wsj.com
Financial Times — AI inference economics — ft.com
WIRED — Sam Altman on GPT-4 training cost ("more than $100M") — wired.com
Entrepreneur — Dario Amodei on training-cost trajectory — entrepreneur.com
Epoch AI — How much does it cost to train frontier AI models? (~3.5×/year growth) — epoch.ai
FutureSearch — OpenAI API Unit Economics (~75% gross margin estimate, June 2024) — futuresearch.ai
Anthropic — $30B Series G at $380B post-money (Feb 2026) — anthropic.com
Anthropic — Expanded compute partnership with Amazon (Apr 2026) — anthropic.com
OpenAI — GPT-5 system card — openai.com
OpenAI — GPT-4 Technical Report (PDF) — cdn.openai.com
ProPublica — OpenAI Inc. Form 990 filing — propublica.org

The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task

Thomas Landgraf — Fri, 01 May 2026 09:16:36 +0000

If you have ever built an agent that walks a codebase, calls tools, and writes structured output, you have hit the same wall I kept hitting: the same model produces wildly different results on the same task depending on what harness you wrap it in. Swap Claude for GPT behind a single OPENAI_BASE_URL and you lose half your output quality. Everyone blames the model. The model is rarely the variable.

I ran an experiment to put a number on it. Thirteen LLMs — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, GPT 5.4, GPT 5.4 Mini, two Gemini 3.1 previews, and six open-weights locals (Qwen 3.6 35B A3B, Gemma 4 at three sizes, GPT-OSS 20B, Nemotron 3 Nano) — on the same real agentic task. Same codebase (excalidraw), same MCP tools, same system prompt. Only the model changes. The output is a specification tree: goal → feature → requirement hierarchies of Markdown files.

What if the SDK is doing more of the work than anyone admits?

Every provider ships an SDK. Most teams assume the SDK is a thin wire-protocol wrapper. It usually isn't. Here's what the Anthropic SDK ships with by default, alongside the MCP tools I expose:

A persistent Todo-List the model reads from and writes to across turns.
A planner for multi-step reasoning that doesn't burn the main conversation budget.
A scratchpad for cross-turn notes that never reach the final output.

Here's what the OpenAI SDK ships with by default when you give it MCP tools: the MCP tools. Nothing else.

I'm the creator of SPECLAN, a VS Code extension for spec-driven development, and the pipeline in this benchmark is one of SPECLAN's agents (full disclosure). But the lesson generalizes to any multi-provider agent harness — and the numbers are genuinely jarring.

The band gap

Requirements produced on the same codebase, same prompt:

Model	SDK	Requirements
Claude Opus 4.7 (1M)	Anthropic	197
Claude Sonnet 4.6	Anthropic	196
Claude Haiku 4.5	Anthropic	203
GPT 5.4	OpenAI	43
GPT 5.4 Mini	OpenAI	60
Gemini 3.1 Pro preview	OpenAI-compat	17
Gemini 3.1 Flash preview	OpenAI-compat	13
Qwen 3.6 35B A3B (local)	OpenAI-compat	174
Gemma 4 31B dense (local)	OpenAI-compat	60
Gemma 4 8B (local)	OpenAI-compat	21
GPT-OSS 20B (local)	OpenAI-compat	17
Nemotron 3 Nano (local)	OpenAI-compat	12

Two things jump out immediately.

The three Claude models cluster at 196–203 regardless of size. Opus is several times the size of Haiku. If model size were driving volume, you would see variance. You don't. That flatness is the scaffolding floor — the shape of what the Anthropic agent loop produces on this benchmark, not the ceiling of what Opus can do.

Every OpenAI-SDK model except one sits at 13–60. An order of magnitude below the Anthropic band. Different vendors (OpenAI, Google, Meta-derived, Chinese open-weights), different sizes (8B to 120B+), same roughly-converged output volume. That convergence is what you would expect if the binding constraint were the harness, not the model.

Why spec authoring exposes scaffolding so brutally

Here's the technical meat. Spec authoring is fundamentally a list-management problem: enumerate the features the code implements, write a requirement, cross it off, move to the next. A human technical writer does this with a notepad.

Without a Todo-List, an LLM has to re-derive from conversation history every turn: did I already write requirements for Shape Drawing Tools? Let me check the last 12 turns… yes I did. Element Organization? Let me check… no, that's next. Every single turn, this bookkeeping consumes context window and decision budget that could have gone into writing the actual requirement.

With a persistent Todo-List, the model does one tiny tool call (todo_list_read), sees next undone: Element Organization, and gets to work. It's doing a fundamentally easier version of the task. That's why you get 197 requirements from Opus and 43 from GPT 5.4 on the same brief — the first model was given a list-management abstraction, the second had to reinvent it in every turn.

If you've ever wondered why your Claude-via-Anthropic-SDK agent seems to "remember things twelve files ago" and your GPT-via-OpenAI-SDK agent feels like it restarts every turn — this is why. Anthropic's SDK implements memory as a tool. OpenAI's SDK expects you to bring your own.

The one exception — and why it matters

Look at the table again: Qwen 3.6 35B A3B produced 174 requirements on the no-scaffolding OpenAI-SDK path. Running locally in LM Studio on a Mac M4 Max, 50k context. Within 12% of the Anthropic cluster. It is the one outlier in an otherwise tight 13–60 band.

Our best guess for why: Qwen's training mix is heavy on agentic tool-call trajectories, and that data seems to have internalized some of the bookkeeping the Anthropic SDK externalizes as tools. The model brought its own list-management to the task.

This matters because it proves the gap is closable without the SDK help — just not by most models. You can think of the benchmark as a 2×2:

	Scaffolding in harness	No scaffolding in harness
Scaffolding-trained model	Opus / Sonnet / Haiku (196–203)	Qwen 3.6 35B A3B (174)
Not trained for agentic bookkeeping	— (we don't have data)	GPT 5.4, Gemini, Gemma, GPT-OSS, Nemotron (13–60)

Three of the four quadrants are populated. The missing one — "scaffolding-trained model on a scaffolding-free harness" — is the obvious follow-up: run Opus on an OpenAI-SDK harness with the Anthropic tools explicitly stripped, so Opus operates on the same MCP-only surface as the others. The delta between that number and Opus's 197 is the SDK's contribution. Whatever's left is the pure-Opus delta. That's the experiment we're shipping next.

The Gemini anomaly — same vendor, two outcomes

One finding from the benchmark lands harder than any single row: Gemma (local open-weights) succeeded at the task; Gemini (frontier cloud preview) failed. Same company. Same pipeline on our end. Same adapter layer (OpenAI-compatibility).

Gemma 4 8B wrote a coherent on-domain tree — every one of 21 requirements landed on a real excalidraw feature. Gemini 3.1 Pro preview wrote Account and Billing Management, Personalized Analytics Dashboard, full acceptance criteria for Subscription tier management (upgrade, downgrade, cancel). Excalidraw has no accounts and no billing.

Our working hypothesis: our OpenAI-compatibility shim round-trips tool-call payloads in a format Gemma tolerates but Gemini treats differently. Gemini falls back to training priors when the adapter produces turns it cannot fluently continue — and "enterprise SaaS reference architecture" is over-represented in those priors. Before anyone dismisses Gemini previews as weak at agentic work, that rerun through Google's native GenerateContent API with planning primitives enabled is on the follow-up list.

The generalizable lesson for anyone building multi-provider agents: every harness silently privileges some providers over others. Our harness privileges Anthropic (full SDK integration), accidentally privileges Gemma-like models (MCP-only works for them), and does not fit Gemini 3.1. Your harness will have the same asymmetry in a different shape. The answer to "which model is best for my harness?" is not "whichever has the most parameters" — it's "whichever your harness actually fits."

What to take from this if you are building agents

Audit what your SDK ships by default. If you picked the Anthropic SDK and haven't looked inside, a non-trivial share of your agent's competence is coming from the built-in Todo-List / planner / scratchpad. Switch providers without replacing that layer and you will measure a model-quality drop that is actually a scaffolding drop.
Invest in the scaffolding layer before investing in a bigger model. In our benchmark, scaffolding was worth roughly an order of magnitude of output volume. A bigger model on a thin harness will not close that gap. A smaller model on a thick harness often will.
Multi-provider support is a harness problem, not a config-flag problem. If you're offering users a choice of providers, you're offering them a choice of how well your harness fits their provider. That's architectural work, not one-line-of-YAML work.
The training mix sometimes bridges the gap for you. Qwen 3.6 35B A3B is the proof. Agentic-tool-call-heavy training data appears to internalize what other models rely on the SDK to externalize. If you're picking a local model for agentic workloads, pick one whose training mix matches that shape.

Try it yourself

All 13 spec trees are browsable side-by-side at speclan.net/compare/ with URL-sharable deep links. Two pairings worth five minutes of your time:

?left=opus&right=qwen3.6-35b-a3b — Anthropic SDK + Opus vs. OpenAI SDK + Qwen. The gap visible in the UI is the SDK story.
?left=gemini3.1-pro&right=gemma4-8b — same vendor, two outcomes. The Gemini adapter-fit anomaly in isolation.

The full canonical post with the complete 13-row table, every caveat (including the excalidraw-in-training-data one), and the follow-up experiments planned lives at speclan.net/blog/2026-04-29-model-comparison.

If you've built a multi-provider agent and seen the SDK-layer drop I'm describing — especially if you've measured it — I'd love to see your numbers in the comments. Particularly curious about LangGraph users who added a Todo-List abstraction and measured a lift across providers.

Four failure modes you'll hit running a local LLM in a multi-step agentic loop

Thomas Landgraf — Sat, 25 Apr 2026 09:56:49 +0000

Most local-LLM benchmarks measure single-turn chat quality. Agentic workflows are a different beast: the model has to read state, call a tool, inspect the tool's result, decide whether it's done, and — if not — call another tool. A model that scores 95% on chat benchmarks can fail catastrophically on this loop in characteristic, reproducible ways.

I spent three weeks trying to get local LLMs to reliably run the agentic workflows in a VS Code extension I maintain. Full disclosure: I'm the creator of SPECLAN, an extension that manages product specs as Markdown files with YAML frontmatter — Git-native, one file per requirement, organized in a tree. A core feature, Infer Specs, walks a codebase and proposes a Goal → Feature → Requirement tree by calling MCP tools (create_feature, update_requirement, read_file, etc.) in a loop until it decides the tree is complete. This is a heavy agentic workflow: multi-turn, tool-heavy, and the model has to know when to stop.

The concept works without the tool. Markdown-plus-YAML-plus-Git as a spec format is older than SPECLAN and is the generalizable pattern this article assumes. The failure modes below will hit any agentic workflow that uses MCP tool calls plus structured output — SPECLAN is just where I observed them on seven different models across two local servers.

Here are the four failure modes, in the order you'll probably hit them.

1. The tool-call loop

Setup: an instruction-tuned model, reasonable size, MCP tool wired, seed a requirement and ask the agent to populate it.

What you'll see in the trace:

18:56:25  update_requirement  → R-0049
18:56:25  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
...  (12 more times, same arguments)

Same tool, same target, same arguments, repeated until the agent runs out of turns. On disk: garbage. The requirement's description got jammed into the YAML title: field, the body is still the untouched template placeholder, and the "Acceptance Criteria" section ends up in the wrong place.

This is not a bug in your code. Google's own Gemma 4 docs acknowledge it: Gemma can emit multiple tool calls per turn and has no built-in loop termination. The model sees the tool's success response but doesn't recognize "I am done." MoE and MatFormer-style elastic variants hit this hardest.

Mitigation (application layer): track tool-call fingerprints in the agent runner. If you see the same (tool_name, stable_arg_hash) three times in a row, interrupt the loop with a synthetic tool result that says "this tool has already produced the expected effect; proceed to the next step or terminate." This works because the loop is usually driven by the model not trusting the first success.

const callHistory: string[] = [];
for await (const step of agent.stream()) {
  if (step.type === 'tool_call') {
    const fp = `${step.name}:${stableStringify(step.args)}`;
    if (callHistory.slice(-3).every(x => x === fp)) {
      yield { role: 'tool', content: 'Already applied. Continue or finish.' };
      continue;
    }
    callHistory.push(fp);
  }
  yield step;
}

Not beautiful but survives every MoE variant I've thrown at it.

2. The hallucinated success

Second failure is worse, because it passes superficial validation.

Trace:

17:22:01  update_requirement  → R-8881   [tool call happened]
17:22:03  assistant: "I read the current state of R-8881 and updated its
                     description with a full specification: [long convincing
                     summary of changes]"

File on disk: unchanged.

The tool call fired. Your logs show it. The agent's final answer says the task succeeded. But the tool-call arguments were malformed in a way your MCP server silently ignored — or the model narrated its intent as a completion without ever carrying it out.

This is the "hallucinated success" mode. It's worse than the loop because:

Tests that assert "the agent called update_requirement at least once" pass.
Tests that assert the file changed fail — but only if you actually assert that.
Manual review sees a confident, detailed "I did it" message and believes it.

Mitigation (observability layer): every tool-call MCP server should return a diff summary as part of its response, not just {"success": true}. Something like { changed: true, hash_before: '...', hash_after: '...', fields_modified: ['description', 'acceptance_criteria'] }. Then your agent runner can verify that the model's final claim is consistent with the actual diff history. If the model says "I updated the description" but the diff summary shows changed: false, flag the session as inconsistent.

I also keep a diff_since_seed field that the agent can read at any time — so the model can literally look at what it has and hasn't changed, rather than relying on its own memory of the conversation.

3. Edit-as-replace

Different workflow: user runs /add Acceptance Criteria on an existing 5-section spec. Claude and GPT-5 default to echoing the full document with the addition merged in. A weaker local model — gemma-4-26b-a4b in my case — returned only the new section. Three sentences. The editor received three sentences and replaced the entire document.

Silent data loss. No error, no warning.

This isn't exclusive to local models; it happens to cloud models too if your prompt doesn't explicitly state the invariant. But strong models infer the invariant ("they want me to add a section, not replace the doc"). Weak models execute the surface instruction. Prompt-engineer it out:

DOCUMENT COMPLETENESS RULE (NON-NEGOTIABLE)

Your response MUST contain the ENTIRE document, not just the portion
you modified. The editor replaces the current document with your full
response. A partial response will delete everything you didn't emit.

If you cannot reproduce the full document (length, context budget,
uncertainty), return the ORIGINAL document unchanged. A no-op is
always correct; silent truncation is never correct.

The fact that this has to be said in capitals to a 26B model is the whole lesson of weak-model prompting: invariants that strong models treat as obvious must be written down.

4. Structured-output non-compliance

Clarification flow: ask the model to propose JSON matching a schema like { changes: [...], reasoning: "..." }. Downstream code does response.changes.map(...).

A local model with no guidance returned a raw array instead of the wrapped object. .map on undefined, crash.

Here's the subtle part: the schema text was never reaching the prompt. A helper signature had changed; the caller was still passing the schema positionally. TypeScript accepted it. The local model made up its own structure because we never showed it the schema in-band.

The lesson: don't rely on OpenAI's response_format or any SDK-level structured-output guarantee for local models. Most local servers implement the OpenAI-compatible API but not the structured-output constraints behind it. Put the schema text directly into the system prompt:

const systemPrompt = `
You return JSON matching this exact schema:

${JSON.stringify(schema, null, 2)}

Critical: the root MUST be an object with a "changes" array and a
"reasoning" string. NEVER return a bare array at the root.
`;

Belt-and-braces with the SDK's structured-output call. Local models will still go off-script occasionally, but the schema-in-prompt approach catches ~95% of the drift in my testing.

The benchmark

Seeded a requirement, asked the agent to populate it with description + acceptance criteria, measured: did it call the right tool? did it call the same tool more than 3× (loop)? did the file on disk actually change?

Server	Model	Type	Heavy workflow	Failure mode
Ollama	`gemma4:latest` (8B)	Dense	PASS	—
Ollama	`gemma4:31b`	Dense	PASS	slow but clean
Ollama	`gpt-oss:20b`	MoE	PASS tools / FAIL schema	output non-compliance
LM Studio	`google/gemma-4-26b-a4b`	MoE	FAIL	tool-call loop ×16
LM Studio	`openai/gpt-oss-20b`	MoE	FAIL	hallucinates completion
LM Studio	`google/gemma-4-e4b`	Elastic	FAIL	"no final response"
LM Studio	`openai/gpt-oss-120b`	MoE	FAIL	tool called, file unchanged

Three findings fall out:

Dense beats MoE / elastic for agentic tool calling. Every MoE and MatFormer variant failed the heavy workflow. Every dense variant passed. The jdhodges 2026 local-LLM tool-calling benchmark shows the same pattern — Qwen 3.5 4B (3.4 GB) at 97.5%, beating models 5× its size. Dense weights + good tool-call fine-tuning dominate.
Ollama beats LM Studio on the same weights. Same gpt-oss:20b, opposite results. The difference is the tool-call translation layer: Ollama maps the model's native tool-call format to the OpenAI-compatible wire faithfully; LM Studio's current implementation loses fidelity in ways that matter. This one surprised me — I'd assumed weights dominated the harness.
Size doesn't rescue you. gpt-oss-120b failed the same way as its 20B sibling. You can't out-parameter a chat-template / tool-call-format mismatch.

What to carry away

If you're building something agentic on top of local LLMs, the checklist is short:

Start dense. Qwen 3.5/3.6 or Gemma 4 dense, on Ollama, 7B minimum.
Add loop detection at the application layer. Don't trust the model to self-terminate.
Return meaningful tool results, not {"success": true}. Diff summaries let you detect hallucinated success.
Put your schema in the prompt, not just the SDK.
Bump context length to 16K+ on LM Studio and reload the model (the setting doesn't apply to already-loaded models — I wasted half a day on "Model did not produce a final response" before I realized).
Prompt against weak-model literal-mindedness. The DOCUMENT COMPLETENESS RULE pattern prevents whole classes of silent data loss.

Everything here is generalizable — none of it is specific to how SPECLAN uses MCP tools. If you've run into different failure modes on your local-LLM agentic workflows (especially with Qwen 3.6 dense, Llama 3.3, or GLM-4.7), drop them in the comments. I'm particularly interested in anyone who's gotten Qwen3.6-35B-A3B to self-terminate reliably on a 10+-step tool-calling loop — the MoE training for agentic coding is supposed to have fixed this, but I haven't verified it yet.

References

The living journey version of this post (with SPECLAN-specific details): speclan.net/blog/2026-04-25-we-gave-speclan-a-local-brain
jdhodges 2026 local-LLM tool-calling benchmark: jdhodges.com/blog/local-llms-on-tool-calling-2026-pt1-local-lm
SPECLAN on the VS Code Marketplace: marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension

Your PO Should Own the Spec, Not the Developer — Here's How Status Gates Fix the AI Handoff Problem

Thomas Landgraf — Sat, 04 Apr 2026 08:56:47 +0000

In most AI-assisted workflows, the developer writes the prompt and owns the outcome. The Product Owner writes a Jira ticket, the developer interprets it, feeds it to an AI agent, and 2,000 lines of code appear. Three sprints later, everyone's still arguing about what was actually specified.

The root cause isn't bad developers or bad POs. It's that nobody owns the spec as a living artifact. Jira tickets describe work to do — they die when the sprint ends. Confluence pages describe features that were planned — they go stale the moment someone changes the code. The actual intent lives in chat logs, Slack threads, and someone's memory.

What if your specs lived in Git?

The idea: each product requirement is a Markdown file with YAML frontmatter, stored in your Git repository right next to the code. One file per requirement, organized in a directory tree that mirrors your feature hierarchy. The frontmatter carries metadata — who owns it, when it was last updated, and crucially, its status.

---
id: R-4201
type: requirement
title: "Add to Cart button"
status: approved
owner: sarah
---

When a user clicks "Add to Cart" on a product page...

No external tools. No Confluence sync. No copy-pasting between systems. The spec is a file, reviewed in PRs, versioned in Git, and readable by both humans and AI agents.

I've been building a VS Code extension called SPECLAN that adds a WYSIWYG editor, spec tree view, and AI implementation tooling on top of this approach (full disclosure: I'm the creator). But the core concept — specs as files with status gates — works with any editor and zero tooling.

Here's the part that changes everything: the status field isn't just a label. It's an ownership protocol.

The status lifecycle

draft → review → approved → in-development → under-test → released

Each transition is a handoff between roles — not a Slack message, not a status change in a project management tool, but a field in the file itself, committed to Git, visible to the entire team.

Here's the key insight: the status isn't a label. It's an ownership signal.

draft means the PO is still thinking. Devs can see it but shouldn't implement it yet.
review means the PO wants the team's eyes on it.
approved means it's been reviewed and is ready to implement — the handoff moment.
in-development means the dev team (or AI agent) owns it now.
under-test means responsibility flows back to the PO — did the result match the intent?
released means everyone agrees it's done. The spec stays as a permanent record.

Walking through it: adding a shopping cart

Sarah the PO creates three Requirements in her spec tree: "Add to Cart" button, cart page with quantity editing, and cart persistence across sessions. She writes each one describing what the feature does, not how to build it. She adds Acceptance Criteria — toast notification within 500ms, cart icon updates without reload, duplicate items increment quantity.

Status: draft. The dev team can see the specs in their tree view, but the status fence prevents premature implementation. Sarah is still thinking.

She moves to review. Marco (senior dev) flags a concern — localStorage has a 5MB limit that could bite them with large carts. Sarah updates the spec. Lisa (QA) adds a missing edge case: what happens at maximum stock quantity? Sarah adds it. All of this happens in Git commits, not Jira comments.

Status moves to approved. Now the dev team takes over. The AI agent reads the approved specs directly — not from a copy-pasted prompt, but through structured tools that give it access to the full requirement text, acceptance criteria, and the spec hierarchy. It implements what was specified, not what it guesses.

After implementation, the status moves to under-test. This is the handback moment — responsibility flows back from the dev team to the PO. Sarah tests each acceptance criterion against the running system. The person who defined the requirement is the person who accepts the result.

Status: released. The spec stays in the repo as the permanent record of what the product does. Six months later, when someone asks "why does the cart sync to the server?", the answer is in the spec file — including Marco's review comment about the 5MB limit.

Why this matters more than you think

It's not just about teams

If you're a solo developer, you're already playing all these roles — you just switch between them unconsciously. You're the PO when you decide what to build. The developer when you implement. The QA when you test.

The problem is these mental switches happen mid-sentence. You're halfway through writing a spec when you think "I know how to build this" and jump straight to coding. The spec never gets finished. Two weeks later, you can't remember what you intended.

Status gates give solo developers forced phase separation:

Specificator hat — write specs, think through edge cases. No coding yet.
Implementor hat — code from the approved spec, not from memory.
Verifier hat — check your own acceptance criteria. "Did I build what I intended?"

Specs outlive sprints

This is the real differentiator from Jira. Tickets describe work. Specs describe the product.

Aspect	Jira Ticket	Spec-as-file
Describes	Work to do	Product behavior
Lifespan	One sprint	Product lifetime
Lives in	External tool	Git, next to the code
After completion	Closed, archived	Still authoritative
AI-readable	Copy-paste into prompt	Structured tools read directly

You can use both — Jira for sprint planning, spec files for the actual requirements. But the spec should outlive the sprint.

AI agents need governance, not freedom

The AI agent reads specs through structured tools, not ad-hoc prompts. It only implements approved specs. It updates the status as work progresses. The human governance layer stays intact even when the coding is automated.

This is the part most AI coding workflows get wrong. They give the AI maximum freedom and wonder why month 3 is a mess. The fix isn't more prompting. It's giving the AI a spec to follow and a lifecycle to respect.

Try it yourself

The Markdown + YAML frontmatter approach works without any tooling — it's just files in Git. But if you want the tree view, WYSIWYG editor, and AI implementation assistant on top of it:

SPECLAN on the VS Code Marketplace (free)
speclan.net — docs and methodology guide

What does your spec handoff look like? I'm curious how other teams handle the PO → dev → PO loop — especially with AI agents in the mix.

Stop Vibe Coding: What Happens When You Give Your AI Agent a Real Spec

Thomas Landgraf — Thu, 05 Mar 2026 17:51:01 +0000

Your AI coding agent can write a feature in minutes. But did it write the right feature?

I've been using Claude Code, Cursor, and Copilot for the past year, and the pattern is always the same: you describe what you want in natural language, the agent generates code, and then you spend the next hour fixing the parts it got wrong. Not because the AI is bad — but because your intent was never structured enough for it to get right.

That loop — prompt, wrong output, re-prompt, repeat — is what people call vibe coding. It works for prototypes. It doesn't work for anything you need to maintain.

The missing layer

The gap isn't in the AI's coding ability. It's between your head and the agent's context window. You know what the feature should do, how it fits into the product, what the edge cases are, and which acceptance criteria matter. The agent knows... whatever you typed into the prompt.

Spec-driven development closes that gap by structuring your intent before the agent starts writing code. Not a 40-page requirements document. Just enough structure that the AI knows:

What the feature is and why it exists (business goal)
Where it fits in the product hierarchy (parent feature)
What "done" looks like (acceptance criteria)
What status it's in (can it be implemented yet?)

What this looks like in practice

I've been building a tool called SPECLAN that takes this approach — it's a free VS Code extension that manages specifications as a tree of Markdown files with YAML frontmatter, living in your Git repository.

I recorded a 7-minute walkthrough that shows the full workflow from importing a raw product idea to orchestrating AI agents against structured work packages:

Here's what the video covers:

0:00 — The problem. Why your AI agent keeps getting it wrong, and what's actually missing.

0:25 — Installation. One click from the VS Code Marketplace.

0:40 — Importing an idea. You paste a high-level product description. SPECLAN's AI decomposes it into a hierarchy: goals, features, requirements — each as a separate Markdown file.

1:10 — The specification tree. A navigable tree view in VS Code's sidebar. Goals break down into features, features into sub-features, sub-features into requirements. The hierarchy is your product structure.

1:35 — WYSIWYG editing. A rich text editor inside a VS Code webview, so you can write specs without thinking about Markdown syntax. What you see round-trips cleanly to Markdown + YAML frontmatter.

1:55 — AI chat assistant. Ask questions about your spec, get suggestions, refine requirements — all within the editor panel.

2:15 — Copy AI Prime Context. This is where it gets practical. One click copies a structured prompt containing the spec, its parent feature, the business goal, acceptance criteria, and surrounding context. Paste that into Claude Code or any agent, and it actually knows what to build.

2:40 — Status lifecycle. Specs move through draft -> review -> approved -> in-development -> under-test -> released. Only approved specs can be implemented. This prevents the "building against a moving target" problem.

2:55 — SWARM implementation. Break approved specs into work packages and let multiple AI agents work on them in parallel — with the specification as the shared source of truth.

3:20 — Change Requests. When an approved spec needs modification, you don't edit it directly. You create a Change Request — a separate file that tracks what changed and why. No more spec drift.

3:45 — Git integration. Every spec is a Markdown file in Git. You get diffs, branches, and merge workflows for free. Your specs live next to your code, versioned the same way.

Why Markdown files in Git?

I chose this approach over a database or a cloud service for one reason: portability.

Your specs are plain text files. They work with any editor, any AI agent, any CI pipeline. If you stop using SPECLAN tomorrow, your specifications are still there — readable, diffable, greppable Markdown. No export step, no migration, no vendor lock-in.

The YAML frontmatter carries the structured metadata (ID, status, parent reference, owner), while the Markdown body carries the human-readable content. Git gives you the audit trail. The VS Code extension gives you the GUI.

The ecosystem is growing

SPECLAN isn't the only tool exploring this space. The BMAD Method uses specialized AI agent personas for structured development. OpenSpec adds a spec layer for existing codebases. GitHub's Spec Kit provides CLI templates for spec-driven workflows. Kiro from AWS takes a steering-file approach.

Each tackles the same insight from a different angle: specifications are the missing layer between human intent and AI execution. The methodology matters more than any single tool.

Try it

SPECLAN is free and open source. Install it from the VS Code Marketplace, point it at any project, and see if structured specs change how your AI agent performs.

The docs are at speclan.net. The source is on GitHub.

I'm the creator — full disclosure. I built this because I was tired of re-prompting Claude Code with the same context every session. If you have questions or feedback, I'm in the comments.

What's your experience with spec-driven development? Are you structuring your prompts before sending them to AI agents, or do you find the overhead isn't worth it? Curious to hear what's working for others.

How I Use .claude/rules/ to Give Claude Code Domain Knowledge About My Project's File Structure

Thomas Landgraf — Wed, 04 Mar 2026 22:17:04 +0000

You know that moment when you ask Claude Code to edit a file and it treats your carefully structured project directory like a random pile of Markdown? It adds implementation details to a specification file. It puts a requirement under the wrong feature. It invents an ID format you never asked for.

The problem isn't that the AI is dumb. It's that it has no idea what your files mean.

I've been building a VS Code extension called SPECLAN that manages layered specifications as Markdown files with YAML frontmatter. The speclan/ directory in any project has a well-defined structure — entity types, ID schemes, status lifecycles, nesting rules. And Claude Code kept stepping on all of them until I discovered the paths frontmatter in .claude/rules/.

The problem: Claude doesn't know your conventions

My project has a directory like this:

speclan/
├── goals/           G-###-slug.md
├── features/        F-####-slug/F-####-slug.md
│   ├── requirements/  R-####-slug/R-####-slug.md
│   │   └── change-requests/  CR-####-slug.md
│   └── change-requests/  CR-####-slug.md
└── templates/

Every file is Markdown with YAML frontmatter. IDs are random, not sequential. Features nest recursively. Requirements always belong to exactly one feature. Status goes draft → review → approved → in-development → under-test → released → deprecated. Only approved specs can be implemented. Locked specs need a ChangeRequest to modify.

None of this is obvious from the files alone. Without guidance, Claude will:

Create sequential IDs (F-0001, F-0002) instead of random ones
Put requirements at the wrong nesting level
Mix implementation concerns into specification files
Skip required frontmatter fields
Ignore the status lifecycle entirely

The solution: path-scoped rules

Claude Code loads .claude/rules/*.md files as persistent context. That alone is useful for project-wide conventions. But the feature that makes it powerful for structured directories is the paths frontmatter:

---
paths:
  - "speclan/**/*.md"
---

This tells Claude Code: "Only inject these rules when I'm working with files that match this glob pattern." The rules file is invisible when you're editing TypeScript, writing tests, or doing anything else. But the moment you touch a file under speclan/, it kicks in.

Rules without a paths field load unconditionally — they're the equivalent of putting instructions in CLAUDE.md. Rules with paths only activate when Claude reads files matching the pattern. That distinction is what makes them useful for domain-specific knowledge.

What goes in the rules file

Here's the actual rules file I use (condensed — the real one is ~96 lines):

---
paths:
  - "speclan/**/*.md"
---

# SPECLAN Specification Rules

## Entity Hierarchy

Goal (G-###) → Feature (F-####) → Requirement (R-####)

ChangeRequest (CR-####) modifies locked entities.

## Directory Structure

speclan/
├── goals/           G-###-slug.md
├── features/        F-####-slug/F-####-slug.md (self-named dirs, recursive)
│   ├── requirements/  R-####-slug/R-####-slug.md
│   │   └── change-requests/  CR-####-slug.md
│   └── change-requests/  CR-####-slug.md
└── templates/<entityType>/  UUID-slug.md

## Frontmatter (YAML)

All specs are Markdown with YAML frontmatter. Required fields:
id, type, title, status, owner, created, updated

## ID Rules (NON-NEGOTIABLE)

- Goal: G-### (3 digits)
- Feature: F-#### (4 digits)
- Requirement: R-#### (4 digits)
- ChangeRequest: CR-#### (4 digits)
- IDs are random, not sequential
- Check collisions before creation

## Status Lifecycle

draft → review → approved → in-development → under-test → released → deprecated

Only approved specs can be implemented.
Locked statuses (approved+) require a ChangeRequest for modifications.

## Invariants

1. Requirements belong to exactly one Feature
2. Features may have sub-features AND requirements
3. ChangeRequests reference exactly one parent

IMPORTANT: files under speclan/ are specifications that tell
WHAT from user perspective, not HOW from developer perspective

That last line is the most important one. It's the semantic boundary that prevents Claude from mixing concerns. Without it, you ask for a new requirement and get implementation pseudocode.

Why this works better than CLAUDE.md alone

You could put all of this in your project's CLAUDE.md. I actually did that first. The problem is context pollution — when Claude is editing a React component, it doesn't need to know about SPECLAN's ID scheme. And when it's editing a spec file, it doesn't need your TypeScript lint rules.

Path-scoped rules solve this cleanly:

Focused context — rules only activate when relevant
No noise — Claude's context window isn't cluttered with irrelevant conventions
Composable — you can have multiple rules files for different parts of your project

The rules file acts like a domain expert sitting next to Claude, whispering "that's a specification file, here's how they work" exactly when it matters.

Designing a good rules file

After iterating on this for a few months, here's what I've found works:

Be declarative, not procedural. Don't write step-by-step instructions. Describe the structure, the constraints, the invariants. Claude is good at applying constraints if you state them clearly.

Mark hard boundaries. I use (NON-NEGOTIABLE) for rules that must never be violated — like the ID format. Claude respects this surprisingly well.

Include the "why" for non-obvious rules. "IDs are random, not sequential" needs the implicit why: collision avoidance across branches and contributors. "Files tell WHAT not HOW" needs no explanation but does need emphasis.

Keep it under 100 lines. This is context that gets injected into every relevant interaction. If your rules file is 500 lines, you're eating into the context window Claude needs for actual work. Compress ruthlessly. Tables over prose. ASCII trees over paragraphs. The official docs recommend targeting under 200 lines for any CLAUDE.md file — for path-scoped rules, I'd argue even tighter is better.

Quote your glob patterns. This is a gotcha that'll bite you: YAML treats * and { as reserved indicators. Always quote your patterns — "**/*.ts" not **/*.ts. Unquoted patterns can silently fail.

Use brace expansion for related types. Instead of listing patterns separately, combine them: "src/**/*.{ts,tsx}" matches both TypeScript and TSX files in one pattern. Same works for directories: "{src,lib}/**/*.ts".

Test it by asking Claude to create something. After writing the rules file, ask Claude to "create a new requirement for feature F-1234." If it gets the file path, ID format, frontmatter, and directory nesting right on the first try — your rules file works.

Beyond project directories: glob patterns for other domains

The speclan/**/*.md pattern is one application. The same mechanism works for any file pattern where Claude needs domain context. Here's what I use across my NX monorepo:

Test files (`/.spec.ts`)* — inject your testing conventions: which frameworks, which patterns, how to mock, what not to test. I have rules for Jest vs Mocha conventions since my project uses both (libraries vs VS Code extension).

Webview files (`/webview/`) — inject your browser-context constraints: no Node APIs, specific CSS framework rules, message-passing protocols between webview and extension host.

Infrastructure files (`cdk//.ts`)* — inject your CDK conventions, naming standards, tagging policies, security guardrails. Claude loves to create overly permissive IAM roles unless you tell it not to.

Security-sensitive code (`src/auth//, src/payments//`)** — guardrails for sensitive areas: never log tokens, always parameterize queries, validate all inputs at function boundaries. These rules are especially valuable because the cost of Claude getting them wrong is high.

Database migrations (`prisma/migrations//`)* — safety rules: always include rollback instructions, never delete columns in the same migration that removes the code using them, add columns as nullable first.

The pattern is always the same: you have files where the semantics aren't obvious from the syntax, and you need Claude to understand the domain rules before touching them.

Tips from the trenches

A few more things I've learned from running 12+ rules files across a monorepo:

One concern per file. A testing.md shouldn't also contain API design guidelines. Separation of concerns applies to instructions just as much as code. Descriptive filenames like api-validation.md beat rules1.md.

Subdirectories work. All .md files are discovered recursively, so you can organize rules into frontend/, backend/, infra/ subdirectories. No configuration needed.

Symlinks for shared rules. If you maintain coding standards across multiple projects, symlink a shared rules directory: ln -s ~/company-standards .claude/rules/shared. Circular symlinks are handled gracefully.

User-level rules for personal preferences. Put rules in ~/.claude/rules/ for things that apply to everything you work on — your preferred commit message format, your testing style, your debugging workflow. These load before project rules, so project rules can override them.

Don't duplicate between CLAUDE.md and rules. If a convention is path-specific, put it in .claude/rules/ with a paths field. If it's truly universal (build commands, project architecture), keep it in CLAUDE.md. Conflicting instructions across files get resolved arbitrarily — not what you want for your ID scheme.

Check what's loaded with /memory. When something isn't being respected, run /memory to see which rules files Claude actually has in context. If your file isn't listed, the glob pattern isn't matching.

The compound effect

One rules file doesn't feel like much. But once you have 3-4 of them covering different parts of your project, Claude starts behaving like a developer who actually read the architecture docs. It stops guessing and starts following your conventions. The number of "no, that's not how we do it" corrections drops dramatically.

I think of .claude/rules/ files as executable documentation. They serve double duty: they document your conventions for human readers and they enforce those conventions when AI touches the code. That's a pretty good return on 50-100 lines of Markdown.

I'm the creator of SPECLAN, a VS Code extension for managing layered specifications as Markdown files in Git. The path-scoped rules described here are how I keep Claude Code aligned with SPECLAN's file structure conventions — but the technique works for any project with well-defined directory semantics.

I built a spec management extension with a WYSIWYG Markdown editor in a VS Code webview — lessons learned

Thomas Landgraf — Tue, 03 Mar 2026 20:18:28 +0000

I've been building a VS Code extension for spec management over the past 3 months (full disclosure: I'm the creator, it's called SPECLAN — free side project). The idea is that specifications need the same structure we give source code: hierarchy, types, lifecycle tracking. So the extension organizes specs as a tree of Markdown files with YAML frontmatter — goals break down into features, features into sub-features, sub-features into requirements. Each file has a status lifecycle (draft → review → approved → in-development → released) so you always know what's specced, what's being built, and what needs to change.

The interesting VS Code challenge: making this usable for non-technical people. Product managers and business analysts define what to build, but they won't write raw Markdown with YAML frontmatter. So I needed a WYSIWYG editor inside a webview that round-trips cleanly to Markdown — same file in Git, two editing experiences.

That editor ate about 40% of total development effort. Here's what I learned.

The stack:

Quill 2.x in a VS Code webview (rich text editing)
remark + remark-gfm for Markdown → HTML on load
turndown + turndown-plugin-gfm for HTML → Markdown on save
gray-matter for YAML frontmatter — strips on load, reattaches on save
quill-table-up for GFM tables (Quill has no native table support)

What actually hurt:

Round-trip fidelity. The pipeline is Markdown → HTML → Quill Delta → HTML → Markdown. Every step is lossy. Links, emphasis, nested lists — they all drift across conversions. I spent weeks writing custom turndown rules to keep Markdown output stable. If you're building something similar: start with the save pipeline, not the editor. The round-trip is the constraint that shapes everything.
Frontmatter is invisible but critical. Each spec file has 10+ YAML fields — status, entity ID, parent references, timestamps. The editor only sees the Markdown body, but the file is meaningless without its frontmatter. gray-matter handles parsing, but you need to be careful that editor changes don't conflict with frontmatter values (e.g., someone editing a title in the body that's also in the YAML).
Tables. Quill doesn't do tables. quill-table-up adds them, but serializing table HTML through turndown into GFM pipe tables has edge cases everywhere — empty cells, inline formatting in cells, nested content.
Webview communication. Everything between the editor (iframe) and the extension host is a postMessage call — load, save, dirty state, undo, external file change detection. I ended up building a structured message protocol with typed handlers on both sides. console.log in the webview doesn't show up anywhere useful, so I added a logging bridge that routes webview logs to the extension's output channel.
Custom editor API. Using CustomTextEditorProvider means the document model is VS Code's TextDocument but the visual state is Quill's Delta. Keeping these in sync — especially during concurrent edits or Git operations that change the file underneath — required careful event sequencing.

What worked well:

The file-system-as-data-model approach. Directories ARE the spec hierarchy. speclan/features/F-1234-auth/requirements/R-5678-login/R-5678-login.md — any tool (or AI agent) can understand the structure by reading the file system. No database, no server.
Snapshot testing the conversion pipeline. Take a Markdown file, push it through the full round-trip, diff the output. Catches regressions fast.
Tree views for navigation. VS Code's TreeDataProvider is excellent. The spec tree (goals → features → requirements) renders as a native sidebar with status icons, drag-and-drop reordering, and context menus. Much less effort than the WYSIWYG editor.

Happy to answer questions about webviews, the conversion pipeline, or the spec structure approach.

Marketplace | speclan.net | GitHub