<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thomas Landgraf</title>
    <description>The latest articles on DEV Community by Thomas Landgraf (@thlandgraf).</description>
    <link>https://dev.to/thlandgraf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3804567%2Fe53f907f-a3b8-4afe-8ecf-5c7884dc867e.png</url>
      <title>DEV Community: Thomas Landgraf</title>
      <link>https://dev.to/thlandgraf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thlandgraf"/>
    <language>en</language>
    <item>
      <title>The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Fri, 01 May 2026 09:16:36 +0000</pubDate>
      <link>https://dev.to/thlandgraf/the-sdk-you-pick-matters-more-than-the-model-a-13-llm-benchmark-on-the-same-agentic-task-1im9</link>
      <guid>https://dev.to/thlandgraf/the-sdk-you-pick-matters-more-than-the-model-a-13-llm-benchmark-on-the-same-agentic-task-1im9</guid>
      <description>&lt;p&gt;If you have ever built an agent that walks a codebase, calls tools, and writes structured output, you have hit the same wall I kept hitting: &lt;strong&gt;the same model produces wildly different results on the same task depending on what harness you wrap it in.&lt;/strong&gt; Swap Claude for GPT behind a single &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; and you lose half your output quality. Everyone blames the model. The model is rarely the variable.&lt;/p&gt;

&lt;p&gt;I ran an experiment to put a number on it. Thirteen LLMs — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, GPT 5.4, GPT 5.4 Mini, two Gemini 3.1 previews, and six open-weights locals (Qwen 3.6 35B A3B, Gemma 4 at three sizes, GPT-OSS 20B, Nemotron 3 Nano) — on the same real agentic task. Same codebase (&lt;a href="https://github.com/excalidraw/excalidraw" rel="noopener noreferrer"&gt;excalidraw&lt;/a&gt;), same MCP tools, same system prompt. Only the model changes. The output is a specification tree: goal → feature → requirement hierarchies of Markdown files.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if the SDK is doing more of the work than anyone admits?
&lt;/h2&gt;

&lt;p&gt;Every provider ships an SDK. Most teams assume the SDK is a thin wire-protocol wrapper. It usually isn't. Here's what the Anthropic SDK ships with by default, alongside the MCP tools I expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;persistent Todo-List&lt;/strong&gt; the model reads from and writes to across turns.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;planner&lt;/strong&gt; for multi-step reasoning that doesn't burn the main conversation budget.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;scratchpad&lt;/strong&gt; for cross-turn notes that never reach the final output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what the OpenAI SDK ships with by default when you give it MCP tools: the MCP tools. Nothing else.&lt;/p&gt;

&lt;p&gt;I'm the creator of &lt;a href="https://marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension" rel="noopener noreferrer"&gt;SPECLAN&lt;/a&gt;, a VS Code extension for spec-driven development, and the pipeline in this benchmark is one of SPECLAN's agents (full disclosure). But the lesson generalizes to any multi-provider agent harness — and the numbers are genuinely jarring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The band gap
&lt;/h2&gt;

&lt;p&gt;Requirements produced on the same codebase, same prompt:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;SDK&lt;/th&gt;
&lt;th&gt;Requirements&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7 (1M)&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;197&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;196&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;203&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT 5.4&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT 5.4 Mini&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro preview&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash preview&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Qwen 3.6 35B A3B&lt;/strong&gt; (local)&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;174&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B dense (local)&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 8B (local)&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS 20B (local)&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Nano (local)&lt;/td&gt;
&lt;td&gt;OpenAI-compat&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things jump out immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three Claude models cluster at 196–203 regardless of size.&lt;/strong&gt; Opus is several times the size of Haiku. If model size were driving volume, you would see variance. You don't. That flatness is the scaffolding floor — the shape of what the Anthropic agent loop produces on this benchmark, not the ceiling of what Opus can do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every OpenAI-SDK model except one sits at 13–60.&lt;/strong&gt; An order of magnitude below the Anthropic band. Different vendors (OpenAI, Google, Meta-derived, Chinese open-weights), different sizes (8B to 120B+), same roughly-converged output volume. That convergence is what you would expect if the binding constraint were the harness, not the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why spec authoring exposes scaffolding so brutally
&lt;/h2&gt;

&lt;p&gt;Here's the technical meat. Spec authoring is &lt;strong&gt;fundamentally a list-management problem&lt;/strong&gt;: enumerate the features the code implements, write a requirement, cross it off, move to the next. A human technical writer does this with a notepad.&lt;/p&gt;

&lt;p&gt;Without a Todo-List, an LLM has to re-derive from conversation history every turn: &lt;em&gt;did I already write requirements for Shape Drawing Tools? Let me check the last 12 turns… yes I did. Element Organization? Let me check… no, that's next.&lt;/em&gt; Every single turn, this bookkeeping consumes context window and decision budget that could have gone into writing the actual requirement.&lt;/p&gt;

&lt;p&gt;With a persistent Todo-List, the model does one tiny tool call (&lt;code&gt;todo_list_read&lt;/code&gt;), sees &lt;code&gt;next undone: Element Organization&lt;/code&gt;, and gets to work. It's doing a fundamentally easier version of the task. That's why you get 197 requirements from Opus and 43 from GPT 5.4 on the same brief — the first model was given a list-management abstraction, the second had to reinvent it in every turn.&lt;/p&gt;

&lt;p&gt;If you've ever wondered why your Claude-via-Anthropic-SDK agent seems to "remember things twelve files ago" and your GPT-via-OpenAI-SDK agent feels like it restarts every turn — this is why. Anthropic's SDK implements memory as a tool. OpenAI's SDK expects you to bring your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one exception — and why it matters
&lt;/h2&gt;

&lt;p&gt;Look at the table again: &lt;strong&gt;Qwen 3.6 35B A3B produced 174 requirements on the no-scaffolding OpenAI-SDK path.&lt;/strong&gt; Running locally in LM Studio on a Mac M4 Max, 50k context. Within 12% of the Anthropic cluster. It is the one outlier in an otherwise tight 13–60 band.&lt;/p&gt;

&lt;p&gt;Our best guess for why: Qwen's training mix is heavy on agentic tool-call trajectories, and that data seems to have internalized some of the bookkeeping the Anthropic SDK externalizes as tools. The model brought its own list-management to the task.&lt;/p&gt;

&lt;p&gt;This matters because it proves the gap is &lt;strong&gt;closable without the SDK help&lt;/strong&gt; — just not by most models. You can think of the benchmark as a 2×2:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Scaffolding in harness&lt;/th&gt;
&lt;th&gt;No scaffolding in harness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaffolding-trained model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus / Sonnet / Haiku (196–203)&lt;/td&gt;
&lt;td&gt;Qwen 3.6 35B A3B (174)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Not trained for agentic bookkeeping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;— (we don't have data)&lt;/td&gt;
&lt;td&gt;GPT 5.4, Gemini, Gemma, GPT-OSS, Nemotron (13–60)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three of the four quadrants are populated. The missing one — "scaffolding-trained model on a scaffolding-free harness" — is the obvious follow-up: run Opus on an OpenAI-SDK harness with the Anthropic tools explicitly stripped, so Opus operates on the same MCP-only surface as the others. The delta between that number and Opus's 197 is the SDK's contribution. Whatever's left is the pure-Opus delta. That's the experiment we're shipping next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gemini anomaly — same vendor, two outcomes
&lt;/h2&gt;

&lt;p&gt;One finding from the benchmark lands harder than any single row: &lt;strong&gt;Gemma (local open-weights) succeeded at the task; Gemini (frontier cloud preview) failed.&lt;/strong&gt; Same company. Same pipeline on our end. Same adapter layer (OpenAI-compatibility).&lt;/p&gt;

&lt;p&gt;Gemma 4 8B wrote a coherent on-domain tree — every one of 21 requirements landed on a real excalidraw feature. Gemini 3.1 Pro preview wrote &lt;code&gt;Account and Billing Management&lt;/code&gt;, &lt;code&gt;Personalized Analytics Dashboard&lt;/code&gt;, full acceptance criteria for &lt;code&gt;Subscription tier management (upgrade, downgrade, cancel)&lt;/code&gt;. Excalidraw has no accounts and no billing.&lt;/p&gt;

&lt;p&gt;Our working hypothesis: our OpenAI-compatibility shim round-trips tool-call payloads in a format Gemma tolerates but Gemini treats differently. Gemini falls back to training priors when the adapter produces turns it cannot fluently continue — and "enterprise SaaS reference architecture" is over-represented in those priors. Before anyone dismisses Gemini previews as weak at agentic work, that rerun through Google's native &lt;code&gt;GenerateContent&lt;/code&gt; API with planning primitives enabled is on the follow-up list.&lt;/p&gt;

&lt;p&gt;The generalizable lesson for anyone building multi-provider agents: &lt;strong&gt;every harness silently privileges some providers over others.&lt;/strong&gt; Our harness privileges Anthropic (full SDK integration), accidentally privileges Gemma-like models (MCP-only works for them), and does not fit Gemini 3.1. Your harness will have the same asymmetry in a different shape. The answer to "which model is best for my harness?" is not "whichever has the most parameters" — it's "whichever your harness actually fits."&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take from this if you are building agents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit what your SDK ships by default.&lt;/strong&gt; If you picked the Anthropic SDK and haven't looked inside, a non-trivial share of your agent's competence is coming from the built-in Todo-List / planner / scratchpad. Switch providers without replacing that layer and you will measure a model-quality drop that is actually a scaffolding drop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invest in the scaffolding layer before investing in a bigger model.&lt;/strong&gt; In our benchmark, scaffolding was worth roughly an order of magnitude of output volume. A bigger model on a thin harness will not close that gap. A smaller model on a thick harness often will.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider support is a harness problem, not a config-flag problem.&lt;/strong&gt; If you're offering users a choice of providers, you're offering them a choice of &lt;em&gt;how well your harness fits their provider&lt;/em&gt;. That's architectural work, not one-line-of-YAML work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The training mix sometimes bridges the gap for you.&lt;/strong&gt; Qwen 3.6 35B A3B is the proof. Agentic-tool-call-heavy training data appears to internalize what other models rely on the SDK to externalize. If you're picking a local model for agentic workloads, pick one whose training mix matches that shape.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;All 13 spec trees are browsable side-by-side at &lt;a href="https://speclan.net/compare/" rel="noopener noreferrer"&gt;speclan.net/compare/&lt;/a&gt; with URL-sharable deep links. Two pairings worth five minutes of your time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://speclan.net/compare/?left=opus&amp;amp;right=qwen3.6-35b-a3b" rel="noopener noreferrer"&gt;&lt;code&gt;?left=opus&amp;amp;right=qwen3.6-35b-a3b&lt;/code&gt;&lt;/a&gt; — Anthropic SDK + Opus vs. OpenAI SDK + Qwen. The gap visible in the UI is the SDK story.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://speclan.net/compare/?left=gemini3.1-pro&amp;amp;right=gemma4-8b" rel="noopener noreferrer"&gt;&lt;code&gt;?left=gemini3.1-pro&amp;amp;right=gemma4-8b&lt;/code&gt;&lt;/a&gt; — same vendor, two outcomes. The Gemini adapter-fit anomaly in isolation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full canonical post with the complete 13-row table, every caveat (including the excalidraw-in-training-data one), and the follow-up experiments planned lives at &lt;a href="https://speclan.net/blog/2026-04-29-model-comparison/" rel="noopener noreferrer"&gt;speclan.net/blog/2026-04-29-model-comparison&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've built a multi-provider agent and seen the SDK-layer drop I'm describing — especially if you've measured it — I'd love to see your numbers in the comments. Particularly curious about LangGraph users who added a Todo-List abstraction and measured a lift across providers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentskills</category>
      <category>claude</category>
      <category>chatgpt</category>
    </item>
    <item>
      <title>Four failure modes you'll hit running a local LLM in a multi-step agentic loop</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Sat, 25 Apr 2026 09:56:49 +0000</pubDate>
      <link>https://dev.to/thlandgraf/four-failure-modes-youll-hit-running-a-local-llm-in-a-multi-step-agentic-loop-3kd9</link>
      <guid>https://dev.to/thlandgraf/four-failure-modes-youll-hit-running-a-local-llm-in-a-multi-step-agentic-loop-3kd9</guid>
      <description>&lt;p&gt;Most local-LLM benchmarks measure &lt;strong&gt;single-turn chat quality&lt;/strong&gt;. Agentic workflows are a different beast: the model has to read state, call a tool, inspect the tool's result, decide whether it's done, and — if not — call another tool. A model that scores 95% on chat benchmarks can fail catastrophically on this loop in characteristic, reproducible ways.&lt;/p&gt;

&lt;p&gt;I spent three weeks trying to get local LLMs to reliably run the agentic workflows in a VS Code extension I maintain. Full disclosure: I'm the creator of SPECLAN, an extension that manages product specs as Markdown files with YAML frontmatter — Git-native, one file per requirement, organized in a tree. A core feature, &lt;em&gt;Infer Specs&lt;/em&gt;, walks a codebase and proposes a Goal → Feature → Requirement tree by calling MCP tools (&lt;code&gt;create_feature&lt;/code&gt;, &lt;code&gt;update_requirement&lt;/code&gt;, &lt;code&gt;read_file&lt;/code&gt;, etc.) in a loop until it decides the tree is complete. This is a heavy agentic workflow: multi-turn, tool-heavy, and the model has to know when to stop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The concept works without the tool.&lt;/strong&gt; Markdown-plus-YAML-plus-Git as a spec format is older than SPECLAN and is the generalizable pattern this article assumes. The failure modes below will hit any agentic workflow that uses MCP tool calls plus structured output — SPECLAN is just where I observed them on seven different models across two local servers.&lt;/p&gt;

&lt;p&gt;Here are the four failure modes, in the order you'll probably hit them.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The tool-call loop
&lt;/h2&gt;

&lt;p&gt;Setup: an instruction-tuned model, reasonable size, MCP tool wired, seed a requirement and ask the agent to populate it.&lt;/p&gt;

&lt;p&gt;What you'll see in the trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;18:56:25  update_requirement  → R-0049
18:56:25  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
18:56:26  update_requirement  → R-0049
...  (12 more times, same arguments)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same tool, same target, same arguments, repeated until the agent runs out of turns. On disk: garbage. The requirement's description got jammed into the YAML &lt;code&gt;title:&lt;/code&gt; field, the body is still the untouched template placeholder, and the "Acceptance Criteria" section ends up in the wrong place.&lt;/p&gt;

&lt;p&gt;This is not a bug in your code. Google's own Gemma 4 docs acknowledge it: Gemma can emit multiple tool calls per turn and &lt;strong&gt;has no built-in loop termination&lt;/strong&gt;. The model sees the tool's success response but doesn't recognize "I am done." MoE and MatFormer-style elastic variants hit this hardest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation (application layer):&lt;/strong&gt; track tool-call fingerprints in the agent runner. If you see the same &lt;code&gt;(tool_name, stable_arg_hash)&lt;/code&gt; three times in a row, interrupt the loop with a synthetic tool result that says &lt;em&gt;"this tool has already produced the expected effect; proceed to the next step or terminate."&lt;/em&gt; This works because the loop is usually driven by the model not trusting the first success.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;callHistory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool_call&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;stableStringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;callHistory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;every&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Already applied. Continue or finish.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
      &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;callHistory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not beautiful but survives every MoE variant I've thrown at it.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The hallucinated success
&lt;/h2&gt;

&lt;p&gt;Second failure is worse, because it passes superficial validation.&lt;/p&gt;

&lt;p&gt;Trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;17:22:01  update_requirement  → R-8881   [tool call happened]
17:22:03  assistant: "I read the current state of R-8881 and updated its
                     description with a full specification: [long convincing
                     summary of changes]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File on disk: &lt;strong&gt;unchanged.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tool call fired. Your logs show it. The agent's final answer says the task succeeded. But the tool-call arguments were malformed in a way your MCP server silently ignored — or the model narrated its intent as a completion without ever carrying it out.&lt;/p&gt;

&lt;p&gt;This is the "hallucinated success" mode. It's worse than the loop because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tests that assert &lt;em&gt;"the agent called &lt;code&gt;update_requirement&lt;/code&gt; at least once"&lt;/em&gt; pass.&lt;/li&gt;
&lt;li&gt;Tests that assert the file changed fail — but only if you actually assert that.&lt;/li&gt;
&lt;li&gt;Manual review sees a confident, detailed "I did it" message and believes it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mitigation (observability layer):&lt;/strong&gt; every tool-call MCP server should return a &lt;strong&gt;diff summary&lt;/strong&gt; as part of its response, not just &lt;code&gt;{"success": true}&lt;/code&gt;. Something like &lt;code&gt;{ changed: true, hash_before: '...', hash_after: '...', fields_modified: ['description', 'acceptance_criteria'] }&lt;/code&gt;. Then your agent runner can verify that the model's final claim is consistent with the actual diff history. If the model says &lt;em&gt;"I updated the description"&lt;/em&gt; but the diff summary shows &lt;code&gt;changed: false&lt;/code&gt;, flag the session as inconsistent.&lt;/p&gt;

&lt;p&gt;I also keep a &lt;code&gt;diff_since_seed&lt;/code&gt; field that the agent can read at any time — so the model can literally look at what it has and hasn't changed, rather than relying on its own memory of the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Edit-as-replace
&lt;/h2&gt;

&lt;p&gt;Different workflow: user runs &lt;code&gt;/add Acceptance Criteria&lt;/code&gt; on an existing 5-section spec. Claude and GPT-5 default to echoing the full document with the addition merged in. A weaker local model — &lt;code&gt;gemma-4-26b-a4b&lt;/code&gt; in my case — returned &lt;strong&gt;only the new section&lt;/strong&gt;. Three sentences. The editor received three sentences and replaced the entire document.&lt;/p&gt;

&lt;p&gt;Silent data loss. No error, no warning.&lt;/p&gt;

&lt;p&gt;This isn't exclusive to local models; it happens to cloud models too if your prompt doesn't explicitly state the invariant. But strong models &lt;em&gt;infer&lt;/em&gt; the invariant ("they want me to add a section, not replace the doc"). Weak models execute the surface instruction. Prompt-engineer it out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DOCUMENT COMPLETENESS RULE (NON-NEGOTIABLE)

Your response MUST contain the ENTIRE document, not just the portion
you modified. The editor replaces the current document with your full
response. A partial response will delete everything you didn't emit.

If you cannot reproduce the full document (length, context budget,
uncertainty), return the ORIGINAL document unchanged. A no-op is
always correct; silent truncation is never correct.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fact that this has to be said in capitals to a 26B model is the whole lesson of weak-model prompting: invariants that strong models treat as obvious must be written down.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Structured-output non-compliance
&lt;/h2&gt;

&lt;p&gt;Clarification flow: ask the model to propose JSON matching a schema like &lt;code&gt;{ changes: [...], reasoning: "..." }&lt;/code&gt;. Downstream code does &lt;code&gt;response.changes.map(...)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A local model with no guidance returned a &lt;strong&gt;raw array&lt;/strong&gt; instead of the wrapped object. &lt;code&gt;.map&lt;/code&gt; on &lt;code&gt;undefined&lt;/code&gt;, crash.&lt;/p&gt;

&lt;p&gt;Here's the subtle part: the schema text was never reaching the prompt. A helper signature had changed; the caller was still passing the schema positionally. TypeScript accepted it. The local model made up its own structure because we never showed it the schema in-band.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; don't rely on OpenAI's &lt;code&gt;response_format&lt;/code&gt; or any SDK-level structured-output guarantee for local models. Most local servers implement the OpenAI-compatible API but not the structured-output constraints behind it. Put the schema text directly into the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
You return JSON matching this exact schema:

&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;

Critical: the root MUST be an object with a "changes" array and a
"reasoning" string. NEVER return a bare array at the root.
`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Belt-and-braces with the SDK's structured-output call. Local models will still go off-script occasionally, but the schema-in-prompt approach catches ~95% of the drift in my testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;Seeded a requirement, asked the agent to populate it with description + acceptance criteria, measured: did it call the right tool? did it call the same tool more than 3× (loop)? did the file on disk actually change?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Heavy workflow&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gemma4:latest&lt;/code&gt; (8B)&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;PASS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemma4:31b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;PASS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;slow but clean&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gpt-oss:20b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;PASS tools / FAIL schema&lt;/td&gt;
&lt;td&gt;output non-compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google/gemma-4-26b-a4b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;FAIL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;tool-call loop ×16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openai/gpt-oss-20b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;FAIL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;hallucinates completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google/gemma-4-e4b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Elastic&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;FAIL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"no final response"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LM Studio&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openai/gpt-oss-120b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;FAIL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;tool called, file unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three findings fall out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dense beats MoE / elastic for agentic tool calling.&lt;/strong&gt; Every MoE and MatFormer variant failed the heavy workflow. Every dense variant passed. The &lt;a href="https://www.jdhodges.com/blog/local-llms-on-tool-calling-2026-pt1-local-lm/" rel="noopener noreferrer"&gt;jdhodges 2026 local-LLM tool-calling benchmark&lt;/a&gt; shows the same pattern — Qwen 3.5 4B (3.4 GB) at 97.5%, beating models 5× its size. Dense weights + good tool-call fine-tuning dominate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama beats LM Studio on the same weights.&lt;/strong&gt; Same &lt;code&gt;gpt-oss:20b&lt;/code&gt;, opposite results. The difference is the tool-call translation layer: Ollama maps the model's native tool-call format to the OpenAI-compatible wire faithfully; LM Studio's current implementation loses fidelity in ways that matter. This one surprised me — I'd assumed weights dominated the harness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size doesn't rescue you.&lt;/strong&gt; &lt;code&gt;gpt-oss-120b&lt;/code&gt; failed the same way as its 20B sibling. You can't out-parameter a chat-template / tool-call-format mismatch.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What to carry away
&lt;/h2&gt;

&lt;p&gt;If you're building something agentic on top of local LLMs, the checklist is short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start dense.&lt;/strong&gt; Qwen 3.5/3.6 or Gemma 4 dense, on Ollama, 7B minimum.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add loop detection&lt;/strong&gt; at the application layer. Don't trust the model to self-terminate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return meaningful tool results,&lt;/strong&gt; not &lt;code&gt;{"success": true}&lt;/code&gt;. Diff summaries let you detect hallucinated success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Put your schema in the prompt,&lt;/strong&gt; not just the SDK.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bump context length&lt;/strong&gt; to 16K+ on LM Studio and reload the model (the setting doesn't apply to already-loaded models — I wasted half a day on "Model did not produce a final response" before I realized).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt against weak-model literal-mindedness.&lt;/strong&gt; The &lt;code&gt;DOCUMENT COMPLETENESS RULE&lt;/code&gt; pattern prevents whole classes of silent data loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything here is generalizable — none of it is specific to how SPECLAN uses MCP tools. If you've run into different failure modes on your local-LLM agentic workflows (especially with Qwen 3.6 dense, Llama 3.3, or GLM-4.7), drop them in the comments. I'm particularly interested in anyone who's gotten &lt;code&gt;Qwen3.6-35B-A3B&lt;/code&gt; to self-terminate reliably on a 10+-step tool-calling loop — the MoE training for agentic coding is supposed to have fixed this, but I haven't verified it yet.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The living journey version of this post (with SPECLAN-specific details): &lt;a href="https://speclan.net/blog/2026-04-25-we-gave-speclan-a-local-brain/" rel="noopener noreferrer"&gt;speclan.net/blog/2026-04-25-we-gave-speclan-a-local-brain&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;jdhodges 2026 local-LLM tool-calling benchmark: &lt;a href="https://www.jdhodges.com/blog/local-llms-on-tool-calling-2026-pt1-local-lm/" rel="noopener noreferrer"&gt;jdhodges.com/blog/local-llms-on-tool-calling-2026-pt1-local-lm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SPECLAN on the VS Code Marketplace: &lt;a href="https://marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension" rel="noopener noreferrer"&gt;marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>claude</category>
      <category>githubcopilot</category>
    </item>
    <item>
      <title>Your PO Should Own the Spec, Not the Developer — Here's How Status Gates Fix the AI Handoff Problem</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Sat, 04 Apr 2026 08:56:47 +0000</pubDate>
      <link>https://dev.to/thlandgraf/your-po-should-own-the-spec-not-the-developer-heres-how-status-gates-fix-the-ai-handoff-problem-cn0</link>
      <guid>https://dev.to/thlandgraf/your-po-should-own-the-spec-not-the-developer-heres-how-status-gates-fix-the-ai-handoff-problem-cn0</guid>
      <description>&lt;p&gt;In most AI-assisted workflows, the developer writes the prompt and owns the outcome. The Product Owner writes a Jira ticket, the developer interprets it, feeds it to an AI agent, and 2,000 lines of code appear. Three sprints later, everyone's still arguing about what was actually specified.&lt;/p&gt;

&lt;p&gt;The root cause isn't bad developers or bad POs. It's that &lt;strong&gt;nobody owns the spec as a living artifact.&lt;/strong&gt; Jira tickets describe work to do — they die when the sprint ends. Confluence pages describe features that were planned — they go stale the moment someone changes the code. The actual intent lives in chat logs, Slack threads, and someone's memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if your specs lived in Git?
&lt;/h2&gt;

&lt;p&gt;The idea: each product requirement is a &lt;strong&gt;Markdown file with YAML frontmatter&lt;/strong&gt;, stored in your Git repository right next to the code. One file per requirement, organized in a directory tree that mirrors your feature hierarchy. The frontmatter carries metadata — who owns it, when it was last updated, and crucially, its &lt;strong&gt;status&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;R-4201&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;requirement&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Cart&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;button"&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;approved&lt;/span&gt;
&lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sarah&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="s"&gt;When a user clicks "Add to Cart" on a product page...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No external tools. No Confluence sync. No copy-pasting between systems. The spec is a file, reviewed in PRs, versioned in Git, and readable by both humans and AI agents.&lt;/p&gt;

&lt;p&gt;I've been building a VS Code extension called &lt;a href="https://marketplace.visualstudio.com/items?itemName=SPECLAN.speclan" rel="noopener noreferrer"&gt;SPECLAN&lt;/a&gt; that adds a WYSIWYG editor, spec tree view, and AI implementation tooling on top of this approach (full disclosure: I'm the creator). But the core concept — specs as files with status gates — works with any editor and zero tooling.&lt;/p&gt;

&lt;p&gt;Here's the part that changes everything: &lt;strong&gt;the status field isn't just a label. It's an ownership protocol.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The status lifecycle
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;draft → review → approved → in-development → under-test → released
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each transition is a handoff between roles — not a Slack message, not a status change in a project management tool, but a field in the file itself, committed to Git, visible to the entire team.&lt;/p&gt;

&lt;p&gt;Here's the key insight: &lt;strong&gt;the status isn't a label. It's an ownership signal.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;draft&lt;/code&gt; means the PO is still thinking. Devs can see it but shouldn't implement it yet.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;review&lt;/code&gt; means the PO wants the team's eyes on it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;approved&lt;/code&gt; means it's been reviewed and is ready to implement — the handoff moment.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;in-development&lt;/code&gt; means the dev team (or AI agent) owns it now.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;under-test&lt;/code&gt; means responsibility flows back to the PO — did the result match the intent?&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;released&lt;/code&gt; means everyone agrees it's done. The spec stays as a permanent record.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdolplv0h7osah9obgoy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdolplv0h7osah9obgoy.png" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Walking through it: adding a shopping cart
&lt;/h2&gt;

&lt;p&gt;Sarah the PO creates three Requirements in her spec tree: "Add to Cart" button, cart page with quantity editing, and cart persistence across sessions. She writes each one describing &lt;em&gt;what&lt;/em&gt; the feature does, not &lt;em&gt;how&lt;/em&gt; to build it. She adds Acceptance Criteria — toast notification within 500ms, cart icon updates without reload, duplicate items increment quantity.&lt;/p&gt;

&lt;p&gt;Status: &lt;code&gt;draft&lt;/code&gt;. The dev team can see the specs in their tree view, but the status fence prevents premature implementation. Sarah is still thinking.&lt;/p&gt;

&lt;p&gt;She moves to &lt;code&gt;review&lt;/code&gt;. Marco (senior dev) flags a concern — localStorage has a 5MB limit that could bite them with large carts. Sarah updates the spec. Lisa (QA) adds a missing edge case: what happens at maximum stock quantity? Sarah adds it. All of this happens in Git commits, not Jira comments.&lt;/p&gt;

&lt;p&gt;Status moves to &lt;code&gt;approved&lt;/code&gt;. Now the dev team takes over. The AI agent reads the approved specs directly — not from a copy-pasted prompt, but through structured tools that give it access to the full requirement text, acceptance criteria, and the spec hierarchy. It implements what was specified, not what it guesses.&lt;/p&gt;

&lt;p&gt;After implementation, the status moves to &lt;code&gt;under-test&lt;/code&gt;. This is the &lt;strong&gt;handback moment&lt;/strong&gt; — responsibility flows back from the dev team to the PO. Sarah tests each acceptance criterion against the running system. The person who defined the requirement is the person who accepts the result.&lt;/p&gt;

&lt;p&gt;Status: &lt;code&gt;released&lt;/code&gt;. The spec stays in the repo as the permanent record of what the product does. Six months later, when someone asks "why does the cart sync to the server?", the answer is in the spec file — including Marco's review comment about the 5MB limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than you think
&lt;/h2&gt;

&lt;h3&gt;
  
  
  It's not just about teams
&lt;/h3&gt;

&lt;p&gt;If you're a solo developer, you're already playing all these roles — you just switch between them unconsciously. You're the PO when you decide what to build. The developer when you implement. The QA when you test.&lt;/p&gt;

&lt;p&gt;The problem is these mental switches happen mid-sentence. You're halfway through writing a spec when you think "I know how to build this" and jump straight to coding. The spec never gets finished. Two weeks later, you can't remember what you intended.&lt;/p&gt;

&lt;p&gt;Status gates give solo developers &lt;strong&gt;forced phase separation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Specificator hat&lt;/strong&gt; — write specs, think through edge cases. No coding yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementor hat&lt;/strong&gt; — code from the approved spec, not from memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verifier hat&lt;/strong&gt; — check your own acceptance criteria. "Did I build what I intended?"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Specs outlive sprints
&lt;/h3&gt;

&lt;p&gt;This is the real differentiator from Jira. Tickets describe work. Specs describe the product.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Jira Ticket&lt;/th&gt;
&lt;th&gt;Spec-as-file&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Describes&lt;/td&gt;
&lt;td&gt;Work to do&lt;/td&gt;
&lt;td&gt;Product behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lifespan&lt;/td&gt;
&lt;td&gt;One sprint&lt;/td&gt;
&lt;td&gt;Product lifetime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lives in&lt;/td&gt;
&lt;td&gt;External tool&lt;/td&gt;
&lt;td&gt;Git, next to the code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After completion&lt;/td&gt;
&lt;td&gt;Closed, archived&lt;/td&gt;
&lt;td&gt;Still authoritative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI-readable&lt;/td&gt;
&lt;td&gt;Copy-paste into prompt&lt;/td&gt;
&lt;td&gt;Structured tools read directly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can use both — Jira for sprint planning, spec files for the actual requirements. But the spec should outlive the sprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI agents need governance, not freedom
&lt;/h3&gt;

&lt;p&gt;The AI agent reads specs through structured tools, not ad-hoc prompts. It only implements approved specs. It updates the status as work progresses. The human governance layer stays intact even when the coding is automated.&lt;/p&gt;

&lt;p&gt;This is the part most AI coding workflows get wrong. They give the AI maximum freedom and wonder why month 3 is a mess. The fix isn't more prompting. It's giving the AI a spec to follow and a lifecycle to respect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;The Markdown + YAML frontmatter approach works without any tooling — it's just files in Git. But if you want the tree view, WYSIWYG editor, and AI implementation assistant on top of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://marketplace.visualstudio.com/items?itemName=SPECLAN.speclan" rel="noopener noreferrer"&gt;SPECLAN on the VS Code Marketplace&lt;/a&gt; (free)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://speclan.net" rel="noopener noreferrer"&gt;speclan.net&lt;/a&gt; — docs and methodology guide&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;What does your spec handoff look like? I'm curious how other teams handle the PO → dev → PO loop — especially with AI agents in the mix.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vscode</category>
      <category>productivity</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Stop Vibe Coding: What Happens When You Give Your AI Agent a Real Spec</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Thu, 05 Mar 2026 17:51:01 +0000</pubDate>
      <link>https://dev.to/thlandgraf/stop-vibe-coding-what-happens-when-you-give-your-ai-agent-a-real-spec-378</link>
      <guid>https://dev.to/thlandgraf/stop-vibe-coding-what-happens-when-you-give-your-ai-agent-a-real-spec-378</guid>
      <description>&lt;p&gt;Your AI coding agent can write a feature in minutes. But did it write the &lt;em&gt;right&lt;/em&gt; feature?&lt;/p&gt;

&lt;p&gt;I've been using Claude Code, Cursor, and Copilot for the past year, and the pattern is always the same: you describe what you want in natural language, the agent generates code, and then you spend the next hour fixing the parts it got wrong. Not because the AI is bad — but because your intent was never structured enough for it to get right.&lt;/p&gt;

&lt;p&gt;That loop — prompt, wrong output, re-prompt, repeat — is what people call vibe coding. It works for prototypes. It doesn't work for anything you need to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The missing layer
&lt;/h2&gt;

&lt;p&gt;The gap isn't in the AI's coding ability. It's between your head and the agent's context window. You know what the feature should do, how it fits into the product, what the edge cases are, and which acceptance criteria matter. The agent knows... whatever you typed into the prompt.&lt;/p&gt;

&lt;p&gt;Spec-driven development closes that gap by structuring your intent &lt;em&gt;before&lt;/em&gt; the agent starts writing code. Not a 40-page requirements document. Just enough structure that the AI knows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What the feature is and why it exists (business goal)&lt;/li&gt;
&lt;li&gt;Where it fits in the product hierarchy (parent feature)&lt;/li&gt;
&lt;li&gt;What "done" looks like (acceptance criteria)&lt;/li&gt;
&lt;li&gt;What status it's in (can it be implemented yet?)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;I've been building a tool called SPECLAN that takes this approach — it's a free VS Code extension that manages specifications as a tree of Markdown files with YAML frontmatter, living in your Git repository.&lt;/p&gt;

&lt;p&gt;I recorded a 7-minute walkthrough that shows the full workflow from importing a raw product idea to orchestrating AI agents against structured work packages:&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/_fl494gtxbw"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;Here's what the video covers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;0:00 — The problem.&lt;/strong&gt; Why your AI agent keeps getting it wrong, and what's actually missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;0:25 — Installation.&lt;/strong&gt; One click from the VS Code Marketplace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;0:40 — Importing an idea.&lt;/strong&gt; You paste a high-level product description. SPECLAN's AI decomposes it into a hierarchy: goals, features, requirements — each as a separate Markdown file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:10 — The specification tree.&lt;/strong&gt; A navigable tree view in VS Code's sidebar. Goals break down into features, features into sub-features, sub-features into requirements. The hierarchy &lt;em&gt;is&lt;/em&gt; your product structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:35 — WYSIWYG editing.&lt;/strong&gt; A rich text editor inside a VS Code webview, so you can write specs without thinking about Markdown syntax. What you see round-trips cleanly to Markdown + YAML frontmatter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1:55 — AI chat assistant.&lt;/strong&gt; Ask questions about your spec, get suggestions, refine requirements — all within the editor panel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2:15 — Copy AI Prime Context.&lt;/strong&gt; This is where it gets practical. One click copies a structured prompt containing the spec, its parent feature, the business goal, acceptance criteria, and surrounding context. Paste that into Claude Code or any agent, and it actually knows what to build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2:40 — Status lifecycle.&lt;/strong&gt; Specs move through &lt;code&gt;draft -&amp;gt; review -&amp;gt; approved -&amp;gt; in-development -&amp;gt; under-test -&amp;gt; released&lt;/code&gt;. Only approved specs can be implemented. This prevents the "building against a moving target" problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2:55 — SWARM implementation.&lt;/strong&gt; Break approved specs into work packages and let multiple AI agents work on them in parallel — with the specification as the shared source of truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3:20 — Change Requests.&lt;/strong&gt; When an approved spec needs modification, you don't edit it directly. You create a Change Request — a separate file that tracks what changed and why. No more spec drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3:45 — Git integration.&lt;/strong&gt; Every spec is a Markdown file in Git. You get diffs, branches, and merge workflows for free. Your specs live next to your code, versioned the same way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Markdown files in Git?
&lt;/h2&gt;

&lt;p&gt;I chose this approach over a database or a cloud service for one reason: portability.&lt;/p&gt;

&lt;p&gt;Your specs are plain text files. They work with any editor, any AI agent, any CI pipeline. If you stop using SPECLAN tomorrow, your specifications are still there — readable, diffable, greppable Markdown. No export step, no migration, no vendor lock-in.&lt;/p&gt;

&lt;p&gt;The YAML frontmatter carries the structured metadata (ID, status, parent reference, owner), while the Markdown body carries the human-readable content. Git gives you the audit trail. The VS Code extension gives you the GUI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ecosystem is growing
&lt;/h2&gt;

&lt;p&gt;SPECLAN isn't the only tool exploring this space. The &lt;a href="https://github.com/bmad-code-org/BMAD-METHOD" rel="noopener noreferrer"&gt;BMAD Method&lt;/a&gt; uses specialized AI agent personas for structured development. &lt;a href="https://github.com/Fission-AI/OpenSpec" rel="noopener noreferrer"&gt;OpenSpec&lt;/a&gt; adds a spec layer for existing codebases. GitHub's &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;Spec Kit&lt;/a&gt; provides CLI templates for spec-driven workflows. &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; from AWS takes a steering-file approach.&lt;/p&gt;

&lt;p&gt;Each tackles the same insight from a different angle: &lt;strong&gt;specifications are the missing layer between human intent and AI execution.&lt;/strong&gt; The methodology matters more than any single tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;SPECLAN is free and open source. Install it from the &lt;a href="https://marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension" rel="noopener noreferrer"&gt;VS Code Marketplace&lt;/a&gt;, point it at any project, and see if structured specs change how your AI agent performs.&lt;/p&gt;

&lt;p&gt;The docs are at &lt;a href="https://speclan.net" rel="noopener noreferrer"&gt;speclan.net&lt;/a&gt;. The source is on &lt;a href="https://github.com/nicob02/speclan" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'm the creator — full disclosure. I built this because I was tired of re-prompting Claude Code with the same context every session. If you have questions or feedback, I'm in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your experience with spec-driven development? Are you structuring your prompts before sending them to AI agents, or do you find the overhead isn't worth it? Curious to hear what's working for others.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>vscode</category>
      <category>productivity</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Use .claude/rules/ to Give Claude Code Domain Knowledge About My Project's File Structure</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Wed, 04 Mar 2026 22:17:04 +0000</pubDate>
      <link>https://dev.to/thlandgraf/how-i-use-clauderules-to-give-claude-code-domain-knowledge-about-my-projects-file-structure-47l9</link>
      <guid>https://dev.to/thlandgraf/how-i-use-clauderules-to-give-claude-code-domain-knowledge-about-my-projects-file-structure-47l9</guid>
      <description>&lt;p&gt;You know that moment when you ask Claude Code to edit a file and it treats your carefully structured project directory like a random pile of Markdown? It adds implementation details to a specification file. It puts a requirement under the wrong feature. It invents an ID format you never asked for.&lt;/p&gt;

&lt;p&gt;The problem isn't that the AI is dumb. It's that it has no idea what your files &lt;em&gt;mean&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I've been building a VS Code extension called &lt;a href="https://speclan.net" rel="noopener noreferrer"&gt;SPECLAN&lt;/a&gt; that manages layered specifications as Markdown files with YAML frontmatter. The &lt;code&gt;speclan/&lt;/code&gt; directory in any project has a well-defined structure — entity types, ID schemes, status lifecycles, nesting rules. And Claude Code kept stepping on all of them until I discovered the &lt;code&gt;paths&lt;/code&gt; frontmatter in &lt;code&gt;.claude/rules/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: Claude doesn't know your conventions
&lt;/h2&gt;

&lt;p&gt;My project has a directory like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;speclan/
├── goals/           G-###-slug.md
├── features/        F-####-slug/F-####-slug.md
│   ├── requirements/  R-####-slug/R-####-slug.md
│   │   └── change-requests/  CR-####-slug.md
│   └── change-requests/  CR-####-slug.md
└── templates/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every file is Markdown with YAML frontmatter. IDs are random, not sequential. Features nest recursively. Requirements always belong to exactly one feature. Status goes &lt;code&gt;draft → review → approved → in-development → under-test → released → deprecated&lt;/code&gt;. Only approved specs can be implemented. Locked specs need a ChangeRequest to modify.&lt;/p&gt;

&lt;p&gt;None of this is obvious from the files alone. Without guidance, Claude will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create sequential IDs (&lt;code&gt;F-0001&lt;/code&gt;, &lt;code&gt;F-0002&lt;/code&gt;) instead of random ones&lt;/li&gt;
&lt;li&gt;Put requirements at the wrong nesting level&lt;/li&gt;
&lt;li&gt;Mix implementation concerns into specification files&lt;/li&gt;
&lt;li&gt;Skip required frontmatter fields&lt;/li&gt;
&lt;li&gt;Ignore the status lifecycle entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The solution: path-scoped rules
&lt;/h2&gt;

&lt;p&gt;Claude Code loads &lt;code&gt;.claude/rules/*.md&lt;/code&gt; files as persistent context. That alone is useful for project-wide conventions. But the feature that makes it powerful for structured directories is the &lt;code&gt;paths&lt;/code&gt; frontmatter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speclan/**/*.md"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Claude Code: "Only inject these rules when I'm working with files that match this glob pattern." The rules file is invisible when you're editing TypeScript, writing tests, or doing anything else. But the moment you touch a file under &lt;code&gt;speclan/&lt;/code&gt;, it kicks in.&lt;/p&gt;

&lt;p&gt;Rules without a &lt;code&gt;paths&lt;/code&gt; field load unconditionally — they're the equivalent of putting instructions in &lt;code&gt;CLAUDE.md&lt;/code&gt;. Rules &lt;em&gt;with&lt;/em&gt; &lt;code&gt;paths&lt;/code&gt; only activate when Claude reads files matching the pattern. That distinction is what makes them useful for domain-specific knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What goes in the rules file
&lt;/h2&gt;

&lt;p&gt;Here's the actual rules file I use (condensed — the real one is ~96 lines):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speclan/**/*.md"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# SPECLAN Specification Rules&lt;/span&gt;

&lt;span class="gu"&gt;## Entity Hierarchy&lt;/span&gt;

Goal (G-###) → Feature (F-####) → Requirement (R-####)

ChangeRequest (CR-####) modifies locked entities.

&lt;span class="gu"&gt;## Directory Structure&lt;/span&gt;

speclan/
├── goals/           G-###-slug.md
├── features/        F-####-slug/F-####-slug.md (self-named dirs, recursive)
│   ├── requirements/  R-####-slug/R-####-slug.md
│   │   └── change-requests/  CR-####-slug.md
│   └── change-requests/  CR-####-slug.md
└── templates/&lt;span class="nt"&gt;&amp;lt;entityType&amp;gt;&lt;/span&gt;/  UUID-slug.md

&lt;span class="gu"&gt;## Frontmatter (YAML)&lt;/span&gt;

All specs are Markdown with YAML frontmatter. Required fields:
id, type, title, status, owner, created, updated

&lt;span class="gu"&gt;## ID Rules (NON-NEGOTIABLE)&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Goal: G-### (3 digits)
&lt;span class="p"&gt;-&lt;/span&gt; Feature: F-#### (4 digits)
&lt;span class="p"&gt;-&lt;/span&gt; Requirement: R-#### (4 digits)
&lt;span class="p"&gt;-&lt;/span&gt; ChangeRequest: CR-#### (4 digits)
&lt;span class="p"&gt;-&lt;/span&gt; IDs are random, not sequential
&lt;span class="p"&gt;-&lt;/span&gt; Check collisions before creation

&lt;span class="gu"&gt;## Status Lifecycle&lt;/span&gt;

draft → review → approved → in-development → under-test → released → deprecated

Only approved specs can be implemented.
Locked statuses (approved+) require a ChangeRequest for modifications.

&lt;span class="gu"&gt;## Invariants&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Requirements belong to exactly one Feature
&lt;span class="p"&gt;2.&lt;/span&gt; Features may have sub-features AND requirements
&lt;span class="p"&gt;3.&lt;/span&gt; ChangeRequests reference exactly one parent

IMPORTANT: files under speclan/ are specifications that tell
WHAT from user perspective, not HOW from developer perspective
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last line is the most important one. It's the semantic boundary that prevents Claude from mixing concerns. Without it, you ask for a new requirement and get implementation pseudocode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works better than CLAUDE.md alone
&lt;/h2&gt;

&lt;p&gt;You could put all of this in your project's &lt;code&gt;CLAUDE.md&lt;/code&gt;. I actually did that first. The problem is context pollution — when Claude is editing a React component, it doesn't need to know about SPECLAN's ID scheme. And when it's editing a spec file, it doesn't need your TypeScript lint rules.&lt;/p&gt;

&lt;p&gt;Path-scoped rules solve this cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Focused context&lt;/strong&gt; — rules only activate when relevant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No noise&lt;/strong&gt; — Claude's context window isn't cluttered with irrelevant conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composable&lt;/strong&gt; — you can have multiple rules files for different parts of your project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rules file acts like a domain expert sitting next to Claude, whispering "that's a specification file, here's how they work" exactly when it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing a good rules file
&lt;/h2&gt;

&lt;p&gt;After iterating on this for a few months, here's what I've found works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Be declarative, not procedural.&lt;/strong&gt; Don't write step-by-step instructions. Describe the structure, the constraints, the invariants. Claude is good at applying constraints if you state them clearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mark hard boundaries.&lt;/strong&gt; I use &lt;code&gt;(NON-NEGOTIABLE)&lt;/code&gt; for rules that must never be violated — like the ID format. Claude respects this surprisingly well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Include the "why" for non-obvious rules.&lt;/strong&gt; "IDs are random, not sequential" needs the implicit why: collision avoidance across branches and contributors. "Files tell WHAT not HOW" needs no explanation but &lt;em&gt;does&lt;/em&gt; need emphasis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep it under 100 lines.&lt;/strong&gt; This is context that gets injected into every relevant interaction. If your rules file is 500 lines, you're eating into the context window Claude needs for actual work. Compress ruthlessly. Tables over prose. ASCII trees over paragraphs. The official docs recommend targeting under 200 lines for any CLAUDE.md file — for path-scoped rules, I'd argue even tighter is better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quote your glob patterns.&lt;/strong&gt; This is a gotcha that'll bite you: YAML treats &lt;code&gt;*&lt;/code&gt; and &lt;code&gt;{&lt;/code&gt; as reserved indicators. Always quote your patterns — &lt;code&gt;"**/*.ts"&lt;/code&gt; not &lt;code&gt;**/*.ts&lt;/code&gt;. Unquoted patterns can silently fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use brace expansion for related types.&lt;/strong&gt; Instead of listing patterns separately, combine them: &lt;code&gt;"src/**/*.{ts,tsx}"&lt;/code&gt; matches both TypeScript and TSX files in one pattern. Same works for directories: &lt;code&gt;"{src,lib}/**/*.ts"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test it by asking Claude to create something.&lt;/strong&gt; After writing the rules file, ask Claude to "create a new requirement for feature F-1234." If it gets the file path, ID format, frontmatter, and directory nesting right on the first try — your rules file works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond project directories: glob patterns for other domains
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;speclan/**/*.md&lt;/code&gt; pattern is one application. The same mechanism works for any file pattern where Claude needs domain context. Here's what I use across my NX monorepo:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test files (`&lt;/strong&gt;/&lt;em&gt;.spec.ts`)&lt;/em&gt;* — inject your testing conventions: which frameworks, which patterns, how to mock, what not to test. I have rules for Jest vs Mocha conventions since my project uses both (libraries vs VS Code extension).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webview files (`&lt;/strong&gt;/webview/&lt;strong&gt;`)&lt;/strong&gt; — inject your browser-context constraints: no Node APIs, specific CSS framework rules, message-passing protocols between webview and extension host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure files (`cdk/&lt;/strong&gt;/&lt;em&gt;.ts`)&lt;/em&gt;* — inject your CDK conventions, naming standards, tagging policies, security guardrails. Claude loves to create overly permissive IAM roles unless you tell it not to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security-sensitive code (`src/auth/&lt;/strong&gt;/&lt;em&gt;&lt;code&gt;, &lt;/code&gt;src/payments/&lt;/em&gt;&lt;em&gt;/&lt;/em&gt;`)** — guardrails for sensitive areas: never log tokens, always parameterize queries, validate all inputs at function boundaries. These rules are especially valuable because the cost of Claude getting them wrong is high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database migrations (`prisma/migrations/&lt;/strong&gt;/&lt;em&gt;`)&lt;/em&gt;* — safety rules: always include rollback instructions, never delete columns in the same migration that removes the code using them, add columns as nullable first.&lt;/p&gt;

&lt;p&gt;The pattern is always the same: you have files where the semantics aren't obvious from the syntax, and you need Claude to understand the domain rules before touching them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tips from the trenches
&lt;/h2&gt;

&lt;p&gt;A few more things I've learned from running 12+ rules files across a monorepo:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One concern per file.&lt;/strong&gt; A &lt;code&gt;testing.md&lt;/code&gt; shouldn't also contain API design guidelines. Separation of concerns applies to instructions just as much as code. Descriptive filenames like &lt;code&gt;api-validation.md&lt;/code&gt; beat &lt;code&gt;rules1.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subdirectories work.&lt;/strong&gt; All &lt;code&gt;.md&lt;/code&gt; files are discovered recursively, so you can organize rules into &lt;code&gt;frontend/&lt;/code&gt;, &lt;code&gt;backend/&lt;/code&gt;, &lt;code&gt;infra/&lt;/code&gt; subdirectories. No configuration needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symlinks for shared rules.&lt;/strong&gt; If you maintain coding standards across multiple projects, symlink a shared rules directory: &lt;code&gt;ln -s ~/company-standards .claude/rules/shared&lt;/code&gt;. Circular symlinks are handled gracefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User-level rules for personal preferences.&lt;/strong&gt; Put rules in &lt;code&gt;~/.claude/rules/&lt;/code&gt; for things that apply to everything you work on — your preferred commit message format, your testing style, your debugging workflow. These load before project rules, so project rules can override them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't duplicate between CLAUDE.md and rules.&lt;/strong&gt; If a convention is path-specific, put it in &lt;code&gt;.claude/rules/&lt;/code&gt; with a &lt;code&gt;paths&lt;/code&gt; field. If it's truly universal (build commands, project architecture), keep it in &lt;code&gt;CLAUDE.md&lt;/code&gt;. Conflicting instructions across files get resolved arbitrarily — not what you want for your ID scheme.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check what's loaded with &lt;code&gt;/memory&lt;/code&gt;.&lt;/strong&gt; When something isn't being respected, run &lt;code&gt;/memory&lt;/code&gt; to see which rules files Claude actually has in context. If your file isn't listed, the glob pattern isn't matching.&lt;/p&gt;

&lt;h2&gt;
  
  
  The compound effect
&lt;/h2&gt;

&lt;p&gt;One rules file doesn't feel like much. But once you have 3-4 of them covering different parts of your project, Claude starts behaving like a developer who actually read the architecture docs. It stops guessing and starts following your conventions. The number of "no, that's not how we do it" corrections drops dramatically.&lt;/p&gt;

&lt;p&gt;I think of &lt;code&gt;.claude/rules/&lt;/code&gt; files as executable documentation. They serve double duty: they document your conventions for human readers &lt;em&gt;and&lt;/em&gt; they enforce those conventions when AI touches the code. That's a pretty good return on 50-100 lines of Markdown.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm the creator of &lt;a href="https://speclan.net" rel="noopener noreferrer"&gt;SPECLAN&lt;/a&gt;, a VS Code extension for managing layered specifications as Markdown files in Git. The path-scoped rules described here are how I keep Claude Code aligned with SPECLAN's file structure conventions — but the technique works for any project with well-defined directory semantics.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>vscode</category>
      <category>productivity</category>
      <category>ai</category>
    </item>
    <item>
      <title>I built a spec management extension with a WYSIWYG Markdown editor in a VS Code webview — lessons learned</title>
      <dc:creator>Thomas Landgraf</dc:creator>
      <pubDate>Tue, 03 Mar 2026 20:18:28 +0000</pubDate>
      <link>https://dev.to/thlandgraf/i-built-a-spec-management-extension-with-a-wysiwyg-markdown-editor-in-a-vs-code-webview-lessons-h5d</link>
      <guid>https://dev.to/thlandgraf/i-built-a-spec-management-extension-with-a-wysiwyg-markdown-editor-in-a-vs-code-webview-lessons-h5d</guid>
      <description>&lt;p&gt;I've been building a VS Code extension for spec management over the past 3 months (full disclosure: I'm the creator, it's called SPECLAN — free side project). The idea is that specifications need the same structure we give source code: hierarchy, types, lifecycle tracking. So the extension organizes specs as a tree of Markdown files with YAML frontmatter — goals break down into features, features into sub-features, sub-features into requirements. Each file has a status lifecycle (draft → review → approved → in-development → released) so you always know what's specced, what's being built, and what needs to change.&lt;/p&gt;

&lt;p&gt;The interesting VS Code challenge: &lt;strong&gt;making this usable for non-technical people.&lt;/strong&gt; Product managers and business analysts define what to build, but they won't write raw Markdown with YAML frontmatter. So I needed a WYSIWYG editor inside a webview that round-trips cleanly to Markdown — same file in Git, two editing experiences.&lt;/p&gt;

&lt;p&gt;That editor ate about 40% of total development effort. Here's what I learned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quill 2.x&lt;/strong&gt; in a VS Code webview (rich text editing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;remark + remark-gfm&lt;/strong&gt; for Markdown → HTML on load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;turndown + turndown-plugin-gfm&lt;/strong&gt; for HTML → Markdown on save&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gray-matter&lt;/strong&gt; for YAML frontmatter — strips on load, reattaches on save&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;quill-table-up&lt;/strong&gt; for GFM tables (Quill has no native table support)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What actually hurt:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Round-trip fidelity.&lt;/strong&gt; The pipeline is Markdown → HTML → Quill Delta → HTML → Markdown. Every step is lossy. Links, emphasis, nested lists — they all drift across conversions. I spent weeks writing custom turndown rules to keep Markdown output stable. If you're building something similar: start with the save pipeline, not the editor. The round-trip is the constraint that shapes everything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frontmatter is invisible but critical.&lt;/strong&gt; Each spec file has 10+ YAML fields — status, entity ID, parent references, timestamps. The editor only sees the Markdown body, but the file is meaningless without its frontmatter. gray-matter handles parsing, but you need to be careful that editor changes don't conflict with frontmatter values (e.g., someone editing a title in the body that's also in the YAML).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tables.&lt;/strong&gt; Quill doesn't do tables. quill-table-up adds them, but serializing table HTML through turndown into GFM pipe tables has edge cases everywhere — empty cells, inline formatting in cells, nested content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Webview communication.&lt;/strong&gt; Everything between the editor (iframe) and the extension host is a postMessage call — load, save, dirty state, undo, external file change detection. I ended up building a structured message protocol with typed handlers on both sides. &lt;code&gt;console.log&lt;/code&gt; in the webview doesn't show up anywhere useful, so I added a logging bridge that routes webview logs to the extension's output channel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom editor API.&lt;/strong&gt; Using &lt;code&gt;CustomTextEditorProvider&lt;/code&gt; means the document model is VS Code's &lt;code&gt;TextDocument&lt;/code&gt; but the visual state is Quill's Delta. Keeping these in sync — especially during concurrent edits or Git operations that change the file underneath — required careful event sequencing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What worked well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The file-system-as-data-model approach.&lt;/strong&gt; Directories ARE the spec hierarchy. &lt;code&gt;speclan/features/F-1234-auth/requirements/R-5678-login/R-5678-login.md&lt;/code&gt; — any tool (or AI agent) can understand the structure by reading the file system. No database, no server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot testing the conversion pipeline.&lt;/strong&gt; Take a Markdown file, push it through the full round-trip, diff the output. Catches regressions fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree views for navigation.&lt;/strong&gt; VS Code's TreeDataProvider is excellent. The spec tree (goals → features → requirements) renders as a native sidebar with status icons, drag-and-drop reordering, and context menus. Much less effort than the WYSIWYG editor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions about webviews, the conversion pipeline, or the spec structure approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://marketplace.visualstudio.com/items?itemName=DigitalDividend.speclan-vscode-extension" rel="noopener noreferrer"&gt;Marketplace&lt;/a&gt; | &lt;a href="https://speclan.net" rel="noopener noreferrer"&gt;speclan.net&lt;/a&gt; | &lt;a href="https://github.com/thlandgraf/speclan-essentials" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>vscode</category>
      <category>typescript</category>
      <category>markdown</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
