I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop.

#ai #vscode #programming #markdown

There are benchmarks for code an LLM writes. HumanEval, MBPP, SWE-Bench, LiveCodeBench. There are no benchmarks for the specifications an LLM writes. The upstream half of agentic software delivery has been flying blind — and the spec is what your downstream coding agent has to interpret.

I went looking for one and there isn't one. So I propose one, and to demonstrate it I gave thirteen LLMs the same real codebase (excalidraw) and asked each of them to produce a specification tree. Six of those thirteen ran locally on a laptop - via LM Studio and Ollama - and one of them landed within 12% of the frontier-cloud baseline. Then I made Claude Opus walk through every other model's output and judge it.

The numbers surprised me. So did how well the local half held up.

The metric: driftless implementability

A spec compiles to nothing. It is reviewed by the customer, the PM, the QA lead — not by a compiler. A bad function fails its test. A bad spec fails a meeting.

The question that matters to anyone shipping software with AI agents downstream is: hand the spec — and only the spec — to Claude Code, Google Antigravity, or Codex. Does the result match what the spec described? If yes, the spec was good. If the agent had to guess, invent, or ask, the spec was lossy. Drift is the cost. Driftlessness is the goal.

That third claim is the whole reason this experiment can exist. It re-frames "which LLM should I use to write my specs" as a question with an answer, not a vibe.

The setup

Thirteen LLMs. One brief. One codebase.

Cloud, frontier: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; GPT 5.4, GPT 5.4 Mini; Gemini 3.1 Pro and Flash previews
Local, open-weights: Qwen 3.6 35B A3B (LM Studio), Gemma 4 26B A4B (LM Studio, MoE), Gemma 4 31B (Ollama, dense), Gemma 4 8B (Ollama), GPT-OSS 20B (open weights), Nemotron 3 Nano (open weights)

Six of the thirteen never touched a cloud endpoint. The whole local cohort was a deliberate test of the question every privacy-constrained team has been asking since the open-weights wave: are on-laptop weights good enough for real agentic spec work in 2026?

I'm the creator of SPECLAN (full disclosure), a VS Code extension for spec-driven development. The pipeline that produced these trees is SPECLAN's Infer Specs from Code agent — it walks a codebase via MCP tools, decides what's a feature, writes the requirement, and stops. Same agent across all thirteen runs. Only the model changes.

Each output is a hierarchy of Markdown files: Goal → Feature → Requirement, every entity in its own file with YAML frontmatter (id, parent, status). You can walk all thirteen trees side-by-side on the speclan.net/compare gallery.

What the numbers say

The most defensible single metric is requirement count. Not because more is better — a spec with 200 noisy requirements is worse than one with 80 clean ones — but because it tells you whether the model committed. A model that wrote 12 requirements for excalidraw missed almost all of it. A model that wrote 200 saw the codebase.

The reference baseline: Claude Opus 4.7 produced 5 goals, 16 top-level features, 43 features, 197 requirements. That's the frontier-model bar.

The first surprise: Claude Haiku 4.5 produced more: 5 goals, 14 top, 45 features, 203 requirements. From a smaller and cheaper model. Haiku earned it by splitting Opus's requirements into smaller pieces — not by inventing things Opus missed, but by carving the same surface finer. The right read: model-family scaling sometimes trades resolution for terseness, not insight for capacity.

The second surprise: Qwen 3.6 35B A3B, running locally in LM Studio on an Apple M4 Max, produced 4 goals, 23 top, 49 features, 174 requirements - within 12% of Opus, no tokens leaving the machine. It was the strongest of the six local runs and the one that finally answers the on-laptop question affirmatively for me. The local LLM crowd has been right about something: open-weights MoE models in the 30-40B range crossed the threshold for real agentic work in 2026, and Qwen 3.6 35B A3B is the cleanest example I've benchmarked.

The local cohort split into three tiers worth naming. The 35B-class MoE (Qwen 3.6) was indistinguishable from frontier-cloud output at the structural layer. The 30B-class dense (Gemma 4 31B on Ollama) was the honest workhorse - 60 requirements across 5 top-level features, 3.3x sparser than Opus but covering the same core ground, credible as a privacy-safe substitute when cloud is off-limits. The smaller end (Gemma 4 8B, GPT-OSS 20B, Nemotron 3 Nano, and the Gemma 4 26B A4B MoE which terminated before the orientation pass) produced coherent feature trees but lost primitives - goals, vision, mission, or in Gemma 4 26B's case the literal goal body text.

The third surprise was a failure mode I'd never have predicted: Gemma 4 26B A4B (the MoE variant, also running locally) left the literal placeholder string "Goal description goes here." inside the body of G-093 "Intuitive User Experience & Customization." Real model output. Real shipped file. The smaller open-weights models often look like they're working - the tree fills in, the IDs validate, the structure passes - and then a goal body is template text that nobody asked for. This is exactly the failure shape that would silently slip past a junior reviewer and break implementation downstream.

The fourth surprise: GPT-OSS 20B (OpenAI's open weights) produced 0 goals — five top-level features, 16 features, 17 requirements — but no GOALS layer at all. Same hierarchy primitives, missing the highest level entirely. Coherent feature tree, no rationale layer above it. The kind of structural omission that's invisible in a single file but obvious the moment you open the tree.

The fifth surprise: Gemini 3.1 Pro wrote "User Identity and Access" features for a drawing tool with no accounts. Excalidraw is local-first, anonymous, no auth. The Gemini Pro spec invented an account-billing system that doesn't exist in the codebase. Pattern-match hallucination at the architecture layer — the model recognized "web app" and reached for the canonical web-app feature set, regardless of whether the actual code supported any of it.

Opus judges the rest

This is the part I didn't plan and ended up promoting to its own scene in the video.

After all 13 trees were generated, I had Claude Opus walk every other candidate's output and add a JUDGEMENT subsection at the top of each spec tree. Strengths. Weaknesses. Concrete drift risks. Opus on Haiku: positive — "captures the same surface at finer resolution." Opus on Gemma 4 26B: "placeholder text in G-093 body indicates incomplete generation; do not use as implementation input." Opus on GPT-OSS 20B: "feature tree is coherent but absence of goals layer means no traceability anchor for downstream agents." Opus on its own output: Opus declined to judge itself, which was the right call.

I find this beat the strongest single argument the video makes — not because Opus is the right judge of every spec, but because it demonstrates the thesis concretely. The downstream agent has to interpret the spec. If the most capable downstream agent we have access to today can name the drift risks in another model's spec, that's the same signal you'd get if you let it try to implement and watched it fumble. Faster to ask the question directly.

The local angle, unpacked

Six of the thirteen runs were local, mixed across LM Studio and Ollama as the runtime. The Qwen 3.6 35B A3B run was the strongest of the six and the one I expect most readers to care about, but the on-camera generation in the video is also the local one - so the privacy claim is visible, not just claimed.

LM Studio loaded Qwen 3.6 35B A3B at Q4_K_M with a 262K context window - comfortably above the ~50K floor SPECLAN's agents want for spec-tree generation. Tool use marked Supported. Architecture qwen35moe (Mixture of Experts). Same OpenAI-compatible /v1 endpoint surface as the frontier providers. Throughput on the M4 Max sat around 80 tok/sec.

SPECLAN's Local LLM (Experimental) provider in LLM Configuration accepts any OpenAI-compatible base URL. Switch the active provider from Anthropic to Local LLM, pick the model from LM Studio's (or Ollama's) loaded catalogue, click Apply. The same Infer Specs from Code wizard runs unchanged. The fact that the runtime is local is invisible to the agent code - it's just a different endpoint. Ollama serves the same /v1 shape, which is why the Gemma 4 31B and 8B runs slot in alongside the LM Studio ones without any agent-code change.

The macOS GPU monitor pinned at top-right of the video shows the M4 Max's GPU utilization bars churning the whole hour-long generation. The privacy claim - no tokens left the machine - is on screen, not just narrated. For privacy-constrained teams whose codebase can't leave the laptop, the structural finding from this benchmark is that you have a real choice in 2026: Qwen 3.6 35B A3B if you want output that's structurally indistinguishable from frontier-cloud, Gemma 4 31B (dense) if you want a slower-but-thoroughly-reliable workhorse, the smaller models if your tradeoff is hardware constraint over output quality.

The caveat: Qwen 3.6 35B A3B ran on the OpenAI SDK path (because LM Studio ships an OpenAI-compatible API; everyone does). The Anthropic SDK adds default scaffolding - a persistent Todo-List, a planner, a scratchpad - that the OpenAI SDK doesn't. Some of the requirement-count gap between Claude band (196-203 across Opus/Sonnet/Haiku) and the OpenAI-SDK candidates is SDK, not model. Qwen 3.6 35B A3B at 174 reqs is the OpenAI-SDK band outlier upward, which is the genuinely interesting signal.

A word on Qwen 3.6 35B A3B specifically

I want to be explicit about how remarkable this result is. The benchmark says: open-weights model, MoE architecture, running on a single Apple M4 Max laptop, quantized to Q4_K_M, talking to a SPECLAN agent through LM Studio's OpenAI-compatible endpoint, with no SDK-level scaffolding helping it - produced 174 requirements across 4 goals, 23 top-level features, 49 features on a 13K-file TypeScript monorepo. That output is structurally close enough to Claude Opus 4.7 (197 requirements, 16 top-level features) that walking the two spec trees side-by-side on the /compare page, you have to look at the model labels to tell them apart at a glance. Qwen's tree actually decomposes more aggressively at the top level (23 vs Opus's 16) - the model carved the canvas surface into finer top-level buckets than the frontier baseline did.

The tool-call reliability is the part that genuinely changed my mental model. Smaller-than-frontier models historically fail under multi-turn structured-output workloads - they produce coherent prose but the JSON-schema adherence falls apart by turn 8 or 10, and the agent's create_feature / create_requirement MCP calls start coming back malformed. Qwen 3.6 35B A3B held adherence across the full ~50-minute generation run with one self-correction (a delete_feature it issued after a misread on its own prior create_feature). One. On a multi-hundred-tool-call run. That's the kind of behavior I would have expected from a frontier model 18 months ago and not from open-weights weights running on a laptop.

If you're picking a single open-weights model to point at agentic spec work today, this is the one. It's the best on-laptop result I've seen, period - and it ran without the SDK scaffolding that the Claude-band candidates lean on. The privacy story finally has an output-quality story to match it.

What I'd tell you to do with this

If you're choosing a model to write the spec your downstream coding agent has to implement: read the /compare gallery, pick the two or three model families that are realistic for your budget and privacy posture, and walk their trees side-by-side. Don't average across model families — Gemini 3.1 Pro produced an architecturally different spec from Claude Opus, not a worse one or a better one. Different.

If you're privacy-constrained: Qwen 3.6 35B A3B in LM Studio on a 24-32GB unified-memory Mac is the current best on-laptop choice for agentic spec work. Throughput on M4 Max sat around 80 tok/sec; context held up well under multi-turn tool use; tool-call reliability surprised me more than the speed did. The 8B-class open-weights models are not there yet — they generate coherent feature trees but lose primitives (goals, status fields, or in Gemma 4 26B's case, the actual goal body).

If you're an SDD methodology nerd: the driftless-implementability framing generalizes beyond SPECLAN. You can apply it to any spec format — the test is "hand it to your downstream coding agent and watch what it does." Run the test on your own specs before you ship them.

The full 13-model walk-through is in the video at the top of this post. The interactive side-by-side viewer with all thirteen trees is at speclan.net/compare — every tree linkable, every requirement reachable, every JUDGEMENT subsection expanded. SPECLAN's Local LLM provider and the 13-model comparison blog post are the deeper-dive companions.

The spec is the upstream half of agentic delivery. It's been flying blind. Driftless implementability is one way to make it visible.