The SDK You Pick Matters More Than the Model — A 13-LLM Benchmark on the Same Agentic Task

#ai #agentskills #claude #chatgpt

If you have ever built an agent that walks a codebase, calls tools, and writes structured output, you have hit the same wall I kept hitting: the same model produces wildly different results on the same task depending on what harness you wrap it in. Swap Claude for GPT behind a single OPENAI_BASE_URL and you lose half your output quality. Everyone blames the model. The model is rarely the variable.

I ran an experiment to put a number on it. Thirteen LLMs — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, GPT 5.4, GPT 5.4 Mini, two Gemini 3.1 previews, and six open-weights locals (Qwen 3.6 35B A3B, Gemma 4 at three sizes, GPT-OSS 20B, Nemotron 3 Nano) — on the same real agentic task. Same codebase (excalidraw), same MCP tools, same system prompt. Only the model changes. The output is a specification tree: goal → feature → requirement hierarchies of Markdown files.

What if the SDK is doing more of the work than anyone admits?

Every provider ships an SDK. Most teams assume the SDK is a thin wire-protocol wrapper. It usually isn't. Here's what the Anthropic SDK ships with by default, alongside the MCP tools I expose:

A persistent Todo-List the model reads from and writes to across turns.
A planner for multi-step reasoning that doesn't burn the main conversation budget.
A scratchpad for cross-turn notes that never reach the final output.

Here's what the OpenAI SDK ships with by default when you give it MCP tools: the MCP tools. Nothing else.

I'm the creator of SPECLAN, a VS Code extension for spec-driven development, and the pipeline in this benchmark is one of SPECLAN's agents (full disclosure). But the lesson generalizes to any multi-provider agent harness — and the numbers are genuinely jarring.

The band gap

Requirements produced on the same codebase, same prompt:

Model	SDK	Requirements
Claude Opus 4.7 (1M)	Anthropic	197
Claude Sonnet 4.6	Anthropic	196
Claude Haiku 4.5	Anthropic	203
GPT 5.4	OpenAI	43
GPT 5.4 Mini	OpenAI	60
Gemini 3.1 Pro preview	OpenAI-compat	17
Gemini 3.1 Flash preview	OpenAI-compat	13
Qwen 3.6 35B A3B (local)	OpenAI-compat	174
Gemma 4 31B dense (local)	OpenAI-compat	60
Gemma 4 8B (local)	OpenAI-compat	21
GPT-OSS 20B (local)	OpenAI-compat	17
Nemotron 3 Nano (local)	OpenAI-compat	12

Two things jump out immediately.

The three Claude models cluster at 196–203 regardless of size. Opus is several times the size of Haiku. If model size were driving volume, you would see variance. You don't. That flatness is the scaffolding floor — the shape of what the Anthropic agent loop produces on this benchmark, not the ceiling of what Opus can do.

Every OpenAI-SDK model except one sits at 13–60. An order of magnitude below the Anthropic band. Different vendors (OpenAI, Google, Meta-derived, Chinese open-weights), different sizes (8B to 120B+), same roughly-converged output volume. That convergence is what you would expect if the binding constraint were the harness, not the model.

Why spec authoring exposes scaffolding so brutally

Here's the technical meat. Spec authoring is fundamentally a list-management problem: enumerate the features the code implements, write a requirement, cross it off, move to the next. A human technical writer does this with a notepad.

Without a Todo-List, an LLM has to re-derive from conversation history every turn: did I already write requirements for Shape Drawing Tools? Let me check the last 12 turns… yes I did. Element Organization? Let me check… no, that's next. Every single turn, this bookkeeping consumes context window and decision budget that could have gone into writing the actual requirement.

With a persistent Todo-List, the model does one tiny tool call (todo_list_read), sees next undone: Element Organization, and gets to work. It's doing a fundamentally easier version of the task. That's why you get 197 requirements from Opus and 43 from GPT 5.4 on the same brief — the first model was given a list-management abstraction, the second had to reinvent it in every turn.

If you've ever wondered why your Claude-via-Anthropic-SDK agent seems to "remember things twelve files ago" and your GPT-via-OpenAI-SDK agent feels like it restarts every turn — this is why. Anthropic's SDK implements memory as a tool. OpenAI's SDK expects you to bring your own.

The one exception — and why it matters

Look at the table again: Qwen 3.6 35B A3B produced 174 requirements on the no-scaffolding OpenAI-SDK path. Running locally in LM Studio on a Mac M4 Max, 50k context. Within 12% of the Anthropic cluster. It is the one outlier in an otherwise tight 13–60 band.

Our best guess for why: Qwen's training mix is heavy on agentic tool-call trajectories, and that data seems to have internalized some of the bookkeeping the Anthropic SDK externalizes as tools. The model brought its own list-management to the task.

This matters because it proves the gap is closable without the SDK help — just not by most models. You can think of the benchmark as a 2×2:

	Scaffolding in harness	No scaffolding in harness
Scaffolding-trained model	Opus / Sonnet / Haiku (196–203)	Qwen 3.6 35B A3B (174)
Not trained for agentic bookkeeping	— (we don't have data)	GPT 5.4, Gemini, Gemma, GPT-OSS, Nemotron (13–60)

Three of the four quadrants are populated. The missing one — "scaffolding-trained model on a scaffolding-free harness" — is the obvious follow-up: run Opus on an OpenAI-SDK harness with the Anthropic tools explicitly stripped, so Opus operates on the same MCP-only surface as the others. The delta between that number and Opus's 197 is the SDK's contribution. Whatever's left is the pure-Opus delta. That's the experiment we're shipping next.

The Gemini anomaly — same vendor, two outcomes

One finding from the benchmark lands harder than any single row: Gemma (local open-weights) succeeded at the task; Gemini (frontier cloud preview) failed. Same company. Same pipeline on our end. Same adapter layer (OpenAI-compatibility).

Gemma 4 8B wrote a coherent on-domain tree — every one of 21 requirements landed on a real excalidraw feature. Gemini 3.1 Pro preview wrote Account and Billing Management, Personalized Analytics Dashboard, full acceptance criteria for Subscription tier management (upgrade, downgrade, cancel). Excalidraw has no accounts and no billing.

Our working hypothesis: our OpenAI-compatibility shim round-trips tool-call payloads in a format Gemma tolerates but Gemini treats differently. Gemini falls back to training priors when the adapter produces turns it cannot fluently continue — and "enterprise SaaS reference architecture" is over-represented in those priors. Before anyone dismisses Gemini previews as weak at agentic work, that rerun through Google's native GenerateContent API with planning primitives enabled is on the follow-up list.

The generalizable lesson for anyone building multi-provider agents: every harness silently privileges some providers over others. Our harness privileges Anthropic (full SDK integration), accidentally privileges Gemma-like models (MCP-only works for them), and does not fit Gemini 3.1. Your harness will have the same asymmetry in a different shape. The answer to "which model is best for my harness?" is not "whichever has the most parameters" — it's "whichever your harness actually fits."

What to take from this if you are building agents

Audit what your SDK ships by default. If you picked the Anthropic SDK and haven't looked inside, a non-trivial share of your agent's competence is coming from the built-in Todo-List / planner / scratchpad. Switch providers without replacing that layer and you will measure a model-quality drop that is actually a scaffolding drop.
Invest in the scaffolding layer before investing in a bigger model. In our benchmark, scaffolding was worth roughly an order of magnitude of output volume. A bigger model on a thin harness will not close that gap. A smaller model on a thick harness often will.
Multi-provider support is a harness problem, not a config-flag problem. If you're offering users a choice of providers, you're offering them a choice of how well your harness fits their provider. That's architectural work, not one-line-of-YAML work.
The training mix sometimes bridges the gap for you. Qwen 3.6 35B A3B is the proof. Agentic-tool-call-heavy training data appears to internalize what other models rely on the SDK to externalize. If you're picking a local model for agentic workloads, pick one whose training mix matches that shape.

Try it yourself

All 13 spec trees are browsable side-by-side at speclan.net/compare/ with URL-sharable deep links. Two pairings worth five minutes of your time:

?left=opus&right=qwen3.6-35b-a3b — Anthropic SDK + Opus vs. OpenAI SDK + Qwen. The gap visible in the UI is the SDK story.
?left=gemini3.1-pro&right=gemma4-8b — same vendor, two outcomes. The Gemini adapter-fit anomaly in isolation.

The full canonical post with the complete 13-row table, every caveat (including the excalidraw-in-training-data one), and the follow-up experiments planned lives at speclan.net/blog/2026-04-29-model-comparison.

If you've built a multi-provider agent and seen the SDK-layer drop I'm describing — especially if you've measured it — I'd love to see your numbers in the comments. Particularly curious about LangGraph users who added a Todo-List abstraction and measured a lift across providers.