DEV Community: Devashish

Owning Inference - Qwen3.6 on DGX Spark for real coding

Devashish — Tue, 07 Jul 2026 17:00:00 +0000

If you spend as much time on Hacker News as I do, you've surely noticed an uptick in conversations around local models, skyrocketing inference costs, shifting focus on Qwen (DeepSeek, Mistral and the likes) for coding tasks and a general frustration around navigating the AI-coding-at-scale-without-going-broke problem.

On the other hand, running local models is no small feat by itself. I'm no AI researcher and my (rather naive) assumption when I got my DGX Spark a few months back was that local inference setup would be akin to pulling a Docker image, typing ollama serve and pointing my OpenCode to the endpoint. Once done, I'd have infinite tokens at my disposal and have agents completing work 24/7.

Needless to say, I opened the Pandora's box there and now I'm deep into the rabbit hole of figuring out what a functional local LLM stack looks like.

A recent 1,190-point HN thread on Qwen 3.6 27B provides a deep insight into the state of local models as of writing this post. Yes, we can now run dense models on a local setup but the true value remains to be seen. That's exactly what I've been trying to do. My goal is to get real, continuous and high quality work done by local models. This post covers the progress I've made so far on this journey.

My previous post in this series (Two Qwen3 Models on One DGX Spark) was about setting up a two model stack on the DGX. This post answers the follow-up: does this stack write code that ships? At present I have Qwen/Qwen3.6-27B-FP8 as the sole active model on my DGX Spark, serving reasoning + tools + MTP at 256K context, and shipping work into my open-source (OSS) project - Clawrium.

This post is broken down into two sections. The first section talks about the journey of getting the stack working and second one shows how this stack is put to work on a real project.

The final stack (for the impatient)

Host: DGX Spark GB10 (Grace Blackwell ARM64 (aarch64), sm_121 compute capability), ~119 GiB unified memory
Model: Qwen/Qwen3.6-27B-FP8 at 262,144-token context
Image: nvcr.io/nvidia/vllm:26.06-py3 with one pip pin (see Wall 3)
Proxy: LiteLLM v1.88.1 at :4000 with drop_params: true
Runner: Model Runner V1 (V2 breaks — see Wall 2)
Flags: --enable-prefix-caching --enforce-eager --enable-auto-tool-choice --tool-call-parser qwen3_xml --reasoning-parser qwen3 --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Clients: pi.dev (editor) and Hermes agents

Now the walls, in the order I hit them to get this working.

Why Change Models?

Before that, a brief overview of the problems I was facing. My previous setup on the same hardware was running two models:

Qwen3-Next-80B-Instruct-FP8
Qwen3-4B-Instruct

And before I moved to the current model, I also tried Qwen3.6-35B-A3B. Pretty much same problem with each of these — 4B-Instruct was anyway not suited for large tasks but I tried solving simple bugs with the other two models. I hoped to get better quality with 35B but since it was not a reasoning model and with only 3B active parameters, it failed on these tasks.

I hooked these up with Hermes and OpenCode and tried with different prompt sizes, system prompts and other tweaks but they didn't do much beyond basic summarization tasks.

The Hacker News post I mentioned earlier triggered my curiosity to use a dense model for the same task with the same rig.

Wall 1 — NGC image too old for `qwen3_5`

Dropped the model YAML onto the existing nvcr.io/nvidia/vllm:25.11-py3 image that had been serving Qwen3-Next-80B fine for weeks. The container crashed on start:

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, The checkpoint you are trying to load has model type `qwen3_5`
  but Transformers does not recognize this architecture.

Qwen3.6 is Qwen3_5ForConditionalGeneration — a dense 27B with hybrid attention plus Mamba/GDN layers. The 25.11-py3 image ships a Transformers old enough to predate qwen3_5. Upgrading Transformers alone breaks the CUDA + vLLM + Transformers version matrix that NGC pins together.

Bumped the image to nvcr.io/nvidia/vllm:26.06-py3 (vLLM 0.22.1, Transformers 5.6.0, arm64 manifest, FP8 kernels for sm_121). The pull took 35 minutes and looked stuck — 20 GiB across ~9 concurrent CloudFront connections is just the actual wall time.

Lesson: NGC image tags bundle a version matrix. Don't upgrade individual components inside a running container. Bump the whole tag.

Wall 2 — the red herring that hid the real crash

New image, qwen3_5 recognized, weights loaded, torch.compile ran, engine died:

ERROR EngineCore failed to start.
File "torch/_ops.py", line 1408, in _get_packet
    op, overload_names = torch._C._jit_get_operation(qualname)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 181

Every guide on the internet ties this UnicodeDecodeError to torch.compile on aarch64. I ran with that hypothesis and added --enforce-eager to skip torch.compile entirely. Same crash. --enforce-eager was working — the log confirmed Enforce eager set, disabling torch.compile and CUDAGraphs — but the crash was unaffected.

Read the traceback more carefully. The crash was inside torch.library._del_library — that's Python interpreter shutdown code, not init.

Something else killed the engine, then the shutdown-cleanup path fired the visible error. Tightened the grep and found the real primary error earlier in the log:

AssertionError: Model Runner V2 has not yet supported mamba_cache_mode='align'.

Chained together:

Qwen3.6-27B is a hybrid attention + Mamba model.
--enable-prefix-caching forces mamba_cache_mode='align' on hybrid models — the only cache layout the linear-attention layers support with prefix caching.
Model Runner V2 in vLLM 0.22 is a rewrite that hasn't reimplemented align-mode prefix caching for hybrids (vllm#26201, vllm#38041).
The NGC 26.06-py3 compose template ships VLLM_USE_V2_MODEL_RUNNER: "1" as a default. That flips the switch that trips the assertion.
The engine crashes during _initialize_kv_caches, the interpreter shuts down, torch.library._del_library iterates a garbled JIT op registry — UnicodeDecodeError, fully downstream.

Fix: delete one env line from the compose template. V1 is the documented fallback and matches the vLLM Recipes reference command.

Lesson: when Python shuts down mid-crash, the visible error is a shutdown artifact. grep -B2 -A20 "ERROR .* failed to start" before you form any hypothesis. I lost about half a day to the wrong one.

Wall 3 — HTTP 500 on every request (Prometheus + FastAPI 0.137)

Engine came up clean. But every chat completion returned 500:

{"error":{"message":"'_IncludedRouter' object has no attribute 'path'",
          "type":"InternalServerError","code":500}}

FastAPI 0.137 (May 2026) restructured include_router() to wrap routers in _IncludedRouter objects that don't expose .path. prometheus-fastapi-instrumentator <= 8.0.0 unconditionally reads .path in its middleware. Every request through the middleware raises AttributeError, which bubbles to a 500. NGC 26.06-py3 bundles fastapi >= 0.137 with prometheus-fastapi-instrumentator == 8.0.0 — the exact broken pair.

Upstream fixes exist (instrumentator 8.0.1, vLLM PRs #45594 / #45629). None of them are in the 26.06-py3 build. No newer NGC tag exists. No vLLM CLI flag disables the FastAPI middleware.

Fix: a two-line derived Dockerfile that pins the instrumentator forward:

FROM nvcr.io/nvidia/vllm:26.06-py3
RUN pip install --no-cache-dir "prometheus-fastapi-instrumentator>=8.0.1"

Built locally, tagged vllm-inx:26.06-py3-patched. Model YAML image: swapped. Chat completion returned 200.

Lesson: NGC bundles a whole ecosystem tightly. When FastAPI does a routing refactor, middlewares break inside the container even if they've shipped a fix. A thin derived Dockerfile with a single pip pin is cheap enough to be the standard workaround.

Wall 4 — `hermes` parser doesn't recognize Qwen3.6's tool emission

Baseline serve closed. In the previous post (Two Qwen3 Models on One DGX Spark), hermes worked with Qwen3-Next-80B-Instruct; carrying that forward here, I added tool calling with the same parser:

- --enable-auto-tool-choice
- --tool-call-parser
- hermes

Ran a calculator tool test. tool_calls was null. The call text was in .content:

"content": "...\n<tool_call>\n<function=calculator>\n<parameter=expression>\n42 * 17\n</parameter>\n</function>\n</tool_call>",
"tool_calls": null

The model was emitting a tool call — in Qwen3-native XML (<function=…><parameter=…>), not Hermes JSON (<tool_call>{"name":"calculator","arguments":{...}}</tool_call>).

The parser doesn't match, everything falls through as raw text, the harness silently thinks the model refused the tool.

NGC 26.06-py3 ships two Qwen3 parsers registered as qwen3_xml and qwen3_coder. Qwen3.6 emits the XML form:

- - hermes
+ - qwen3_xml

Re-deploy. Structured tool call recognized, finish_reason: tool_calls, content mostly clean (reasoning still leaked — the next phase's target).

Lesson: tool parsers map to emission formats, not model families. Qwen 2.5 Coder, Qwen3-Coder, and Qwen3.6 all pick different parsers. Run a calculator-tool test on raw output before you wire any agent harness. The failure is silent.

Reasoning + tools together

Added --reasoning-parser qwen3 to route <think>...</think> into .reasoning_content. The critical simultaneous-use test asks the model to think, then call a tool:

"reasoning_content": "The user wants to know 42 times 17. I need to use the calculator tool...",
"content": null,
"tool_calls": [
  {"function": {"arguments": "{\"expression\": \"42 * 17\"}", "name": "calculator"}, ...}
],
"finish_reason": "tool_calls"

Reasoning populated, content: null (no leakage), tool_calls[0] populated, finish_reason: tool_calls. That's the shape a coding-agent harness wants.

MTP — the risk that turned out safe

MTP (Multi-Token Prediction speculative decoding) was the highest-risk phase because a draft head that doesn't respect structured-output boundaries corrupts either the tool XML or the <think> tags. Startup log after adding the flag:

Resolved architecture: Qwen3_5MTP
Loading drafter model...
Detected MTP model. Sharing target model {embedding,lm_head} weights with the draft model.

Architecture flipped from Qwen3_5ForConditionalGeneration to Qwen3_5MTP — the FP8 checkpoint ships the draft head embedded. Re-ran the reasoning + tool acceptance test. Same clean shape. MTP did not corrupt tool tags or <think> boundaries.

Lesson: read the resolved architecture line in the startup log. Qwen3_5MTP vs Qwen3_5ForConditionalGeneration tells you whether MTP is actually active. The flag can be accepted without the checkpoint having the head — silent no-op.

Wall 5 — pi sends `reasoning_effort`, LiteLLM 400s

Wired the pi.dev agent harness with a new provider extension pointed at http://<host>:4000/v1, model Qwen3.6-27B, reasoning: true. First call:

400 litellm.UnsupportedParamsError: openai does not support parameters:
['reasoning_effort'], for model=Qwen/Qwen3.6-27B-FP8.
To drop these, set `litellm.drop_params=True`

Pi's reasoning: true flag emits an OpenAI-style reasoning_effort param with every request. The LiteLLM route maps to a generic openai/* upstream, and OpenAI's SDK strictly refuses reasoning_effort for non-o1 models. LiteLLM propagates the strict validation as a 400. This is orthogonal to vLLM's own reasoning behavior — --reasoning-parser qwen3 operates on the model's <think> output tokens server-side; reasoning_effort is a client-facing param LiteLLM doesn't route by default.

One-line fix in the LiteLLM config:

litellm_settings:
  drop_params: true

LiteLLM now silently strips unsupported params before forwarding.

Lesson: drop_params: true is load-bearing for any LiteLLM deployment that serves multiple client SDKs. pi, aider, opencode, Continue, and Cursor each pass different reasoning-related params (reasoning_effort, thinking, reasoning, thinking_budget). Without drop_params, every new client is a fresh 400 to debug.

Also worth a mention: GB10's unified memory pool doesn't report the way discrete GPUs do. nvidia-smi --query-gpu=memory.used returns [N/A]. Use nvidia-smi --query-compute-apps=used_memory --format=csv and sum. Same intent, different query. Update any monitoring or pre-flight gates written against discrete-VRAM semantics.

The generalized rules

Five lessons that generalize past this box:

The visible error is not the primary error when Python shuts down. Grep for EngineCore failed to start first; the _del_library UnicodeDecodeError is downstream noise.
NGC image tags bundle a version matrix. Don't upgrade individual components. Bump the whole tag or write a thin derived Dockerfile.
Model Runner V2 doesn't support hybrid-attention align-mode prefix caching yet. If you're serving Qwen3.5/3.6, MiniMax M2, or any Mamba/GDN model, unset VLLM_USE_V2_MODEL_RUNNER.
Tool parsers map to emission formats, not model families. Run a calculator-tool test on raw output before wiring any agent harness. The failure is silent — the call lands in .content.
drop_params: true in LiteLLM is load-bearing. Client SDKs pass different reasoning params; the OpenAI upstream rejects unknown ones. drop_params decouples clients from upstream opinion.

Putting it all together

This work only matters if the stack survives contact with a real repo. Clawrium is my open-source agent fleet manager where this stack is routing to. I'm using this project to benchmark the stack with most of the new changes. This is a non-trivial project with real users so the objective is to dogfood my stack against something I have an incentive to keep stable. I started with small PRs and some documentation tasks. I'll give it a few weeks to settle down before graduating to more long running tasks.

Additionally, I've also onboarded a Hermes agent to use the same stack.

Evidence: code that shipped into Clawrium

Here are concrete Clawrium pull requests that went through the local-Qwen path:

One important nuance: not every artifact is an execution artifact. Some are planning artifacts that support the same workflow:

#858 — docs(#712): implementation plan for GUI lifecycle 502 fix (plan-only)
Hermes agent using this setup self-published a release notes blog: https://ric03uec.github.io/clawrium/blog/v26.7.0-release

The workflow

As much as this model seems promising, the rule of thumb is: Don't ask it for any non-trivial planning. Ask it to write the change once the plan is concrete. This also requires a change in approach. Granularity and precision of the plan defines the success of a local model.

Plan with Codex/Claude/GLM. Feed the issue, the file tree, and the acceptance criteria. Output: an ordered task list with file paths, function signatures, and test expectations.
Execute with local Qwen. Feed the task list to pi.dev or OpenCode pointed at the LiteLLM proxy. The model reads the codebase (tools), applies diffs (tools), runs tests (tools). All at 256K context with MTP speculative decoding.
Review locally, then ship. PR draft, review pass, merge. The frontier is called for planning, not for typing.
Harness choice matters. OpenCode remains my daily driver but as much as it makes model switching easier, it comes with real bloat. Going for local LLMs means obsessing over keeping your stack lean. Starting with the harness. I'm getting good results with pi.dev.
Code without validation is useless. I haven't ported my validation harness yet to use the local model (yet). For true cost efficiencies both generation and validation need to be localised.

That's the actual split I'm aiming for: frontier models plan, local Qwen executes, and the repo accumulates both kinds of output.

Trends so far

The stack usage is around 50M tokens so far over the past weekend. I'll start adding more tasks to the queue this week and report updated numbers soon.

Qwen3.6-27B-FP8 on the DGX Spark sustained about 14-16 tok/s decode throughput on active two-request workloads.

Prompt ingest showed bursty but high prefill throughput, with aggregate effective prefill around 12.6k tok/s.
Prefix caching was highly effective at roughly 95% hit rate.
Average TTFT was about 8.5s, and average end-to-end latency was about 42s, driven mostly by long decode phases on very large prompts.
The observed average prompt size was huge at about 78.8k tokens/request.

Closing

Local inference is finally getting interesting because it can survive contact with real work. It has to reason, call tools, survive a real repo, and ship code I'd actually merge.

That's what this DGX Spark stack is starting to do. Frontier models still do the heavy planning. But once the task is concrete, local Qwen is getting close to being the machine I trust to execute it. That's the future I want: owned inference, owned context, owned workflow. And this is the first setup I've had that feels like it might actually get there.

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding

Devashish — Tue, 16 Jun 2026 20:23:10 +0000

My agent stack with Hermes runs on a workstation. The models run on a DGX Spark on the same LAN. The split is deliberate: the workstation stays responsive, the Spark does the GPU work, and they talk over an HTTP proxy.

Since I started managing the agent fleet through Clawrium, the Hermes count has climbed. More agents on more hosts, more concurrent traffic, all hitting the same Spark. What was a one-laptop, one-model setup is now a small fleet against a single backend — and the shape of the load is exactly what a single-model server can't serve.

The Spark served models through ollama for months. It worked. One model up, single config, easy to bring down.

But ollama owns the card. There's no per-process memory budget, no gpu_memory_utilization knob, no straightforward way to coresident a heavy model for reasoning and a fast model for quick turns. KV cache management is whatever the underlying llama.cpp backend gives you. PagedAttention isn't there.

vLLM fixes all of that.

PagedAttention reclaims KV blocks instead of contiguous-pinning them.

gpu_memory_utilization gives you a per-container budget.

One Spark (GB10, 119.67 GiB unified memory) can run multiple vLLM containers behind a LiteLLM proxy on :4000, and Hermes hits one URL to route to either model. The promise: serve Qwen3-Next-80B-Instruct-FP8 for the heavy work and Qwen3-4B-Instruct-2507 for fast turns, coresident, both reachable from a single endpoint.

That's the why. What follows is what it took to make the promise hold.

Spark hardware will happily hold two Qwen3 models if the numbers line up. They didn't, for several days. That's where my last weekend went.

Attempt one: trust the target

First 80B config: gpu_memory_utilization: 0.75, max_model_len: 65536, max_num_seqs: 4. vLLM's KV cache init crashed with "No available memory for the cache blocks." Qwen3-Next is mostly Mamba; the per-block page alignment pushes KV pool demand higher than the ~14 GiB residue after weights.

Bumped to 0.85. Now the free-memory check crashed: "Free memory on device (98.51/119.67 GiB) is less than desired GPU memory utilization (0.85, 101.72 GiB)." The 4B was already resident at ~16 GiB. The 80B's 0.85 target was reading the whole card, not what was free.

That's the first lesson. gpu_memory_utilization is a fraction of total GPU memory, not free memory.

Two co-resident vLLM processes need their fractions to sum below ~0.95 to leave room for CUDA framework overhead. If your math assumes free, you'll oscillate between OOMs and silent KV starvation.

Settled at 0.80 / 32k / 2 for the 80B. Loaded clean. KV pool ~20.8 GiB after weights.

Attempt two: point Hermes at it

Then Hermes came online and tool calls came back as plain text. <tool_call> JSON sitting inside content. tool_calls: []. finish_reason: stop. Hermes never executed it.

A day of parser triage produced nothing actionable. Both hermes_tool_parser.py and qwen3xml_tool_parser.py look for <tool_call> (singular). The <tools> plural tag is the system-prompt definition, not the output. The parser wasn't wrong. The model wasn't emitting.

tool_choice: "required" worked. tool_choice: "auto" came back empty: tool_calls: [], content: "", 619 characters of reasoning inside <think> concluding "Alright, that's it" without emitting the call.

Qwen's own model card states it plainly: Qwen3-Next-80B-Thinking supports only thinking mode. enable_thinking: false is a structural no-op on this checkpoint. /no_think in the prompt is ignored. The model reasons inside <think>, decides, and never emits.

That's an unrecoverable failure for any agent SDK that defaults to tool_choice: "auto". The fix wasn't a parser flag. It was swapping the whole 80B backbone from Thinking to Instruct.

77 GiB pre-pull. Drain GPU. Bring up with --enable-auto-tool-choice --tool-call-parser hermes, no --reasoning-parser. Three LiteLLM aliases (writer / reviewer / sources) all passed tool_choice: "auto" cleanly with finish_reason: tool_calls. Trade accepted: reviewer loses native <think> traces. Reasoning moved into the prompt.

Attempt three: the bump that broke coresidency

Reviewer agent (running on Hermes) needed 64k context. Bumped the 80B to 0.85 / 65536 / 2. 80B loaded healthy. The 4B's restart loop kicked in 19 times: "Free memory on device (12.58/119.67 GiB) is less than desired GPU memory utilization (0.12, 14.36 GiB)."

80B's actual residency at 0.85 was 101.5 GiB. Plus ~5 GiB CUDA framework overhead. That left ~12.5 GiB free. The 4B needed 14.36 GiB. No room.

Toned the 80B back to 0.80, dropped the 4B to 0.10 / 16384 / 8. Both came up healthy. The 4B's max_model_len had to drop because the 0.10 allocation leaves only ~3.5 GiB for KV pool — 32k single-seq KV demand (~4.8 GiB) doesn't fit; 16k (~2.4 GiB) does.

The residency math

This is the table I wish I'd built on day one:

Component	Allocation target	Actual resident
Qwen3-Next-80B-Instruct-FP8 at 0.80	~95 GiB	87.8 GiB
Qwen3-4B-Instruct at 0.10	~12 GiB	13.8 GiB
Total	~107 GiB	101.6 GiB
Free headroom	~12 GiB	~18 GiB

Three observations from the actuals.

The 80B's actual residency at 0.80 ran 8 GiB under allocation. That cushion is the only reason the 4B's restart variability doesn't break the deployment. At 0.85, the cushion went negative — same hardware, same models, same vLLM build.

The 4B at 0.10 actually resides at 13.8 GiB, not the 12 GiB the target implies. CUDA framework overhead doesn't disappear at small allocations.

On Qwen3-Next specifically, max_model_len × max_num_seqs is dominated by Mamba state alignment, not attention KV. Halving max_model_len doesn't halve KV pool demand the way it does on a pure attention model. Plan KV against Mamba page sizes, not against intuition from Llama-class models.

Once the wiring was complete, LiteLLM showed all the aliases for the same two models running on the Spark.

The insight

gpu_memory_utilization is a snapshot vLLM takes at process start, against total card memory. It is not a target against free memory. CUDA contexts from prior failed attempts can transiently inflate residency and trip the check spuriously. Co-resident processes don't negotiate — they race.

The only number that matters is actual residency after both processes have stabilized, measured against the headroom the harder-to-restart model needs to come back from a crash. Target allocations are a planning input; actuals are the ground truth.

For a two-model Spark deployment, the playbook is: load the bigger model first, let it settle, run nvidia-smi to read actual residency, then size the smaller model's gpu_memory_utilization against the free pool minus ~5 GiB for its own framework overhead. Recheck after both restart cleanly twice.

The 24-hour action

If you have a vLLM deployment running right now, pull this:

nvidia-smi --query-gpu=memory.used --format=csv

Compare the actual number to what your gpu_memory_utilization target implies. If the two diverge by more than 10%, your sizing model is wrong. Fix it before you ship anything that depends on coresidency — agent stacks, parallel workers, fallback chains. The math has to be empirical, not aspirational.

If you're standing up a similar local-LLM stack — DGX Spark (or other hardware), vLLM, multiple coresident models, or wiring a remote agent fleet to a single inference backend — I'd love to compare notes.

https://www.devashish.me/p/aie-code-2025-wrapup

Devashish — Tue, 23 Dec 2025 06:56:57 +0000

AIE Code 2025 Wrapup - devashish.me

Leadership and engineering takeaways from AIE CODE 2025.

devashish.me