Lessons from the DGX Spark: Speed, VRAM, and the "Thinking" Problem
We have a DGX Spark at the office everyone fights over.. dying to play with it.. had a simple goal: build an internal automation agent that peers into Salesforce, Confluence, and our internal APIs to generate workflows, pricing quotes, etc. Keep sensitive data local and, frankly, kill the API costs as much as possible.
But as you know, “running it locally” is not straightforward.. many times I wanted to just throw the key in .env and be done with it. Here’s what we learned from the trenches of model selection, VRAM management, and prompt tuning.
The “Thinking” Tax: Why We Pivoted from Qwen 3.6
The first instinct was to grab the newest shiny object: Qwen3.6-27B. It’s a beast on paper, but we ran into an immediate “personality” issue. The model has a heavy “scratchpad” style—it wants to think out loud before it gives you the answer.
For our use case—generating clean JSON for an internal UI—this was a disaster. It burned tokens and time on analysis we didn’t ask for. We tried enable_thinking=false, but it wasn’t consistent. We moved to Qwen3-30B-A3B and hit the same wall.
So! if you just need a model to follow tool calls and return a schema, “thinking” models can actually be a hindrance. You don’t need a philosopher; you need a clerk.
The Sweet Spot: Qwen2.5-32B-Instruct-fp8
We eventually landed on Qwen2.5-32B-Instruct-fp8.
TThe FP8 quantization allowed it to sit comfortably in the Spark’s VRAM, even with our embedding model (BGE-M3) running alongside it.
In head-to-head evals against Claude 3.5 Sonnet, the latency difference was a little surprising.
The Benchmarks (22 Paired Evals)
Metric. Qwen2.5-32B (Local) Claude 3.5 Sonnet (Cloud)
TTFT 1–2s 9–35s
Response Concise 2.3x longer
Claude is impressive—it adds citations and caveats that Qwen just doesn’t match—but for routine synthesis, 35 seconds for a first token is a non-starter for a snappy UI.
Closing the Quality Gap: The “Schema-First” Strategy
Qwen was fast, but it was “hallucination-prone”—dropping schema fields and making up URLs. To fix this, we stopped treating it like a chatbot and started treating it like a compiler.
Our Optimization Stack:
Temperature 0.1: Kill the creativity.
Schema-First Prompting: We moved the JSON structure to the very top of the prompt. We tell it how to output before we tell it what to do.
Hard Constraints: We added rules like empty section = [] and a strict Never fabricate command.
Zero Persona: We stripped all “You are a helpful assistant” fluff. It just gets in the way of the logic.
The Hardware Squeeze
One thing to watch if you’re running on a Spark: VRAM is a zero-sum game, obviously. Adding BGE-M3 for semantic search and multilingual support was non-negotiable for our data, but it made the memory overhead incredibly tight.
What’s Next?
We’re going to run a full eval on these changes to see if the prompt tuning is enough. If not, the next step is building a middleware layer to catch malformed JSON and trigger second calls. I’m also looking at putting Llama 3 through its paces to see if the tool-calling is more robust.
The Bottom Line: We’re closing the gap. We’ll use the local Qwen for the 90% “routine” synthesis and save the Claude API calls for the truly hard reasoning tasks.
Top comments (0)