After three weeks of running Qwen2.5-32B on a DGX Spark, the number that surprised me most wasn't the throughput or latency. It was zero.
Zero structural errors across 2,859 code generation tests.
What I Tested
EvalScope with code generation tasks covering:
- Structured JSON output
- Function calling (OpenAI tool format)
- Multi-step tool use chains
- Code completion with specific output formats
Each test run validates four things:
- Valid JSON structure — no unclosed brackets, no broken syntax
- Correct function call schema — the right parameters, right types
- No truncated output — response completes fully within the token budget
- Response within timeout — no hung generations
Seven test sessions, roughly 400 prompts each. Every single one passed.
The Setup
Model: Qwen2.5-32B-Instruct-AWQ (4-bit)
Engine: vLLM 0.21 with continuous batching
Temperature: 0 (deterministic mode)
Hardware: DGX Spark, 128GB unified memory, ARM64
bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes
Why Zero Errors Surprised Me
I've used cloud APIs extensively. Even the best ones occasionally return truncated JSON under load, or a function call with a missing parameter. It's rare — 0.1-0.3% error rates — but when you're running autonomous agents doing 40+ sequential tool calls, a single failure cascades.
At 0.3% error rate per call, a 50-step agent loop has a ~14% chance of hitting at least one failure. Your agent works perfectly nine times, then mysteriously dies on the tenth run.
With zero errors in 2,859 trials, the 95% confidence upper bound on the error rate is 0.13%. That means a 50-step loop has a 93.8%+ chance of completing cleanly.
The Comparison
I also ran 1,280 identical prompts against cloud APIs:
| Backend | Latency (median) | Structural Errors |
|---|---|---|
| DeepSeek V3 | 2.6s | 0 |
| Kimi | 4.9s | 2 |
| Qwen2.5-14B (Mac M4) | 9.9s | 0 |
| STORM (DGX, 32B) | 19.6s | 0 |
Cloud wins on speed. But the local setup matched the cloud on reliability, while the 14B on a $599 Mac Mini held its own on quality.
Reproduce It
Full methodology, test datasets, and raw results are on GitHub:
https://github.com/YIQI-NUMBER1/stormengine
If you've got a local setup, pull the repo and run the benchmarks. If you find errors I missed, open an issue — I genuinely want to know what breaks this.
Top comments (0)