Running Qwen2.5-32B on a DGX Spark: 3 Weeks, 2,859 Tests, Zero Errors — Full Setup Guide

#ai #api

Why This Setup

If you're building agent pipelines, you already know the problem: one broken tool call at step 47, and your entire autonomous loop is toast. Cloud APIs have rate limits, and they don't care that your agent is running at 3 AM.
I wanted to see if a local setup could deliver the one thing that matters most for agents: deterministic, structurally perfect output. Every time. Here's what I learned after three weeks.

Hardware

DGX Spark (GB10)
128GB unified memory
20-core ARM64
Ubuntu 24.04 LTS

Single machine, single model. No Kubernetes. Sitting in a residential room behind CGNAT, exposed via Cloudflare Tunnel.

Model & Engine
bash
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ

python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--served-model-name Qwen2.5-32B \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--dtype auto \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes

Key flags explained:

--enforce-eager: ARM64 can't handle CUDA graphs — this is mandatory, not optional
--max-model-len 65536: Full 64K context window for long agent loops
--gpu-memory-utilization 0.9: Leave 10% headroom for KV cache spikes
--tool-call-parser hermes: Qwen2.5 uses Hermes format for tool calls

The AWQ 4-bit quantization is what makes this possible. 32B model at full precision would need ~64GB just for weights. Quantized, it's ~18GB, leaving plenty of room for KV cache in the 128GB unified memory pool.

The Numbers

Raw Performance

Single-stream generation: 12.9 tok/s. Not going to win any speed contests. ARM64 and 32B parameters are a heavy lift.

But throughput is a different story with vLLM's continuous batching:

25 concurrent: 266 tok/s system throughput
TTFT P50: 649ms
TTFT P99 at 25 concurrent: 1,579ms
TPOT median: 74ms

vLLM's prefix caching is doing the heavy lifting on TTFT — in agent loops, successive calls share system prompt context, and the cache hits keep first-token latency down.

The Concurrency Cliff

This was the most surprising finding:

30 concurrent: 100% success rate
35 concurrent: 100% timeout rate

Not gradual degradation. A hard wall. Memory bandwidth maxes out at ~32-33 concurrent requests, and the GPU memory simply can't serve more. If you're planning a DGX Spark deployment, plan for 30 concurrent max with zero headroom.

Benchmark Results

2,859 code generation tests via EvalScope across 7 sessions. Each test validates JSON structure, function call schema, output completeness, and timeout compliance.

Structural errors: zero.

I ran the same 1,280 prompts against cloud APIs for comparison:

Model	Latency	Errors	Output (avg lines)
STORM (DGX, 32B)	19.6s	0	37
DeepSeek V3	2.6s	0	43
Kimi	4.9s	2	40
Mac M4 Pro (14B)	9.9s	0	38

DeepSeek wins speed and verbosity. Kimi is fast but had format breaks. The Mac M4 with a 14B model was surprisingly competitive on quality.

What's the Takeaway?

For chat and real-time applications, cloud APIs win. They're faster, simpler, and you don't need to manage hardware.

For agent pipelines where:

You're running long tool-calling loops
A single malformed JSON breaks the entire flow
Rate limits at unpredictable hours are unacceptable
You want prompt data staying on your hardware

...local inference with the right configuration delivers something cloud APIs don't: guaranteed output structure. Not once in 2,859 tests did the model break format. That's the product.

Try It Yourself

Everything is open source. Reproduce the setup, run the benchmarks, verify the numbers:

Questions about the DGX setup, vLLM tuning, or benchmark methodology? Drop a comment.