Why This Setup
If you're building agent pipelines, you already know the problem: one broken tool call at step 47, and your entire autonomous loop is toast. Cloud APIs have rate limits, and they don't care that your agent is running at 3 AM.
I wanted to see if a local setup could deliver the one thing that matters most for agents: deterministic, structurally perfect output. Every time. Here's what I learned after three weeks.
Hardware
DGX Spark (GB10)
128GB unified memory
20-core ARM64
Ubuntu 24.04 LTS
Single machine, single model. No Kubernetes. Sitting in a residential room behind CGNAT, exposed via Cloudflare Tunnel.
Model & Engine
bash
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ
python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--served-model-name Qwen2.5-32B \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--dtype auto \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes
Key flags explained:
-
--enforce-eager: ARM64 can't handle CUDA graphs — this is mandatory, not optional -
--max-model-len 65536: Full 64K context window for long agent loops -
--gpu-memory-utilization 0.9: Leave 10% headroom for KV cache spikes -
--tool-call-parser hermes: Qwen2.5 uses Hermes format for tool calls
The AWQ 4-bit quantization is what makes this possible. 32B model at full precision would need ~64GB just for weights. Quantized, it's ~18GB, leaving plenty of room for KV cache in the 128GB unified memory pool.
The Numbers
Raw Performance
Single-stream generation: 12.9 tok/s. Not going to win any speed contests. ARM64 and 32B parameters are a heavy lift.
But throughput is a different story with vLLM's continuous batching:
- 25 concurrent: 266 tok/s system throughput
- TTFT P50: 649ms
- TTFT P99 at 25 concurrent: 1,579ms
- TPOT median: 74ms
vLLM's prefix caching is doing the heavy lifting on TTFT — in agent loops, successive calls share system prompt context, and the cache hits keep first-token latency down.
The Concurrency Cliff
This was the most surprising finding:
- 30 concurrent: 100% success rate
- 35 concurrent: 100% timeout rate
Not gradual degradation. A hard wall. Memory bandwidth maxes out at ~32-33 concurrent requests, and the GPU memory simply can't serve more. If you're planning a DGX Spark deployment, plan for 30 concurrent max with zero headroom.
Benchmark Results
2,859 code generation tests via EvalScope across 7 sessions. Each test validates JSON structure, function call schema, output completeness, and timeout compliance.
Structural errors: zero.
I ran the same 1,280 prompts against cloud APIs for comparison:
| Model | Latency | Errors | Output (avg lines) |
|---|---|---|---|
| STORM (DGX, 32B) | 19.6s | 0 | 37 |
| DeepSeek V3 | 2.6s | 0 | 43 |
| Kimi | 4.9s | 2 | 40 |
| Mac M4 Pro (14B) | 9.9s | 0 | 38 |
DeepSeek wins speed and verbosity. Kimi is fast but had format breaks. The Mac M4 with a 14B model was surprisingly competitive on quality.
What's the Takeaway?
For chat and real-time applications, cloud APIs win. They're faster, simpler, and you don't need to manage hardware.
For agent pipelines where:
- You're running long tool-calling loops
- A single malformed JSON breaks the entire flow
- Rate limits at unpredictable hours are unacceptable
- You want prompt data staying on your hardware
...local inference with the right configuration delivers something cloud APIs don't: guaranteed output structure. Not once in 2,859 tests did the model break format. That's the product.
Try It Yourself
Everything is open source. Reproduce the setup, run the benchmarks, verify the numbers:
- GitHub (code + data + methodology)
- Benchmark report
- API endpoint (free tier for testing)
Questions about the DGX setup, vLLM tuning, or benchmark methodology? Drop a comment.
Top comments (0)