<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Storm Engine Technology.</title>
    <description>The latest articles on DEV Community by Storm Engine Technology. (@yiqinumber1).</description>
    <link>https://dev.to/yiqinumber1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963941%2F7d26b306-e577-4d26-94e3-45dcaabb3c65.jpg</url>
      <title>DEV Community: Storm Engine Technology.</title>
      <link>https://dev.to/yiqinumber1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yiqinumber1"/>
    <language>en</language>
    <item>
      <title>How I Ran 2,859 LLM Code Generation Tests with EvalScope — and Got Zero Errors</title>
      <dc:creator>Storm Engine Technology.</dc:creator>
      <pubDate>Tue, 02 Jun 2026 07:07:02 +0000</pubDate>
      <link>https://dev.to/yiqinumber1/how-i-ran-2859-llm-code-generation-tests-with-evalscope-and-got-zero-errors-17km</link>
      <guid>https://dev.to/yiqinumber1/how-i-ran-2859-llm-code-generation-tests-with-evalscope-and-got-zero-errors-17km</guid>
      <description>&lt;p&gt;After three weeks of running Qwen2.5-32B on a DGX Spark, the number that surprised me most wasn't the throughput or latency. It was zero.&lt;/p&gt;

&lt;p&gt;Zero structural errors across 2,859 code generation tests.&lt;/p&gt;

&lt;p&gt;What I Tested&lt;/p&gt;

&lt;p&gt;EvalScope with code generation tasks covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured JSON output&lt;/li&gt;
&lt;li&gt;Function calling (OpenAI tool format)&lt;/li&gt;
&lt;li&gt;Multi-step tool use chains&lt;/li&gt;
&lt;li&gt;Code completion with specific output formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each test run validates four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Valid JSON structure — no unclosed brackets, no broken syntax&lt;/li&gt;
&lt;li&gt;Correct function call schema — the right parameters, right types&lt;/li&gt;
&lt;li&gt;No truncated output — response completes fully within the token budget&lt;/li&gt;
&lt;li&gt;Response within timeout — no hung generations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Seven test sessions, roughly 400 prompts each. Every single one passed.&lt;/p&gt;

&lt;p&gt;The Setup&lt;/p&gt;

&lt;p&gt;Model: Qwen2.5-32B-Instruct-AWQ (4-bit)&lt;br&gt;
Engine: vLLM 0.21 with continuous batching&lt;br&gt;
Temperature: 0 (deterministic mode)&lt;br&gt;
Hardware: DGX Spark, 128GB unified memory, ARM64&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
python -m vllm.entrypoints.openai.api_server \&lt;br&gt;
  --model Qwen2.5-32B-Instruct-AWQ \&lt;br&gt;
  --max-model-len 65536 \&lt;br&gt;
  --gpu-memory-utilization 0.9 \&lt;br&gt;
  --enforce-eager \&lt;br&gt;
  --enable-auto-tool-choice \&lt;br&gt;
  --tool-call-parser hermes&lt;/p&gt;

&lt;p&gt;Why Zero Errors Surprised Me&lt;/p&gt;

&lt;p&gt;I've used cloud APIs extensively. Even the best ones occasionally return truncated JSON under load, or a function call with a missing parameter. It's rare — 0.1-0.3% error rates — but when you're running autonomous agents doing 40+ sequential tool calls, a single failure cascades.&lt;/p&gt;

&lt;p&gt;At 0.3% error rate per call, a 50-step agent loop has a ~14% chance of hitting at least one failure. Your agent works perfectly nine times, then mysteriously dies on the tenth run.&lt;/p&gt;

&lt;p&gt;With zero errors in 2,859 trials, the 95% confidence upper bound on the error rate is 0.13%. That means a 50-step loop has a 93.8%+ chance of completing cleanly.&lt;/p&gt;

&lt;p&gt;The Comparison&lt;/p&gt;

&lt;p&gt;I also ran 1,280 identical prompts against cloud APIs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Latency (median)&lt;/th&gt;
&lt;th&gt;Structural Errors&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3&lt;/td&gt;
&lt;td&gt;2.6s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi&lt;/td&gt;
&lt;td&gt;4.9s&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-14B (Mac M4)&lt;/td&gt;
&lt;td&gt;9.9s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STORM (DGX, 32B)&lt;/td&gt;
&lt;td&gt;19.6s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cloud wins on speed. But the local setup matched the cloud on reliability, while the 14B on a $599 Mac Mini held its own on quality.&lt;/p&gt;

&lt;p&gt;Reproduce It&lt;/p&gt;

&lt;p&gt;Full methodology, test datasets, and raw results are on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/YIQI-NUMBER1/stormengine" rel="noopener noreferrer"&gt;https://github.com/YIQI-NUMBER1/stormengine&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've got a local setup, pull the repo and run the benchmarks. If you find errors I missed, open an issue — I genuinely want to know what breaks this.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>Running Qwen2.5-32B on a DGX Spark: 3 Weeks, 2,859 Tests, Zero Errors — Full Setup Guide</title>
      <dc:creator>Storm Engine Technology.</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:51:12 +0000</pubDate>
      <link>https://dev.to/yiqinumber1/running-qwen25-32b-on-a-dgx-spark-3-weeks-2859-tests-zero-errors-full-setup-guide-lh</link>
      <guid>https://dev.to/yiqinumber1/running-qwen25-32b-on-a-dgx-spark-3-weeks-2859-tests-zero-errors-full-setup-guide-lh</guid>
      <description>&lt;p&gt;Why This Setup&lt;/p&gt;

&lt;p&gt;If you're building agent pipelines, you already know the problem: one broken tool call at step 47, and your entire autonomous loop is toast. Cloud APIs have rate limits, and they don't care that your agent is running at 3 AM.&lt;br&gt;
I wanted to see if a local setup could deliver the one thing that matters most for agents: deterministic, structurally perfect output. Every time. Here's what I learned after three weeks.&lt;/p&gt;

&lt;p&gt;Hardware&lt;/p&gt;

&lt;p&gt;DGX Spark (GB10)&lt;br&gt;
128GB unified memory&lt;br&gt;
20-core ARM64&lt;br&gt;
Ubuntu 24.04 LTS&lt;/p&gt;

&lt;p&gt;Single machine, single model. No Kubernetes. Sitting in a residential room behind CGNAT, exposed via Cloudflare Tunnel.&lt;/p&gt;

&lt;p&gt;Model &amp;amp; Engine&lt;br&gt;
bash&lt;br&gt;
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ&lt;/p&gt;

&lt;p&gt;python -m vllm.entrypoints.openai.api_server \&lt;br&gt;
  --model Qwen2.5-32B-Instruct-AWQ \&lt;br&gt;
  --served-model-name Qwen2.5-32B \&lt;br&gt;
  --host 0.0.0.0 --port 8000 \&lt;br&gt;
  --max-model-len 65536 \&lt;br&gt;
  --gpu-memory-utilization 0.9 \&lt;br&gt;
  --dtype auto \&lt;br&gt;
  --enforce-eager \&lt;br&gt;
  --enable-auto-tool-choice \&lt;br&gt;
  --tool-call-parser hermes&lt;/p&gt;

&lt;p&gt;Key flags explained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--enforce-eager&lt;/code&gt;: ARM64 can't handle CUDA graphs — this is mandatory, not optional&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--max-model-len 65536&lt;/code&gt;: Full 64K context window for long agent loops&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--gpu-memory-utilization 0.9&lt;/code&gt;: Leave 10% headroom for KV cache spikes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--tool-call-parser hermes&lt;/code&gt;: Qwen2.5 uses Hermes format for tool calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AWQ 4-bit quantization is what makes this possible. 32B model at full precision would need ~64GB just for weights. Quantized, it's ~18GB, leaving plenty of room for KV cache in the 128GB unified memory pool.&lt;/p&gt;

&lt;p&gt;The Numbers&lt;/p&gt;

&lt;p&gt;Raw Performance&lt;/p&gt;

&lt;p&gt;Single-stream generation: 12.9 tok/s. Not going to win any speed contests. ARM64 and 32B parameters are a heavy lift.&lt;/p&gt;

&lt;p&gt;But throughput is a different story with vLLM's continuous batching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;25 concurrent: 266 tok/s system throughput&lt;/li&gt;
&lt;li&gt;TTFT P50: 649ms&lt;/li&gt;
&lt;li&gt;TTFT P99 at 25 concurrent: 1,579ms&lt;/li&gt;
&lt;li&gt;TPOT median: 74ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;vLLM's prefix caching is doing the heavy lifting on TTFT — in agent loops, successive calls share system prompt context, and the cache hits keep first-token latency down.&lt;/p&gt;

&lt;p&gt;The Concurrency Cliff&lt;/p&gt;

&lt;p&gt;This was the most surprising finding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30 concurrent: 100% success rate&lt;/li&gt;
&lt;li&gt;35 concurrent: 100% timeout rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not gradual degradation. A hard wall. Memory bandwidth maxes out at ~32-33 concurrent requests, and the GPU memory simply can't serve more. If you're planning a DGX Spark deployment, plan for 30 concurrent max with zero headroom.&lt;/p&gt;

&lt;p&gt;Benchmark Results&lt;/p&gt;

&lt;p&gt;2,859 code generation tests via EvalScope across 7 sessions. Each test validates JSON structure, function call schema, output completeness, and timeout compliance.&lt;/p&gt;

&lt;p&gt;Structural errors: &lt;strong&gt;zero&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I ran the same 1,280 prompts against cloud APIs for comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Errors&lt;/th&gt;
&lt;th&gt;Output (avg lines)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STORM (DGX, 32B)&lt;/td&gt;
&lt;td&gt;19.6s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V3&lt;/td&gt;
&lt;td&gt;2.6s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi&lt;/td&gt;
&lt;td&gt;4.9s&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac M4 Pro (14B)&lt;/td&gt;
&lt;td&gt;9.9s&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DeepSeek wins speed and verbosity. Kimi is fast but had format breaks. The Mac M4 with a 14B model was surprisingly competitive on quality.&lt;/p&gt;

&lt;p&gt;What's the Takeaway?&lt;/p&gt;

&lt;p&gt;For chat and real-time applications, cloud APIs win. They're faster, simpler, and you don't need to manage hardware.&lt;/p&gt;

&lt;p&gt;For agent pipelines where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're running long tool-calling loops&lt;/li&gt;
&lt;li&gt;A single malformed JSON breaks the entire flow&lt;/li&gt;
&lt;li&gt;Rate limits at unpredictable hours are unacceptable&lt;/li&gt;
&lt;li&gt;You want prompt data staying on your hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...local inference with the right configuration delivers something cloud APIs don't: &lt;em&gt;guaranteed output structure&lt;/em&gt;. Not once in 2,859 tests did the model break format. That's the product.&lt;/p&gt;

&lt;p&gt;Try It Yourself&lt;/p&gt;

&lt;p&gt;Everything is open source. Reproduce the setup, run the benchmarks, verify the numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/YIQI-NUMBER1/stormengine" rel="noopener noreferrer"&gt;GitHub (code + data + methodology)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://api.stormengine.cloud/static/bench_report.html" rel="noopener noreferrer"&gt;Benchmark report&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://api.stormengine.cloud" rel="noopener noreferrer"&gt;API endpoint&lt;/a&gt; (free tier for testing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Questions about the DGX setup, vLLM tuning, or benchmark methodology? Drop a comment.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
    </item>
  </channel>
</rss>
