DEV Community: Storm Engine Technology.

STORM AI: When "Cheap" Isn't Cheap and Premium Hurts—I Built the Middle Ground

Storm Engine Technology. — Fri, 12 Jun 2026 07:54:42 +0000

Two weeks ago, I plugged an NVIDIA DGX Spark into the network, loaded Qwen2.5-32B, slapped on a proxy, and pointed a domain at it. STORM AI inference API went live.
No team. No budget. One machine. Day one, I ran 2,859 evaluation prompts through EvalScope across four models.
Result: zero structural errors, 100% success at 30 concurrent requests. Not the best hardware. Good enough.
Why another API? The market is flooded.
What's missing is determinism.
Doubao started charging. DeepSeek looks cheap—¥1/M input, ¥2/M output—but "shared instance at capacity" is the norm, not the exception. You deploy an Agent. It runs fine at 2 PM. At 3 AM it hits 429. You wake up to a log
full of retries. The cheap price tag hides the real cost: your time.
OpenAI GPT-4o-mini at $0.60/M output. Claude at $15/M. Looks reasonable until your Agent burns through twenty bucks a day during development. Trial and error at scale isn't free.

I'm not saying STORM is better. I'm saying there's a use case being ignored: inference built for Agents, not chatbots.
Chatbots forgive failure. One retry, nobody cares. Agents chain calls: A's output feeds B's input, B's output triggers C's tool selection. One random fluctuation anywhere in that chain, and the whole thing collapses.
You don't need the smartest model. You need the same output for the same input, every time.
That's why STORM defaults to temperature=0. Not because we hate creativity. Because Agents don't need it. They need reliability.
Numbers don't lie

 EvalScope head-to-head:

 |                  | Success Rate | Avg Latency | Output Throughput |
 |------------------|--------------|-------------|-------------------|
 | STORM (DGX, 32B) | 100%         | 24s         | 307 tok/s         |
 | DeepSeek V3      | 100%         | 4s          | 980 tok/s         |
 | Kimi             | 96%          | 6s          | 520 tok/s         |
 | Mac M4 (14B)     | 100%         | 73s         | 45 tok/s          |

DeepSeek is faster. Nobody disputes that. But that speed is shared-pool speed, not your speed. STORM is slower, but those 307 tok/s are yours alone—no noisy neighbors, no sudden rate limits, no "sorry, high traffic." The DGX sits in Nanjing. Its compute budget is finite. What's allocated to you stays allocated to you.
Free trial. No strings.
100,000 free tokens. No credit card. No signup form. Point your Agent at it and see if it holds up. If it works for you, paid plans start at $3.90/month for 500K tokens.

Pricing: https://api.stormengine.cloud
Benchmarks: https://api.stormengine.cloud/static/bench_report.html
API docs: right on the landing page
If something's broken, tell me. I'll fix it.

June 11, 2026 Morning News on Artificial Intelligence

Storm Engine Technology. — Fri, 12 Jun 2026 05:56:25 +0000

Good morning, readers. Today is June 11, 2026, Thursday. Welcome to the morning news on artificial intelligence. Mid-June has arrived, and the global AI industry is undergoing a multidimensional transformation: geopolitical tensions have triggered a brief sell-off in US tech stocks, causing Nvidia's market cap to fall below $5 trillion; however, the race for computing power continues unabated—OpenAI is negotiating the lease of a 10 GW super data center, driving the largest infrastructure layout in history. Meanwhile, the IPO sprint of global AI unicorns is entering a fever pitch, and Chinese AI forces are accelerating both in-depth industrial application and capitalization. Domestic large models have led global usage for six consecutive weeks.

Macro and Market: Geopolitical Risks Temporarily Impact AI Sector, Computing Power Race Continues
On June 10, the Middle East situation tightened again, and rising geopolitical risks quickly increased risk aversion in global capital markets, leading to a sell-off in US tech stocks. Nvidia (NVDA) fell 3.73%, with its market cap falling below the $5 trillion mark; Google fell 2.48%, Microsoft fell 1.5%, Amazon fell 2.53%, Meta fell 2.33%, and Tesla fell 3.8%. The chip sector was under pressure overall, with Qualcomm, ARM, and Broadcom all falling by more than 5%, and AMD and TSMC falling by more than 4%. Oracle fell more than 7% after the market due to unexpected quarterly capital expenditures, raising concerns about the profitability of AI infrastructure business.

Despite the sector-wide correction, the underlying logic of expanding global AI technology infrastructure remains unshaken. According to multiple reports, OpenAI is conducting in-depth negotiations to lease a 10 GW super data center on federal land in Ohio. Nvidia has discussed providing financial guarantees for the project. If calculated based on current estimates, the total cost of the park upon completion will be at least $500 billion, making it the largest infrastructure commitment in OpenAI's history. In terms of capitalization, OpenAI has officially submitted a confidential IPO application to the SEC, significantly accelerating its financing pace.

Global Landscape: Three AI Giants Enter Intense IPO Race, Global Computing Power Map Accelerates Reconfiguration
Currently, the capitalization process in the global AI field is advancing at an unprecedented speed. SpaceX is priced at $135 per share, corresponding to a valuation of approximately $1.75 trillion, and will officially list on Nasdaq on June 12, becoming the largest IPO in US stock market history. After completing a $65 billion H-round financing, Anthropic's valuation has risen to $965 billion, surpassing OpenAI's $852 billion, and has submitted a confidential IPO application to the SEC, expected to list as early as October this year. The combined valuation of the three companies is approximately $3.6 trillion, and their IPO processes will reshape the global capital's valuation anchor for the AI track.

In terms of model iteration, Anthropic officially launched the new model Claude Fable 5, claiming superior performance compared to all previously released models, with significant advantages in software engineering and visual tasks, especially in long and complex task scenarios. At the same time, OpenAI officially entered the physical robot field, CEO Altman announced the formation of a robotics team and adjusted the R&D focus to human-machine collaboration, marking a comprehensive extension of global large model competition from pure software competition to a full-stack competition of "software + hardware."

Specifically, the fundamental characteristics of the three IPO giants exhibit subtle differences:

Anthropic's valuation has surpassed OpenAI, and its commercialization efficiency is more prominent. In the first quarter of 2026, its market share reached 31.4%, surpassing OpenAI's 29% for the first time, becoming the new leader in the global large model industry. The gross margin of its reasoning business has risen from 38% to over 70%, and its business model centered on enterprise customers is favored by public market investors.

SpaceX, although with the highest valuation, also faces significant loss pressures—revenue in 2025 was $18.674 billion, with a loss of $4.937 billion; revenue in the first quarter of 2026 was $4.69 billion, with operating losses of $1.94 billion and net losses of $4.276 billion.

OpenAI continues to incur huge losses, with monthly revenue reaching $2 billion in March, but a loss of $8 billion. Predictions indicate that it may achieve profitability by 2030.

Domestic Industry: Thousand

The 2026 China AI Agent Leaders List Revealed, Witnessing the Year of Intelligent Agent Application Explosion

Storm Engine Technology. — Thu, 11 Jun 2026 01:40:56 +0000

(Source: Lei Feng Network)

On June 2nd, at the 2026 Beijing Cybersecurity Conference (BCS 2026), the 2026 China AI Agent Leaders List was officially announced. Over 100 intelligent agents submitted by more than 100 companies across 20 industries were selected for the list, comprehensively showcasing the latest landscape of intelligent agent technology implementation and industrial innovation in China.

The selection was jointly initiated by authoritative institutions such as the China Internet Association and the China Artificial Intelligence Industry Alliance, and organized by the Beijing Cybersecurity Conference (BCS). The aim was to select exemplary cases that demonstrate innovation, demonstration, and safety. Unlike previous evaluations that focused heavily on technical indicators, this evaluation emphasized both application value and controllable security. The focus was on the depth of implementation and sustained operational capability of intelligent agents in real business scenarios. Therefore, data security and permission control around intelligent agents became critical evaluation criteria.

Former directors of the First and Third Research Institutes of the Ministry of Public Security, Yan Ming, the chairman of the Computer Security Special Committee of CCF; Huang Qingcheng, the chairman of the Data Security and Governance Working Committee of the China Internet Association; Zhao Lin, the former deputy director of the Science and Technology Information Bureau of the Ministry of Public Security and the chairman of the Digital Security Professional Committee of the China Security Prevention Product Industry Association; Jing Jing, the senior business manager of the Security Governance Department of AIIA (China Artificial Intelligence Industry Alliance) and the China Academy of Information and Communications Technology, attended the event and presented certificates to the selected companies.

A Hundred Boats Racing, Over 100 Intelligent Agents Profoundly Reconstruct Productivity

Since the application submission began on March 30th, hundreds of applications from various industries including government affairs, finance, telecommunications, energy, healthcare, manufacturing, and more were received. The evaluation comprehensively assessed the advanced technology, application effectiveness, security, and replicability of the submissions, ultimately forming the "2026 China Intelligent Agent Industry Map."

Selected projects not only cover a wide range of areas but also demonstrate significant value in real business scenarios. From code security auditing to smart customer service, from power inspection to government office work, intelligent agents are becoming crucial tools for enterprises to reduce costs and improve efficiency, and for governments to enhance governance capabilities. This marks the transition of China's AI technology from "usable" to "effective," and heralds the arrival of a turning point for the large-scale implementation of intelligent agents.

Four Major Tracks, Witnessing Intelligent Agents Moving Towards Full Industry and Scenario Implementation

The top 100 intelligent agents selected this time were strictly chosen according to four core tracks, covering the core development directions of the intelligent agent industry. Each track has seen the emergence of leading enterprises with technological leadership and exemplary implementation.

(I) Key Capability Intelligent Agents: Building the Safe Foundation of AI Applications

This track focuses on core supporting capabilities such as AI security protection, data privacy protection, intelligent operation, and code security, which are the foundational bedrock for ensuring the safe and stable operation of the intelligent economy. Among the selected projects, Hubei Bank's AI Code Security Intelligent Agent and North Silver Financial's AI Code Security Audit and Automatic Verification Intelligent Agent have solved the security pain points in code development. China Life Insurance (Overseas)'s "Guo Shou Cross-border Compliance Security Operation Intelligent Agent" uses AI technology to intelligently identify, dynamically monitor, and automatically handle compliance risks in cross-border business, providing a digital safeguard for financial institutions going global.

Ping An Securities' Threat Intelligence Analysis Intelligent Agent and Qihoo 360's Threat Intelligence Analysis Intelligent Agent have built an intelligent cybersecurity threat disposal system, significantly improving the efficiency and response speed of security operations. China Unicom Guangzhou Software Institute's Intelligent Maintenance Intelligent Agent and Inspur Information's Security Protection Intelligent Agent have achieved technological breakthroughs in intelligent maintenance and infrastructure security protection. Alibaba Cloud and Tencent Cloud's "DDoS Security Operation Intelligent Agent" and "Cloud Security Intelligent Agent" demonstrate the deep accumulation and layout of public cloud providers in intelligent security operations.

(II) Organizational Operations Intelligent Agents: Reshaping the Efficiency Engine of Enterprise Operations

Organizational operations intelligent agents are quietly transforming core functions such as human resources, marketing, customer service, and legal affairs in enterprises, becoming new tools for reducing costs and improving efficiency. In the marketing and service domains, Toyota Motor's "Lexus Customer Service Intelligent Agent" and Beijing Drainage Group's "Hotline Intelligent AI Calling System" showcase the AI upgrade practices of multinational automakers and state-owned enterprises in customer service. Baidu's "Enterprise All-in-One AI Marketing Application Hogee" demonstrates tremendous

Struggling to Gain Posting Privileges on r/LocalLLaMA: Seeking Help

Storm Engine Technology. — Mon, 08 Jun 2026 03:56:44 +0000

There is a community on Reddit called r/LocalLLaMA. I am eager to post about my Storm Engine Technology website there.
However, the community restricts posting for anyone who has not
contributed to the website. I have been contributing by liking posts and commenting every day for over ten days, but I am still unable to post. I am very frustrated. Can someone help me? Thank you very much!

Self-Hosted Inference API — Open Beta, Looking for 5 Free Users

Storm Engine Technology. — Fri, 05 Jun 2026 03:57:44 +0000

What this is

 One NVIDIA DGX Spark (128GB unified memory), sitting in a room in Nanjing. Qwen2.5-32B-AWQ — the open-source model from Alibaba — running on the vLLM inference engine, straight quantized, no fine-tuning. Exposed through Cloudflare Tunnel. I built an OpenAI-compatible API endpoint: change one base_url and plug it into your Agent, your LangChain pipeline, your automated coding workflow.

This is not a big-company product. No GPU cluster, no elastic scaling, no fancy dashboard. Just a developer's self-hosted inference node, purpose-built for Agent workloads. Nothing else.

Why I'm looking for testers

Because I can't find real-world edge cases on my own. I've run 2,859 benchmark cases with zero structural errors, but that's a simulated environment. What happens when a real Agent hammers tool calling in a loop?
Does it hold up with 128K context stuffed to the brim? Only actual users can answer that.

What you get — completely free

Unlimited tokens. Both the 32B and 14B models, no caps.
60 requests per minute. No concurrency limits within that envelope.
Zero data retention. No logging for training, logs auto-purge after 30 days.
Just an API key. No account creation, no sign-up flow, no onboarding bullshit. Free for the first month. After that, if you find it useful, we'll talk pricing. No commitment. What I expect from you
You're actually running an Agent pipeline — not casually chatting with a bot. Code is executing, tool calls are flying, the system is doing work.
You're willing to give feedback when things break. A one-liner saying "it choked" is genuinely useful.
No stress-testing, no crypto mining. Please.

How to apply
Drop a comment with your email, or reach out directly:
jwx2020@aliyun.com. Five slots, first come first served. If you include a GitHub link or a quick description of what you're building, you jump the queue.
The downsides (no sugar-coating)

Single machine. If it goes down, it goes down. No failover.
ARM64 + 32B means single-request speed is modest (~13 tok/s). But vLLM continuous batching keeps system throughput reasonable under concurrent load.
Latency depends on your physical distance from Nanjing.

Details & API docs
https://stormengine.cloud

That's it. If you're building something that needs an inference backend and you don't need a cloud giant, shoot me an email.

Baidu Ernie Foundation Model 5.1 Officially Released

Storm Engine Technology. — Fri, 05 Jun 2026 00:13:56 +0000

On May 9, Baidu announced the official release of its next-generation foundation model, Ernie Foundation Model 5.1.
According to the introduction, Ernie 5.1 adopts "Multi-Dimensional Elastic Pre-training" technology, achieving leading baseline performance with only about 6% of the pre-training cost of industry models of the same scale, and ranking first domestically on the LMArena Search Leaderboard.
In addition, the latest rankings from the Large Model Arena show that Ernie 5.1 scored 1,223 points, placing first domestically and fourth globally on the LMArena Search Leaderboard, making it the only Chinese model on the list.
The significant improvement in Ernie 5.1's comprehensive capabilities is attributed to key technologies such as "Multi-Dimensional Elastic Pre-training." This technology was first introduced with the release of Ernie 5.0, enabling the generation of multiple model scales from a single training run. As a milestone achievement of this technology, Ernie 5.1 fully inherits the knowledge from Ernie 5.0, compressing the total parameters to approximately 1/3 and the active parameters to approximately 1/2, while achieving leading baseline performance with only about 6% of the pre-training cost of industry models of the same scale.
Currently, Ernie 5.1 has been simultaneously launched on Baidu Qianfan Model Square and the Ernie Bot official website, open for experience by enterprise users and developers.

In March 2026, China's daily Token (word token) call volume exceeded 140 trillion

Storm Engine Technology. — Thu, 04 Jun 2026 01:24:35 +0000

Officially disclosed by Liu Liehong, Director of the National Data Administration, at the China Development Forum 2026 Annual Meeting on March 23, 2026.
Growth Trajectory:
Early 2024: 100 billion
End of 2025: 100 trillion
March 2026: 140 trillion
Two-year growth: Over 1,000x
End of 2025 to March 2026: Over 40% growth in just 3 months
Global Comparison
During the same period, China's weekly AI large model Token call volume surpassed the United States for three consecutive weeks, becoming one of the countries with the highest AI application activity globally. Data from OpenRouter, the world's largest AI model API aggregation platform, shows that from March 16 to 22, 2026:
Global total Token call volume: 20.4 trillion
China's share: 7.359 trillion, accounting for 36% of global volume
Policy Background
This explosive growth was driven by intensive policy support:
August 2025: The State Council issued the "Opinions on Deepening the Implementation of the 'AI+' Action"
January 2026: Eight ministries jointly issued the "Implementation Opinions on the 'AI+Manufacturing' Special Action", proposing to launch 1,000 high-level industrial AI agents by 2027
March 2026: The Government Work Report at the National Two Sessions first proposed "building a new form of intelligent economy"
Data Infrastructure
By the end of 2025, the country had built over 100,000 high-quality datasets with a total volume exceeding 890 PB (equivalent to 310 times the digital resource volume of the National Library of China). The National Data Administration is advancing six special actions: "Strong Foundation Expansion, Annotation Breakthrough, Quality Improvement, Application Empowerment, Management Services, and Value Release."
Industry Significance
Liu Liehong, Director of the National Data Administration, pointed out that the significant increase in daily Token call volume indicates:
China's AI development has entered a rapid growth phase
Application scenarios have deepened from "being able to chat" to "intelligent agents capable of decision-making and execution"
China's AI industry competitiveness has significantly strengthened, with "Token going global" becoming a marker of enhanced industrial competitiveness
The value of data elements is accelerating its release, and a virtuous cycle of "data supply—value release" has begun to emerge
Application Ecosystem Data (April 2026)
According to the "2026 China AI Application Panorama Report" by QbitAI Think Tank:
Web monthly visits: Exceeded 900 million
APP monthly downloads: Over 240 million
DAU year-on-year growth: 223%
AI productivity tools occupy over 70% of Web traffic share
AI creation APP DAU year-on-year growth: 449%
Commercialization Signals
Since late January 2026, some model companies have set records of 20-day revenue exceeding total annual revenue of 2025. As the "value anchor of the intelligent era" and "settlement unit", Token is accelerating the evolution of a new business logic around its calling, distribution, and settlement.

llama.cpp b9455 Finally Caught vLLM: 70t/s on 2x3090 Qwen 27B UQ8

Storm Engine Technology. — Wed, 03 Jun 2026 06:03:19 +0000

A Reddit user on r/LocalLLaMA just dropped some impressive numbers for llama.cpp build b9455, and they're worth paying attention to if you're running multi-GPU setups.

For months, vLLM was the undisputed king of multi-GPU inference — its tensor parallelism consistently hit 70+ tokens per second on dual 3090s while llama.cpp languished at 30-50 t/s. GGUF users grudgingly accepted the speed penalty as the cost of good quantization.

Build b9455 changed that.

The Setup

Hardware: 2x RTX 3090s (24GB each). Model: Unsloth's Qwen3.6-27B-UD-Q8_K_XL — a UD-Q8 quant that this user had been running at 30-50 t/s on older llama.cpp builds.

The key new flags:

--tensor-split 50,50 -sm tensor
--flash-attn on
--cache-type-k q8_0 --cache-type-v q8_0
--spec-type draft-mtp --spec-draft-n-max 3

The -sm tensor flag is the magic here. It changes how llama.cpp splits the model across GPUs — instead of the default row-based split (which leaves one card mostly idle during certain operations), tensor parallelism distributes individual matrix multiplications across both GPUs simultaneously.

The Numbers

Here's a raw coding session trace. Each line shows context size and throughput:

ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold
ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached
ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached
ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached
ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached
ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached
ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold

Three things jump out:

1. Decode speed is rock-solid at 67-81 t/s. Even at 68K context with a 2K token output, it held 68 t/s. That's the kind of consistency you need for agent workloads where context grows relentlessly across turns.

2. Prompt processing is absurdly fast. Cold-started 27K context filled in 18.8 seconds — 1,417 tokens per second of prefill. At that rate you're looking at about 60-70 seconds to fill a full 100K context from cold.

3. The 54 t/s low point was a 4,500-token decode. Long outputs are usually the bottleneck. Even there it stayed above 50 t/s, which at Q8 quality is more than usable for streaming a full code review or refactor.

Why This Matters

The OP had been running qwen3.6-mtp-8.0 on vLLM as a compromise — it ran fast, but the 8.0 quant was making subtle coding mistakes. Wrong variable names. Off-by-one errors in generated loops. The kind of bugs that pass unit tests but fail code review.

UD-Q8_K_XL at this speed is a completely different experience. Code output is clean — not "mostly correct," actually correct. For anyone feeding models into an agent loop that runs 20+ turns without human intervention, those silent errors compound fast. Going back to Q8 at vLLM speed eliminates an entire class of failures.

The Interesting Bits

A few configuration details worth noting:

MTP speculative decoding (--spec-draft-n-max 3): the draft model predicts 2-3 tokens ahead accurately enough to justify the extra compute. Going higher than 3 showed diminishing returns.
--no-mmap: costs a few seconds on cold start but prevents VRAM fragmentation across server restarts.
KV cache quantization to q8_0: basically lossless at these context sizes, but saves roughly 30% cache memory — the difference between fitting 60K context with headroom vs. OOM at 50K.

tl;dr

If you shelved llama.cpp for multi-GPU inference because vLLM was faster — especially if you're forced into lower-quality quants on vLLM to get acceptable speed — b9455 with -sm tensor is worth a retest. The gap is gone.

Full credit to the original Reddit post on r/LocalLLaMA for the benchmarks. What are you seeing with tensor-split on your hardware?

How I Ran 2,859 LLM Code Generation Tests with EvalScope — and Got Zero Errors

Storm Engine Technology. — Tue, 02 Jun 2026 07:07:02 +0000

After three weeks of running Qwen2.5-32B on a DGX Spark, the number that surprised me most wasn't the throughput or latency. It was zero.

Zero structural errors across 2,859 code generation tests.

What I Tested

EvalScope with code generation tasks covering:

Structured JSON output
Function calling (OpenAI tool format)
Multi-step tool use chains
Code completion with specific output formats

Each test run validates four things:

Valid JSON structure — no unclosed brackets, no broken syntax
Correct function call schema — the right parameters, right types
No truncated output — response completes fully within the token budget
Response within timeout — no hung generations

Seven test sessions, roughly 400 prompts each. Every single one passed.

The Setup

Model: Qwen2.5-32B-Instruct-AWQ (4-bit)
Engine: vLLM 0.21 with continuous batching
Temperature: 0 (deterministic mode)
Hardware: DGX Spark, 128GB unified memory, ARM64

bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes

Why Zero Errors Surprised Me

I've used cloud APIs extensively. Even the best ones occasionally return truncated JSON under load, or a function call with a missing parameter. It's rare — 0.1-0.3% error rates — but when you're running autonomous agents doing 40+ sequential tool calls, a single failure cascades.

At 0.3% error rate per call, a 50-step agent loop has a ~14% chance of hitting at least one failure. Your agent works perfectly nine times, then mysteriously dies on the tenth run.

With zero errors in 2,859 trials, the 95% confidence upper bound on the error rate is 0.13%. That means a 50-step loop has a 93.8%+ chance of completing cleanly.

The Comparison

I also ran 1,280 identical prompts against cloud APIs:

Backend	Latency (median)	Structural Errors
DeepSeek V3	2.6s	0
Kimi	4.9s	2
Qwen2.5-14B (Mac M4)	9.9s	0
STORM (DGX, 32B)	19.6s	0

Cloud wins on speed. But the local setup matched the cloud on reliability, while the 14B on a $599 Mac Mini held its own on quality.

Reproduce It

Full methodology, test datasets, and raw results are on GitHub:

https://github.com/YIQI-NUMBER1/stormengine

If you've got a local setup, pull the repo and run the benchmarks. If you find errors I missed, open an issue — I genuinely want to know what breaks this.

Running Qwen2.5-32B on a DGX Spark: 3 Weeks, 2,859 Tests, Zero Errors — Full Setup Guide

Storm Engine Technology. — Tue, 02 Jun 2026 06:51:12 +0000

Why This Setup

If you're building agent pipelines, you already know the problem: one broken tool call at step 47, and your entire autonomous loop is toast. Cloud APIs have rate limits, and they don't care that your agent is running at 3 AM.
I wanted to see if a local setup could deliver the one thing that matters most for agents: deterministic, structurally perfect output. Every time. Here's what I learned after three weeks.

Hardware

DGX Spark (GB10)
128GB unified memory
20-core ARM64
Ubuntu 24.04 LTS

Single machine, single model. No Kubernetes. Sitting in a residential room behind CGNAT, exposed via Cloudflare Tunnel.

Model & Engine
bash
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ

python -m vllm.entrypoints.openai.api_server \
--model Qwen2.5-32B-Instruct-AWQ \
--served-model-name Qwen2.5-32B \
--host 0.0.0.0 --port 8000 \
--max-model-len 65536 \
--gpu-memory-utilization 0.9 \
--dtype auto \
--enforce-eager \
--enable-auto-tool-choice \
--tool-call-parser hermes

Key flags explained:

--enforce-eager: ARM64 can't handle CUDA graphs — this is mandatory, not optional
--max-model-len 65536: Full 64K context window for long agent loops
--gpu-memory-utilization 0.9: Leave 10% headroom for KV cache spikes
--tool-call-parser hermes: Qwen2.5 uses Hermes format for tool calls

The AWQ 4-bit quantization is what makes this possible. 32B model at full precision would need ~64GB just for weights. Quantized, it's ~18GB, leaving plenty of room for KV cache in the 128GB unified memory pool.

The Numbers

Raw Performance

Single-stream generation: 12.9 tok/s. Not going to win any speed contests. ARM64 and 32B parameters are a heavy lift.

But throughput is a different story with vLLM's continuous batching:

25 concurrent: 266 tok/s system throughput
TTFT P50: 649ms
TTFT P99 at 25 concurrent: 1,579ms
TPOT median: 74ms

vLLM's prefix caching is doing the heavy lifting on TTFT — in agent loops, successive calls share system prompt context, and the cache hits keep first-token latency down.

The Concurrency Cliff

This was the most surprising finding:

30 concurrent: 100% success rate
35 concurrent: 100% timeout rate

Not gradual degradation. A hard wall. Memory bandwidth maxes out at ~32-33 concurrent requests, and the GPU memory simply can't serve more. If you're planning a DGX Spark deployment, plan for 30 concurrent max with zero headroom.

Benchmark Results

2,859 code generation tests via EvalScope across 7 sessions. Each test validates JSON structure, function call schema, output completeness, and timeout compliance.

Structural errors: zero.

I ran the same 1,280 prompts against cloud APIs for comparison:

Model	Latency	Errors	Output (avg lines)
STORM (DGX, 32B)	19.6s	0	37
DeepSeek V3	2.6s	0	43
Kimi	4.9s	2	40
Mac M4 Pro (14B)	9.9s	0	38

DeepSeek wins speed and verbosity. Kimi is fast but had format breaks. The Mac M4 with a 14B model was surprisingly competitive on quality.

What's the Takeaway?

For chat and real-time applications, cloud APIs win. They're faster, simpler, and you don't need to manage hardware.

For agent pipelines where:

You're running long tool-calling loops
A single malformed JSON breaks the entire flow
Rate limits at unpredictable hours are unacceptable
You want prompt data staying on your hardware

...local inference with the right configuration delivers something cloud APIs don't: guaranteed output structure. Not once in 2,859 tests did the model break format. That's the product.

Try It Yourself

Everything is open source. Reproduce the setup, run the benchmarks, verify the numbers:

Questions about the DGX setup, vLLM tuning, or benchmark methodology? Drop a comment.