What this is
One NVIDIA DGX Spark (128GB unified memory), sitting in a room in Nanjing. Qwen2.5-32B-AWQ — the open-source model from Alibaba — running on the vLLM inference engine, straight quantized, no fine-tuning. Exposed through Cloudflare Tunnel. I built an OpenAI-compatible API endpoint: change one base_url and plug it into your Agent, your LangChain pipeline, your automated coding workflow.
This is not a big-company product. No GPU cluster, no elastic scaling, no fancy dashboard. Just a developer's self-hosted inference node, purpose-built for Agent workloads. Nothing else.
Why I'm looking for testers
Because I can't find real-world edge cases on my own. I've run 2,859 benchmark cases with zero structural errors, but that's a simulated environment. What happens when a real Agent hammers tool calling in a loop?
Does it hold up with 128K context stuffed to the brim? Only actual users can answer that.
What you get — completely free
- Unlimited tokens. Both the 32B and 14B models, no caps.
- 60 requests per minute. No concurrency limits within that envelope.
- Zero data retention. No logging for training, logs auto-purge after 30 days.
- Just an API key. No account creation, no sign-up flow, no onboarding bullshit. Free for the first month. After that, if you find it useful, we'll talk pricing. No commitment. What I expect from you
- You're actually running an Agent pipeline — not casually chatting with a bot. Code is executing, tool calls are flying, the system is doing work.
- You're willing to give feedback when things break. A one-liner saying "it choked" is genuinely useful.
- No stress-testing, no crypto mining. Please.
How to apply
Drop a comment with your email, or reach out directly:
jwx2020@aliyun.com. Five slots, first come first served. If you include a GitHub link or a quick description of what you're building, you jump the queue.
The downsides (no sugar-coating)
- Single machine. If it goes down, it goes down. No failover.
- ARM64 + 32B means single-request speed is modest (~13 tok/s). But vLLM continuous batching keeps system throughput reasonable under concurrent load.
- Latency depends on your physical distance from Nanjing.
Details & API docs
https://stormengine.cloud
That's it. If you're building something that needs an inference backend and you don't need a cloud giant, shoot me an email.
Top comments (0)