Self-Hosted Inference API — Open Beta, Looking for 5 Free Users

#api #ai #webdev #beginners

What this is

 One NVIDIA DGX Spark (128GB unified memory), sitting in a room in Nanjing. Qwen2.5-32B-AWQ — the open-source model from Alibaba — running on the vLLM inference engine, straight quantized, no fine-tuning. Exposed through Cloudflare Tunnel. I built an OpenAI-compatible API endpoint: change one base_url and plug it into your Agent, your LangChain pipeline, your automated coding workflow.

This is not a big-company product. No GPU cluster, no elastic scaling, no fancy dashboard. Just a developer's self-hosted inference node, purpose-built for Agent workloads. Nothing else.

Why I'm looking for testers

Because I can't find real-world edge cases on my own. I've run 2,859 benchmark cases with zero structural errors, but that's a simulated environment. What happens when a real Agent hammers tool calling in a loop?
Does it hold up with 128K context stuffed to the brim? Only actual users can answer that.

What you get — completely free

Unlimited tokens. Both the 32B and 14B models, no caps.
60 requests per minute. No concurrency limits within that envelope.
Zero data retention. No logging for training, logs auto-purge after 30 days.
Just an API key. No account creation, no sign-up flow, no onboarding bullshit. Free for the first month. After that, if you find it useful, we'll talk pricing. No commitment. What I expect from you
You're actually running an Agent pipeline — not casually chatting with a bot. Code is executing, tool calls are flying, the system is doing work.
You're willing to give feedback when things break. A one-liner saying "it choked" is genuinely useful.
No stress-testing, no crypto mining. Please.

How to apply
Drop a comment with your email, or reach out directly:
jwx2020@aliyun.com. Five slots, first come first served. If you include a GitHub link or a quick description of what you're building, you jump the queue.
The downsides (no sugar-coating)

Single machine. If it goes down, it goes down. No failover.
ARM64 + 32B means single-request speed is modest (~13 tok/s). But vLLM continuous batching keeps system throughput reasonable under concurrent load.
Latency depends on your physical distance from Nanjing.

Details & API docs
https://stormengine.cloud

That's it. If you're building something that needs an inference backend and you don't need a cloud giant, shoot me an email.