DEV Community

Alex
Alex

Posted on

131 tokens per second on GPU under Kubernetes

Choosing a GDPR-compliant provider for non-IT SME owners feels like navigating fog. Transparency varies widely, and pay-as-you-go pricing makes it hard to predict costs when usage grows.

We decided to build something they could control. We put Qwen 3.6 on RTX A5000 pod in the Swedish cluster, fully loaded model layers into GPU memory. Set a 196K-token context window.

I gave the model a task: summarize a 97-page PDF in German. It completed in 15 seconds. ChatGPT did the same in 13 seconds, Claude in 20. It is not a comprehensive benchmark, of course. Kinda side effect of fixed monthly bills and compliance. 🫣

Top comments (0)