GPUStack

Posted on Jun 30

GPUStack v2.2: From Model Serving to Token Operations, from Compute Pooling to GPU-as-a-Service

#ai #llm #opensource

Deploying a model and bringing it online is only the starting point of AI service delivery.

As large language model applications move into scaled production, AI infrastructure is entering an inevitable phase of maturity — from simply being able to run to becoming operations-ready.

This shift is not just about adding more features. It reflects an evolution in platform positioning: from reliably serving inference workloads to becoming the infrastructure foundation that can truly support enterprise AI service delivery.

At this stage, the core challenge lies in advancing two areas in parallel: model serving must deliver operations-grade reliability and visibility, while compute management must expand from “serving inference workloads” to “unified allocation of the diverse resources required by AI.”

GPUStack v2.2 continues to move deeper in both directions: model serving is evolving from available to operations-ready, while compute management is extending from unified scheduling to on-demand services.

Deepening Support for Model Inference Scenarios and Lifecycle Management

The stability of model serving is not only a deployment-stage concern. After an instance starts running, issues such as OOM errors, hanging inference requests, and silent process crashes are often the more common problems in production environments.

Previously, GPUStack’s health checks mainly covered the startup phase. Once an instance started successfully, the platform had no way to detect issues that occurred later. Faulty instances could remain in the service pool and continue receiving traffic, resulting in silent failures that would only be addressed after someone noticed and handled them manually.

In v2.2, health probing is extended across the entire runtime lifecycle. The platform continuously checks the actual inference capability of each instance. When an abnormal instance is detected, it is immediately removed from the service pool and automatically restarted. Once recovered, it is automatically added back. Service availability is now proactively maintained by the platform, rather than relying on manual inspections or user reports.

Troubleshooting capabilities have also been systematically enhanced. In production environments, what teams often need most is a complete record of what happened at the time of failure. Previously, this meant logging into the terminal and manually checking logs. v2.2 introduces three types of log access:

Historical logs before restart, allowing you to view the complete output before an instance crashed, instead of losing access to past failure logs after a restart;

Distributed sub-instance logs, allowing you to inspect the output of each node separately in multi-node deployments and quickly identify where the issue occurred;

Ray container logs, allowing you to view Ray container logs directly in the UI without troubleshooting through terminal commands.

Most production troubleshooting workflows can now be completed end to end within the GPUStack UI.

For distributed inference, v2.2 introduces vLLM MP auto-distributed mode.

Previously, GPUStack only supported Ray-based auto-distributed vLLM deployments. MP-based distributed deployments had to be configured manually, and GPUStack could not automatically spin up all distributed instances.

With the rapid evolution of vLLM, the new MP-based distributed mode offers clear advantages over Ray-based auto-distributed deployments in terms of operational overhead and inference performance.

Users can now choose the vLLM auto-distributed deployment strategy that best fits their needs.

Another notable update is support for Multi-LoRA.

In enterprise environments, fine-tuning models for different business scenarios is a common requirement. Previously, each LoRA Adapter had to run as a separate model instance, causing GPU memory overhead to grow linearly with the number of tasks and leading to significant resource waste.

With v2.2, multiple LoRA Adapters can be mounted to the same base model instance and switched dynamically. This allows the same hardware to support more fine-tuned tasks while significantly improving GPU memory utilization.

Token Usage Governance for Model Services

Models may be running, but where exactly are tokens being consumed? This question may not be obvious at an early stage with limited usage. But once multiple teams and applications start sharing the same platform, it quickly becomes an operational pain point.

GPUStack previously supported usage statistics by model and by user, helping teams understand overall consumption trends.

However, these two dimensions were not sufficient for precise attribution.

When different applications and business lines share multiple keys under the same user account, usage cannot be clearly separated, making cost accounting difficult to perform.

v2.2 introduces usage statistics at the API Key level. Token consumption for each key is metered independently, allowing administrators to clearly see which caller is consuming what, and how much. This provides a direct basis for cross-team cost attribution and quota management.

Another important change is giving visibility back to users. Previously, users who wanted to understand their own consumption had to ask administrators to pull the data for them.

v2.2 introduces self-service personal usage queries on the user side. Consumption history by model and by time range can now be viewed directly in the UI, without going through a request process.

Once metering capabilities are in place, token consumption is no longer a black box. It becomes operational data that can support quota allocation, internal chargeback, and cost analysis.

Enhanced Production Deployment Capabilities

Platform capabilities can only be fully realized when supported by a solid deployment experience. In v2.2, GPUStack addresses key gaps in enterprise production deployment across three areas.

Kubernetes has become a mainstream choice for enterprise infrastructure. However, deploying GPUStack in K8s environments previously lacked a standardized cloud-native path.

v2.2 provides an official Helm Chart, enabling installation and configuration through Helm in a single streamlined process. This allows GPUStack to fit directly into existing GitOps workflows and CI/CD systems, significantly reducing the operational cost of deployment and upgrades.

v2.2 also expands database compatibility with support for OceanBase and openGauss, giving teams more flexibility in enterprise deployment environments.

On the network topology side, v2.2 supports a Worker-to-Server one-way access mode.

In cross-region or cross-network-boundary deployment scenarios, many environments make it difficult to establish bidirectional connectivity between Server and Worker.

With one-way networking, Worker nodes only need to access the Server, while the Server does not need to initiate reverse connections. This removes a key networking barrier for unified management of multi-region clusters.

From Compute Pooling to GPU Services

Unified scheduling of heterogeneous compute resources has always been one of GPUStack’s core capabilities — bringing GPUs from different vendors and with different specifications into a single compute pool for unified scheduling and monitoring. Previously, however, this compute pool primarily served one scenario: inference.

Data scientists often need an interactive development environment, while algorithm engineers may need dedicated GPUs for experimentation and debugging. In many teams, these needs were handled outside GPUStack through separate systems, resulting in fragmented resource allocation and metering.

v2.2 introduces the GPU Instance Service, bringing the process of “allocating an isolated GPU environment” under unified platform management.

Users can request isolated GPU instances on demand, specify the GPU vendor, model, and quantity, and select runtime templates that include storage mounts and port configurations. Once the instance is ready, they can access it through SSH or the web.

Usage is metered centrally by the platform and shares the same scheduling and usage system as inference services.

This marks an expansion of GPUStack’s compute service model: the same compute resource pool can serve as the foundation for inference services, while also being exposed as on-demand GPU instances.

With unified scheduling and unified metering, management no longer becomes fragmented simply because resources are used in different scenarios. Future capabilities such as resource scheduling for training and fine-tuning and more fine-grained virtualized compute partitioning can naturally build on this foundation.

The Next Stage of AI Infrastructure Platforms

The evolution of AI infrastructure follows a clear path: from being able to run models, to operating model services reliably, and then to unified management and allocation of the compute resources required by AI.

v2.2 takes a step forward in both dimensions. On the model serving side, deeper inference scenario support, improved instance lifecycle management, a token usage governance system, and enhanced production deployment capabilities provide the foundation for scaled service delivery.

On the compute side, the introduction of the GPU Instance Service expands resource allocation and metering beyond inference scenarios.

Both directions point toward the same goal: making GPUStack an infrastructure foundation that enterprises can truly rely on for AI service delivery.

Get started:

GitHub: https://github.com/gpustack/gpustack

Documentation: https://docs.gpustack.ai

GPUStack v2.2 Enterprise Edition: Coming Soon

The v2.2 open-source release lays the foundation for platform-level operational capabilities.

In more complex enterprise environments, there is another layer of capabilities that the open-source edition is not designed to fully cover. When a platform needs to serve multiple isolated tenants, manage compute consumption down to each API Key and model route, ensure end-to-end high availability, and support Token/GPU billing management, these are the challenges that GPUStack Enterprise Edition is built to address.

GPUStack v2.2 Enterprise Edition is designed for enterprise scenarios that require organization-level governance and commercial operations. It will support multi-tenant isolation, fine-grained quota, rate limiting, and access control, production-grade high availability, resource topology visualization, and billing management. Stay tuned.

If you are interested in the Enterprise Edition, feel free to contact us to learn more and explore early access opportunities.