DEV Community: GPUStack

GPUStack MaxKB: Build a Powerful and Easy-to-Use Open-Source Enterprise AI Agent Platform

GPUStack — Tue, 10 Mar 2026 02:54:50 +0000

GPUStack × MaxKB: Build a Powerful and Easy-to-Use Open-Source Enterprise AI Agent Platform

Summary: By leveraging GPUStack for efficient model deployment and management, and connecting those models to MaxKB, you can easily build an AI assistant with knowledge base retrieval + intelligent Q&A capabilities.

As AI applications become increasingly common within organizations, more teams are beginning to focus on two core challenges:

How to efficiently manage and deploy local large models
How to quickly build enterprise knowledge bases and AI Agents

If you are looking for solutions to both problems, the combination of GPUStack + MaxKB is well worth exploring.

GPUStack: Focuses on GPU resource management and model deployment, supporting multi-node clusters and multi-model services.
MaxKB: An open-source enterprise knowledge base and AI application platform that enables rapid development of knowledge-based Q&A systems and AI Agents.

By connecting GPUStack-provided model services to MaxKB, you can easily build a practical enterprise AI knowledge assistant.

This article will walk through the entire process from scratch.

📌 What You'll Learn

Deploy the latest GPUStack v2.1.0
Deploy required models in GPUStack
Obtain GPUStack model connection information
Deploy MaxKB
Connect GPUStack models in MaxKB
Practical example: Build a GPUStack documentation knowledge base

Install GPUStack v2.1.0

1. Install GPUStack Server

sudo docker run -d --name gpustack-server \
  --restart unless-stopped \
  -p 80:80 \
  -v gpustack-data:/var/lib/gpustack \
  -v /data/gpustack_cache:/var/lib/gpustack/cache \
  gpustack/gpustack:v2.1.0 \
  --bootstrap-password "123" \
  --debug

After running the command above, open your browser and visit:

http://your_host_ip

You will enter the GPUStack UI.

Default login credentials:

admin / 123

2. Create a Cluster

GPUStack manages worker nodes in units called Clusters.

When deploying GPUStack Server for the first time, you will be prompted to create your first cluster. Click:

Create Your First Cluster

Follow the UI instructions to complete the setup.

You can also go to the Clusters page from the sidebar and click Add Cluster to create one manually.

3. Add a Worker

After creating a cluster, the system will prompt you to Add Worker.

Follow the instructions in the UI.

You can also add one manually via the Workers page in the sidebar.

Run the diagnostic command provided in the guide interface.

If the drivers and container runtime are correctly installed, you will see two OK messages.

If not configured appears, follow the provided links to check dependency documentation and install the missing components according to your environment.

Model Cache Volume Mount: Mount this directory to the model cache directory /var/lib/gpustack/cache.

GPUStack Data Volume: Mount this directory to the data directory /var/lib/gpustack.

Then run the Worker startup command:

sudo docker run -d --name gpustack-worker \
   -e "GPUSTACK_RUNTIME_DEPLOY_MIRRORED_NAME=gpustack-worker" \
   -e "GPUSTACK_TOKEN=gpustack_7b42996d3f5571d5_8181f986537c100369eaa2dfcf6d6359" \
   --restart=unless-stopped \
   --privileged \
   --network=host \
   --volume /var/run/docker.sock:/var/run/docker.sock \
   --volume gpustack-worker-data:/var/lib/gpustack \
   --volume /data/gpustack_cache:/var/lib/gpustack/cache \
   --runtime nvidia \
   gpustack/gpustack:v2.1.0 \
   --server-url http://192.168.50.14 \
   --worker-ip 192.168.50.14

Deploy Models in GPUStack

Click Deployments in the sidebar to open the model deployment page.

If no models are currently deployed, you will see a Deploy Now button in the center of the page.

Click it to enter the Model Catalog, select the desired model, and follow the prompts to deploy it.

Additional deployment methods are available under the Deploy Model menu in the top-right corner.

For this tutorial, we deploy the following three models:

Qwen3-Reranker-4B
Qwen3-Embedding-4B
Qwen3.5-35B-A3B

GPU memory allocation can be adjusted according to your environment.

Deploy Qwen3-Reranker-4B

After deployment, you can test it in the Playground.

Deploy Qwen3-Embedding-4B

After deployment, test it in the Playground.

Deploy Qwen3.5-35B-A3B

Here we additionally set the PYPI_PACKAGES_INSTALL environment variable to upgrade the transformers library.

After deployment, test it in the Playground.

Obtain GPUStack Model Access Information

Open the Routes page from the sidebar.

Click the three-dot menu next to the Route, then select:

API Access Info

Record the following information:

Base URL
Model Name
API Key

Example:

Base URL: http://192.168.50.14/v1

Model Name:
qwen3.5-35b-a3b
qwen3-reranker-4b
qwen3-embedding-4b

API Key:
gpustack_xxxxxxxxxxxxxxxxx

You can create an API Key following the instructions in the UI.

Deploy MaxKB

MaxKB supports one-command Docker deployment:

docker run -d --name=maxkb --restart=always -p 8080:8080 -v ~/.maxkb:/opt/maxkb 1panel/maxkb

Default credentials:

admin / MaxKB@123..

Upon first login, you will be prompted to change the password.

Follow the instructions to update it.

Connect GPUStack Models in MaxKB

In the top navigation bar of MaxKB, select Model.

Click Add Model in the upper-right corner.

Note
API URL and API Key will only appear after entering the Base Model and pressing Enter.

Add the following models in the same way:

qwen3-reranker-4b
qwen3-embedding-4b

For qwen3-reranker-4b, you must enable Generic Proxy:

This is because MaxKB uses the endpoint:

/v2/rerank

After configuration, it should look like this:

Practical Example: Build a GPUStack Documentation Knowledge Base

Open the Knowledge page at the top and click Create to create a knowledge base.

Select Web Knowledge.

Enter the GPUStack documentation URL.

MaxKB will automatically crawl and parse the page content.

After crawling is complete:

Create an AI Agent

Go to the Agent page.

Click Create to create a new Agent.

After completing the configuration, click Publish.

Once published successfully, you can start chatting with the agent.

Chat Demo

Open the chat interface:

Example result:

🙌 Join the GPUStack Community

If you have already started using GPUStack,
or are exploring local large models / GPU resource management / AI infrastructure,
you are welcome to join our community group to exchange practical experience, pitfalls, and best practices.

https://discord.gg/QAzGncGs

No More Token Anxiety: Build an “Unlimited-Use” Local AI Assistant with GPUStack + OpenClaw

GPUStack — Fri, 06 Mar 2026 02:32:29 +0000

Over the past two years, more and more teams have integrated AI into their daily workflows.
But soon, a practical issue emerged:

The more the model is used, the faster Tokens are consumed, and both costs and psychological pressure rise accordingly.

Many people rely on AI to improve efficiency, while at the same time having to “use it sparingly” and “let it think less.”
In the end, AI instead becomes a carefully budgeted consumable.

If AI can run on your own GPU,
without being billed by Token, available for conversation at any time, and running long-term inside collaboration tools,
then it truly feels like a real “work assistant.”

Based on the local model capabilities provided by GPUStack, combined with OpenClaw (supporting multiple collaboration platforms such as WhatsApp, Telegram, Discord, Slack, Lark, etc.) and Telegram,
this article will walk through step by step how to build a truly usable, sustainably running, and almost Token-worry-free local AI assistant.

📌 What This Article Covers

Deploying a model with GPUStack
Creating a Telegram bot application and configuring permissions
Installing, configuring, and key considerations for OpenClaw
First-time authorization and connectivity testing on the Telegram side
Practical example: Let the assistant star the GPUStack project
Built-in assistant commands
Useful OpenClaw commands and resource links

I. Deploy a Model with GPUStack and Prepare Access Information

Before connecting OpenClaw, we need to complete model deployment in GPUStack and obtain the model service access information.

This section will use Qwen3.5-35B-A3B as an example to demonstrate the complete process from
Custom inference backend → Deploy model → Obtain access information.

1. Environment Preparation and Version Information

GPUStack version: v2.0.3
Custom inference backend image: vllm/vllm-openai:qwen3_5
Model weights: Qwen/Qwen3.5-35B-A3B

⚠️ OpenClaw has requirements for the model context window:
Minimum 16K, recommended 128K or above.

2. Configure Custom Inference Backend (vLLM)

In the GPUStack console, go to:

“Inference Backends” → “Edit vLLM” → “Add Version”

3. Deploy the Qwen3.5-35B-A3B Model

Example parameters:

--tensor-parallel-size=2
--mm-encoder-tp-mode data
--mm-processor-cache-type shm
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

If you encounter:

text Error 803: system has unsupported display driver / cuda driver combination

You can try adding the environment variable:

bash LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu

4. Obtain GPUStack Model Access Information

Record the following three items:

API Base URL
Model ID
API Key (create it in GPUStack)

II. Create a Telegram Bot

Open Telegram and search for BotFather
Open the BotFather APP

Create a new Bot and fill in the basic information

Copy the Bot Token

For details, please refer to: https://docs.openclaw.ai/channels/telegram

III. Install and Configure OpenClaw

Demo environment: Ubuntu 24.04

1. One-Click Installation

bash curl -fsSL https://openclaw.ai/install.sh | bash

The script will automatically install dependencies such as Node and Git.

2. Interactive Configuration Wizard

Model/Auth Provider Select Custom Provider (Any OpenAI or Anthropic compatible endpoint)

Enter the GPUStack API Base URL / API Key

Select Telegram for Channel

Paste the Bot Token

IV. First-Time Authorization and Testing

Send a message to the bot in Telegram
On first use, it will prompt for Pairing authorization

On the server, run:

bash openclaw pairing approve telegram <Pairing-Code>

V. Practical Example: Let the Bot Star the GPUStack Project

1. Prepare a GitHub PAT

Use Tokens (classic)
Check the repo permission

2. Write to Environment Variables

bash vim ~/.openclaw/.env

Restart:

bash openclaw gateway restart

3. Send a Command to the Bot

Result:

VI. Common Commands

/new: Start a new session
/status: Check bot status
/reset: Reset context
/model: View / switch model

VII. Useful OpenClaw Commands and Resources

Common CLI Commands

bash openclaw logs --follow openclaw doctor openclaw gateway --help openclaw dashboard openclaw tui

Documentation and Ecosystem

📘 https://docs.openclaw.ai
🌐 https://clawhub.ai

Conclusion: When AI Becomes Infrastructure, Not a Consumable

Looking back, the essence of Token anxiety is not that models are expensive, but that AI is treated as an “external consumable resource.”

When models run in the cloud and capabilities are controlled by others,
we become accustomed to careful budgeting, limiting usage, and controlling call frequency.

But when the model truly runs on your own GPU,
when inference capability, context, and tool calls all become part of your infrastructure,
the role of AI changes accordingly—

It is no longer a paid API call each time,
but a readily available, long-term online, continuously evolving work assistant.

This is exactly the significance of combining GPUStack and OpenClaw:
Let AI return from a “cost item” to “productivity.”

If you already have GPU resources,
you might as well try it yourself and truly integrate AI into your daily workflow.

When you no longer worry about Tokens,
you will truly begin to make good use of AI.

🙌 Join the GPUStack Community

If you have already started using GPUStack,
or are exploring local large models / GPU resource management / AI Infra,
you are welcome to join our community group to exchange practical experience, pitfalls, and best practices together.

https://discord.gg/QAzGncGs

Building Your Private ChatGPT and Knowledge Base with AnythingLLM and GPUStack

GPUStack — Tue, 12 Nov 2024 05:00:48 +0000

AnythingLLM [https://github.com/Mintplex-Labs/anything-llm] is an all-in-one AI application that runs on Mac, Windows, and Linux. Its goal is to enable the local creation of a personal ChatGPT using either commercial or open-source LLMs along with vector database solutions. AnythingLLM goes beyond being a simple chatbot by including Retrieval-Augmented Generation (RAG) and Agent capabilities. These features allow it to perform a variety of tasks, such as fetching website information, generating charts, summarizing documents, and more.

AnythingLLM can integrate various types of documents into different workspaces, enabling users to reference document content during chats. This provides a easy way to organize workspaces for different tasks and documents.

In this article, we will introduce how to build a personal ChatGPT with knowledge base using AnythingLLM + GPUStack.

Run models with GPUStack

GPUStack is an open-source GPU cluster manager for running large language models (LLMs). It enables you to create a unified cluster from GPUs across various platforms, including Apple MacBooks, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories like Hugging Face, allowing developers to access these models as easily as they would access public LLM services from providers such as OpenAI or Microsoft Azure.

Unlike Ollama, GPUStack is a cluster solution designed to aggregate GPU resources from multiple devices to run models.

To deploy the Chat Model and Embedding Model on GPUStack:

• Chat Model: llama3.1

• Embedding Model: bge-m3

And you need to create an API key. This key will be used by AnythingLLM to authenticate when accessing the models API deployed on GPUStack.

Install and configure AnythingLLM

AnythingLLM offers packages for Mac, Windows, and Linux, you can download from https://anythingllm.com/download. After installation, open AnythingLLM to begin the setup process.

Configure LLM Provider

First, configure the chat model. Search for OpenAI, select Generic OpenAI:

And fill in the details for the model deployed on GPUStack:

Save and configure embedding model.

Configure Embedding Provider

AnythingLLM includes a lightweight embedding model, all-MiniLM-L6-v2, which offers limited performance and context length. For more powerful embedding capabilities, you can either opt for public embedding services or run open-source embedding models. Here, we’ll configure the embedding model bge-m3, which is running on GPUStack. Set the embedding provider to Generic OpenAI and fill in the relevant configuration.

Then create a workspace, and we can use AnythingLLM after it's completed.

Use AnythingLLM

Chat with LLM

Select a workspace, create a new thread, and send your question to the LLM:

Fetch website content

Click the upload button next to the workspace, enter the website URL in the Fetch website box, and fetch the website content.

The fetched website content will be sent to the embedding model for vectorization and then stored in the vector database.

Check the content fetched from the website:

Documents embedding

Click the upload button next to the workspace, then click the upload box and upload a document. The document will be sent to the embedding model for vectorization and then stored in the vector database.

Check the content of embedded documents:

For more information, please read the AnythingLLM documentation: https://docs.anythingllm.com/

Conclusion

In this tutorial, we have introduced how to use AnythingLLM + GPUStack to aggregate GPUs across multiple devices and build an all-in-one AI application for RAG and AI Agents.

GPUStack provides a standard OpenAI-compatible API, which can be quickly and smoothly integrated with various LLM ecosystem components. Wanna give it a go? Try to integrate your tools/frameworks/software with GPUStack now and share with us!

If you encounter any issues while integrating GPUStack with third parties, feel free to join GPUStack Discord Community and get support from our engineers.

Building Free GitHub Copilot Alternative with Continue + GPUStack

GPUStack — Fri, 23 Aug 2024 17:00:00 +0000

Click here to read original post

Continue is an open-source alternative to GitHub Copilot, this is an open-source AI coding assistant that allows to connect various large language models(LLMs) within VS Code and JetBrains to build custom code autocompletion and chat capabilities. It supports:

Code parsing
Code autocompletion
Code optimization suggestions
Code refactoring
Code implementations Inquiring
Documentation online searching
Terminal errors parsing

and more. It assists developers in coding and enhancing their development efficiency.

In this tutorial, we are going to use Continue + GPUStack to build a free GitHub Copilot locally, providing developers with an AI-paired programming experience.

Running Models with GPUStack

First, we will deploy the models on GPUStack. There are three model types recommended by Continue:

Chat model: select llama3.1, this is the latest open-source model trained by Meta.
Autocompletion model: select starcoder2:3b, a highly advanced autocompletion model trained by Hugging Face.
Embedding model: select nomic-embed-text, which supports a context length of 8192 tokens, it outperforms OpenAI ada-002 and text-embedding-3-small models for both short and long context tasks.

After deploying the models, you are also required to create an API key in the API Keys section for authentication when Continue accesses the models deployed on GPUStack.

Installing and Configuring Continue

Continue provides extensions for both VS Code and JetBrains. In this article, we will use VS Code as an example. Install Continue from the VS Code extension store:

Once installed, drag the Continue extension to the right panel to avoid conflict with the file explorer:

Then, select the settings button in the bottom-right corner to edit Continue's configuration and connect to the models deployed on GPUStack. Replace the sections for "models", "tabAutocompleteModel", and "embeddingsProvider" with your own GPUStack-generated API Key:

{
  "models": [
    {
      "title": "Llama 3.1",
      "provider": "openai",
      "model": "llama3.1",
      "apiBase": "http://192.168.50.4/v1-openai",
      "apiKey": "gpustack_f58451c1c04d8f14_c7e8fb2213af93062b4e87fa3c319005"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 2 3b",
    "provider": "openai",
    "model": "starcoder2",
    "apiBase": "http://192.168.50.4/v1-openai",
    "apiKey": "gpustack_f58451c1c04d8f14_c7e8fb2213af93062b4e87fa3c319005"
  },
  "embeddingsProvider": {
    "provider": "openai",
    "model": "nomic-embed-text",
    "apiBase": "http://192.168.50.4/v1-openai",
    "apiKey": "gpustack_f58451c1c04d8f14_c7e8fb2213af93062b4e87fa3c319005"
  }
}

Get to Use Continue

After configuring Continue to connect to the GPUStack-deployed models, go to the top-right corner of the Continue plugin interface and select Llama 3.1 model. Now you are able to use the features we mentioned at the beginning of this tutorial:

Code Parsing: Select the code, press Cmd/Ctrl + L, and enter a prompt to let the local LLM parse the code:
Code Autocompletion: While coding, press Tab to let the local LLM attempt to autocomplete the code:
Code Refactoring: Select the code, press Cmd/Ctrl + I, and enter a prompt to let the local LLM attempt to optimize the code:

The LLM will provide suggestions, and you can decide whether to accept or reject them:

Inquire About Code Implementation: You can try @Codebase to ask questions about the codebase, such as how a certain feature is implemented:
Documentation Search: Use @Docs and select the document site you wish to search for and ask your questions, enabling you to find the results you need:

For more information, please read the official Continue documentation: https://docs.continue.dev/how-to-use-continue

Conclusion

In this tutorial, we have introduced how to use Continue + GPUStack to build a free local GitHub Copilot, offering AI-paired programming capabilities at no cost to developers.

If you encounter any issues while integrating GPUStack with third parties, feel free to join GPUStack Discord Community and get support from our engineers.

Introducing GPUStack: An open-source GPU cluster manager for running LLMs

GPUStack — Thu, 25 Jul 2024 17:00:54 +0000

What is GPUStack?

We are thrilled to launch GPUStack, an open-source GPU cluster manager for running Large Language Models (LLMs). Even though LLMs are widely available as public cloud services, organizations cannot easily host their own LLM deployments for private use. They need to install and manage complex clustering software such as Kubernetes and then figure out how to install and manage the AI tool stack on top. Popular ways to run LLMs locally, such as LMStudio and LocalAI, works on a single machine.

GPUStack allows you to create a unified cluster from any brand of GPUs in Apple MacBooks, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories such as Hugging Face. Developers can then access LLMs just as easily as accessing public LLM services from vendors like OpenAI or Microsoft Azure.

For more details about GPUStack, visit:

GitHub repo: https://github.com/gpustack/gpustack

User guide: https://docs.gpustack.ai

Why GPUStack?

Today, organizations who want to host LLMs on a cluster of GPU servers have to do a lot of work to integrate a complex software stack. By using GPUStack, organizations no longer need to worry about cluster management, GPU optimization, LLM interference engines, usage and metering, user management, API access, and dashboard UI. GPUStack is a complete software platform for building your own LLM-as-a-Service (LLMaaS).

As the following figure illustrates, the admin deploys models into GPUStack from a repository like HuggingFace, and then developers can connect to GPUStack to use these models in their applications.

Key features of GPUStack

GPU cluster setup and resource aggregation

GPUStack aggregates all GPU resources within a cluster. It is designed to support all GPU vendors, including Nvidia, Apple, AMD, Intel, Qualcomm, and others. GPUStack is compatible with a laptops, desktops, workstations, and servers running MacOS, Windows, and Linux.

The initial release of GPUStack supports Windows PCs and Linux servers with Nvidia graphics cards, and Apple Macs.

Deployment and Inference for Models

GPUStack supports distributed deployment and inference of LLMs across a cluster of GPU machines.

GPUStack selects the best inference engine for running the given LLM on the given GPU. The first LLM inference engine supported by GPUStack is LLaMA.cpp, which allows GPUStack to support GGUF models from Hugging Face and all models listed in the ollama library (ollama.com/library).

You can run any model on GPUStack by first converting it to GGUF format and uploading it to Hugging Face or Ollama library.

Support of other inference engines, such as vLLM, is on our roadmap and will be provided in the future.

Note: GPUStack will automatically schedule the model you select to run on machines with appropriate resources, relieving you of manual intervention. If you want to assess the resource consumption of your chosen model, you can use our GGUF Parser project: https://github.com/gpustack/gguf-parser-go. We intend to provide more detailed tutorials in the future.

Although GPU acceleration is recommended for inference, we also support CPU inference, though the performance isn't as good as GPU. Alternatively, using a mix of GPU and CPU for inference can maximize resource utilization, which is particularly useful in edge or resource-constrained environments.

Easy integration with your applications

GPUStack offers OpenAI-compatible APIs and provides an LLM playground along with API keys. The playground enables AI developers to experiment with and customize your LLMs, and seamlessly integrate them into AI-enabled applications.

Additionally, you can use the metrics GPUStack provides to understand how your AI applications utilize various LLMs. This helps administrators manage GPU resource consumption effectively.

Observability metrics for GPUs and LLMs

GPUStack provides comprehensive metrics performance, utilization, and status monitoring.

For GPUs, administrators can use GPUStack to monitor real-time resource utilization and system status. Based on these metrics:

Administrators perform scaling, optimization, and other maintenance operations.
GPUStack adjusts its model scheduling algorithm.

For LLMs, developers can use GPUStack to access metrics like token throughput, token usage, and API request throughput. These metrics help developers evaluate model performance and optimize their applications. GPUStack plans to support auto-scaling based on these inference performance metrics in future releases.

Authentication and access control

GPUStack also provides authentication and role-based access control (RBAC) for enterprises. Users on the platform can have either admin or regular user roles. This guarantees that only authorized administrators can deploy and manage LLMs and that only authorized developers can utilize them.

GPUStack Use Cases

GPUStack unlocks a world of possibilities for running LLMs on any GPU vendors. Here are just a few examples of what you can achieve with GPUStack:

Aggregate existing MacBooks, Windows PCs, and other GPU resources to offer a low-cost LLMaaS for a development team.
In limited resource environments, aggregate multiple edge nodes to provide LLMaaS on CPU resources.
Create your own enterprise-wide LLMaaS in your own data center for highly sensitive workloads that cannot be hosted in a cloud.

Getting Started with GPUStack

Installation

Linux or MacOS

GPUStack provides a script to install it as a service on systemd or launchd based systems. To install GPUStack using this method, execute:

curl -sfL https://get.gpustack.ai | sh -

Now you have deployed and started the GPUStack server, which serves as the first worker node. You can access the GPUStack page via http://myserver (Replace with the IP address or domain of the host you installed).

Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

cat /var/lib/gpustack/initial_admin_password

To add additional worker nodes and form a GPUStack cluster, please run the following command on each worker node:

curl -sfL https://get.gpustack.ai | sh - --server-url http://myserver --token mytoken

Replace http://myserver with your GPUStack server URL and mytoken with your secret token for adding workers. To retrieve the token in the default setup from the GPUStack server, use the following command:

cat /var/lib/gpustack/token

Or follow the instructions on GPUStack to add workers:

Windows

Run PowerShell as administrator, then run the following command to install GPUStack:

Invoke-Expression (Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content

You can access the GPUStack page via http://myserver (Replace with the IP address or domain of the host you installed).

Log in to GPUStack with username admin and the default password. You can run the following command to get the password for the default setup:

Get-Content -Path (Join-Path -Path $env:APPDATA -ChildPath "gpustack\initial_admin_password") -Raw

Optionally, you can add extra workers to form a GPUStack cluster by running the following command on other nodes:

Invoke-Expression "& { $((Invoke-WebRequest -Uri "https://get.gpustack.ai" -UseBasicParsing).Content) } -ServerURL http://myserver -Token mytoken"

In the default setup, you can run the following to get the token used for adding workers:

Get-Content -Path (Join-Path -Path $env:APPDATA -ChildPath "gpustack\token") -Raw

For other installation scenarios, please refer to our installation documentation at: https://docs.gpustack.ai/docs/quickstart

Serving LLMs

As an LLM administrator, you can log in to GPUStack as the default system admin, navigate to Resources to monitor your GPU status and capacities, and then go to Models to deploy any open-source LLM into the GPUStack cluster. This enables you to provide these LLMs to regular users for integration into their applications. This approach helps you to efficiently utilize your existing resources and deliver stable LLM services for various needs and scenarios.

Access GPUStack to deploy the LLMs you need. Choose models from Hugging Face (only GGUF format is currently supported) or Ollama Library, download them to your local environment, and run the LLMs:

GPUStack will automatically schedule the model to run on the appropriate Worker:

You can manage and maintain LLMs by checking API requests, token consumption, token throughput, resource utilization status, and more. This helps you decide whether to scale up or upgrade LLMs to ensure service stability.

Integrating with your applications

As an AI application developer, you can log in to GPUStack as a regular user and navigate to Playground from the menu. Here, you can interact with the LLM using the UI playground.

Next, visit API Keys to generate and save your API key. Return to Playground to customize your LLM by adjusting the system prompt, adding few-shot learning examples, or resizing prompt parameters. When you're done, click View Code and select your preferred code format (curl, Python, Node.js) along with the API key. Use this code in your applications to enable communication with your private LLMs.

you can access the OpenAI-compatible API now, for example, use curl as the following:

export GPUSTACK_API_KEY=myapikey
curl http://myserver/v1-openai/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $GPUSTACK_API_KEY" \
  -d '{
    "model": "llama3",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ],
    "stream": true
  }'

Join Our Community

Please find more information about GPUStack at: https://gpustack.ai.

If you encounter any issues or have suggestions for GPUStack, feel free to join our Community for support from the GPUStack team and to connect with fellow users globally.

We are actively enhancing the GPUStack project and plan to introduce new features in the near future, including support for multimodal models, additional accelerators like AMD ROCm or Intel oneAPI, and more inference engines. Before getting started, we encourage you to follow and star our project on GitHub at gpustack/gpustack to receive instant notifications about all future releases. We welcome your contributions to the project.

About Us

GPUStack is brought to you by Seal, Inc., a team dedicated to enabling AI access for all. Our mission is to enable enterprises to use AI to conduct their business, and GPUStack is a significant step towards achieving that goal.

Quickly build your own LLMaaS platform with GPUStack! Start experiencing the ease of creating GPU clusters locally, running and using LLMs, and integrating them into your applications.