Disclaimer: This article is meant for ordinary language model users instead of people who host models or fine tune models. It's by no means a professional article on AI or models. This article is also heavily based on personal experiences and is very opinionated.
Disclaimer: This article is written for the 12th day of Akatsuki Games Inc.'s Advent Calendar 2024
GPT-4o is one of the most widely used AI models today, thanks to ChatGPT's popularity. The AI landscape includes other major players like Anthropic's Claude models, Google's Gemini models, Meta's Llama models, Cohere's Command R models, Mistral's models, xAI's Grok models, and most recently, Amazon's Nova models.
Problem
However, besides these commercial, big models, there are also lots of open source and open weights models that are available on HuggingFace, some of which are with decent parameter amounts while others are smaller yet fine tuned with curated datasets, making them particularly good at some areas (such as role playing or creative writing).
The problem is, while it's possible to chat with some models on HuggingFace via Inference API, a model has to have enough activities to be deployed to Inference API, which means to chat or use models that are less popular or even niche, one has to either:
- Host the model on your local machine with your own GPUs
- Subscribe to hosts like Arli AI, Featherless or Infermatic
- Pay by tokens via hosts like Novita AI or Groq, or via proxy services like OpenRouter
- Use GPU Cloud to rent GPU hours to host models
Yet they all come with their limitations.
Host locally
While a relatively decent GPU should be able to host a quantized 12B model, any model larger than that would not be possible and requires better or multiple GPUs to host. As a normal model user, it's simply impractical, let alone all the hassle one needs to go through in order to set up a rig for that.
Subscription-based
Subscription-based services usually feature unlimited usages, but their context windows are usually lower than what the model is capable of due to costs. If you are like me who usually sends requests with longer contexts, these services are not really the best.
Pay by tokens
Pay by tokens is one of the most common ways I access models nowadays because thanks to OpenRouter's proxying, it's fairly easy to use lots of models hosted by different providers. However, the downside is since OpenRouter doesn't host models on their own, and hosts like Novita AI and Groq choose which models they want to host, if the model you want to use is unavailable due to low demands or license problems (such as Mistral's licensing), you're out of luck. Also, paying by tokens means the more requests and tokens you use, the more you pay, and if you are constantly using models like OpenAI o1, the cost will build up fairly quickly.
Host on the cloud (GPU Cloud)
GPU Cloud services let you rent powerful GPUs by the hour, giving you the flexibility to run any model you want without long-term commitment or hardware investment.
Therefore, if:
- You don't have your own GPUs to host models with arbitrary sizes
- You want longer context windows
- The model you want to use is not hosted by any provider, or can't be hosted commercially due to license problems
- You don't want to pay by tokens
I think it's just simpler to use GPU Cloud to rent GPU hours to host any model one is interested in, booting it up when you need it and shutting it down when you don't need it.
Host Models on the Cloud with RunPod
RunPod is a popular service that allows you to rent GPU hours on-demand. While they might not be the best option and they also have other plans, those are out of the scope of this article. This article will cover how to host models on the cloud, from choosing the models you want to sending requests to your own pod with LibreChat.
Choose a model
There are tons of models available on HuggingFace, so the first step will be choosing the model we want to host, as it will also affect how much VRAM you need and how much disk space you need.
Even though bigger models don't necessarily mean better, generally the bigger the model, the smarter the model will be. Personally I think 70B is a sweet spot that balances between intelligence and cost.
I just stumbled upon a model called
Chronos-Platinum-72B the other day and I kinda want to try it out, so in this article we will choose this model as an example.
We'll choose the GGUF version of it since according to HuggingFace's documentation, GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes.
Visiting Chronos-Platinum-72B's GGUF version page, we can see the following quantizations.
We won't get into details about what are quantizations and how they work, but generally, you don't want quantizations that are too low as the quality would be deteriorated too much. For this example, we will choose Q6_K.
According to Bartowski's notes in the same page:
Which file should I choose?
A great write up with charts showing various performances is provided by Artefact2 here
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
From the table we know the the size of Q6_K quantization is 64.35GB. If we want to fit the whole model on the GPU's VRAM, that means the VRAM of our pod must be higher than 64.35GB. Also, the higher the context size, the more memory you will need. According to HuggingFace's documentation:
Model Size | 1k tokens | 16k tokens | 128k tokens |
---|---|---|---|
8B | 0.125 GB | 1.95 GB | 15.62 GB |
70B | 0.313 GB | 4.88 GB | 39.06 GB |
405B | 0.984 GB | 15.38 GB | 123.05 GB |
That means to host this model with 128K context:
Total VRAM needed = Model size (64.35GB) + Context window memory requirement (39.06GB for 128k tokens) = 103.41GB
To summarize, the hardware requirement of our pod will approximately be:
- Disk space: 64.35GB
- VRAM: 103.41GB
You can also use LLM-Model-VRAM-Calculator to calculate it.
Now we have the requirements, it's time to spin up a pod.
Using KoboldCpp template on RunPod
KoboldCpp is a popular text generation software for GGML and GGUF models. It also comes with an OpenAI-compatible API endpoint when serving a model, which makes it easy to use with LibreChat and other software that can connect to OpenAI-compatible endpoints.
RunPod already has an official KoboldCpp template for spinning up a pod.
- Head to RunPod and register.
-
On the left pane, click on "Pods," then clicks on "+Deploy."
You can see there are lots of types of GPUs to choose from. We need at least 110.58GB of VRAM, so 3x A40 seems like a fair choice.
Choose A40
-
Scroll down and drag the GPU Count bar to 3.
-
If you haven't selected a template, click on the "Change Template" button, and search and select "KoboldCpp - Official Template - Text and Image"
Now here's the important part: after selecting the official KoboldCpp template, we need to click on the "Edit Template" button and change the parameters to host the model of our choice.
Click on the "Edit Template" button.
-
In "Pod Template Overrides" panel, we need to change the following parameters.
- Container Disk: 100 GB as we only need about 64.35GB.
- KCPP_MODEL: Replace everything with
https://huggingface.co/bartowski/Chronos-Platinum-72B-GGUF/resolve/main/Chronos-Platinum-72B-Q6_K/Chronos-Platinum-72B-Q6_K-00001-of-00002.gguf,https://huggingface.co/bartowski/Chronos-Platinum-72B-GGUF/resolve/main/Chronos-Platinum-72B-Q6_K/Chronos-Platinum-72B-Q6_K-00002-of-00002.gguf
This is because the Q6_K quantization of this model has 2 split files, and we need to pass all split files to KoboldCpp by separating the links with commas.
You can get links to GGUF files by right-click and copy the address on the small icon next to the file name.
- KCPP_ARGS: Replace everything with
--multiplayer --usecublas mmq --gpulayers 999 --contextsize 128000 --multiuser 20 --flashattention --ignoremissing --chatcompletionsadapter ChatML.json --hordemodelname Chronos-Platinum-72B
KoboldCpp's documentation covers the functionalities of these parameters, so we will only focus on some important parameters here.
- --contextsize 128000: Setting the context size to 128000 tokens.
- --chatcompletionsadapter ChatML.json: Many models will provide recommended chat templates to use with the model. According to Chronos-Platinum-72B's model card, it uses ChatML template. The officially bundled chat template files can be found here.
-
--hordemodelname Chronos-Platinum-72B: Many models don't seem to come with custom names, and by default any model served by KoboldCpp will have the name
koboldcpp/model
, which is barely helpful and identifiable. This parameter will set the model name properly, so when accessing the API and listing the available models, the model will be shown with an appropriate name.
Then click on "Set Overrides" to save the overrides.
Now we are ready to deploy our pod. Click on "Deply On-Demand" to deploy our pod.
Check our pod status
RunPod will start deploying our pod. In the "Pods" page, you can click on the "Logs" button of our newly created pod to see the logs and check if our model is ready.
If you see this line in the logs, that means our model and OpenAI-compatible endpoint is ready. Copy and paste this endpoint (which ends with /v1
) to use later.
Now it's time for us to create a RunPod API key to access our model.
Creating a RunPod API key
Creating a RunPod API key is straightforward. Head to the "Settings" page in your RunPod console, and expand the "API Keys" panel. Then click on the "+ Create API Key" to create an API key. A Read Only API key is all we need to access our model in our pod.
After creating an API key, copy and paste the key to somewhere else so we can use it later.
Chat With RunPod Model with LibreChat
Now our pod is up and running, it's time for us to set up LibreChat locally and access our model.
- Clone the LibreChat repository
- Copy
docker-compose.override.yml.example
, putting it in exactly the same folder, and rename it todocker-compose.override.yml
- Uncomment
services
and the part under# USE LIBRECHAT CONFIG FILE
so we can override settings with our custom config file. -
Add
environment
section to the same part and put our RunPod API key here for easy access. IMPORTANT: For security reasons, you should NEVER store API keys directly in configuration files if you plan to:- Host LibreChat publicly where others can access it
- Share or publish your LibreChat configuration
- Push your configuration to version control systems like GitHub
Instead, for production deployments, you should:
- Let users provide their own API keys through the interface
- Use secure environment variables or a secrets management system
- Follow security best practices for credential management
Since the purpose of this article is to host models for our own usage. Storing RunPod's API key in our local
docker-compose.override.yml
would suffice. -
Now the file should look like this:
Remember: This configuration with embedded API keys should ONLY be used for personal, local development where you are the sole user of the system.
Copy
.env.example
, putting it in exactly the same folder, and rename it to.env
Copy
librechat.example.yaml
, putting it in exactly the same folder, and rename it tolibrechat.yaml
Scroll down to the
custom
part and underPortKey
setting (or whatever is the last one in the custom endpoint part), add the following settings:
- name: "Chronos-Platinum-72B"
apiKey: '${RUNPOD_API_KEY}'
baseURL: 'https://<Your RunPod Endpoint>/v1'
models:
default: ['koboldcpp/Chronos-Platinum-72B']
fetch: true
titleConvo: true
titleModel: 'current_model'
modelDisplayLabel: 'Chronos-Platinum-72B'
The options here are:
- name: The provider name that would appear in the drop down list in LibreChat
- apiKey: The RunPod API key we created in the previous step.
-
baseURL: The OpenAI-compatible base URL we copied in the previous step by viewing the logs of our pod. This link has to end with
/v1
-
models:
- default: The default model to use
-
fetch: Whether to fetch all available models so they will appear in the drop down menu. This option is usually set to
true
even if there's only a single model in our pod.
- titleConvo: Whether to convert the title of a conversation by summarizing the content of the chat, just like what we see in Google AI Studio, ChatGPT, and Claude.
- titleModel: Which model to use for summarizing the conversation in order to give the conversation a title.
- modelDisplayLabel: Inside a conversation, the label/name of the model shown to the user (next to the endpoint's icon).
Once we set up the RunPod endpoint, it's time for us to boot up LibreChat. Run docker compose up -d
in the LibreChat folder, and if there are no errors, visiting http://localhost:3080/
should bring us to LibreChat's login page. Create an account and log into LibreChat.
Remember: LibreChat aims to clone the familiar interface of ChatGPT, so even though we need to register in order to log into LibreChat, as long as we are running it locally, our account and email address will only be stored in our local MongoDB (and our local disk)
Once we create our account, we can log into LibreChat and see LibreChat's interface.
From the drop-down menu on the top left corner, we can see our newly added endpoint is already available.
Choose our endpoint and model, give model a system prompt in the right pane, and send a message to say hello to our model hosted on RunPod!
Shut Down LibreChat
Simply run docker compose down
or use GUIs such as Rancher Desktop, Docker Desktop, or OrbStack to shut down the containers.
Shut Down Our Pod
Click on the garbage can icon in the "Pods" page in RunPod to terminate our pod and prevent incurring costs further.
Congratulations! We now can choose whatever HuggingFace model we want to use and the context window want it to support (as long as the model itself supports it). There are still lots of settings and details that can be tweaked, and also lots of different models to explore. I hope this article can serve as a starting point for anyone interested in hosting models for their own use. Happy prompting!
Top comments (0)