DEV Community

Cover image for How to host your own ChatGPT-like model?
Kevin Naidoo
Kevin Naidoo

Posted on • Updated on • Originally published at kevincoder.co.za

How to host your own ChatGPT-like model?

Want to run a similar model to ChatGPT on your infrastructure?

With a huge push to build open-source models, Mixtral 7B is one of the best models available for free. It's also super efficient to run on low-spec hardware.

Although Mixtral is not as powerful as ChatGPT, it still is powerful enough for most generation tasks. I use Mixtral for the classification of products, labeling, and generating descriptions.

Setting up

You probably can get away with a decent-sized VPS or dedicated server, I suggest though - get a GPU box. These can be expensive, however, there are companies like Hetzner where you can get a GPU box for under $150 pm.

First things first, you would want to set up the graphics drivers and CUDA. These instructions are for Ubuntu 22.04, and may or may not work with other Ubuntu versions.

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update -y
sudo apt-get install linux-headers-$(uname -r)
sudo ubuntu-drivers install --gpgpu
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update -y
sudo apt-get -y install cuda-toolkit-12-3
Enter fullscreen mode Exit fullscreen mode

Install Ollama

Ollama is a powerful Golang library that can run large language models more efficiently. I have tested various ways of running models including llama.cpp, hugging face inference API, and various other tools, Ollama tends to perform the best with a GPU.

If you are stuck on a CPU, llama.cpp may work better, but still, I managed to get Ollama working on a CPU just fine. I didn't do enough tests to draw a conclusion on which is better for CPU-only machines, however, on a GPU box Ollama wins by a large margin.

To install Ollama:

curl https://ollama.ai/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Now you should have Ollama installed, to set up Mixtral:

ollama pull mixtral:instruct
Enter fullscreen mode Exit fullscreen mode

The above command will pull down the Mixtral model and configure it for you so that Ollama can run this model locally.

Running Mixtral 7B

Now that you have successfully configured Mixtral 7B with Ollama, running this model is as simple as:

ollama run mixtral:instruct
Enter fullscreen mode Exit fullscreen mode

The command above will open a prompt shell, where you can prompt the model similar to how you would chat with ChatGPT.

This is great for local testing but not very useful for integrating with web applications or other external apps. Next, We will look at running Ollama as an API server to solve this very problem.

Running Ollama as an API Server

To run Ollama as an API server you can use "systemd". "Systemd" is a Linux daemon that allows you to run and manage background tasks.

Here is an example of a Systemd config:
mlapi.service

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollamauser
Group=ollamauser
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:8000"
[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

In the above config, we run this process as "ollamauser", this is just an isolated system user I created for security purposes.

You can run Ollama as any available user on your server, however, I would avoid running this process as "root". Rather, create a new system user and keep the process as isolated as possible.

You will then need to place the config file in: "/etc/systemd/system/"

I called the file "mlapi.service". You can name this whatever you like, just be aware when using the Systemd CLI tool, you need to reference this name as per the file name.

To enable your service:

systemctl enable mlapi.service
Enter fullscreen mode Exit fullscreen mode

Now start your service as follows:

systemctl start mlapi.service
Enter fullscreen mode Exit fullscreen mode

To check that the service is up and running, you can use:

systemctl status mlapi.service
Enter fullscreen mode Exit fullscreen mode

Now that you are all set up, you can make an API call to the service as follows:

import requests
import json

url = "http://127.0.0.1:8000/api/generate"

payload = json.dumps({
  "model": "mixtral:instruct",
  "stream": False,
  "prompt": "Designing Data-Intensive Applications By Martin Kleppmann",
  "system": "Tag this book as one of the following: programming, cooking, fishing, young adult. Return only the tag exactly as per the tag list with no extra spaces or characters."
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Enter fullscreen mode Exit fullscreen mode

Correctly so, the model returns "programming" as the tag. Ollama supports various options. You can get more detailed information about the options available here.

To get you started, here is a breakdown of the most common parameters:

  1. model (required) - Ollama can run multiple models from the same API, so we need to tell it which model to use.
  2. stream (optional) - Set this to "false" to just return the whole model's response. The default is to stream the response, so you will get a large JSON object with several child objects containing chunked phrases.
  3. prompt (required) - The actual chat prompt.
  4. system (optional) - Any context information you want to give the model before it processes your prompt.

Top comments (3)

Collapse
 
aatmaj profile image
Aatmaj

Can you please elaborate how much effective it is with respect to chatGPT and how much computing power will be required?

Collapse
 
kwnaidoo profile image
Kevin Naidoo

Thanks for the question. On Hetzner, you can get an ARM box for around $30 pm, this is 16vCPU and 32GB RAM. This should be sufficient to run Mistral 7B. In fact, I ran it on one of the cheaper $15 pm boxes as well and the performance was okay.

This is without a GPU, I got 9-12 tokens per second. The only issue is that you can run just one task at a time.

For better performance a GPU box is advised, this can cost between $100-$200 pm. I got an NVIDIA GeForce GTX 1080 Box with 8 cores (AMD) and 64GB RAM for around 100 dollars. I am getting 5-8 tokens per second and can run multiple tasks in parallel - it does start to slow down as you run more and more tasks.

ChatGPT is a commercial offering and has loads of resources, so Mistral is not going to be as good as ChatGPT. What I found though for tasks like generating articles, code, labeling, tagging, and categorization - this model works well, so it all depends on your use case.

Collapse
 
clark23 profile image
Clark Horton

I have seen and perused cryptocurrency scams. It reminds me of some heart ache experience when I lost 8000k dollars to a fake online crypto investment scam when I invested a huge amount of money. I searched online and came across a recovery expert who I contacted via (santoshihacker@hotmail.com) who in a short while helped recover my scammed money, and I was pleased.
I will advise you to communicate the aforementioned email for assistance, you can also contact him via Website ( santoshihacker.godaddysites.com ) for help, he is fast and reliable.

Uploading image