Containerized AI before Apocalypse 🐳🤖

#ai #tutorial #beginners #kubernetes

ChatGPT is awesome, and privacy is a concern for many. But what if you could host your own private AI on an old PC without relying on GPU clusters?

Thanks to the efforts of the amazing community projects like ggml, llama.cpp, and TheBloke, it is now possible for anyone to chat with AI, privately, without internet, ~~before the apocalypse~~.

In this article, ~~we will containerize an AI before it ends the world~~, we will explore how to deploy a Large Language Model (LLM, also known as AI) in a container within a Kubernetes cluster, enabling us to have conversations with it.

To get started, you'll need a Kubernetes cluster, for example, a minikube with approximately 8 CPU threads and 5GB of memory. Additionally, you'll need to have Helm installed.

Let's begin by deploying the LLM within a minimal wrapper.

cat > values.yaml <<EOF
replicas: 1
deployment:
  image: quay.io/chenhunghan/ialacol:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
    DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
    DEFAULT_MODEL_META: ""
    THREADS: 8
    BATCH_SIZE: 8
    CONTEXT_LENGTH: 1024
service:
  type: ClusterIP
  port: 8000
  annotations: {}
EOF
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install orca-mini-3b ialacol/ialacol -f values.yaml

If you're interested in the technical details, here's what's happening behind the scenes:

We are deploying a Helm release orca-mini-3b using Helm chart ialacol
The container image ialacol is a mini RESTFul API server compatible with OpenAI API. Disclaimer: I am the main contributor to this project
The deployed LLM binary, orca mini, has 3 billion parameters. Orca mini is based on the OpenLLaMA project.
The binary has been quantized by TheBloke into a 4-bit GGML format.

Now, please be patient for a few minutes as the container downloads the binary, which is around 1.93GB in size:

INFO:     Downloading model... TheBloke/orca_mini_3B-GGML/orca-mini-3b.ggmlv3.q4_0.bin

Once the download is complete, it's time to start a conversation!

Expose the service:

kubectl port-forward svc/orca-mini-3b 8000:8000

Ask a question:

USER_QUERY="What is the meaning of life? Explain like I am 5."
MODEL="orca-mini-3b.ggmlv3.q4_0.bin"
curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "prompt": "### System:You are an AI assistant that follows instruction extremely well. Help as much as you can.### User:'${USER_QUERY}'### Response:", "model": "'${MODEL}'" }' \
     http://localhost:8000/v1/completions

According to AI...

The meaning of life is a question that has puzzled humans for centuries. Some believe it to be finding happiness, others think it's achieving success or something greater than ourselves, while some see it as fulfilling our purpose on this planet. Ultimately, everyone answers this question differently and what matters most in the end is how we live our lives with integrity and make a positive impact on those around us.

Let's start scaling LLM on Kubernetes!

DEV Community

Containerized AI before Apocalypse 🐳🤖

Top comments (0)

Read next

👀 How Check Memory Leaks in React?⚠️🚨🚨

Container Orchestration with Kubernetes

Amazon Q: Your GenAI Assistant for Business Processes, Code Reviews, and Documentation

10 Cool Ideas for Discord Bots You Can Build Today