DEV Community: Sam Stoelinga

Don't use a K8s Service for LLM Serving!

Sam Stoelinga — Tue, 11 Mar 2025 07:25:05 +0000

Relying solely on standard Kubernetes Services for load balancing can lead to suboptimal performance when sering LLMs. That's because engines like vLLM provide Prefix caching which can speed up the inference. However, you need to make sure request with same prompt prefix go to the same vLLM instance when you have multiple instances serving the same model. That's why a standard K8s service won't work:

How LLM engines use caching

LLMs use Key-Value (KV) caches to store processed data from input prompts. This "prefix caching" speeds up responses when similar requests are made. However, standard Kubernetes Services distribute requests randomly, causing cache misses and slower responses.

Why Standard Load Balancing Falls Short

Random request distribution leads to:

Frequent Cache Evictions: Caches are cleared often, reducing efficiency.
Increased Latency: More time is needed to process requests without cache benefits.

A Smarter Approach: Prompt Prefix Consistent Hashing with Bounded Loads (CHWBL)

Prefix/Prompt based CHWBL offers a better solution by:

Maximizing Cache Use: Similar requests go to the same LLM replica, keeping caches relevant.
Balancing Load: Ensures no single replica is overwhelmed, maintaining performance.

Real-World Benefits

Implementing CHWBL has shown:

95% Faster Initial Responses: Quicker start to data processing.
127% Increase in Throughput: More requests handled efficiently.

Conclusion

For effective LLM serving, move beyond standard Kubernetes Services. Adopting advanced load balancing like CHWBL can significantly enhance performance and user satisfaction.

Paper used as the source: Prefix Aware Load Balancing Paper

Tutorial: Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB

Sam Stoelinga — Tue, 08 Oct 2024 02:44:38 +0000

Tutorial on how to deploy the Llama 3.1 405B model on GKE Autopilot with 8 x A100 80GB GPUs using KubeAI.

We're using fp8 (8 bits) precision for this model. This allows us to reduce GPU memory required and allows us to serve the model on a single machine.

Create a GKE Autopilot cluster

gcloud container clusters create-auto cluster-1 \
    --location=us-central1

Add the helm repo for KubeAI:

helm repo add kubeai https://www.kubeai.org
helm repo update

Create a values file for KubeAI with required settings:

cat <<EOF > kubeai-values.yaml
resourceProfiles:
  nvidia-gpu-a100-80gb:
    imageName: "nvidia-gpu"
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
      # Each A100 80GB GPU gets 10 CPU and 12Gi memory
      cpu: 10
      memory: 12Gi
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
    nodeSelector:
      cloud.google.com/gke-accelerator: "nvidia-a100-80gb"
      cloud.google.com/gke-spot: "true"
EOF

Install KubeAI with Helm:

helm upgrade --install kubeai kubeai/kubeai \
    -f ./kubeai-values.yaml \
    --wait

Deploy Llama 3.1 405B by creating a KubeAI Model object:

kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-405b-instruct-fp8-a100
spec:
  features: [TextGeneration]
  owner:
  url: hf://neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  args:
    - --max-model-len=65536
    - --max-num-batched-token=65536
    - --gpu-memory-utilization=0.98
    - --tensor-parallel-size=8
    - --enable-prefix-caching
    - --disable-log-requests
    - --max-num-seqs=128
    - --kv-cache-dtype=fp8
    - --enforce-eager
    - --enable-chunked-prefill=false
    - --num-scheduler-steps=8
  targetRequests: 128
  minReplicas: 1
  maxReplicas: 1
  resourceProfile: nvidia-gpu-a100-80gb:8
EOF

The pod takes about 15 minutes to startup. Wait for the model pod to be ready:

kubectl get pods -w

Once the pod is ready, the model is ready to serve requests.

Setup a port-forward to the KubeAI service on localhost port 8000:

kubectl port-forward service/kubeai 8000:80

Send a request to the model to test:

 curl -v http://localhost:8000/openai/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.1-405b-instruct-fp8-a100", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'

Now let's run a benchmarking using the vLLM benchmarking script:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-405b-instruct-fp8-a100 \
    --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8

This was the output of the benchmarking script on 8 x A100 80GB GPUs:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  410.49
Total input tokens:                      232428
Total generated tokens:                  173391
Request throughput (req/s):              2.44
Output token throughput (tok/s):         422.40
Total Token throughput (tok/s):          988.63
---------------Time to First Token----------------
Mean TTFT (ms):                          136607.47
Median TTFT (ms):                        125998.27
P99 TTFT (ms):                           335309.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          302.24
Median TPOT (ms):                        267.34
P99 TPOT (ms):                           1427.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           249.94
Median ITL (ms):                         128.63
P99 ITL (ms):                            1240.35
==================================================

Hope this is helpful to other folks struggling to get Llama 3.1 405B up and running on GKE. Similar steps would work for GKE standard as long as you create your a2-ultragpu-8g nodepools in advance.

Infinity embeddings on Kubernetes with KubeAI

Sam Stoelinga — Wed, 25 Sep 2024 13:41:20 +0000

Just merged and released the Infinity support PR in KubeAI, adding Infinity as an embedding engine. So you can get embeddings running on your Kubernetes clusters with an OpenAI compatible API.

Infinity is a high performance and low latency embeddings engine: https://github.com/michaelfeil/infinity
KubeAI is a Kubernetes Operator for running OSS ML serving engines: https://github.com/substratusai/kubeai

How to use this?

Run on any K8s cluster:

helm repo add kubeai https://www.kubeai.org
helm install kubeai kubeai/kubeai --wait --timeout 10m
cat > model-values.yaml << EOF
catalog:
  bge-embed-text-cpu:
    enabled: true
    features: ["TextEmbedding"]
    owner: baai
    url: "hf://BAAI/bge-small-en-v1.5"
    engine: Infinity
    resourceProfile: cpu:1
    minReplicas: 1
EOF
helm install kubeai-models kubeai/models -f ./model-values.yaml

Forward kubeai service to local host:

kubectl port-forward svc/kubeai 8000:80

Afterwards you could use the OpenAI Python client to get embeddings:

from openai import OpenAI
# Assumes port-forward of kubeai service to localhost:8000.
client = OpenAI(api_key="ignored", base_url="http://localhost:8000/openai/v1")
response = client.embeddings.create(
    input="Your text goes here.",
    model="bge-embed-text-cpu"
)
print(response)

What’s next?

Support for autoscaling based on Infinity reported metrics.

Introducing KubeAI: Open AI Inference Operator

Sam Stoelinga — Mon, 16 Sep 2024 23:31:16 +0000

We recently launched KubeAI. The goal of KubeAI is to get LLMs, embedding models and Speech to text running on Kubernetes with ease.

KubeAI provides an OpenAI compatible API endpoint which makes it work out of the box with most software that works with the OpenAI APIs.

Repo on GitHub: substratusai/kubeai

When it comes to LLMs, KubeAI directly operates vLLM and Ollama servers in isolated Pods, configured and optimized on a model-by-model basis. You get metrics-based auto scaling out of the box (including scale-from-zero). When you hear scale-from-zero in Kubernetes-land you probably think Knative and Istio - but not in KubeAI! We made an early design decision to avoid any external dependencies (Kubernetes is complicated enough as-is).

We are hoping to release more functionality soon. Next up: model caching, metrics and dashboard.

If you need any help or have any feedback, reach out directly, here, or via the channels listed in the repo. We are currently making it our priority to assist the project’s early adopters. So far users have seen success in use cases ranging from processing large scale batches in the cloud to running lightweight inference at the edge.

How I deployed a global website speed checking service to 25 locations while costing less than $5/yr using GCP Cloud Run

Sam Stoelinga — Mon, 28 Feb 2022 22:57:19 +0000

I created https://websu.io an open source webpage speed monitoring tool and a key feature was the ability to test your webpage speed from locations across the world. This post will describe the architecture of Websu and how that allowed me to easily deploy it to 25 locations across the world for $3.73/yr

The easiest thing to do would be to spin up a server in each location across the world and when user requests a speed test you run it on the server. That however comes with a large downside, you will have to pay for this server even when users aren't submitting any requests to that specific location. If I had done this, this would have cost me 25 * ~$5/month = $1500 / year even when I didn't have any users. This cost, being purely a hobby project for me wouldn't have been feasible. Note I don't like spending money, I am Dutch after all, there is a reason there is a saying "going Dutch" when you split up your bill.

So what comes to the mind of a scrappy Dutchmen? Serverless with Cloud Run! I split up Websu into 3 components to make it possible to run from any location:

websu-ui: React/Nextjs front-end that consumes a REST API. OSS version here: https://github.com/websu-io/websu-ui
websu-api: The API front-end that takes in the end-user requests. Source repo: https://github.com/websu-io/websu note: this project was a result of me trying to learn code, excuse the bad code.
lighthouse-server: GRPC server that takes in lighthouse requests and responds with the raw lighthouse json output. Source repo: https://github.com/websu-io/websu/tree/master/pkg/lighthouse

The websu-ui and websu-api are deployed to a single region in Cloud Run however the lighthouse-server is deployed to all regions available with Cloud Run. This way, cloud run will run 0 lighthouse-server instances in any of the 25 regions so I won't get billed. Only when an end-user tries to test the page speed in a specific location e.g. Japan, then it will spin up a cloud run instance automatically. The cold start delay has been very minimal for me and definitely worth the savings. You can see a screenshot of my Cloud Run and Artifact Registry costs here:

To make my life easier, I also wrote a simple python script to manage the deployment of new lighthouse-servers: https://github.com/websu-io/websu/blob/master/scripts/deploy-lighthouse-servers.py

During this time I was able to serve 3.1k users (see analytics screenshot below) across 25 locations for $3.73/yr instead of $1500/yr. I hope this posts helps people give insight to new use cases for serverless.

The big spike was due to Korben.info mentioning Websu: https://korben.info/script-mesurer-performances-site-web.html and I was happy that Cloud Run could easily sustain that.

Note/Disclaimer: I'm excluding the cost of a VM that's running Mongo, which is costing me about $10 / month. The reason for that is because with or without Cloud Run I would have incurred that cost. Opinions are my own and not the views of my employer