DEV Community

Sam Stoelinga
Sam Stoelinga

Posted on

Tutorial: Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB

Tutorial on how to deploy the Llama 3.1 405B model on GKE Autopilot with 8 x A100 80GB GPUs using KubeAI.

We're using fp8 (8 bits) precision for this model. This allows us to reduce GPU memory required and allows us to serve the model on a single machine.

Create a GKE Autopilot cluster

gcloud container clusters create-auto cluster-1 \
    --location=us-central1
Enter fullscreen mode Exit fullscreen mode

Add the helm repo for KubeAI:

helm repo add kubeai https://www.kubeai.org
helm repo update
Enter fullscreen mode Exit fullscreen mode

Create a values file for KubeAI with required settings:

cat <<EOF > kubeai-values.yaml
resourceProfiles:
  nvidia-gpu-a100-80gb:
    imageName: "nvidia-gpu"
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
      # Each A100 80GB GPU gets 10 CPU and 12Gi memory
      cpu: 10
      memory: 12Gi
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
    nodeSelector:
      cloud.google.com/gke-accelerator: "nvidia-a100-80gb"
      cloud.google.com/gke-spot: "true"
EOF
Enter fullscreen mode Exit fullscreen mode

Install KubeAI with Helm:

helm upgrade --install kubeai kubeai/kubeai \
    -f ./kubeai-values.yaml \
    --wait
Enter fullscreen mode Exit fullscreen mode

Deploy Llama 3.1 405B by creating a KubeAI Model object:

kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-405b-instruct-fp8-a100
spec:
  features: [TextGeneration]
  owner:
  url: hf://neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  args:
    - --max-model-len=65536
    - --max-num-batched-token=65536
    - --gpu-memory-utilization=0.98
    - --tensor-parallel-size=8
    - --enable-prefix-caching
    - --disable-log-requests
    - --max-num-seqs=128
    - --kv-cache-dtype=fp8
    - --enforce-eager
    - --enable-chunked-prefill=false
    - --num-scheduler-steps=8
  targetRequests: 128
  minReplicas: 1
  maxReplicas: 1
  resourceProfile: nvidia-gpu-a100-80gb:8
EOF
Enter fullscreen mode Exit fullscreen mode

The pod takes about 15 minutes to startup. Wait for the model pod to be ready:

kubectl get pods -w
Enter fullscreen mode Exit fullscreen mode

Once the pod is ready, the model is ready to serve requests.

Setup a port-forward to the KubeAI service on localhost port 8000:

kubectl port-forward service/kubeai 8000:80
Enter fullscreen mode Exit fullscreen mode

Send a request to the model to test:

 curl -v http://localhost:8000/openai/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.1-405b-instruct-fp8-a100", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'
Enter fullscreen mode Exit fullscreen mode

Now let's run a benchmarking using the vLLM benchmarking script:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-405b-instruct-fp8-a100 \
    --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
Enter fullscreen mode Exit fullscreen mode

This was the output of the benchmarking script on 8 x A100 80GB GPUs:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  410.49
Total input tokens:                      232428
Total generated tokens:                  173391
Request throughput (req/s):              2.44
Output token throughput (tok/s):         422.40
Total Token throughput (tok/s):          988.63
---------------Time to First Token----------------
Mean TTFT (ms):                          136607.47
Median TTFT (ms):                        125998.27
P99 TTFT (ms):                           335309.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          302.24
Median TPOT (ms):                        267.34
P99 TPOT (ms):                           1427.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           249.94
Median ITL (ms):                         128.63
P99 ITL (ms):                            1240.35
==================================================
Enter fullscreen mode Exit fullscreen mode

Hope this is helpful to other folks struggling to get Llama 3.1 405B up and running on GKE. Similar steps would work for GKE standard as long as you create your a2-ultragpu-8g nodepools in advance.

Top comments (0)