Last year, I mentioned that I'm interested in learning how to serve LLMs in production. At first it was just curiosity, but over time I wanted to actually try building something—not just reading about it.
This post is a small step in that direction: serving an LLM using vLLM, deployed on Amazon EKS, provisioned the infra using AWS CDK, and wrapped into a simple chatbot using Streamlit.
TL;DR
- Exploring LLM serving on a Kubernetes cluster (EKS)
- Using vLLM as the inference engine
- Provisioning the infrastructure with AWS CDK (IaC)
- Building a simple chatbot to interact with the LLM using Streamlit
What We Tryna Build
The idea is simple: build a small chatbot powered by an LLM and run the model on Kubernetes.
I'm not focusing on training models here. I just want to understand how to serve an LLM properly.
The flow looks like this:
- User interacts with a chatbot (running locally)
- The chatbot sends a request to a vLLM API
- The model processes the request and returns a response
- The vLLM service runs on Amazon EKS
Prerequisites
Before we dive in, you'll need:
-
AWS Account & IAM: An AWS Account ID and an IAM User with Administrator access (IAM to manage EKS). We'll need the IAM username to map
kubectl(admin) permissions to the EKS cluster. -
AWS CLI installed and configured (
aws configure) using your IAM user credentials. -
AWS CDK installed (
npm install -g aws-cdk).
AWS usually limits new accounts to 0 vCPUs for "Running On-Demand G and VT instances". You'll need to go to the AWS Service Quotas console and request an increase to at least 4 vCPUs to run the
g4dn.xlargenode.
The Stack
- vLLM — inference engine for the LLM. Fast, supports streaming, and exposes an OpenAI-compatible API out of the box.
- Amazon EKS — The Kubernetes service on AWS to run the vLLM workload.
-
AWS CDK — infrastructure as code to manage AWS infra, at this time I'll using Python. One
cdk deployand everything is provisioned. - Streamlit — simple chatbot UI that talks to the vLLM endpoint.
Why vLLM?
There are a few ways to serve an LLM — you could use TGI, Triton, or just raw HuggingFace transformers. I went with vLLM for a few reasons:
-
PagedAttention — manages GPU memory more efficiently, which matters a lot on a single
g4dn.xlarge -
OpenAI-compatible API — the chatbot can use the
openaiPython SDK without any changes - Streaming support — responses stream token by token, which makes the chatbot feel more responsive
Why EKS?
I could've just spun up an EC2 instance and SSH'd in. But that's not really building reliable infrastructure — that's just running a script on a server.
EKS gives us a proper environment to run GPU workloads: node groups, taints and tolerations to make sure only the vLLM pod lands on the GPU node, and a LoadBalancer service to expose the endpoint.
Environment Setup
Before getting into the code, let's set up a .env file at the root of the project. We'll use this to manage our AWS configurations so we don't hardcode them into the repository.
# AWS Config
AWS_DEFAULT_ACCOUNT=123456789012
AWS_DEFAULT_REGION=us-east-1
AWS_ADMIN_USER=your_aws_username
AWS_BUCKET=eks-llm-model-bucket
# EKS Config
CLUSTER_NAME=eks-llm
# VLLM Config
# VLLM_URL will be added later after the deployment is live
# VLLM_URL=http://<nlb-endpoint>.elb.us-east-1.amazonaws.com
The Code
EKS Stack
The EksStack provisions everything at the infrastructure level: VPC, EKS cluster, node groups, and an S3 bucket for model storage.
vpc = ec2.Vpc(self, "EksVpc", max_azs=2)
cluster = eks.Cluster(
self, "EksCluster",
version=eks.KubernetesVersion.V1_34,
vpc=vpc,
default_capacity=0,
kubectl_layer=kubectl_layer,
)
default_capacity=0 means no default node group — we define our own below.
We have two node groups:
# 1. CPU, runs system pods (CoreDNS, kube-proxy, etc.)
self.cluster.add_nodegroup_capacity(
"ManagedNodeGroup",
desired_size=1,
min_size=1,
max_size=1,
instance_types=[ec2.InstanceType("t3.medium")],
ami_type=eks.NodegroupAmiType.AL2023_X86_64_STANDARD,
)
# 2. GPU, for running vLLM
gpu_node_role = iam.Role(
self,
"GpuNodeRole",
assumed_by=iam.ServicePrincipal("ec2.amazonaws.com"),
managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKSWorkerNodePolicy"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEC2ContainerRegistryReadOnly"),
iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEKS_CNI_Policy"),
],
)
self.cluster.add_nodegroup_capacity(
"GpuNodeGroup",
desired_size=1,
min_size=0,
max_size=2,
disk_size=100,
instance_types=[ec2.InstanceType("g4dn.xlarge")],
node_role=gpu_node_role,
ami_type=eks.NodegroupAmiType.AL2023_X86_64_NVIDIA,
labels={"workload": "gpu"},
taints=[
eks.TaintSpec(
key="nvidia.com/gpu",
value="true",
effect=eks.TaintEffect.NO_SCHEDULE,
)
],
)
self.cluster.aws_auth.add_user_mapping(
iam.User.from_user_name(self, "AdminUser", os.environ["AWS_ADMIN_USER"]),
groups=["system:masters"],
)
# Allow GPU nodes to read from the model bucket
self.model_bucket.grant_read(gpu_node_role)
The disk_size=100 ensures we don't get pod eviction issues, as the default 20GB is too small for the vLLM container image and the model cache. The taint nvidia.com/gpu=true:NoSchedule on the GPU node group means no pod will be scheduled there unless it explicitly tolerates it. This keeps system pods off the GPU node.
The S3 bucket is for model weights, and the GPU node role gets read access to it:
# S3 bucket for model weights
self.model_bucket = s3.Bucket(
self,
"ModelBucket",
bucket_name=os.environ.get("AWS_BUCKET"),
removal_policy=RemovalPolicy.RETAIN,
block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
)
We will create those instances:
| Node | vCPU | Memory |
|---|---|---|
| t3.medium | 2 | 4Gi |
| g4dn.xlarge | 4 | 16Gi |
vLLM Stack
The VllmStack takes the cluster from EksStack and deploys vLLM on top of it.
First, we install the NVIDIA device plugin via Helm. This is what makes EKS aware of the GPU on the node — without it, you can't request nvidia.com/gpu as a resource in your pod spec.
model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4" # Our LLM
cluster.add_helm_chart(
"NvidiaDevicePlugin",
chart="nvidia-device-plugin",
repository="https://nvidia.github.io/k8s-device-plugin",
namespace="kube-system",
values={
"nodeSelector": {"workload": "gpu"},
"tolerations": [{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}],
},
)
Note the toleration on the plugin itself — it needs to run on the GPU node to expose the GPU, so it has to tolerate the taint we set earlier.
Then the vLLM Deployment:
cluster.add_manifest("VllmDeployment", {
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {"name": "vllm", "namespace": "default"},
"spec": {
"replicas": 1,
"selector": {"matchLabels": {"app": "vllm"}},
"template": {
"metadata": {"labels": {"app": "vllm"}},
"spec": {
"tolerations": [{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}],
"nodeSelector": {"workload": "gpu"},
"containers": [{
"name": "vllm",
"image": "vllm/vllm-openai:latest",
"args": [
"--model", model_id,
"--download-dir", "/model-cache",
"--dtype", "half",
"--quantization", "awq",
"--max-model-len", "4096",
],
"env": [
{"name": "AWS_DEFAULT_REGION", "value": self.region},
{"name": "MODEL_BUCKET", "value": model_bucket_name},
{"name": "VLLM_PORT", "value": "8000"},
],
"ports": [{"containerPort": 8000}],
"resources": {
"limits": {"nvidia.com/gpu": "1"},
"requests": {"memory": "12Gi", "cpu": "2"},
},
"volumeMounts": [{"name": "model-cache", "mountPath": "/model-cache"}],
"readinessProbe": {
"httpGet": {"path": "/health", "port": 8000},
"initialDelaySeconds": 120,
"periodSeconds": 15,
},
}],
"volumes": [{"name": "model-cache", "emptyDir": {}}],
},
},
},
})
A few things worth noting:
-
nodeSelector: workload=gpupins the pod to the GPU node group -
nvidia.com/gpu: 1requests exactly one GPU -
dtype: halfandquantization: awqdrops the model size to ~5.7GB so it comfortably fits in the 16GB VRAM ofg4dn.xlargewithout OOM -
max-model-len: 4096caps the context window to avoid OOM
Finally, a LoadBalancer service to expose the endpoint publicly:
cluster.add_manifest("VllmService", {
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"name": "vllm",
"namespace": "default",
"annotations": {"service.beta.kubernetes.io/aws-load-balancer-type": "nlb"},
},
"spec": {
"type": "LoadBalancer",
"selector": {"app": "vllm"},
"ports": [{"port": 80, "targetPort": 8000, "protocol": "TCP"}],
},
})
# Internal cluster URL for the vLLM service
self.vllm_url = os.environ.get("VLLM_URL", "http://vllm.default.svc.cluster.local:80")
CfnOutput(self, "VllmUrl",
value=self.vllm_url,
description="Internal vLLM service URL",
)
Deploy
cdk bootstrap # first time only
cdk deploy --all
After the deployment is success you'll see the node on this cluster:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-xx-yy.ec2.internal Ready <none> 9m18s v1.34.7-eks-40737a8
$ kubectl get nodes --show-labels | grep gpu
ip-10-0-xx-yy.ec2.internal Ready <none> 4m23s v1.34.7-eks-40737a8 beta.kubernetes.io/arch=amd64,
...
workload=gpu
Wait for the vLLM pod to be ready (~5-10 minutes, model is downloaded from HuggingFace on first start):
kubectl get pods -w
NAME READY STATUS RESTARTS AGE
vllm-64c858884-pz4gz 0/1 Running 0 2m24s
kubectl logs -f deployment/vllm
WARNING 05-01 12:48:00 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` o
ption. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299]
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.0
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299] █▄█▀ █ █ █ █ model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:299]
(APIServer pid=1) INFO 05-01 12:48:00 [utils.py:233] non-default args: {'model_tag': 'hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4', 'model': 'hugging-quants/Meta-L
lama-3.1-8B-Instruct-AWQ-INT4', 'dtype': 'half', 'max_model_len': 4096, 'quantization': 'awq', 'download_dir': '/model-cache'}
...
Inference
Once the pod is running, grab the NLB endpoint:
kubectl get svc vllm -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Then, check that the model is loaded:
curl http://<nlb-endpoint>/v1/models
You should see something like:
{
"object": "list",
"data": [{
"id": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"object": "model",
"owned_by": "vllm"
}]
}
Next, send it a prompt:
curl http://<nlb-endpoint>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is CAP theorem?"}
],
"max_tokens": 150
}'
Response:
{
"id": "chatcmpl-8609921b347e2718",
"object": "chat.completion",
"created": 1777640350,
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The CAP theorem, also known as the Brewer's CAP theorem, is a fundamental concept in distributed systems. It was first proposed by Eric Brewer in 2000.\n\n**CAP stands for:**\n\n1. **Consistency**: This refers to the ability of a system to ensure that all nodes in the system have the same view of the data. In other words, all nodes see the same data, and any updates are reflected uniformly across the system.\n2. **Availability**: This refers to the ability of a system to ensure that every request receives a (non-error) response, without guarantee that it contains the most recent version of the information. In other words, the system is always available, even if some nodes are down or"
},
}
...
The vLLM logs:
(APIServer pid=1) INFO: 10.0.253.68:42022 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-01 12:59:19 [loggers.py:271] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 28.8%
(APIServer pid=1) INFO: 10.0.253.68:33400 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-01 12:59:29 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 28.8%
(APIServer pid=1) INFO 05-01 12:59:39 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 28.8%
(APIServer pid=1) INFO: 10.0.253.68:34814 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.0.228.56:20496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Finally, if you get a response back, congrats the model is live. 🎉
Working with API endpoint is great, but typing curl commands is not exactly a great user experience. Let's build a chatbot UI on top of this.
Chatbot with Streamlit
So let's built a simple chatbot using Streamlit that talks directly to the vLLM.
The nice part? Since vLLM exposes an OpenAI-compatible API, we can just use the openai Python SDK without any efforts.
Setup
Install the dependencies:
pip install streamlit openai
Let's create a simple UI:
mkdir src
touch src/app.py
import os
import streamlit as st
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
# Using the URL from AWS load balancer
VLLM_URL = os.getenv("VLLM_URL", "http://xx-yy.elb.us-east-1.amazonaws.com")
MODEL_ID = "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4"
client = OpenAI(base_url=f"{VLLM_URL}/v1", api_key="none")
st.set_page_config(page_title="Llama 3 Chatbot", page_icon="🦙")
st.title("🦙 Llama 3 Chatbot")
st.caption("Powered by vLLM on EKS")
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
st.chat_message(msg["role"]).write(msg["content"])
if prompt := st.chat_input("How is you day? Say something..."):
st.session_state.messages.append({"role": "user", "content": prompt})
st.chat_message("user").write(prompt)
with st.chat_message("assistant"):
stream = client.chat.completions.create(
model=MODEL_ID,
messages=st.session_state.messages,
stream=True,
)
response = st.write_stream(chunk.choices[0].delta.content or "" for chunk in stream)
st.session_state.messages.append({"role": "assistant", "content": response})
Run the UI:
streamlit run app.py
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://ww.xx.yy.zz:8501
Open the URL in your browser, and you should see a simple chatbot interface. Type in a message, and watch the response stream back token by token.
Conclusion
Serving an LLM is a bit different than deploying a typical web app. Memory constraints are real—we had to use an AWQ quantized model just to make it fit inside a single g4dn.xlarge instance without hitting OOM. But combining vLLM for inference and AWS CDK to spin up the EKS infrastructure makes the whole setup pretty straightforward.
Don't forget to run cdk destroy --all when you're done! Leaving an EKS cluster and a g4dn.xlarge node running 24/7 will result in a very hefty AWS bill.
Aight. Thanks for reading this post, hope you found something useful 🚀
References:




Top comments (0)