curl https://vllm-endpoint-xxxxx-ew4.a.run.app/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-4B",
"messages": [{"role": "user", "content": "Say hello in one word."}],
"max_tokens": 32
}'
That endpoint is real. An NVIDIA L4 GPU on Cloud Run, running vLLM (best open source way to serve LLMs) OpenAI-compatible API.
Four commands to get there:
make tf-init # initialize Terraform
make build-cloud # build the Docker image remotely
make deploy # provision infrastructure + deploy
make smoke-test # verify it works end-to-end
No Kubernetes, no Helm charts, no 47-step tutorials. One Makefile, a Dockerfile, a few Terraform files. Here's how the pieces fit together:
HuggingFace
│
▼ docker build (downloads weights into image)
Baked Image ───► Artifact Registry
│
▼ terraform apply
Cloud Run (GPU)
│
▼
OpenAI-compatible Endpoint
Four pieces, one responsibility each.
The Dockerfile takes a model name and a HuggingFace token, downloads the weights at build time, and bundles them into a vLLM image. The result is a baked image — self-contained, no internet at runtime.
Artifact Registry holds that image. Terraform creates the repo with a cleanup policy that keeps only the latest tag. Old images get deleted. No surprise bills.
Cloud Run runs the image on an NVIDIA L4 GPU. Serverless — no VM, no cluster, just a URL.
The Makefile ties it together. build-cloud triggers a remote build on Cloud Build, so your laptop never touches 15 GB of weights. deploy runs Terraform. smoke-test hits the endpoint with three requests and tells you if anything's broken.
First piece: the Dockerfile.
The Dockerfile
FROM vllm/vllm-openai:latest-cu129-ubuntu2404
ARG MODEL_NAME
ARG HF_TOKEN
ENV HF_HOME=/model-cache
ENV MODEL_NAME=${MODEL_NAME}
ENV PORT=8080
RUN hf auth login --token ${HF_TOKEN} && \
hf download ${MODEL_NAME}
ENV HF_HUB_OFFLINE=1
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
Seven instructions. The base image is vllm/vllm-openai — vLLM preconfigured to speak the OpenAI API. No server to write, nothing to install. Give it a model and it serves.
The RUN line is the important one. It logs into HuggingFace and downloads the model during the image build, not at startup. The weights end up baked into the image layer.
When Cloud Run pulls this image, it gets the model too. No runtime download, no waiting on HuggingFace, no cold start that depends on someone else's network.
Then HF_HUB_OFFLINE=1. This tells the HuggingFace libraries: don't try the internet. If the model isn't already local, fail immediately. If you built the image without weights, you find out at startup — not after ten minutes of confusing errors.
The entrypoint script runs vLLM with bfloat16 precision, 90% GPU memory utilization, and 4096 max context tokens. Reasonable defaults for a single L4. Tune per model.
#!/bin/bash
exec vllm serve "${MODEL_NAME}" \
--host 0.0.0.0 \
--port "${PORT:-8080}" \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 1 \
--reasoning-parser qwen3 \
--max-model-len 4096
exec replaces the shell with the vLLM process so SIGTERM passes through. Cloud Run sends SIGTERM to shut down the container. Without exec, the signal gets swallowed and Cloud Run force-kills after the timeout.
Build it, push it, Cloud Run serves the model. But how does Cloud Run know to attach a GPU?
The Terraform
Cloud Run doesn't attach a GPU by default. You have to ask explicitly, and in the right way. Here's the Cloud Run service resource — the core of the deployment:
resource "google_cloud_run_v2_service" "vllm" {
name = var.service_name
location = var.region
project = var.project_id
template {
execution_environment = "EXECUTION_ENVIRONMENT_GEN2"
scaling {
min_instance_count = 1
max_instance_count = var.max_instances
}
containers {
image = local.image_url
ports {
container_port = 8080
}
resources {
limits = {
cpu = "8"
memory = "24Gi"
"nvidia.com/gpu" = "1"
}
cpu_idle = false
}
startup_probe {
tcp_socket {
port = 8080
}
initial_delay_seconds = 240
timeout_seconds = 240
failure_threshold = 2
period_seconds = 240
}
}
node_selector {
accelerator = "nvidia-l4"
}
}
}
Three things make the GPU work. Miss any one and it won't.
"nvidia.com/gpu" = "1" in the resource limits. A Kubernetes-style resource request that Cloud Run adopted. "1" means one GPU.
node_selector { accelerator = "nvidia-l4" }. Tells Cloud Run which GPU. nvidia-l4 is the sweet spot — 24 GB VRAM, relatively cheap, available in most regions.
execution_environment = "EXECUTION_ENVIRONMENT_GEN2". GPUs require Gen2. Gen1 doesn't support them. Forget this and Terraform will error, but the message won't tell you why.
Then the startup probe. vLLM needs 60–90 seconds to load a 4B-parameter model into VRAM. The probe waits 240 seconds — generous, but it covers larger models too.
Cloud Run won't route traffic until the probe passes (TCP connect on port 8080). Until then, the service shows "not ready" and requests get a 503.
The rest is plumbing. A service account, an Artifact Registry repo with a one-image cleanup policy, API enablement for Cloud Run and IAM. All boilerplate, all in the repo.
Image built, infrastructure defined. What does deploying actually look like?
Deploying, start to finish
Prerequisites: a GCP project, gcloud CLI, Terraform >= 1.5, a HuggingFace token, and make.
Configure HF_TOKEN in .env and project_id in terraform/terraform.tfvars, then run:
make tf-init # first time only
make build-cloud # 5–10 min, builds remotely on Cloud Build
make deploy # terraform apply, outputs the endpoint URL
make smoke-test # three requests against the live endpoint
build-cloud submits to Cloud Build so your laptop never touches the weights. For local builds, make build && make push works too.
Clone to working endpoint: ~15 minutes, mostly the build. The endpoint is OpenAI-compatible — set base_url and go:
from openai import OpenAI
client = OpenAI(
base_url="https://vllm-endpoint-xxxxx-ew4.a.run.app/v1",
api_key="not-needed", # public endpoint, no auth yet
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-4B",
messages=[{"role": "user", "content": "Explain GPUs in one sentence."}],
)
print(response.choices[0].message.content)
The takeaway
vLLM is built for throughput under concurrency — PagedAttention, continuous batching, and scaling knobs for tensor parallelism, pipeline parallelism, data parallelism, and chunked prefill. Controls like max_num_batched_tokens, max_num_seqs, and gpu_memory_utilization let you tune the latency-throughput tradeoff. Cloud Run's GPU support means you get all of that without managing infrastructure.
Bake the model, push the image, run. Change MODEL_NAME to swap models — the infrastructure stays the same.
What's not covered yet: authentication (the endpoint is public), monitoring alerts, multi-model routing, and cost control at scale. Not hard problems — decisions for when you need them.
Top comments (2)
GitHub link: github.com/borisBarac/LLM-Prod-dep...
Some comments may only be visible to logged-in visitors. Sign in to view all comments.