Amit Malhotra

Posted on May 12

Self-Hosting LLMs on GKE: Why Most Teams Decide Wrong

#llm #gke #kubernetes #aiinfrastructure

Self-Hosting LLMs on GKE: The Decision Most Teams Get Wrong

Most teams make the self-hosted vs managed LLM decision based on the wrong variable. They look at per-token pricing, see that Gemini API calls cost more than running Llama on their own GPU, and assume self-hosting is the obvious choice. Then they spend six months learning why infrastructure economics don't work that way.

I've watched this play out at multiple B2B SaaS companies building agentic workflows with Google's Agent Development Kit. The ADK makes it easy to swap model backends — that flexibility is a feature, not a bug. But the architectural decision of where to run your model isn't primarily technical. It's a cost, compliance, and operational maturity question that most teams answer backwards.

The Real Problem: Bad Math and Incomplete Requirements

The spreadsheet calculation looks simple. A single NVIDIA L4 GPU on GKE runs about $0.70/hour. Gemini 1.5 Flash charges per million tokens. If you do enough inference, self-hosting wins. Right?

The math is correct. The inputs are wrong.

Here's what I've seen go sideways:

GPU utilization that doesn't match projections. A team provisions an L4 node pool for their ADK agent. The agent handles customer support queries during business hours — maybe 40 hours of actual usage per week. But the GPU node runs 168 hours per week. They're paying for 128 hours of idle compute at $0.70/hour. That's $90/week in waste before they process a single token.

Model update responsibility nobody planned for. Llama 3.1 is great until Llama 3.2 ships with better instruction following. Gemini models improve automatically. Self-hosted models require you to pull new weights, test for regressions, and redeploy. Most teams don't budget engineering time for model ops.

No autoscaling, no cost control. I've reviewed GKE deployments where the vLLM container runs on a static GPU node pool with no Horizontal Pod Autoscaler configured. During low-traffic periods, that GPU sits warm. During traffic spikes, the single replica bottlenecks everything.

The teams that get this decision right ask different questions before they touch infrastructure.

What Actually Drives This Decision

In my experience advising SaaS companies preparing for SOC 2 and handling sensitive customer data, three factors dominate the architecture choice:

1. Data Residency and Compliance Requirements

This is the only factor that makes the decision obvious. If your data cannot leave a specific geography, or cannot be processed through shared API infrastructure, self-hosting isn't optional — it's mandatory.

PIPEDA-regulated data for Canadian customers, HIPAA-protected health information, financial services data subject to specific processing constraints — these requirements eliminate Vertex AI's hosted models from consideration. You need the model running on infrastructure you control, in a region you specify.

When compliance drives the decision, self-hosting is correct regardless of cost comparison. The alternative is regulatory risk that no per-token savings can offset.

2. Actual Token Volume at Scale

The break-even calculation depends on sustained inference load, not peak usage.

A rough model: Gemini 1.5 Flash input tokens cost approximately $0.075 per million. An L4 GPU running Llama 3.1 8B can process roughly 2,000 tokens per second under load. If your workload sustains that throughput for hours daily, self-hosting wins economically.

If your agents handle 50,000 tokens per day total? The API cost is negligible. The GPU cost is fixed overhead.

I've seen teams project to "eventually" high volume and provision GPU infrastructure now. That eventually costs real money every hour it doesn't arrive.

3. Operational Capacity for GPU Infrastructure

This is where the SCALE framework's Lifecycle Operations stage becomes critical. Self-hosting an LLM isn't a deploy-once proposition. It's ongoing infrastructure:

GPU driver updates and compatibility testing
Model weight management and storage
vLLM version upgrades (they ship fast)
Monitoring and alerting for inference latency and errors
Capacity planning as agent traffic grows

If your platform team is already stretched managing GKE workloads, Terraform pipelines, and security controls, adding GPU ops creates operational risk. If you have a team comfortable with ML infrastructure, it's manageable.

Making ADK Architecture Swap-Ready

Regardless of your initial decision, architect your ADK agents to support backend changes. This is where I've seen teams save themselves future pain.

ADK supports pluggable model backends. Don't hardcode Gemini API endpoints. Configure the model backend as an environment variable or secret that points to an endpoint — not a model name.

env:
- name: LLM_ENDPOINT
  value: "http://vllm-service.default.svc.cluster.local:8000/v1"

vLLM exposes an OpenAI-compatible API. Your ADK agent can switch from self-hosted Llama to Vertex AI Gemini with a configuration change rather than code changes.

This flexibility matters because your requirements will shift. A company that doesn't need data residency today might acquire a healthcare customer next quarter. A startup running light inference today might hit scale where self-hosting makes sense in six months.

The GKE Configuration That Actually Works

If you've answered the three questions and self-hosting is the right call, here's what works in production:

gcloud container node-pools create gpu-pool \
  --cluster=CLUSTER \
  --machine-type=g2-standard-24 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=1 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=3

Setting min-nodes=0 is critical. The node pool scales to zero when no pods require GPU resources. You stop paying for idle GPUs.

The vLLM deployment needs appropriate resource requests to trigger autoscaling:

containers:
- name: vllm
  image: vllm/vllm-openai:latest
  args:
  - --model=meta-llama/Llama-3.1-8B-Instruct
  - --port=8000
  resources:
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"

GKE Autopilot now supports GPU workloads, but Standard mode gives more control over node pool behavior and is typically cheaper for workloads that need persistent GPU allocation.

Trade-offs You Need to Accept

Self-hosting: Lower per-token cost at scale, complete data control, no SLA, model update responsibility, GPU infrastructure ops.

Managed Vertex AI: Higher per-token cost, data processed through Google infrastructure, automatic model improvements, managed SLA, zero infrastructure overhead.

Neither is universally correct. The architecture decision follows from your compliance requirements, actual token volume, and team capacity.

Get the Inputs Right First

Before you provision a GPU node pool or wire up API credentials, answer three questions:

Does your data residency or compliance posture require self-hosting?
What's your actual sustained token volume — not projected, not peak?
Does your team have operational capacity for GPU infrastructure?

The answers determine the architecture. The infrastructure follows.

Need help designing your ADK agent architecture on GKE? Work with a GCP specialist — book a free discovery call.

Amit Malhotra

Principal GCP Architect, Buoyant Cloud Inc

Work with a GCP specialist — book a free discovery call → https://buoyantcloudtech.com

DEV Community