Karl Weinmeister for Google AI

Posted on Nov 6 • Originally published at Medium on Oct 12

Deploy Faster with Terraform: Your Guide to vLLM on GKE with Infrastructure-as-Code

#vllm #gke #terraform #ai

Somewhere in your AI journey, you’re going to push the limits of what models can do.

You might need to squeeze out that extra bit of performance, or try to fit a big model right under a GPU’s VRAM limit. All of these situations require tweaking and redeployment. That’s not as simple as it sounds, when the infrastructure includes everything from GPU clusters to storage to networking.

The solution is to treat your infrastructure the same way you treat your application code. It needs to be versioned in Git. It needs to be tested. And it needs to be deployed through an automated pipeline. This practice, known as Infrastructure as Code, or IaC, is the foundation of any serious MLOps strategy.

This article is a practical guide on how to use Terraform for agile ML engineering. I’ll walk through a real-world example of deploying a high-with vLLM on Google Kubernetes Engine. You can follow along with the complete source code on GitHub in the vllm-gke-terraform repository.

We will use the Qwen3–32B model in this article, which can be run on easily accessible NVIDIA L4 GPUs on Google Cloud. The Terraform script has been tested on larger models, such as the Qwen/Qwen3–235B-A22B-Instruct-2507 on a cluster with 8 H100 GPUs.

The scripts currently use GKE standard clusters for maximum flexibility. For production workloads where you want to offload node management and focus purely on the application, it’s recommended to leverage GKE Autopilot capabilities.

Declarative Infrastructure

Terraform uses a declarative language (HCL) where you define the desired end state of your infrastructure. You specify what you need, and Terraform’s engine calculates the necessary API calls to make the real-world infrastructure match that state. Before applying any changes, you can run the terraform plan command to see a detailed preview of what Terraform will create, modify, or destroy.

This allows for a thorough review to ensure the proposed changes align with your intentions, preventing unintended modifications. This declarative model is the key to eliminating configuration drift and ensuring that every environment is provisioned identically, a critical requirement for reproducible experiments.

The Terraform provider for Google Cloud is the interface between Terraform and Google Cloud. For example, the google_container_cluster resource is used to manage a GKE cluster. You can find the full set of GKE resources here.

In our project, the gke.tf file declares the desired state of a GKE cluster with specific node pools:

# gke.tf
resource "google_container_cluster" "qwen_cluster" {
  name = local.cluster_name
  location = var.zone
  project = var.project_id
  # ...
}

resource "google_container_node_pool" "gpu_pools" {
  # ...
  node_config {
    machine_type = each.value.machine_type
    guest_accelerator {
      type = each.value.accelerator_type
      count = each.value.accelerator_count
    }
  }
}

To manage this, Terraform maintains a state file that maps these definitions to their real-world resources. For team collaboration, using a remote state backend like Cloud Storage is recommended. It provides a centralized source of truth and uses locking mechanisms to prevent conflicting changes. Here’s how to instruct Terraform to use GCS as its backend:

# main.tf
terraform {
  backend "gcs" {
    prefix = "terraform/state/vllm-gke"
  }
}

Reusable Modules

Terraform modules are the primary mechanism for abstraction and reuse. MLOps teams can create a library of standardized modules for common components like a GKE cluster or a vector database.

Modules are made reusable through input variables. This allows an engineer to maintain a single, version-controlled set of Terraform files and use variable files (.tfvars) to launch new, isolated deployments.

To test a new model, you could simply create a new variable file like llama3-test.tfvars. By overriding a few default values, you can spin up an entirely new, isolated environment to test Llama-3–8B on L4 GPUs:

# my-experiment.tfvars
project_id = "my-gcp-project"
name_prefix = "my-llama3-deployment"
model_id = "meta-llama/Llama-3-8B-Instruct"
gpu_type = "l4"

Running terraform apply -var-file=llama3-test.tfvars makes spinning up parallel experiments a trivial, declarative operation, dramatically increasing a team’s experimental throughput.

For production systems, this same principle allows for sophisticated, zero-downtime strategies like Blue/Green deployments. A second, parallel “green” version of the entire stack is deployed by instantiating the Terraform configuration with a different set of variables. Once the new environment is fully validated, production traffic can be instantly switched at the load balancer or DNS level. The old “blue” environment can then be decommissioned. By codifying these complex release strategies, the entire deployment process becomes a version-controlled, auditable artifact.

Configuring the vLLM Engine

Provisioning hardware consistently is the first step. Configuring software to utilize that hardware efficiently is next.

The sample project uses the popular vLLM inference engine. Let’s show how to effectively link Terraform variables to configuration parameters in vLLM.

In variables.tf, the high-level knobs for experiments are defined:

# variables.tf
variable "gpu_memory_utilization" {
  description = "GPU memory utilization ratio"
  type = number
  default = 0.9
}

variable "max_model_len" {
  description = "The maximum model length."
  type = number
  default = 8192
}

variable "vllm_max_num_seqs" {
  description = "The maximum number of sequences (requests) to batch together."
  type = number
  default = 64
}

Then, the deployment in kubernetes.tf consumes these variables to construct the vLLM server’s startup arguments:

# kubernetes.tf
...
container {
  name = "vllm-container"
  args = compact([
    # --- Base Model Arguments ---
    "--model",
    var.model_id,
    "--tensor-parallel-size",
    tostring(local.gpu_config.accelerator_count),

    # --- Performance Tuning from Variables ---
    "--gpu-memory-utilization",
    tostring(var.gpu_memory_utilization),
    "--max-model-len",
    tostring(var.max_model_len),
    "--max-num-seqs",
    tostring(var.vllm_max_num_seqs),
  ])
}

Production-Grade Architecture

The sample project showcases a blueprint for a production-grade inference endpoint on GKE designed for both performance and cost-efficiency.

The gke.tf file provisions a GKE cluster with both spot and on-demand GPU node pools, which allows for a flexible and cost-effective approach to managing expensive GPU resources. You can read more here about the strategy to back up spot VMs with an on-demand node pool.

To avoid re-downloading large models on every pod restart, a kubernetes_persistent_volume_claim is created in kubernetes.tf to provide a persistent cache for the Hugging Face models. A Kubernetes Job, defined in kubernetes_jobs.tf, is then used to download the specified model into this persistent volume. This job runs to completion before the main vLLM deployment is scaled up, ensuring the model is ready before the inference server starts.

Automated Workflows

While Terraform itself is a big leap forward from shell scripting, it’s crucial that teams don’t stop there. The next step beyond running manual terraform commands is to embrace an automated, end-to-end CI/CD workflow, often called GitOps. The source control repository becomes the single source of truth for both application code and infrastructure.

The sample project includes a basic GitHub Actions workflow that validates the Terraform code on every push and pull request.

# .github/workflows/terraform-validate.yml
name: 'Terraform Validate'
on: [push, pull_request]

jobs:
  validate:
    name: 'Terraform Validate'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate

A complete CI/CD pipeline would extend this by running terraform plan on pull requests to preview changes and automatically running terraform apply on merge to the main branch to deploy them. This creates a flywheel where code is pushed and infrastructure is updated without manual intervention.

Infrastructure-as-Code is a now an AI Competency

The main takeaway is this: mastering Infrastructure as Code isn’t an optional “DevOps” skill. It’s a core competency for the modern ML engineer. For any organization serious about productionizing AI, Terraform on Google Cloud is the a key step toward building a scalable engineering culture.

If you’d like to keep learning more, I recommend the step-by-step guide on using a GKE cluster with Terraform: Quickstart: Deploy a workload with Terraform.

From there, I’d love to hear more about your journey with AI and Cloud infrastructure. Connect on LinkedIn, X, or Bluesky to continue the discussion!

Top comments (1)

Devin • Nov 6

Thanks for this