DEV Community

Cover image for Inference on GKE Private Clusters
Maciej Strzelczyk for Google Cloud

Posted on • Originally published at Medium

Inference on GKE Private Clusters

Setting up inference service without access to Internet

Deploying an inference service on your GKE cluster in 2026 is a fairly simple task. With a short Deployment definition making use of a vLLM image (TPU or GPU) and a Service definition, you have the basic setup ready to go! vLLM grabs the model of your choosing from Hugging Face during its startup. It’s all nicely automated. However, this setup requires your GKE nodes to have access to the Internet. What should you do when there’s no Internet connection? I will discuss the options in this article, but first, let’s start with a short analysis of how and why you may want to have no Internet connection for your nodes.

GKE Private Nodes

One situation where your vLLM pod might not be able to download a model from the Internet is when you decide to use GKE Private Cluster. When you choose this option, the nodes in your cluster are assigned only a private IP from your VPC network. With only a private IP address, it’s impossible to reach them from outside of your network, but they also lose the default way to communicate with the outside world. This feature is great for increasing the security of your system, but it has obvious drawbacks, like this lack of connectivity to the world.

One easy solution to the private nodes situation is to configure Cloud NAT for the region your cluster is in. That will create a way for the nodes and pods running on them to access the Internet, while keeping them protected from any attempt to establish new connections from outside of the network. However, if you want your pods to be unable to connect to the Internet, we need another way to get the model for vLLM to run.

Providing images to the pods

One other problem you might encounter when choosing to use Private Cluster without access to the Internet is the fact that your nodes won’t have access to the default source of Docker images: Docker Hub. The simple vllm/vllm-openai:latest image specification will not work. You will need to copy the images you want to use to the Artifact Registry—this way GKE Nodes will be able to download the images and run them. This gives you additional control over your environment; you can carefully control which versions of the images to download and allow cluster users to use.

Providing the LLM

vLLM can run a model stored in a local directory if you pass it as the --model argument value. To make use of this ability in your private GKE cluster, you will have to somehow provide the model to the vLLM through a mounted directory. The easiest way to do this is through GCS FUSE, which allows you to simply mount a GCS bucket as a folder in your Pod. You just need to remember that:

  1. The GKE Cluster must have the GcsFuseCsiDriver add-on enabled.
  2. You should use Workload Identity and a dedicated service account to allow the pod to access the bucket. The roles/storage.objectViewer role should work just fine for read-only access.
  3. It’s important to host the model in the same region as the nodes of your cluster to ensure the fastest transfers.

Serving LLMs from a mounted directory speeds up the startup process of your inference service, as it doesn’t have to download the model each time a new pod is started.

Alternative to mounting GCS Bucket - persistent disks

An alternative to mounting a bucket is to use a zonal or regional persistent disk or hyperdisk. A single disk can be mounted by multiple pods at once if using read-only mode. Creating a disk to store a model is a bit more time consuming than using a GCS bucket, but might provide better performance (depending on the disk type) and be cheaper, as GCS and disk billing is structured differently.

To create a disk storing a model, you will need a temporary Compute Instance, where you will mount, format and fill the disk with data (hf download works just fine for this). Once the disk is ready, the VM can be deleted and the disk attached to the vLLM pods.

Summary

Using GKE without Internet access can be a good practice, providing you with additional security and control. As you can see, the additional work required to get your inference service running in this case is not negligible, but it is also not a deal-breaker. It’s up to you to decide if it’s a configuration you would like to use in your setup. Using a GCS Bucket or persistent disk to store a model is also a very good idea to simply cut down on the startup time of your services, especially with larger models.

The ecosystem of AI is changing at a rapid pace and it’s important to stay up to date with all the latest news. Follow the official Google Cloud blog, Google Developers blog and Google Cloud Tech YouTube channel to not miss any updates!

Top comments (0)