Daniel Gwerzman for Google Developer Experts

Posted on Apr 4

Deploy Gemma 4 on Cloud Run: Pay Only When You Actually Use It

#gemma #cloudrun #cloud #ai

Last year, Google flew me to Paris for the announcement of Gemma 3. It was an exciting event. The demos were impressive. But what really mattered happened later, back at my desk, when I ran my own tests and found out the demos weren't lying.

Gemma 3 was the first open model that closed the gap on the big commercial ones. It didn't beat Gemini. But it reached the level Gemini was at a year earlier. For an open model you could run on your own infrastructure, that was a meaningful leap. I started integrating it into my own pipelines. Specific tasks, small steps, places where the answer doesn't need a frontier model to get it right.

Then I made a mistake.

I deployed Gemma 3 on Vertex AI Model Garden over a weekend for testing. Left it running. Didn't turn it off. Came back to a bill that made me rethink my relationship with cloud infrastructure. I made a video about it on my YouTube channel so others wouldn't repeat the same mistake.

This article is the redemption.

Gemma 4 just launched. It's a bigger jump than Gemma 3 was. And this time, I'm deploying it on Cloud Run, which scales to zero when you're not using it. Forget to turn it off. I dare you. You won't pay a cent.

This article is in two parts. The first covers what Gemma 4 is, why running your own model changes what you can build, how the deployment stack works, and the performance data from my own tests.

The second part is the step-by-step deployment guide. Prerequisites, VPC setup, model upload, deploy commands for all four model sizes, and cleanup.

Part 1: Understanding Gemma 4 on Cloud Run

What Changed in Gemma 4

Gemma 4 ships as four distinct models, not one. Two small ones and two large ones, each with a different tradeoff.

Model	Parameters	Architecture	Context Window
E2B	2.3B effective	Dense	128k tokens
E4B	4.5B effective	Dense	128k tokens
26B A4B	26B total, 4B active	MoE	256k tokens
31B	31B	Dense	256k tokens

The 26B model deserves a closer look. It uses Mixture of Experts (MoE) architecture: a design where the model has 26 billion parameters on disk but only activates 4 billion of them per token during inference. Think of it like a large team of specialists where only the relevant experts are called in for each task, rather than everyone working on every problem. The result: capability that approaches a 26B model, at the compute cost of a 4B one. This matters enormously at inference time, as you'll see in the numbers below.

Beyond size, Gemma 4 adds multimodal input. Images, audio, video: all supported as inputs, with text output. The small models (E2B, E4B) can process video with audio. The larger ones handle images with extended context.

But the two improvements that matter most for anyone building agentic pipelines are reasoning and function calling.

Reasoning means the model works through a problem step by step before producing an answer, rather than jumping straight to a response. For complex tasks that previously required a frontier model, a reasoning-capable Gemma 4 can now handle them at a fraction of the cost. Function calling has also been significantly improved, the model reliably returns structured tool calls, which is what makes it composable inside an agent that orchestrates multiple steps.

Together, these two capabilities change where Gemma fits in a production system. My own pipelines split work into layers: small, focused steps that don't need deep reasoning, and orchestration steps that evaluate results and decide what happens next. Gemma 3 could handle the simple steps. Gemma 4 can handle more of the middle layer too, the tasks that previously needed a bigger, more expensive model to get right. Every step moved from a frontier API to a self-hosted Gemma is tokens you stop paying for.

Why Running Your Own Model Changes What You Can Build

First, let's be precise about what "open" means here. Gemma 4 is not open source. The training code, the training data, and the full recipe that produced the model are not public. What Google releases are the weights, the trained parameters of the model itself, under the Apache 2.0 license. You can download them, run them, modify them, and build commercial products on top of them. But you can't reproduce the training process.

That distinction matters less than people think for most use cases, and more than people think for one specific one: fine-tuning.

Because you have the weights, you can train on top of them. Take the 4B model, run it through your own domain-specific dataset, and produce a version that understands your terminology, follows your output format, and performs better on your specific tasks than the general-purpose model does. This is the path from "capable open model" to "model that knows your business." The data you use for fine-tuning never leaves your environment, and the result belongs to you.

There's a category of problem that big cloud-hosted models can't solve for certain customers. Not because the models aren't capable, but because the data can't leave the building.

Healthcare providers, financial institutions, legal firms. Organizations with serious data privacy obligations can't pipe sensitive information through external APIs. For them, powerful AI has always meant exposing data to someone else's infrastructure. That's a non-starter in many regulated environments.

A self-hosted Gemma on your own Cloud Run service changes the equation. The model runs in your project, your VPC, your infrastructure. The data never leaves.

But the most compelling example isn't in an office. It's in a field.

Imagine a drone with cameras flying over farmland, using computer vision and reasoning to detect crop diseases, identify irrigation problems, or spot pest damage. That drone can't wait for a round-trip API call to a cloud endpoint. It might not even have a reliable internet connection in a remote countryside location. The decision needs to happen on the device, or close to it. And it needs to happen fast.

Gemma 4's multimodal capability, combined with its small model sizes, makes that kind of on-device or edge deployment practical. A 2B or 4B model can run on hardware that a drone or industrial sensor could realistically carry or connect to. The reasoning capability means it can do more than classify. It can think through what it's seeing.

How Gemma 4 Gets Deployed: The Stack

When Gemma 3 came out, the typical deployment used Ollama, a tool designed for running models locally. It's simple, works well for small models, and you can get something running quickly. Ollama bakes the model weights directly into the container image. The container starts, the model is already there, and you're serving in seconds. For a 2B or 7B model this is fine.

Gemma 4's larger models break that pattern. A 31B model doesn't fit comfortably in a container image. You can't bake 65GB of weights into something you expect to deploy and scale quickly. Ollama also doesn't expose the production controls you need at scale: request batching, KV-cache sharing across concurrent requests, quantization that doesn't sacrifice accuracy. It's a great local tool. It's not designed for what we're building here.

Gemma 4's official path uses vLLM instead. vLLM is an inference engine built specifically for serving LLMs in production. It handles multiple concurrent requests efficiently by batching them together, shares the KV-cache across requests to reduce memory pressure, and supports fp8 quantization, which is what lets the 31B model fit inside the NVIDIA RTX Pro 6000 Blackwell's 96GB of VRAM without meaningful quality loss. No cluster to manage, no node pool to configure. One flag in your deploy command.

For the 26B and 31B models, there's a third piece: the Run:ai Model Streamer. Rather than waiting for the entire model to load before serving the first request, the streamer fetches weights in parallel from Google Cloud Storage while vLLM initializes. The model starts accepting requests before it's fully loaded. This is what makes large model cold starts on Cloud Run feasible rather than painful.

The Model Loading Decision: HuggingFace vs GCS

Before you deploy, there's one choice that affects everything else: where does the model come from at startup?

HuggingFace is the simple option. The container downloads the model weights on every cold start. No storage cost, no upfront setup. The tradeoff: you're downloading over the public internet each time, and that download time dominates your cold start.

Google Cloud Storage is the production option. You upload the weights once, and the container streams them from GCS on each cold start via the Run:ai Model Streamer. More setup upfront. But the streaming happens over Google's internal network, and the speed difference is significant.

Here's the part that surprised me when I tested it: GCS without proper VPC configuration is actually slower than HuggingFace for small models. The Run:ai streamer's advantage only materializes when traffic stays on Google's internal network. When it goes out to the public internet, the overhead eliminates the benefit.

The fix is Private Google Access on your VPC subnet. One command. It opens a route from your Cloud Run container to Google APIs (including GCS) without touching the public internet. The official documentation doesn't highlight this, and it's the single most important detail in this entire guide.

The Numbers

I deployed all four model sizes from HuggingFace and from GCS, with and without VPC, and measured cold start time (first request after scale-to-zero), time to first token, and warm response time. Same prompt every time: "What is the moon?"

A note on methodology: these are single measurements, not averages from a full benchmark suite. LLMs are nondeterministic by nature, and infrastructure performance varies with load, region capacity, and network conditions. The numbers below are directionally correct, not scientifically precise. Use them to understand the relative tradeoffs between deployment options. Before committing to one approach in production, run your own tests with your own workload.

Model	Source	Cold Start	Warm Response
2B	HuggingFace	311s	1.75s
2B	GCS (no VPC)	334s	1.81s
2B	GCS + VPC	245s	1.81s
4B	HuggingFace	452s	2.46s
4B	GCS (no VPC)	433s	2.47s
4B	GCS + VPC	246s	2.47s
26B	GCS + VPC	191s	1.61s
31B	GCS + VPC	251s	5.90s

A few things to unpack.

GCS without VPC is slower than HuggingFace for small models. The streamer adds overhead that only pays off when traffic is fast. Over the public internet, HuggingFace wins for small files.

VPC changes everything. With Private Google Access, the 4B cold start drops from 433 seconds to 246 seconds. That's a 43% reduction just from routing traffic differently.

The 26B model cold starts faster than the 2B from HuggingFace. Read that again. A 26 billion parameter model, streamed over Google's internal network, is ready to serve in 191 seconds. The 2B downloading from HuggingFace takes 311 seconds. Network path and streaming architecture matter more than model size on disk.

MoE vs. dense matters at inference time, not just startup. The 26B warm response is 1.61 seconds. The 31B is 5.90 seconds. The 31B is a dense model: every one of its 31 billion parameters participates in every token. The 26B only activates 4 billion at a time. That's why the 26B responds nearly four times faster despite being nominally "larger." For latency-sensitive applications with a 256k context requirement, the 26B A4B is the more interesting choice.

Scale-to-Zero and What It Means for Cost

Cloud Run scales running instances based on traffic. When there are no requests, it scales to zero. No instances, no GPU allocated, no cost. The moment a request arrives, a new instance starts, loads the model, and serves it.

This is fundamentally different from Vertex AI Model Garden, where a deployed endpoint keeps a running instance alive. Walk away for the weekend, come back to a bill.

With Cloud Run's scale-to-zero, the worst case is the cold start delay. And as the numbers show, even a 26B model is ready in about three minutes. For development and testing, that tradeoff is straightforward.

To verify an instance has scaled to zero: open the Cloud Run console, click your service, go to the Metrics tab, and look at Instance count. Zero means you're not being charged. Or just wait 5 minutes after your last request and send a new one. If it takes 200+ seconds instead of 2, the instance scaled down.

Scale-to-zero is the right choice for development and testing. But it's not the only option. Cloud Run also lets you set a minimum number of instances to keep alive at all times. For production serving consistent traffic, you'd configure at least one warm instance to eliminate cold starts entirely. That changes the cost model, you're paying for idle time again, but it's a deliberate tradeoff, not an accident.

The guide in this article is optimized for testing: minimal cost, maximum flexibility, scale-to-zero on everything. When you're ready to move to production and need to think about minimum instances, concurrency tuning, and traffic management, my post "This is Cloud Run: Configuration" covers those options.

Part 2: The Deployment Guide

I ran all of this myself before writing a word of this guide. Deployed every model size, hit every error, debugged every failure. The instructions below are what actually works for me.

What You'll Need

A Google Cloud project with billing enabled
Cloud Shell (recommended) or the gcloud CLI installed locally. This guide assumes you're using Cloud Shell.
Access to the Gemma 4 models on HuggingFace (requires accepting the license for each model)
A HuggingFace access token with read access

I ran everything below from Cloud Shell. Google Cloud's browser-based terminal, pre-authenticated with your account and gcloud already installed. No local setup, no version mismatches. Open it from the Cloud Console by clicking the terminal icon in the top right.

Step 1: Set Your Environment Variables

Set these once at the start of your Cloud Shell session. Every command in this guide uses them:

export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_CLOUD_REGION="us-central1"
export GCS_BUCKET="${GOOGLE_CLOUD_PROJECT}-model-cache"
export VPC_NETWORK="gemma-vpc"
export VPC_SUBNET="gemma-subnet"
export HF_TOKEN="your-huggingface-token"
export PROJECT_NUMBER=$(gcloud projects describe "${GOOGLE_CLOUD_PROJECT}" \
  --format='value(projectNumber)')

Note 1: RTX Pro 6000 available in us-central1 and europe-west4
Note 2: Cloud Shell sessions don't persist environment variables across reconnects. If you close and reopen Cloud Shell, run this block again before continuing.

Step 2: Enable the Required APIs

On a new project, you'll need all of these:

gcloud services enable \
    run.googleapis.com \
    compute.googleapis.com \
    storage.googleapis.com \
    artifactregistry.googleapis.com \
    iam.googleapis.com \
    --project $GOOGLE_CLOUD_PROJECT

compute.googleapis.com is required for VPC and GPU resources. iam.googleapis.com is needed to grant permissions to the Cloud Run service agent. The others cover model storage and Cloud Run itself.

Step 3: Check GPU Quota (optional)

Cloud Run GPU access requires quota approval. You can check your current allocation:

Go to IAM & Admin > Quotas & System Limits in the Google Cloud Console, filter by NVIDIA_RTX_PRO_6000 and region:us-central1. If the limit is 0 or the quota doesn't appear, you need to request it.

Request GPU quota through the Cloud Run quotas page. Approval is not instant, allow a few days.

Step 4: Create the VPC

As explained in Part 1, Private Google Access on the subnet is the critical step. Without it, the container can't reach GCS at all. It's included directly in the subnets create command below — no separate update needed.

gcloud compute networks create "${VPC_NETWORK}" \
  --subnet-mode=custom \
  --bgp-routing-mode=regional \
  --project="${GOOGLE_CLOUD_PROJECT}"

gcloud compute networks subnets create "${VPC_SUBNET}" \
  --network="${VPC_NETWORK}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --range=10.0.0.0/24 \
  --enable-private-ip-google-access \
  --project="${GOOGLE_CLOUD_PROJECT}"

Grant the Cloud Run service agent permission to use the subnet:

gcloud projects add-iam-policy-binding "${GOOGLE_CLOUD_PROJECT}" \
  --member="serviceAccount:service-${PROJECT_NUMBER}@serverless-robot-prod.iam.gserviceaccount.com" \
  --role="roles/compute.networkUser"

Note: In production, instead of using the default compute service account, create a dedicated one with least-privilege access to the GCS bucket.

Step 5: Create the GCS Bucket and Upload Models

Create a single-region bucket in the same region as your Cloud Run service:

gcloud storage buckets create "gs://${GCS_BUCKET}" \
  --location="${GOOGLE_CLOUD_REGION}" \
  --uniform-bucket-level-access \
  --project="${GOOGLE_CLOUD_PROJECT}"

gcloud storage buckets add-iam-policy-binding "gs://${GCS_BUCKET}" \
  --member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

On disk space: The 2B (~5GB) and 4B (~9GB) models fit in Cloud Shell. The 26B (~50GB) and 31B (~65GB) don't. Cloud Shell has about 5GB of disk. For the large models, spin up a temporary GCE VM:

gcloud compute instances create gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --machine-type=n2-standard-8 \
  --boot-disk-size=300GB \
  --boot-disk-type=pd-ssd \
  --image-family=debian-12 \
  --image-project=debian-cloud \
  --network=default \
  --subnet=default \
  --scopes=storage-full

If you get a subnet error, add --network="${VPC_NETWORK}" --subnet="${VPC_SUBNET}" to the command. Cloud Shell handles this automatically.

gcloud compute ssh gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}"

Inside the VM, install dependencies and upload the models:

export HF_TOKEN="your-huggingface-token"
export GCS_BUCKET="your-bucket-name"

sudo apt-get update && sudo apt-get install -y python3-pip --fix-missing
python3 -m pip install huggingface_hub hf_transfer --break-system-packages

# Stop immediately if any step fails
set -e

# Download, upload, and clean up each model one at a time
for MODEL in "google/gemma-4-E2B-it:gemma-4-E2B-it" \
             "google/gemma-4-E4B-it:gemma-4-E4B-it" \
             "google/gemma-4-26B-A4B-it:gemma-4-26B-A4B-it" \
             "google/gemma-4-31B-it:gemma-4-31B-it"; do
  REPO="${MODEL%%:*}"
  DIR="${MODEL##*:}"
  export HF_HOME="/tmp/hf_cache_${DIR}"
  python3 -c "
from huggingface_hub import snapshot_download
import os
snapshot_download(repo_id='${REPO}', local_dir='/tmp/${DIR}', token=os.environ['HF_TOKEN'])
"
  gcloud storage cp /tmp/${DIR}/* "gs://${GCS_BUCKET}/models/${DIR}/" --recursive
  rm -rf /tmp/${DIR} "/tmp/hf_cache_${DIR}"
done

Exit the VM and delete it:

exit

gcloud compute instances delete gemma-uploader \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --zone="${GOOGLE_CLOUD_REGION}-a" \
  --quiet

Step 6: Deploy

Every deployment uses the same prebuilt container image from Google:

us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4

One important detail covered in Part 1: the container doesn't read the model or vLLM flags from environment variables. They must be passed via --command="vllm" and --args. Without --command="vllm", the startup script fails immediately.

2B model:

export GCS_MODEL_PATH_2B="gs://${GCS_BUCKET}/models/gemma-4-E2B-it"

gcloud beta run deploy gemma4-2b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=64 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_2B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

4B model: same command, replace GCS_MODEL_PATH_2B with GCS_MODEL_PATH_4B, service name with gemma4-4b.

export GCS_MODEL_PATH_4B="gs://${GCS_BUCKET}/models/gemma-4-E4B-it"

gcloud beta run deploy gemma4-4b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=64 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_4B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--max-num-seqs=64,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

26B model: adds fp8 quantization and drops concurrency to 8.
The startup probe below allows 6 minutes total (1 minute initial delay + 5 checks × 1 minute each), based on my measured load time. If your deployment times out, increase failureThreshold — each unit adds one more minute:

export GCS_MODEL_PATH_26B="gs://${GCS_BUCKET}/models/gemma-4-26B-A4B-it"

gcloud beta run deploy gemma4-26b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=8 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_26B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=5,timeoutSeconds=60,periodSeconds=60"

31B model: same as 26B with failureThreshold=6, allowing 7 minutes total based on my measured load time. Increase failureThreshold if needed, same rule applies: each unit adds one minute:

export GCS_MODEL_PATH_31B="gs://${GCS_BUCKET}/models/gemma-4-31B-it"

gcloud beta run deploy gemma4-31b \
  --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --execution-environment=gen2 \
  --allow-unauthenticated \
  --cpu=20 \
  --memory=80Gi \
  --gpu=1 \
  --gpu-type=nvidia-rtx-pro-6000 \
  --no-gpu-zonal-redundancy \
  --no-cpu-throttling \
  --max-instances=1 \
  --concurrency=8 \
  --timeout=600 \
  --network="${VPC_NETWORK}" \
  --subnet="${VPC_SUBNET}" \
  --vpc-egress=all-traffic \
  --command="vllm" \
  --args="serve,${GCS_MODEL_PATH_31B},--enable-chunked-prefill,--enable-prefix-caching,--generation-config=auto,--enable-auto-tool-choice,--tool-call-parser=gemma4,--reasoning-parser=gemma4,--dtype=bfloat16,--quantization=fp8,--kv-cache-dtype=fp8,--max-num-seqs=8,--gpu-memory-utilization=0.95,--load-format=runai_streamer,--tensor-parallel-size=1,--port=8080,--host=0.0.0.0,--max-model-len=32767" \
  --startup-probe="tcpSocket.port=8080,initialDelaySeconds=60,failureThreshold=6,timeoutSeconds=60,periodSeconds=60"

Get the service URLs after deployment:

export SERVICE_URL_2B=$(gcloud run services describe gemma4-2b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_4B=$(gcloud run services describe gemma4-4b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_26B=$(gcloud run services describe gemma4-26b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

export SERVICE_URL_31B=$(gcloud run services describe gemma4-31b \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --format="value(status.url)")

echo "2B:  $SERVICE_URL_2B"
echo "4B:  $SERVICE_URL_4B"
echo "26B: $SERVICE_URL_26B"
echo "31B: $SERVICE_URL_31B"

Step 7: Test It

The service exposes an OpenAI-compatible API. Any client that speaks the OpenAI protocol works against it.

The first request will be slow. If the instance has scaled to zero, Cloud Run needs to start a new one and load the model before responding. For the 2B and 4B models expect around 4 minutes. For the 26B and 31B, up to 5 minutes. Don't cancel the request — it will come back. Every request after that will be fast.

curl -X POST "${SERVICE_URL_2B}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "What is the moon?"}],
    "max_tokens": 200
  }'

The model name in the request must match the HuggingFace repo ID passed in --args (or a custom --served-model-name if you set one).

Production Hardening

Everything above uses --allow-unauthenticated and the default compute service account. That's fine for testing. Before real users or real data:

Authentication. Replace --allow-unauthenticated with --no-allow-unauthenticated. Cloud Run supports OIDC tokens for service-to-service calls and IAP for user-facing access.

Dedicated Service Account. Create one with only roles/storage.objectViewer on the model bucket. The default compute service account has broader permissions than necessary.

Private endpoint. For sensitive workloads, remove the public URL and access the service only from within your VPC via Cloud Run private networking.

Cleaning Up

To remove everything after testing is done, run these commands in order:

Delete the Cloud Run services:

for SERVICE in gemma4-2b gemma4-4b gemma4-26b gemma4-31b; do
  gcloud run services delete $SERVICE \
    --region="${GOOGLE_CLOUD_REGION}" \
    --project="${GOOGLE_CLOUD_PROJECT}" \
    --quiet
done

Delete the GCS bucket and all model weights:

gcloud storage rm -r "gs://${GCS_BUCKET}"

Delete the VPC subnet and network:

gcloud compute networks subnets delete "${VPC_SUBNET}" \
  --region="${GOOGLE_CLOUD_REGION}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --quiet

gcloud compute networks delete "${VPC_NETWORK}" \
  --project="${GOOGLE_CLOUD_PROJECT}" \
  --quiet

If subnet deletion fails with an error about IP addresses still in use: Cloud Run holds onto internal IP addresses for a period after services are deleted. Nothing to force-release them. Give it a few hours and try again.

Remove the IAM binding:

gcloud projects remove-iam-policy-binding "${GOOGLE_CLOUD_PROJECT}" \
  --member="serviceAccount:service-${PROJECT_NUMBER}@serverless-robot-prod.iam.gserviceaccount.com" \
  --role="roles/compute.networkUser"

The Bigger Picture

I started this piece talking about a Paris visit and an unexpected cloud bill. But the real reason I spent time getting Gemma 4 running on Cloud Run isn't just the cost.

It's the access.

When an LLM runs in your own infrastructure, things become possible that weren't possible before. Regulated data that couldn't touch a third-party API can now be processed by a capable model. Privacy becomes a feature of the architecture, not a compromise against capability.

But there's a more practical argument too. Commercial frontier models are expensive per token, and they come with regional rate limits that cap how much you can do. When your production pipeline hits a rate limit, it will slow it down. When you hit your own model, there's no rate limit. You control the capacity. You control the cost. You decide when to scale.

Gemma 4 is the first open model where that tradeoff genuinely makes sense across the full range of AI tasks: text, vision, reasoning, function calling. Not every step in your pipeline needs a frontier model. The steps that don't, and with Gemma 4's reasoning capability, that's more steps than before, can run on infrastructure you own, at a cost you control, without a rate limit in sight.

The drone flying over the farmer's fields, making decisions on its own, is not a hypothetical. The only thing that was missing was a model good enough to run on hardware that fits in a backpack.

Now there is one. And you know how to deploy it. Enjoy discovering it.

Top comments (5)

Andrew Voirol • Apr 4

Great article. I'm in my deep dive now of finding the fun edges of the model in local, and I feel that pinch of letting something run and getting that surprise later. Thanks and hope you get flown out to somewhere exciting on your next Google visit (hopefully not on the drone ) .