Olivier Bourgeois for Google AI

Posted on Dec 5, 2025 • Originally published at cloud.google.com

Hands-on with Gemma 3 on Google Cloud

#gemma #ai #cloud #opensource

The landscape of generative AI is shifting. While proprietary APIs are powerful, there is a growing demand for open models—models where the architecture and weights are publicly available. This shift puts control back in the hands of developers, offering transparency, data privacy, and the ability to fine-tune for specific use cases.

To help you navigate this landscape, we are releasing two new hands-on labs featuring Gemma 3, Google’s latest family of lightweight, state-of-the-art open models.

Why Gemma?

Built from the same research and technology as Gemini, Gemma models are designed for responsible AI development. Gemma 3 is particularly exciting because it offers multimodal capabilities (text and image) and fits efficiently on smaller hardware footprints while delivering massive performance.

But running a model on your laptop is very different from running it in production. You need scale, reliability, and hardware acceleration (GPUs). The question is: Where should you deploy?

We have prepared two different paths for you, depending on your infrastructure needs: Cloud Run or Google Kubernetes Engine (GKE).

Path 1: The Serverless Approach (Cloud Run)

Best for: Developers who want an API up and running instantly without managing infrastructure, scaling to zero when not in use.

If your priority is simplicity and cost-efficiency for stateless workloads, Cloud Run is your answer. It abstracts away the server management entirely. With the recent addition of GPU support on Cloud Run, you can now serve modern LLMs without provisioning a cluster.

Start the lab!

Lab: Serving Gemma 3 with vLLM on Cloud Run

Objectives:

Containerize vLLM (a high-throughput serving engine).
Deploy Gemma 3 to Cloud Run.
Leverage GPU acceleration for fast inference.
Expose an OpenAI-compatible API endpoint.

Path 2: The Platform Approach (GKE)

Best for: Engineering teams building complex AI platforms, requiring high throughput, custom orchestration, or integration with a broader microservices ecosystem.

When your application graduates from a prototype to a high-traffic production system, you need the control of Kubernetes. GKE Autopilot gives you that power while still handling the heavy lifting of node management. This path creates a seamless journey from local testing to cloud production.

Start the lab!

Lab: Deploying Open Models on GKE

In this lab, you will learn how to:

Prototype locally using Ollama.
Containerize your setup and transition to GKE Autopilot.
Deploy a scalable inference service using standard Kubernetes manifests.
Manage resources effectively for production workloads.

Which Path Will You Choose?

Whether you are looking for the serverless simplicity of Cloud Run or the robust orchestration of GKE, Google Cloud provides the tools to take Gemma 3 from a concept to a deployed application.

Dive into the labs today and start building:

Share your progress and connect with others on the journey using the hashtag #ProductionReadyAI. Happy learning!

These labs are part of the Open Models module in our official Production-Ready AI with Google Cloud program. Explore the full curriculum for more content that will help you bridge the gap from a promising prototype to a production-grade AI application.

Top comments (4)

Art light • Dec 5 '25

Great work on this! The breakdown of Gemma 3 and the two deployment paths is super clear and really helpful. I appreciate how you made something complex feel easy to understand, and it definitely boosted my interest in trying it out myself. Excited to dive into these labs—thanks for sharing!

Timm David • Dec 10 '25

Nice work

Some comments may only be visible to logged-in visitors. Sign in to view all comments.