DEV Community: Google Cloud

How to Count Gemini Tokens Locally

Laurent Picard — Thu, 02 Jul 2026 13:50:42 +0000

✨ Overview

This article explores how Gemini tokenizes data and demonstrates how to count or estimate tokens locally. You’ll learn how to use the local tokenizer to estimate text token counts offline, understand the tokenization math for multimodal inputs (images, audio, video, PDFs), and see how to retrieve precise token usage metadata from API responses for accurate tracking and billing.

ℹ️ The complete source code is available in this notebook (including all setup details and future updates) under the Apache 2.0 license. You can also directly open the notebook in Colab. This article reproduces all the results generated by a click on “Run all”.

⚙️ Setup

🐍 Google Gen AI Python SDK

To call the Gemini API, we'll use the Google Gen AI Python SDK. The Gemini API provides a count_tokens method, and the SDK offers an experimental implementation of a LocalTokenizer class.

Make sure you have a recent version of the google-genai package with its local-tokenizer extra:

%pip install --quiet "google-genai[local-tokenizer]>=2.9.0"

🛠️ Google Cloud Project

To get started using the Gemini API on Agent Platform, you must have an existing Google Cloud project and enable the Agent Platform API.

Learn more about setting up a project and a development environment.

import os

# fmt: off
PROJECT_ID = ""  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "global" # @param {type: "string", placeholder: "[your-region]", isTemplate: true}
# fmt: on

if not PROJECT_ID:
    PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
    assert PROJECT_ID, "❌ Please set the PROJECT_ID variable"
if not LOCATION:
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global")

🤖 Gen AI SDK Client

To interact with the Gemini API, we initialize a genai.Client. Since we're using the enterprise-ready Agent Platform backend (formerly Vertex AI), we pass enterprise=True along with our Google Cloud project and location:

from google import genai


def print_configuration(client: genai.Client) -> None:
    service = "Agent Platform" if client.vertexai else "Google AI"
    print(f"ℹ️ Using the {service} API", end="")
    if client._api_client.project:
        print(f' with project "{client._api_client.project[:7]}…"', end="")
        print(f' in location "{client._api_client.location}"')
    elif client._api_client.api_key:
        api_key = client._api_client.api_key
        print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', end="")
        print(f" (in case of error, make sure it was created for {service})")


client = genai.Client(enterprise=True, project=PROJECT_ID, location=LOCATION)

print_configuration(client)

ℹ️ Using the Agent Platform API with project "lpdemo-…" in location "global"

🧠 Gemini Model

We'll use gemini-3.1-flash-lite as our default model for token counting and content generation. This lightweight, fast model is ideal for high-throughput tasks.

MODEL_ID = "gemini-3.1-flash-lite"

🧩 The Basics: Tokens and Tokenizers

Tokens

Large language models (LLMs) don't process our inputs directly, nor do they generate the final text or media we see. Instead, they operate on fundamental units called tokens, ingesting them as inputs and generating them as outputs.

Here's what happens when we send an LLM request:

Our inputs are transformed into tokens. In other words, they are tokenized.
The model generates output tokens, which represent the most likely next tokens based on the overall context.
These output tokens are transformed back into the final content we can use.

You can think of a token as a piece of information, and this tokenization process acts as an information compression codec:

Encoding: Input → Input tokens
Decoding: Output tokens → Output

Tokenization is necessary to compress information to the right level of semantic granularity, allowing the model's attention mechanism to focus and develop an understanding of the provided data.

Tokenizers

Gemini is natively multimodal and accepts text, images, audio, video, and PDFs. These media types can be processed by a set of three tokenizers:

Input	Text Tokenizer	Image Tokenizer	Audio Tokenizer	Comment
Text	✅			The original tokenizer type, when LLMs were only chatbots.
Image		✅		An image is can be worth a thousand ~~words~~ tokens!
Audio	✅		✅	Text tokens are used for timestamps (`MM:SS` or `H:MM:SS`).
Video	✅	✅	[✅]	By default, one frame is sampled per second, along with its corresponding timestamp. Audio is optional for videos.
PDF	✅	✅		PDFs are processed by vision tokenizers. Text tokens are used for OCR and pagination data.

As you can see, up to three tokenizers can be involved, depending on the modality.

💡 Keep in mind that not all underlying tokens are necessarily billed. See the usage_metadata section below for examples of tokens actually billed per modality.

Vocabulary

The complete set of unique tokens that an LLM can ingest or generate makes up its vocabulary. Once an LLM is trained, its vocabulary is fixed and is used for inference.

A vocabulary is essentially a lookup table mapping text sequences to token IDs (which correspond to vector representations in a semantic space). This means tokenizers are simply algorithms that use this vocabulary to encode and decode tokens (i.e., to convert data to and from token IDs).

For example, the Gemini text tokenizers process common words like this:

Text	Tokens	Tokenization	Token IDs
`hello`	1	A single token for most common sequences	`23391`
`passion`	1	`passion`	`208039`
`passionate`	2	`pass • ionate`	`4373 • 84242`
`passionné`	2	`passion • né` (passionate in French)	`208039 • 8504`
`passionately`	2	`passion • ately`	`208039 • 2295`
`passionalmente`	2	`pass • ionalmente` (passionately in Italian)	`4373 • 134916`

💡 As you can see, words with the same root aren't necessarily split the same way. Text tokenizers have no concept of syllables, prefixes, or suffixes. They don't think like linguists or grammarians; they think like statisticians and look for statistically optimal combinations.

🌐 Baseline: API Token Counting

The Gemini API lets you count tokens for any multimodal input by sending a count_tokens request. While you need to be authenticated to use it, this method is free of charge, so you can audit your prompts before committing to a paid request. Likewise, the compute_tokens method lets you retrieve the list of corresponding tokens and token IDs.

Let's reproduce the previous table:

from collections.abc import Iterator

import IPython.display

from google.genai.types import (
    ComputeTokensResponse,
    ComputeTokensResult,
    CountTokensResponse,
    CountTokensResult,
)

RowData = tuple[str, str, str, str]


def display_token_info_from_api(model: str, texts: list[str]) -> None:
    def yield_data() -> Iterator[RowData]:
        for text in texts:
            count_result = client.models.count_tokens(model=model, contents=text)
            compute_result = client.models.compute_tokens(model=model, contents=text)
            yield get_text_token_info(text, count_result, compute_result)

    display_token_info(yield_data())


def display_token_info(yield_data: Iterator[RowData]) -> None:
    def yield_row() -> Iterator[RowData]:
        yield "Text", "Tokens", "Tokenization", "Token IDs"
        yield "-", ":-:", "-", "-"
        yield from yield_data

    markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row())
    IPython.display.display(IPython.display.Markdown(markdown))


def get_text_token_info(
    text: str,
    count_tokens_res: CountTokensResponse | CountTokensResult,
    compute_tokens_res: ComputeTokensResponse | ComputeTokensResult,
) -> RowData:
    def inline_code(s: str) -> str:
        return f"`{s}`"

    total_tokens = count_tokens_res.total_tokens
    tokens_info = compute_tokens_res.tokens_info
    assert tokens_info is not None and len(tokens_info) == 1
    info = tokens_info[0]
    assert info.tokens is not None and info.token_ids is not None
    tokenization = " • ".join(t.decode("utf-8", errors="replace") for t in info.tokens)
    token_ids = " • ".join(str(token_id) for token_id in info.token_ids)

    return (
        inline_code(text),
        str(total_tokens),
        inline_code(tokenization),
        inline_code(token_ids),
    )


TEXTS = [
    "hello",
    "passion",
    "passionate",
    "passionné",
    "passionately",
    "passionalmente",
]
display_token_info_from_api(MODEL_ID, TEXTS)

Text	Tokens	Tokenization	Token IDs
`hello`	1	`hello`	`23391`
`passion`	1	`passion`	`208039`
`passionate`	2	`pass • ionate`	`4373 • 84242`
`passionné`	2	`passion • né`	`208039 • 8504`
`passionately`	2	`passion • ately`	`208039 • 2295`
`passionalmente`	2	`pass • ionalmente`	`4373 • 134916`

🚀 Why Count Tokens Locally?

Here are a few use cases where counting (or just estimating) tokens locally is useful:

Offline & Speed: You can count tokens completely offline. Plus, even when you're online, doing it locally means you don't have to wait for a network round-trip to the Gemini API just to check your prompt size.
Quotas: While the count_tokens method is free, counting locally saves bandwidth and prevents you from hitting API rate limits, especially during high-volume token counting.
Latency: You can estimate how much time is needed to process your text input before you start receiving a response (for a given model, the time-to-first-token latency is roughly proportional to the number of input tokens).
Cost Control: You can estimate and budget your API costs before committing to a paid request.
Routing: Knowing which token-count bucket your input falls into lets you route requests to different models based on speed, cost, or context size.
Privacy: You can audit the token count of sensitive data without sending it over your network.

🔤 Using the Local Text Tokenizer

Create a local tokenizer for the specific Gemini model you're using:

from google.genai.local_tokenizer import LocalTokenizer

tokenizer = LocalTokenizer(model_name=MODEL_ID)

💡 Remarks

Creating a tokenizer takes a few seconds, during which the configuration and vocabulary are loaded into memory.

On the first call, the tokenizer data is downloaded and stored in a local cache. This step requires an internet connection and about 30MB of storage.

If you want to build a fully offline solution, you can check out the SDK source code and persist the tokenizer assets (e.g., by configuring a persistent cache directory or building a container image).

Checking the internal tokenizer name confirms that the Gemma open-weight models share the same text tokenizer as the Gemini 3 family:

print(f'Text tokenizer name for "{MODEL_ID}": "{tokenizer._tokenizer_name}"')

Text tokenizer name for "gemini-3.1-flash-lite": "gemma4"

Call the count_tokens() method on a small text input:

contents = "Hello World!"
result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens=}")

result.total_tokens=3

Now, let's reproduce the previous API tokenization tests with our local tokenizer:

def display_token_info_from_local_tokenizer(
    tokenizer: LocalTokenizer, texts: list[str]
) -> None:
    def yield_data() -> Iterator[RowData]:
        for text in texts:
            count_result = tokenizer.count_tokens(contents=text)
            compute_result = tokenizer.compute_tokens(contents=text)
            yield get_text_token_info(text, count_result, compute_result)

    display_token_info(yield_data())


display_token_info_from_local_tokenizer(tokenizer, TEXTS)

Text	Tokens	Tokenization	Token IDs
`hello`	1	`hello`	`23391`
`passion`	1	`passion`	`208039`
`passionate`	2	`pass • ionate`	`4373 • 84242`
`passionné`	2	`passion • né`	`208039 • 8504`
`passionately`	2	`passion • ately`	`208039 • 2295`
`passionalmente`	2	`pass • ionalmente`	`4373 • 134916`

💡 As expected, we get exactly the same results, but with 100% local execution this time.

Finally, let's download a longer text, like Hamlet:

import requests


def get_text_from_url(content_url: str, force_encoding: str = "") -> str:
    response = requests.get(content_url, timeout=10)
    response.raise_for_status()
    if force_encoding:  # Use for HTTP headers with unknown/incorrect charset
        response.encoding = force_encoding
    return response.text


TEXT_URL = "https://storage.googleapis.com/dataflow-samples/shakespeare/hamlet.txt"
contents = get_text_from_url(TEXT_URL)

print(contents[:256] + "[…]")

    HAMLET


    DRAMATIS PERSONAE


CLAUDIUS    king of Denmark. (KING CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS    lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS    nephew to the kin[…]

How many tokens do we need to encode Hamlet?

result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens=:,}")

result.total_tokens=54,660

💡 Hamlet gets broken down locally into 50k+ tokens in a fraction of a second. If you tokenize War and Peace, you'll get 850k+ tokens.

🕵️‍♂️ Accounting for "Hidden" Tokens

When you send a request to Gemini, the total input token count isn't always just the sum of your input data.

To keep things simple, we tested text token counts with default parameters. The count_tokens and compute_tokens methods both have a config parameter. Depending on your request configuration, your inputs and outputs may include additional tokens.

Keep an eye out for these hidden additions:

System Instructions: Any system prompt you set will add to the total token count.
Thinking: If thinking is enabled, an internal chain of thought can generate additional thinking tokens.
Tools and Functions: If you provide a list of tools (like Python execution or custom functions), their declarations, calls, and responses are part of your prompt payload.
Response Schema: Enforcing structured outputs (like JSON) requires the model to process the schema definition you provide, which consumes input tokens.
Chat History: In multi-turn conversations, the entire chat history is sent back to the model with every new message, meaning your input token count grows with each turn.

🧮 Multimodal Token Math

Multimodal inputs (images, audio, video, and documents) aren't tokenized like text. They usually have specific calculation rules based on the model (and its underlying tokenizers), the media type, and the request configuration.

For multimodal inputs, refer to the documentation for details on how token counts are calculated for different media types:

There are generally multiple tokenization options, even for a single modality. You can use the count_tokens method and the calculation rules to estimate the token count of your own payloads. To get a clearer picture, let's look at actual requests and see how token counts are broken down by modality…

🎯 Tracking Actual Token Usage

While estimating token counts is super useful, you should always rely on the usage_metadata returned in the API response when you need to track your actual usage down to the exact token. It's the single source of truth for billing.

Here's the gist of how usage_metadata lets you get the token counts by modality:

class GenerateContentResponse:
    # …
    usage_metadata: Optional[GenerateContentResponseUsageMetadata]
    # …


class GenerateContentResponseUsageMetadata:
    # …
    prompt_token_count: Optional[int]
    prompt_tokens_details: Optional[list[ModalityTokenCount]]
    # …


class ModalityTokenCount:
    modality: Optional[MediaModality]
    token_count: Optional[int]


class MediaModality(StrEnum):
    MODALITY_UNSPECIFIED = "MODALITY_UNSPECIFIED"
    TEXT = "TEXT"
    IMAGE = "IMAGE"
    VIDEO = "VIDEO"
    AUDIO = "AUDIO"
    DOCUMENT = "DOCUMENT"

🐍 Let's define a few helpers:

from google.genai.types import (
    FileData,
    GenerateContentResponse,
    MediaModality,
    Part,
    PartMediaResolution,
    PartMediaResolutionLevel,
    VideoMetadata,
)

TokensPerModality = dict[MediaModality, int]


def display_tokens_per_modality(response: GenerateContentResponse) -> None:
    usage_metadata = response.usage_metadata
    if not usage_metadata:
        print("⚠️ No usage metadata found in the response.")
        return
    prompt_tokens_details = usage_metadata.prompt_tokens_details or []
    tokens_per_modality = get_empty_tokens_per_modality()

    for tokens_details in prompt_tokens_details:
        modality = tokens_details.modality
        if modality and modality in tokens_per_modality:
            tokens_per_modality[modality] += tokens_details.token_count or 0

    prompt_token_count = usage_metadata.prompt_token_count or 0
    display_token_table(tokens_per_modality, prompt_token_count)


def get_empty_tokens_per_modality() -> TokensPerModality:
    return {
        modality: 0
        for modality in MediaModality
        if modality != MediaModality.MODALITY_UNSPECIFIED
    }


def display_token_table(
    tokens_per_modality: TokensPerModality,
    total_tokens: int,
) -> None:
    def yield_row() -> Iterator[list[str]]:
        yield [mod.value for mod in tokens_per_modality.keys()] + ["Total"]
        yield [":-:" for _ in range(len(tokens_per_modality) + 1)]
        yield [f"{t:,d}" for t in tokens_per_modality.values()] + [f"{total_tokens:,d}"]

    markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row())
    IPython.display.display(IPython.display.Markdown(markdown))

Let's check a few examples…

🖼️ Image Tokenization

Image token counts depend on the image itself and the configured media resolution:

class PartMediaResolutionLevel(StrEnum):
    MEDIA_RESOLUTION_UNSPECIFIED = "MEDIA_RESOLUTION_UNSPECIFIED"
    MEDIA_RESOLUTION_LOW = "MEDIA_RESOLUTION_LOW"
    MEDIA_RESOLUTION_MEDIUM = "MEDIA_RESOLUTION_MEDIUM"
    MEDIA_RESOLUTION_HIGH = "MEDIA_RESOLUTION_HIGH"
    MEDIA_RESOLUTION_ULTRA_HIGH = "MEDIA_RESOLUTION_ULTRA_HIGH"

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per image:

`media_resolution`	Tokens
`MEDIA_RESOLUTION_LOW`	280
`MEDIA_RESOLUTION_MEDIUM`	560
`MEDIA_RESOLUTION_HIGH` (default)	1,120
`MEDIA_RESOLUTION_ULTRA_HIGH`	2,240

🐍 Check how this cat image is tokenized by default:

def display_tokens_for_image(
    image_uri: str,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {media_resolution_level=}")
    contents = Part.from_uri(
        file_uri=image_uri,
        mime_type="image/*",
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


IMAGE_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/image/chair-cat.png"
display_tokens_for_image(IMAGE_URI)

🧪 media_resolution_level=None

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	1,080	0	0	0	1,080

💡 This image is tokenized into only 1,080 tokens (instead of the maximum 1,120), saving us 40 tokens! It's a nice touch that helps keep costs down rather than defaulting to the upper limit.

🐍 For less detailed images, you can reduce token counts by a factor of 2 or 4 using the medium or low levels:

display_tokens_for_image(IMAGE_URI, PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW)

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	264	0	0	0	264

💡 At the other end of the media resolution range, the ultra-high level is great for detailed images (like a photo of a circuit board with many components), ensuring maximum visual understanding. An image at this level uses between 2,000 and 2,240 tokens.

🔊 Audio Tokenization

Audio tokenization currently uses 25 tokens per second to represent the audio stream semantically.

🐍 Here is the tokenization for a 3.049-second audio file:

def display_tokens_for_audio(audio_uri: str) -> None:
    contents = Part.from_uri(file_uri=audio_uri, mime_type="audio/*")
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/hello_gemini_are_you_there.wav"
display_tokens_for_audio(AUDIO_URI)

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	0	0	77	0	77

💡 ceil(3.049 s × 25 tok/s) = ceil(76.225 tok) = 77 tok

🐍 A longer, 30.772-second audio file requires 10 times as many tokens, as expected:

AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/sailor_audio.mp3"
display_tokens_for_audio(AUDIO_URI)

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	0	0	770	0	770

💡 ceil(30.772 s × 25 tok/s) = ceil(769.3 tok) = 770 tok

🎬 Video Tokenization

For videos:

The audio tokenizer is the same as for standalone audio (25 tokens per second).
Video frames are sampled (1 FPS by default) and tokenized based on the media resolution.

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per sampled frame:

`media_resolution`	Max. tokens
`MEDIA_RESOLUTION_LOW`/`MEDIA_RESOLUTION_MEDIUM` (default)	70
`MEDIA_RESOLUTION_HIGH`	280

🐍 Here's the tokenization for a 59-second video:

def display_tokens_for_video(
    video_uri: str,
    fps: float | None = None,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {fps=}, {media_resolution_level=}")
    contents = Part(
        file_data=FileData(file_uri=video_uri, mime_type="video/*"),
        video_metadata=VideoMetadata(fps=fps) if fps is not None else None,
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk"
display_tokens_for_video(VIDEO_URI)

🧪 fps=None, media_resolution_level=None

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	0	3,894	1,475	0	5,369

💡 Details

Video: ceil(59 s × 1 frame/s × 66 tok/frame) = ceil(3894 tok) = 3894 tok

Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

🐍 Doubling the sampling rate requires twice as many video tokens:

display_tokens_for_video(VIDEO_URI, fps=2)

🧪 fps=2, media_resolution_level=None

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	0	7,788	1,475	0	9,263

💡 Details

Video: ceil(59 s × 2 frame/s × 66 tok/frame) = ceil(7788 tok) = 7788 tok

Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

🐍 If you switch from low/medium to high media resolution, sampled frames are tokenized in greater detail, requiring four times as many video tokens:

VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk"
display_tokens_for_video(
    VIDEO_URI,
    media_resolution_level=PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH,
)

🧪 fps=None, media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	0	15,576	1,475	0	17,051

💡 Details

Video: ceil(59 s × 1 frame/s × 264 tok/frame) = ceil(15576 tok) = 15576 tok

Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

📄 Document Tokenization

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per PDF page:

`media_resolution`	Tokens
`MEDIA_RESOLUTION_LOW`	280
`MEDIA_RESOLUTION_MEDIUM` (default)	560
`MEDIA_RESOLUTION_HIGH`	1,120

🐍 Here's the tokenization for a one-page PDF at different media resolutions:

def display_tokens_for_document(
    document_uri: str,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {media_resolution_level=}")
    contents = Part.from_uri(
        file_uri=document_uri,
        mime_type="application/pdf",
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


DOCUMENT_URI = (
    "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf"
)
media_resolution_levels = [
    PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW,
    PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM,
    PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH,
]
for media_resolution_level in media_resolution_levels:
    display_tokens_for_document(DOCUMENT_URI, media_resolution_level)

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	266	0	0	0	266

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	532	0	0	0	532

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	1,092	0	0	0	1,092

💡 Remarks

Low: 266 tok/pg

Medium: 532 tok/pg

High: 1092 tok/pg

🐍 Here's another test for a 15-page PDF:

DOCUMENT_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf"
for media_resolution_level in media_resolution_levels:
    display_tokens_for_document(DOCUMENT_URI, media_resolution_level)

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	3,990	0	0	0	3,990

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	7,800	0	0	0	7,800

🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>

TEXT	IMAGE	VIDEO	AUDIO	DOCUMENT	Total
0	16,530	0	0	0	16,530

💡 Remarks

Low: 3990 tok / 15 pg = 266 tok/pg

Medium: 7800 tok / 15 pg = 520 tok/pg

High: 16530 tok / 15 pg = 1102 tok/pg

🎉 Conclusion

You've now mastered token counting both locally and via the Gemini API!

With the LocalTokenizer, you can estimate text token counts completely offline, saving bandwidth and avoiding rate limits. You've also seen how Gemini's multimodal tokenizers handle images, audio, video, and PDFs, and how to extract precise token usage from usage_metadata for accurate tracking and billing.

➕ More!

Try it yourself: Use the companion notebook (or run the notebook on Colab) to reproduce all results in this article.
Get inspired: Explore typical use cases in the Agent Platform Prompt Gallery.
Stay updated: Follow the Agent Platform Release Notes.
Follow me: Connect with me (@PicardParis) on LinkedIn or Twitter / X for more cloud, applied AI, and Python explorations…

Real-time IP capacity in Google Cloud subnets

Zach S. — Wed, 17 Jun 2026 15:46:58 +0000

When managing Shared VPCs, most teams allocate dedicated IP subnets for each service project to keep firewall rules simple, but this isolation often leads to poor IP utilization — it is not uncommon to see subnet IP utilization hovering in the low teens. On the other hand, using large shared subnets requires coordinating workload deployments to ensure there is enough internal IP address space for everyone. To optimize these shared networks, you need real-time visibility. The WITH_UTILIZATION query parameter on the Method: subnetworks.list | Compute Engine API solves this by returning the exact count of allocated and free IP addresses for each subnet IP range.

This capability is designed for query-time decisions. For example, if you need to deploy a GCE workload requiring 100 instances, you can search for a subnet with enough capacity. This query-time data comes directly from Google Cloud's internal IP allocator and includes both primary and secondary CIDR ranges.

Automating the search with gcloud and jq

To automate capacity checks before you deploy, you can script this check. The script below uses gcloud compute networks subnets list | Google Cloud SDK to grab the utilization data as JSON, and then uses jq to parse, filter, and sort the subnets based on your required capacity:

#!/bin/bash

# --- Configuration (Replace with your details) ---
PROJECT="<YOUR_PROJECT_ID>"
NETWORK_NAME="<YOUR_VPC_NETWORK_NAME>"
REGION="<YOUR_REGION>"
REQUIRED_IP_CAPACITY=100

echo "Searching $NETWORK_NAME in $REGION for subnets with >= $REQUIRED_IP_CAPACITY free IPs..."
echo "------------------------------------------------------------------------"

# Fetch subnets with utilization data, output as JSON, and pipe to jq
gcloud compute networks subnets list \
    --project="$PROJECT" \
    --network="$NETWORK_NAME" \
    --regions="$REGION" \
    --view=WITH_UTILIZATION \
    --format=json | \
jq -r --argjson min_ips "$REQUIRED_IP_CAPACITY" '
  [ 
    .[] | {
      name: .name,
      cidr: .ipCidrRange,
      # Safely extract totalFreeIp: if it is null, substitute "0" before converting to a number
      free_ips: (.utilizationDetails.ipv4Utilizations[0].totalFreeIp // "0" | tonumber)
    } 
    # Keep only the subnets that meet the minimum requirement
    | select(.free_ips >= $min_ips)
  ] 

  # Sort ascending by free_ips, then reverse to get descending order
  | sort_by(.free_ips) 
  | reverse 

  # Format the final output into clean, readable strings
  | .[] 
  | "Subnet: \(.name) | CIDR: \(.cidr) | Free IPs: \(.free_ips)"
'

Let's list the configured subnets in our target region first:

~  gcloud compute networks subnets list \
    --project="my-gcp-project" \
    --format='table(name, region, network, ipCidrRange)'

NAME       REGION      NETWORK  RANGE
subnet-a0  us-south1  vpc-a    10.0.0.0/28
subnet-a1  us-south1  vpc-a    10.0.1.0/28
subnet-a2  us-south1  vpc-a    10.0.2.0/28
subnet-a3  us-south1  vpc-a    10.0.3.0/28
subnet-a4  us-south1  vpc-a    10.0.4.0/28
subnet-a5  us-south1  vpc-a    10.0.5.0/24
subnet-a6  us-south1  vpc-a    10.0.6.0/25
subnet-a7  us-south1  vpc-a
subnet-a8  us-south1  vpc-a
subnet-a9  us-south1  vpc-a

Running the script returns only the subnets that can safely host our 100-instance workload:

~  bash gce-subnet-utilization.sh

Searching vpc-a in us-south1 for subnets with >= 100 free IPv4 addresses...
------------------------------------------------------------------------
Subnet: subnet-a5 | CIDR: 10.0.5.0/24 | Free IPs: 252
Subnet: subnet-a6 | CIDR: 10.0.6.0/25 | Free IPs: 124

Under the hood: Reading the utilization payload

When you request a subnet list with the utilization view, the API returns a utilizationDetails object. For a standard subnet with only a primary IPv4 address configured, the JSON payload looks like this:

"utilizationDetails": {
  "ipv4Utilizations": [
    {
      "totalAllocatedIp": "4",
      "totalFreeIp": "252"
    }
  ]
}

Notice that the totalAllocatedIp is 4. In any primary IPv4 range, Google Cloud reserves four IP addresses for default routing and metadata, as detailed in the Subnets | Virtual Private Cloud - Google Cloud Documentation.

If you have secondary ranges configured (often used for GKE Pods), the API includes utilization metrics for each secondary range, identified by rangeName:

"utilizationDetails": {
  "ipv4Utilizations": [
    {
      "totalAllocatedIp": "4",
      "totalFreeIp": "124"
    },
    {
      "rangeName": "a6-secondary",
      "totalAllocatedIp": "0",
      "totalFreeIp": "4096"
    }
  ]
}

The API also breaks down IPv6 utilization if you are running dual-stack subnets. It tracks external instance IPs, load balancer endpoints, and internal IPv6 allocations separately:

"utilizationDetails": {
  "externalIpv6InstanceUtilization": {
    "totalAllocatedIp": { "high": "0", "low": "0" },
    "totalFreeIp": { "high": "0", "low": "9223372036854775808" }
  },
  "externalIpv6LbUtilization": {
    "totalAllocatedIp": { "high": "0", "low": "0" },
    "totalFreeIp": { "high": "0", "low": "9223372036854775808" }
  },
  "internalIpv6Utilization": {
    "totalAllocatedIp": { "high": "0", "low": "8589934592" },
    "totalFreeIp": { "high": "0", "low": "18446744065119617024" }
  }
}

A few quick constraints

API Support: The WITH_UTILIZATION parameter works with both Method: subnetworks.get | Compute Engine and Method: subnetworks.list | Compute Engine.
gcloud Support: You can pass the parameter in gcloud using --view=WITH_UTILIZATION, as documented in gcloud compute networks subnets list | Google Cloud SDK.
JSON Strings: The API returns allocated and free counts as strings in the JSON payload. Make sure to cast them (like using tonumber in jq) before running any mathematical comparisons.

Next steps

Next time you are building a deployment pipeline, try integrating the WITH_UTILIZATION view. It is a simple way to programmatically ensure you have enough network headroom before kicking off a deployment.

Seamless scaling with VPA In-place Pod Resize on GKE

Olivier Bourgeois — Thu, 04 Jun 2026 18:19:03 +0000

Right-sizing Kubernetes workloads is a common platform engineering challenge. Set your requests too high, and you burn cloud budgets on idle capacity; set your limits too low, and your applications face throttling or dreaded OOMKills.

For years, the Vertical Pod Autoscaler (VPA) has been the standard answer to this problem, automatically adjusting CPU and memory requirements based on actual usage. However, this method of scaling came with a significant catch that prevented widespread adoption for critical workloads: applying new resource parameters required evicting and restarting the pod.

This disruption was often unacceptable for stateful applications, long-running connections, or latency-sensitive services.

Introducing In-place Pod Resize (IPPR) on GKE

In-place Pod Resize (IPPR) changes the game by allowing Kubernetes to modify resource requests and limits on live, running containers directly through the underlying container runtime, without triggering a restart.

By combining the intelligence of VPA with the non-disruptive nature of IPPR, GKE users finally have a viable path to dynamic, seamless, and automated right-sizing.

Note: As of writing, VPA IPPR is in Preview on GKE. While it is a massive step forward, I recommend evaluating it in staging environments before rolling it out to production workloads.

Getting started with IPPR

To use In-place Pod Resize, you need a GKE cluster running version 1.34.0-gke.2201000 or later.

GKE Autopilot: VPA is enabled by default.
GKE Standard: Requires the Vertical Pod Autoscaling feature to be enabled.

1. Enable the feature

If you aren't using Autopilot, ensure your cluster is created or updated with the necessary feature flags:

gcloud container clusters create CLUSTER_NAME \
  --project=PROJECT_ID \
  --location=us-east1 \
  --release-channel=rapid \
  --enable-vertical-pod-autoscaling

2. Define your VPA object

Create a VerticalPodAutoscaler resource targeting your Deployment or StatefulSet. The crucial element here is setting spec.updatePolicy.updateMode to InPlaceOrRecreate.

apiVersion: "autoscaling.k8s.io/v1"
kind: "VerticalPodAutoscaler"
metadata:
  name: "my-vpa"
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: "Deployment"
    name: "my-deployment"
  updatePolicy:
    updateMode: "InPlaceOrRecreate"

3. Watch it scale

Apply the resource to your cluster and monitor your application under load. Instead of watching Pods terminate and recreate, you can watch the resources modify live using kubectl describe.

kubectl describe pod POD_NAME

Look for the AllocatedResources field or check the events section. You will see the requests change in real-time to match the VPA recommendations, while the Restart Count remains exactly the same.

The "Or Recreate" Fallback: Keep in mind that physics still apply. If VPA recommends a resource size that exceeds the remaining capacity of the Node your Pod is currently running on, an in-place resize is impossible. In this scenario, VPA will fall back to evicting and recreating the Pod so it can be scheduled onto a larger or emptier Node.

Ready to dive deeper?

While this introduction covers the basics of IPPR, right-sizing is just one part of a robust scaling strategy. Implementing VPA often goes hand-in-hand with horizontal scaling and cluster autoscaling. Check out the guide to master scaling on GKE: Run full-stack workloads at scale on GKE.

Surviving the eviction: How to build interrupt-resilient AI workloads on GKE

Olivier Bourgeois — Tue, 02 Jun 2026 20:02:20 +0000

You did everything right. You containerized your massive model training job, deployed it to Google Kubernetes Engine (GKE), and cleverly routed it to a Spot VM node pool to save up to 90% on compute costs.

Everything is humming along perfectly for 38 hours. Then, a priority on-demand customer needs capacity, Google Cloud reclaims your underlying Spot VM, and your node vanishes.

Whether you are using preemptible Spot VMs to save money, or leveraging the Dynamic Workload Scheduler (DWS) to queue for scarce GPUs, you are building on top of ephemeral compute. The hardware will eventually be taken away. To successfully run critical AI workloads on un-committed capacity, your application architecture must assume failure is a given.

Here is a practical guide to building interruptible workloads on GKE.

1. Trap the warning

When Google Cloud reclaims a Spot VM, it doesn't just pull the power cord immediately. It sends an ACPI signal to the underlying node to begin a power off cycle. Kubernetes intercepts this and translates it into a SIGTERM signal sent directly to your running containers.

You have a grace period (up to 15 seconds for non-system pods) between that SIGTERM and the fatal SIGKILL.

Your application must explicitly listen for this signal. When caught, your code should immediately stop accepting new batches, finish its current loop, flush any in-memory data to disk, and exit with a 0 (success) status.

Here is a simple example on how to catch this signal in Python:

import signal
import sys
import time

def handle_sigterm(signum, frame):
    print("Received SIGTERM. Initiating graceful shutdown...")
    # 1. Stop processing new data
    # 2. Flush memory to persistent storage
    # 3. Save final checkpoint
    print("State saved. Exiting cleanly.")
    sys.exit(0)

# Register the signal handler
signal.signal(signal.SIGTERM, handle_sigterm)

# Your main training loop
print("Starting training loop...")
while True:
    # Train model...
    time.sleep(1)

2. Externalize your checkpoints

If your container dies, everything inside its local filesystem dies with it. To survive an interruption, you must periodically save your progress (model weights, optimizer states, epoch counters, etc.) to an external storage location.

Cloud Storage (GCS) is a common solution for this on Google Cloud.

Save frequently: Decide on a checkpointing interval that balances the cost of lost work against the overhead of writing to storage. Saving every epoch or every few thousand steps is common, but this can vary based on your needs.
Keep it local: Ensure your GCS buckets are in the same region as your GKE cluster (e.g., us-central1) to minimize latency and avoid outbound data transfer fees.
Resume, don't restart: The first thing your container's startup script should do is to check for that GCS bucket. If a checkpoint exists in the bucket, load it and resume from that exact step.

3. Design for Idempotency

"Idempotency" is a fancy way of saying that doing something twice yields the same result as doing it once.

Imagine a batch inference job that reads an image, processes it, and writes the result to a database. If your pod is preempted milliseconds after writing to the database but before it can mark the task as complete, the rescheduled pod will likely process that image again.

If your database blindly inserts new rows, you now have unintentional, duplicate data.

To build an idempotent pipeline:

Use UPSERT (update or insert) operations in your database based on a unique identifier (like an image ID).
Check if a record already exists before spending expensive GPU cycles processing it.

4. Decouple work queues for batch processing

If you are running a massive batch processing or inference job across thousands of files, do not write a monolithic Python script that iterates through a static CSV list. If the node dies at row 5,000, managing the state of where to restart is a nightmare.

Instead, decouple the workload:

Publish the work: Break your dataset down into discrete messages and push them into a message broker like Pub/Sub.
Pull the work: Have your Spot VM worker pods pull messages off the queue one by one or as a small chunk (e.g. 10 at a time).
Acknowledge completion: Only send an "ACK" (acknowledgment) back to Pub/Sub once the result is safely stored.

If a Spot node is preempted mid-inference, the worker dies before sending the ACK. After a brief timeout, Pub/Sub will automatically make that specific message available again. Another surviving worker pod will pick it up seamlessly. No data lost, no manual intervention required.

Key takeaways

Running on ephemeral compute like Spot VMs isn't just an infrastructure choice; it is a design choice. By handling termination signals, checkpointing aggressively to GCS, ensuring idempotent operations, and decoupling your queues, you can unlock massive cost savings and tap into scarce GPU pools without sacrificing reliability.

Strategies for running AI workloads on GKE without committed quota

Olivier Bourgeois — Mon, 01 Jun 2026 18:54:52 +0000

You’ve built your model, your training code is containerized, and you’re ready to scale up on Google Kubernetes Engine (GKE). You go to provision your nvidia-h100-80gb node pool and... QUOTA_EXCEEDED.

It’s one of the most common (and frustrating) roadblocks in modern AI development. High-end accelerators like H100s, A100s, and TPUs are in massive demand, and securing permanent, on-demand quota for them can be difficult. But a lack of on-demand quota doesn't mean you're out of options.

GKE provides two powerful, cost-effective strategies for acquiring these scarce resources when you can't get standard, on-demand instances: Spot VMs and the Dynamic Workload Scheduler (DWS).

Let's break down what they are, when to use each, and how to implement them.

Strategy 1: Spot VMs

Spot VMs are Google Cloud's excess compute capacity sold at a massive discount, up to 90% off the price of standard on-demand VMs. They are perfect for workloads that can be interrupted.

The catch is that Spot VMs have no availability guarantee. Google Cloud can "preempt" (i.e., terminate) them at any time if that capacity is needed for on-demand customers. GKE gets a 30-second warning before the node is terminated. Kubernetes uses this window to gracefully shut down your application (giving non-system pods up to 15 seconds to wrap up) before the node vanishes.

When to use Spot VMs for accelerators

Spot VMs are ideal for workloads that are:

Fault-tolerant and stateless: Your application can handle a node vanishing and having its pods rescheduled elsewhere.
Batch processing: Jobs that can be easily restarted or have checkpointing built-in.
CI/CD pipelines: Running tests or builds that don't need 100% uptime.

How to use Spot VMs in GKE

You can easily add a Spot VM node pool to your GKE Standard cluster. The key is to use Spot VMs for your workers, not your critical system pods.

Create a dedicated Spot VM node pool:
When creating a node pool, simply add the --spot flag and apply a taint so standard pods don't accidentally schedule there.

gcloud container node-pools create spot-gpu-pool \
  --cluster=my-cluster \
  --region=northamerica-northeast2 \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1 \
  --spot \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

Add the toleration to your workload's YAML:

# You want to "tolerate" that taint only on the specific workloads you want to run there.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-batch-job
spec:
  template:
    # ... other specs
    spec:
      tolerations:
      - key: "cloud.google.com/gke-spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

This architecture ensures your critical components stay on reliable on-demand nodes, while your interruptible training jobs run on the preemptible Spot nodes. (Note: If you are using GKE Autopilot, you simply request a Spot class in your pod spec and GKE handles the taints and nodes automatically!)

Strategy 2: Dynamic Workload Scheduler (DWS) with flex-start

What if your job can't be interrupted? Many large-scale training jobs can take days. While they might have checkpointing, restarting from scratch every few hours due to Spot preemptions is inefficient and costly.

This is where Dynamic Workload Scheduler (DWS) comes in.

DWS is a feature designed specifically for acquiring large amounts of scarce resources (like GPUs and TPUs) for batch workloads. It changes the request from "Give me this GPU right now" to "Give me this GPU when it becomes available."

The catch here is that your job doesn't start immediately. It enters a queue and might wait for minutes, hours, or even days for the resources to be provisioned.

There are a few massive upsides:

It's a "get-in-line" system: Instead of you writing a script to retry the gcloud command every 5 minutes, DWS queues your request and provisions the nodes automatically when capacity is found.
No preemptions: Once your DWS nodes are provisioned, they are yours for the entire duration of your job (up to seven days). They are not Spot VMs and will not be preempted.
Cost savings: DWS workloads are also offered at a significant discount (up to 53% for L4 GPUs) compared to on-demand instances.

When to use DWS

DWS is perfect for:

Model training or reinforcement learning (RL): Jobs that need to run uninterrupted for many hours or days.
Batch inference: Running a large inference job on a massive dataset.
Any workload that is not time-sensitive to start, but is sensitive to interruptions.

How to use DWS with flex-start

The flex-start mode in DWS is what enables this "wait-in-queue" behavior. If you are using a GKE Autopilot cluster (or a Standard cluster with Node Auto-provisioning enabled), implementing this is incredibly simple.

You do not need to create complex custom resources; you simply signal your intent via a nodeSelector in your standard Kubernetes Job object.

Request flex-start in your Job:
In your Job.yaml, add the cloud.google.com/gke-flex-start: "true" node selector alongside your accelerator request.

apiVersion: batch/v1
kind: Job
metadata:
  name: my-training-job
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-flex-start: "true"
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
      containers:
      - name: my-trainer
        image: "gcr.io/my-project/my-training-image"
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

When you apply this Job, GKE sees the flex-start selector. It puts the Job's Pods into a Pending state until the DWS queueing system can provision the requested A100 node. Once the node is ready, the Pod is scheduled, your job runs to completion without interruption, and the node is automatically deprovisioned.

Which strategy should you choose?

Here's a simple cheat sheet:

Feature	Spot VMs	DWS with flex-start
Best for	Fault-tolerant, interruptible workloads	Long-running, uninterruptible batch jobs
Primary trade-off	Starts fast, can be preempted at any time	Can wait hours/days to start
Cost savings	Up to 90%	Up to 50%
GKE mode	Standard or Autopilot	Standard or Autopilot
Implementation	--spot flag in a Node Pool	cloud.google.com/gke-flex-start nodeSelector

By mastering both Spot VMs and the Dynamic Workload Scheduler, you can build a resilient and cost-effective AI platform on GKE, even when on-demand accelerator quota seems impossible to find.

Inference on GKE Private Clusters

Maciej Strzelczyk — Thu, 12 Mar 2026 12:52:00 +0000

Setting up inference service without access to Internet

Deploying an inference service on your GKE cluster in 2026 is a fairly simple task. With a short Deployment definition making use of a vLLM image (TPU or GPU) and a Service definition, you have the basic setup ready to go! vLLM grabs the model of your choosing from Hugging Face during its startup. It’s all nicely automated. However, this setup requires your GKE nodes to have access to the Internet. What should you do when there’s no Internet connection? I will discuss the options in this article, but first, let’s start with a short analysis of how and why you may want to have no Internet connection for your nodes.

GKE Private Nodes

One situation where your vLLM pod might not be able to download a model from the Internet is when you decide to use GKE Private Cluster. When you choose this option, the nodes in your cluster are assigned only a private IP from your VPC network. With only a private IP address, it’s impossible to reach them from outside of your network, but they also lose the default way to communicate with the outside world. This feature is great for increasing the security of your system, but it has obvious drawbacks, like this lack of connectivity to the world.

One easy solution to the private nodes situation is to configure Cloud NAT for the region your cluster is in. That will create a way for the nodes and pods running on them to access the Internet, while keeping them protected from any attempt to establish new connections from outside of the network. However, if you want your pods to be unable to connect to the Internet, we need another way to get the model for vLLM to run.

Providing images to the pods

One other problem you might encounter when choosing to use Private Cluster without access to the Internet is the fact that your nodes won’t have access to the default source of Docker images: Docker Hub. The simple vllm/vllm-openai:latest image specification will not work. You will need to copy the images you want to use to the Artifact Registry—this way GKE Nodes will be able to download the images and run them. This gives you additional control over your environment; you can carefully control which versions of the images to download and allow cluster users to use.

Providing the LLM

vLLM can run a model stored in a local directory if you pass it as the --model argument value. To make use of this ability in your private GKE cluster, you will have to somehow provide the model to the vLLM through a mounted directory. The easiest way to do this is through GCS FUSE, which allows you to simply mount a GCS bucket as a folder in your Pod. You just need to remember that:

The GKE Cluster must have the GcsFuseCsiDriver add-on enabled.
You should use Workload Identity and a dedicated service account to allow the pod to access the bucket. The roles/storage.objectViewer role should work just fine for read-only access.
It’s important to host the model in the same region as the nodes of your cluster to ensure the fastest transfers.

Serving LLMs from a mounted directory speeds up the startup process of your inference service, as it doesn’t have to download the model each time a new pod is started.

Alternative to mounting GCS Bucket - persistent disks

An alternative to mounting a bucket is to use a zonal or regional persistent disk or hyperdisk. A single disk can be mounted by multiple pods at once if using read-only mode. Creating a disk to store a model is a bit more time consuming than using a GCS bucket, but might provide better performance (depending on the disk type) and be cheaper, as GCS and disk billing is structured differently.

To create a disk storing a model, you will need a temporary Compute Instance, where you will mount, format and fill the disk with data (hf download works just fine for this). Once the disk is ready, the VM can be deleted and the disk attached to the vLLM pods.

Summary

Using GKE without Internet access can be a good practice, providing you with additional security and control. As you can see, the additional work required to get your inference service running in this case is not negligible, but it is also not a deal-breaker. It’s up to you to decide if it’s a configuration you would like to use in your setup. Using a GCS Bucket or persistent disk to store a model is also a very good idea to simply cut down on the startup time of your services, especially with larger models.

The ecosystem of AI is changing at a rapid pace and it’s important to stay up to date with all the latest news. Follow the official Google Cloud blog, Google Developers blog and Google Cloud Tech YouTube channel to not miss any updates!

AI deployment: to host or not to host?

Maciej Strzelczyk — Tue, 10 Mar 2026 23:28:46 +0000

So you’ve built your AI application prototype. You used your own local GPU to run the AI model, or just used the free AI Studio tier to power your clever program. The app is ready, the world is ready, time to deploy your production instance! In the case of traditional, non-AI powered apps and services, the choice of deployment platform is based on personal preference, what you are familiar with, how much control over fine details you want to have etc. Cost is usually not the most important factor, as for a new service, that’s just going to start gaining a userbase, the first usage bills won’t be that high anyway. The situation is different when it comes to running services that make use of AI. Here, you need to make two separate decisions. First is how to deploy your application, this is the same as for a vanilla non-AI app. Second is how you are going to provision the AI capabilities. This second decision will most likely be responsible for a big chunk of your bill and it shouldn’t be made without proper consideration. In this article, I will try to help you make the right decision for your use case.

Serverless vs hosted inference service

There are two ways of provisioning AI for a production-grade application:

Serverless - where you pay for the tokens your application sends and receives.This is sometimes called Model as a Service (MaaS). In Google Cloud, this approach is available in Vertex AI and Google AI Studio (Gemini API).
Hosted - where you pay for the time you use the infrastructure running an LLM. In Google Cloud, this model is available through multiple services like: Compute (through certain machine types), Vertex AI, GKE or Cloud Run.

Depending on your situation, you may not have an option to choose between the two, because only one would be possible. For example, if you have to use one of the Gemini models, there’s no way to host it yourself and the MaaS (pay per token) approach is the only one available. Similarly, if you have to use a custom model that is not available as a service, you just have to go down the hosted path.

In cases where you do have a choice between the two paths you need to understand how they will affect your budget.

Serverless (pay per token)

Paying only for the tokens your application uses is a fair and easy to understand setup. It works exactly like any other paid service on Google Cloud - you pay for what you use.

Pros:

It scales to zero, when you don’t use the AI,
you don’t have to worry about scaling,
Configuration and maintenance are extremely simple,

Cons:

Less predictable for your budget
You may reach service quota, either when your application experiences a rush-hour or when you reach some total monthly usage quota
In case your application is hacked, your bill might skyrocket
Once your application gets popular, the bill will grow with your active userbase

Hosted (pay per second)

Hosting an LLM on infrastructure that you pay for is extremely predictable cost-wise. As long as you know how long you are going to hold on to that GPU or TPU accelerated instance, you know exactly how much you are going to pay.

Pros:

Extremely predictable cost
Many ways to lower your bill: CUDs, Spot Instances, choosing a cheaper zone or choosing the right instance and/or accelerator type
No quota on how much tokens your application consumes
Full control over hardware and software inference configuration

Cons:

Big initial cost
Doesn’t scale as smoothly as serverless
Configuration and maintenance is more complicated

Couple of considerations

To help you out a bit further, here are some questions you should ask yourself, before deciding on one of the deployment options.

How much traffic do I expect?

With low traffic, the choice is almost obvious - serverless is cheaper and easier. However, as your usage grows, the number of tokens consumed will add up to a considerable amount. In such a case, using a self-hosted solution might save you from unexpected bills at the end of the month.

Am I legally bound to keep user data in certain region?

In some cases, like with medical or financial data, you might be required by local regulations or your own contracts to ensure your user data doesn’t leave a certain location, or will not be sent to service you don’t control. This might be a situation where no matter the cost effectiveness self-hosting an AI model is the only possible solution.

Am I likely to hit the hourly/monthly quota?

All API Services have some usage quotas, that includes AI services. If you expect your application may reach this quota, it’s a big hint that you should consider self-hosting your model.

Mixed-approach

It is also worth noting that you don’t have to limit your architecture to using only one AI Model with one deployment option. Imagine your application offers multiple AI-powered features - some of them might be simple enough for a small model to handle, while others require full power of Gemini. It is perfectly fine to have for example a Gemma 3 running on a VM, handling the easier tasks, while you delegate the harder/bigger tasks to Gemini API.

This is not an irrevocable decision

Even after careful consideration, the decision might still not be a simple one, especially if you’re starting with a new idea and simply don’t know how popular it’ll get. Luckily, with good architecture of your application, it is not that difficult to prepare for changing the AI API endpoint. It’s reasonable to start with a serverless solution, where you will often make great use of the fact that no traffic = zero cost. Once your application takes off and the Vertex AI or AI Studio bill reaches levels comparable to running a self-hosted model, you should reevaluate your situation and perhaps switch to the more predictable approach.

Keep up!

P.S. Did you know that Google Cloud now offers Developer Knowledge API and MCP server that can give your AI Agents access to always up-to-date knowledge straight from the official Google Cloud, Firebase and Android documentation?!

Making Sure Your Prompt Will Be There For You When You Need It

Shawn Jones — Tue, 10 Mar 2026 17:44:00 +0000

At Google, our team (Google Cloud Samples) uses Gemini to produce thousands of samples in batches. In doing so, we've learned that the biggest hurdle isn't the AI, it's our own expectations about these tools. As developers, we are wired for deterministic systems: we call a function and it produces the same result for the same input every time. This predictability allows for standard unit tests.

Large Language Models (LLMs) however, are probabilistic and stochastic. They don't store facts; they store the likelihood of patterns and use a "sophisticated roll of the dice" to choose the next token. This is why the same prompt can yield a “sparkly” ✨ success one minute and a hallucination 🤪 the next. You aren't just testing code anymore; you are forecasting the weather of your system. To move to production, we must build containment structures (like quality gates and evaluators) that make the unpredictability manageable.

LLMs Can Make Mistakes

Trying to make samples in large batches is different from asking for a single sample from a tool like Gemini CLI. When producing many samples at once, we see more mistakes because the statistics catch up with us. A small percentage of bad samples becomes a large number once the overall number of samples gets higher, not unlike issues in manufacturing. Here are some examples of mistakes.

Sometimes we detect code with syntax issues, like the def def snippet below. Python uses only one def keyword to specify the start of a function definition.

def def create_secret_with_expiration(
    project_id: str, location: str, secret_id: str
):

Syntax issues like this can be detected with linting or other build tools. If we detect them in our pipeline, we can just regenerate the sample. Other times the issues are more subtle, like how this JSDoc below is 7 lines away from the function it is documenting, separated from its function by a use statement, imports, and an object instantiation.

/**
 * Get secret metadata.
 *
 * @param projectId Google Cloud Project ID (such as 'example-project-id')
 * @param secretId ID of the secret to retrieve (such as 'my-secret-id')
 */
'use strict';

const {SecretManagerServiceClient} = require('@google-cloud/secret-manager');
const {status} = require('@grpc/grpc-js');

const client = new SecretManagerServiceClient();

async function getSecretMetadata(projectId, secretId) {

Or other times the docstring is incorrect, like how the the docstring below is missing parameters used by the function it documents.

def create_secret_with_notifications(
    project_id: str, location: str, secret_id: str, topics: list[str]
) -> None:
    """Create Secret with Pub/Sub Notifications. Creates a new secret resource
    configured to send notifications to Pub/Sub topics. This enables external
    systems to react to secret lifecycle events.

    Args:
        project_id: The Google Cloud project ID. for example,
            'example-project-id'
        location: The location of the resource. for example, 'us-central1'
    """

Issues don’t always show up directly in code, either. We have Gemini generating build artifacts, like package.json. In the case below, it was so eager to include the gRPC package that it listed the package 3 times in different ways, including one that has been deprecated.

{
  "name": "example",
  "private": true,
  "description": "Google Cloud Platform Code Samples 🎒",
  "dependencies": {
    "@google-cloud/secret-manager": "latest",
    "@grpc/grpc-js": "latest",
    "@grpc/grpc-js": "^1.10.0",
    "grpc": "latest"
  },
  "scripts": {
    "test": "node --test"
  }
}

We have other, more subtle issues as well. Sometimes the code is correct, but not saved with the correct filename or in the correct folder structure. Issues like these lead to more manual evaluation and testing. By iterating on prompts with evaluation we have improved our results.

Prompt Templates as Functional Interfaces

Quality responses are guided by the three elements shown above: the input data, a prompt template, and the LLM itself. As part of prompting for production, we’re evaluating prompt templates, like those created with the dotprompt format. Below is a very simple example of a prompt template in dotprompt. Using the prompt template we can reuse the same prompt text over and over with different inputs. Prompt templates give us a function interface for interacting with the LLM.

---
model: gemini-3-flash-preview
input:
  schema:
    need: string
    language: string
output:
  schema:
    code: string
---

Generate code that satisfies the need of {{ need }} using language {{ language }}.

By using templates, we can run the same logic across hundreds of different inputs to see where the "weather" changes.

We've found that a successful workflow follows these phases:

Build a foundation with Ground Truth
Finding Your Candidate Prompt (Vibe Check)
Statistical Trials – Because Unit Tests Alone Don’t Work

Phase 1: Build a foundation with Ground Truth

In the prompt template world, the template is only part of the picture. We need the input values as well. We also need the matching expected output values. You may say “But this sounds like unit testing!” and you would be right; it is a similar idea. The amount of testing data you need depends on what question you want to answer. If your question boils down to “Is the prompt template bad?” then 5-10 records of input/output test data is good enough. This will help you eliminate a bad prompt template quickly. If your question is more “Will my prompt template work well?” then you need 50 - 100. The more edge cases you can insert into your test data, the better.

Fortunately, we have a golden set of samples we can use as known good testing data. We continue to iterate on our test data while also adding more samples to it.

Phase 2: Finding Your Candidate Prompt (Vibe Check)

Before you share with your team, start experimenting by using a tool like Google AI studio to develop some handmade prompts. Try them with different inputs and outputs. Build an intuition for what works and what doesn’t. Use Gemini to help in your evaluation.

AI Studio’s playground can be very helpful at this stage, including providing structured outputs that can then be used to help plan the outputs used in our dotprompt file. When you feel good about your results, you have anecdotal evidence that your prompt template might work, but not statistical evidence.

Phase 3: Statistical Trials – Because Unit Tests Alone Don’t Work

Does your candidate prompt template work with many different inputs? This is where things get more complex and we move from the familiar deterministic unit testing to probabilistic testing. Because the LLM could answer differently each time, we need to run multiple trials for each input/output test record. But how many is enough? For recent academic work, my previous team ran as many as 128 times per input/output pair for better statistical relevance, but this gets expensive fast. To balance cost, time, and effort, the community consensus is either four and five times per input/output test record. The argument for five over four is that we need an odd number to “break ties.”

But how do you know if the output of your prompt is working well? Use a deterministic metric. In the case of samples, we build the code, we lint it, we apply other static analysis tools, which all provide deterministic review and feedback. Finally, once we have something that passes those quality gates, we perform manual testing and human review. With this many quality gates and a large number of samples, we can begin to rely on the Law of Large Numbers to determine if a prompt template is working and not worry about four or five trials per sample.

Embracing Statistical Techniques For The Best Performance

Beyond prompt templates, we can evaluate other parts of our workflow. The scenarios below show how we can freeze some elements of the workflow while keeping others the same (freeze). We start by listing the question we want to answer and then list which elements to change and which elements to freeze.

How well does my new prompt template work?
1. Change: prompt template
2. Freeze: model, hyperparameters, ground truth input and output
How well does a different model or model version affect the results?
1. Change: model
2. Freeze: hyperparameters, ground truth input and output, prompt template
Is a new input value a useful addition to the ground truth?
1. Change: input value
2. Freeze: model, hyperparameters, ground truth output, prompt template
Is a new output value a useful addition to the ground truth?
1. Change: output value
2. Freeze: model, hyperparameters, ground truth input, prompt template
How will changing the hyperparameter values improve the results?
1. Change: hyperparameter value
2. Freeze: model, ground truth input and output, prompt template

Say a new model version is released and we have results from testing the previous model. We can keep the hyperparameters, ground truth, and prompt template the same as before. Then we change the model in the dotprompt file and rerun our evaluation. Now we have data to decide if we want to use the new model version. Likewise, we can alter the other items in the list above to answer other questions.

We might be able to sidestep the statistical testing by forcing Gemini behave more deterministically. We could set its hyperparameters to their most deterministic values – such as temperature at 0, top-k at 1, top-p at 0, or by using the same seed the same value every time. This creates its own issues, and does not rid us of the need for testing. What if a given prompt’s deterministic response is incorrect every time? How do we automatically correct things for which there are no deterministic tools? We want there to be some degree of creativity and stochasticity in its responses. We want the option of running the generation again with the probability of getting a better response. We embrace this power but we also need to be more statistics-minded about our testing to make sure our prompts are there for us when we need them.

Join the Conversation

I’m curious about what others are doing to help evaluate their prompts and prompt templates.

Are you just starting out? How do you do your vibe checks? How do you test before shipping?
Have you been evaluating prompts for a while? How many times do you evaluate a prompt template before putting it into production? How do you keep time and cost down?
What recommendations do you follow when testing prompts? Do you have sources to share? Can we do this better?
What workflows have you found to work?

Please share in the comments.

A paper on the “Budget 5”: Confidence Improves Self-Consistency in LLMs
Vertex AI’s advice on evaluation datasets
Anthropic’s Writing Effective tools for AI agents
Stanford’s and UC Santa Barbara’s With LIttle Power Comes Great Responsibility about how many NLP studies are underpowered in terms of statistical testing
7 Technical Takeaways from Using Gemini to Generate Code Samples at Scale
How My Team Aligns on Prompting for Production

Thanks to Jennifer Davis, Adam Ross, Nim Jayawardena, and Katie McLaughlin for feedback on this post.

How My Team Aligns on Prompting for Production

Adam Ross — Tue, 17 Feb 2026 16:53:46 +0000

My team at Google is automating sample code generation and maintenance. Part of that is using Generative AI to produce and assess instructional code. This introduces a challenge: How do we trust the system to meet our specific standards, when core components are non-deterministic?

Establishing trust requires isolating and understanding each large language model (LLM) request. We need to know exactly what goes into the model, and a guarantee of what comes out.

This challenge isn't different from other feature development. To succeed, we realized we had to stop treating prompting like chatting or guessing and start treating it like coding.

Natural Language is a Fuzzy Programming Language

Prompting an LLM is effectively "natural language programming": we are programming in English. The problem is that English is not the greatest language for programming. It is ambiguous. It is subjective. It is open to interpretation.

In C++, a missing semicolon breaks the build. In English, a missing comma changes the objective entirely: "I don't know, John" becomes "I don't know John". In a programming language, syntax is binary; it works or it doesn't. In English, the difference between "Ensure variables are immutable" and "Make sure variables never change" might yield different results based on the model’s training data.

When you combine the fuzziness of human language with the "black box" probabilistic processing of an LLM, you face a difficult question: What is the weather going to be today in the land of AI?

To answer that, you have to make the intentions behind your prompts explicit.

Be Efficient with Brains (Pair & Review)

Writing a prompt is an exploratory process of finding words that trigger the best response. However, a single writer is limited by their own understanding. This is risky with LLMs, which are suited for ambiguous problems but require strict guardrails.

Relying on one person to do this creates blind spots. We found that prompt quality benefits from pairing. A diversity of thought helps create a more complete definition of any problem. What one engineer considers a clear instruction, another might see as a loophole. Pairing covers the gaps that a single brain might miss.

Furthermore, you should also review every prompt. This isn't just checking for typos; it’s a logic check. Does this prompt align with business requirements? We’ve found that prompt reviews often uncover disagreements about the requirements themselves, forcing us to align as a team before we ship.

Document "The Why" and manage change

Because English is fuzzy, the intent behind a specific word choice isn't always obvious. Why did we use the passive voice here? Why did we specify "immutable"?

Even well-structured prompts can eventually obscure the original business requirements. As optimizations blur into the general text, we must take every opportunity to document "the why":

Documentation: Avoid relying on prompts as canonical business requirements. Our LLM requests are a combination of system instructions, user input, context, and deterministic post-processing; the prompt is not complete for onboarding a developer.
Comments: Comment on complex prompts just as you would complex code. Spotlight specific constraints or even punctuation to explain the problem they solve. The model is a moving target, so any unintentional changes can make troubleshooting hard.
Commit Messages: Use commit messages as an opportunity to explain what was wrong with the prompt. (for example, fixed: Missing comma lost John)

Separation of Concerns (Use Dedicated Files)

Writing code and writing prompts require distinct mindsets. One focuses on syntax and execution flow; the other on semantics and intent. Embedding long English instructions inside code creates a distraction.

We keep prompts in dedicated files to disentangle application logic from the LLM interaction configuration, which requires frequent tuning.

By treating the prompt as a standalone component, we can prototype and iterate on the LLM behavior independent of the application's control flow. Tools like dotprompt allow us to treat these files as first-class artifacts containing text, model parameters, and schema definitions. This highlights that invoking a model isn't just a function call; it’s an integration with a distinct system that requires its own configuration.

Use Structured Output

To build a reliable tool, you need a bridge between unpredictable LLM output and deterministic computers.

We rely on structured output¹ to guide the model to emit JSON according to a schema. Even if we only need a single field, defining a schema provides a guardrail that helps the model output conform to a shape we can validate programmatically. This is critical for code generation, where models often add unwanted preambles, conversational filler, or inconsistent markdown fences.

If the output doesn't match the schema, we fail fast or retry. This allows us to integrate the LLM output into our process with the same confidence we have in detecting a bad API response.

From Magic to Engineering

Moving from one successful prompting to a reliable system requires acknowledging that prompts are code. You need to manage, review, and test them with the same rigor applied to the rest of your stack. While we are still working on better ways to benchmark quality, treating our prompts as first-class codebase assets is our first step towards building confidence in our AI assisted automation.

Join the conversation

I'm curious how you are handling the fuzzy nature of LLMs in production.

How does your team review and test prompts?
Do you treat prompts as configuration, code, or something else entirely?
Share your workflow in the comments.

The lumberjack paradox: From theory to practice by Jennifer Davis
Google Sandwich Manager, and the hallucinated SDK by Katie McLaughlin and Brian Dorsey

Thanks to Jennifer Davis & Shawn Jones for review and contributions.
Cover Photo by Jeton Bajrami on Unsplash

This links to how structured output works in the Genkit framework, which my team is using. A succinct example. ↩

Why your `curl` logic just bit you 🐾

Jennifer Davis — Mon, 09 Feb 2026 05:01:07 +0000

It’s a common strategy to test a new API directly with curl. It feels intuitive, fast, and removes the overhead of a language runtime. For example, if you are testing out the Google Cloud Logging API, you might start with a simple request to list logs from a Kubernetes container.:

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://logging.googleapis.com/v2/entries:list \
  -d '{
    "resourceNames": ["projects/your-project-id"],
    "filter": "resource.type=\"k8s_container\""
  }'

The JSON returns perfectly. But then, you move that logic into a Node.js application using the official @google-cloud/logging library.

import { Logging } from "@google-cloud/logging";

export async function readLogsAsync() {
  // Common approach: Initializing the client without explicit auth
  const logging = new Logging();

  const [entries, , response] = await logging.getEntries({
    filter: 'resource.type="k8s_container"',
    resourceNames: ['projects/your-project-id'],
  });

  console.log(entries.length, response.nextPageToken);
}

All of a sudden, the code fails with a cryptic error: Error: Could not load the default credentials.

This happens because of a disconnect between how the gcloud CLI manages session tokens and how the Google Cloud client libraries search for credentials.

While your curl command relies on the explicit token you provided via gcloud auth print-access-token, the client libraries are designed to look for Application Default Credentials (ADC).

Running gcloud auth login authenticates, but it does not create the specific credential file that the Node.js library requires to run locally.

🤫 Authenticate for local development

The most efficient way to solve this is to provide the library with the credentials it is looking for. Run the gcloud auth application-default login command in your terminal.

gcloud auth application-default login

This opens a browser window to authorize your account and saves a JSON file to your local configuration folder. Once this is done, your Node.js code will automatically find these credentials—no code changes required.

🏡 Moving to Production

Once you get your local environment running, you need to think about "productionizing" your code. Hardcoded project IDs and lack of robust error handling are common causes of production outages.

🛡️ Robust Error Handling

In production, your app should handle permission issues or network timeouts gracefully.

try {
  const [entries] = await logging.getEntries({
    filter: 'resource.type="k8s_container"',
    resourceNames: [`projects/${process.env.GOOGLE_CLOUD_PROJECT}`],
    pageSize: 10, 
  });
} catch (error) {
  if (error.code === 7) {
    console.error("Permission Denied: Check your Service Account roles.");
  } else {
    console.error("Unexpected Logging Error:", error.message);
  }
}

🎭 Service Accounts

In production, you shouldn't use your personal user credentials. Use a Service Account with the "Logs Viewer" role.

🌍 Environment Variables

Avoid hardcoding project IDs. Use an environment variable to make your code portable across staging and production environments.

const PROJECT_ID = process.env.GOOGLE_CLOUD_PROJECT || 'your-project-id';
...

export async function readLogsAsync() {
  const logging = new Logging({ projectId: PROJECT_ID });

Structured Logging

Instead of just reading logs, think about how you write them. Using structured JSON logs makes them much easier to query later in the Cloud Logging console.

If you are running on Google Kubernetes Engine (GKE), you don't even need to use the Logging library to write logs. Printing a JSON string to stdout allows GKE to parse your data into searchable fields automatically.

export function logStructuredStatus() {
  const logEntry = {
    severity: "INFO",
    message: "Container health check successful",
    container_info: { version: "v2.1.0", uptime: process.uptime() },
    http_stats: { active_connections: 42 }
  };

  // GKE picks this up and converts it to a structured log automatically
  console.log(JSON.stringify(logEntry));
}

Get started

I'll be sharing a more thorough walk-through on using the native observability features provided with GKE real soon—including how to automate this stuff so you can avoid the manual debugging of credential errors.

In the meantime, dive deeper here:

📖 How Application Default Credentials work – Understand the "magic" behind the lookup.
🛠️ Providing credentials to ADC – Setup guides for every environment.

The lumberjack paradox: From theory to practice

Jennifer Davis — Wed, 19 Nov 2025 00:42:14 +0000

Previously, I shared my thoughts on Neal Sample’s "lumberjack paradox" and the urgent need to build the systems thinkers of tomorrow. I argued that leaders must move beyond simple efficiency and focus on re-engineering the experience (Dr. Gary Klein) and creating context to ensure we don't lose the path to deep expertise.

But what does "leadership as context creator" look like in practice?

For us in Cloud DevRel Engineering, it isn't abstract. It comes down to how we manage the most fundamental unit of our developer experience: the code sample.

As Neal notes, AI will lead to the "industrialization of creativity"—an infinite supply of ideas and code. In this world, the premium shifts to discernment: the ability to distinguish the quality from the mediocre.

But this isn't a choice between the axe (manual craft) and the chainsaw (AI). The modern expert needs both.

If you only have the axe, you are restricted to the problems that fit within manual reach. It is the perfect tool for the campsite, but it cannot clear the forest.
But if you only have the chainsaw, without the judgment to guide it, you are dangerous. You lack the control to distinguish a clean cut from a destructive one.

You need the deep expertise of the axe to get the precise, consistent outcomes from the chainsaw.

From theory to practice: The catalog as ground truth

In my previous post, I mentioned Dr. Richard Cook's work on "building common ground" and Donella Meadows’ warnings about suboptimization.

In Cloud DevRel Engineering, we realized that our code samples are the primary tool for building this common ground. In Dr. Cook’s terms, they form the "Line of Representation"—the tangible surface that connects the human "above the line" to the complex system "below the line."

When a developer (the human) learns a new platform, the sample is their manual for the "axe." When an AI assistant generates a solution, the sample is the training data that guides the "chainsaw."

When we looked at our systems, we saw suboptimization. By treating samples as low-priority content maintained by individual contributors, we created a fractured reality.

We broke the Line of Representation.

We saw this failure hit on two fronts:

We break the human judgment loop: If samples are inconsistent, developers cannot learn "good" from "bad." We fail to re-engineer the experience (Dr. Klein) necessary to build expertise.
We poison the AI well: AI models ingest our official repositories. The AI learns them, scales them, and feeds them back to the user.

We are currently witnessing exactly how this hand-crafted approach fails at scale.

The high cost of "geological strata" in code

Without central standardization, our repositories accumulated "geological strata"—layers of outdated practices—because manual maintenance cannot keep up with language evolution. This makes it hard to know what is correct today.

Node.js' paradigm tax: Our Node.js repositories contain a mix of callbacks, raw promises, and async/await. A user learning Pub/Sub sees one era, while a user learning Cloud Storage sees another. The AI sees all of it and treats it all as valid, stripping away the context of "outdated" versus "modern."
Python: The contributor long tail: With over 650 contributors, our Python samples suffer from extreme fragmentation. The total cost of ownership (TCO) of manually bringing thousands of older snippets up to modern Python 3.10+ standards is astronomically high, so it simply doesn't happen. This leaves a massive surface area of "technical debt" that the AI happily recycles.

Inconsistent quality creates "false best practices"

When samples are hand-written by federated teams, personal "developer flair" masquerades as industry best practice. Users copy-paste these patterns, inadvertently adopting technical debt.

Java's Framework creep: Instead of teaching the core platform, contributors often introduce heavy frameworks for simple tasks. This increases the "time-to-hello-world" and teaches the AI that simple tasks require complex dependencies.
Python vs. Go: Most Go samples handle errors correctly because the language forces it. Many Python samples show only the "happy path," skipping critical distributed systems patterns like exponential backoff or retry logic. The AI then generates code that looks clean but fails in production.

The hidden cost of incoherence

This is the "suboptimization" Donella Meadows warned about. It is not enough for individual samples to be correct in isolation; they must function as a cohesive unit.

For a human developer, shifting between products that use different coding styles creates friction. They have to spend mental energy decoding the "dialect" of a specific product team rather than focusing on the logic.

For an AI, this lack of cohesion is even more dangerous.

The Context Gap: When our samples for Cloud Storage look structurally different from our samples for BigQuery, the AI treats them as unrelated entities. It fails to learn the underlying "grammar" of our platform.
The Integration Failure: When a user asks for a solution that combines these products, the AI struggles to bridge the gap. Lacking a consistent pattern to follow, it often hallucinates a messy, "glue code" solution that is brittle and insecure.

By allowing fragmentation, we aren't just impacting the docs; we are training the AI to misunderstand how our platform is supposed to fit together.

Get started

We cannot view code samples as static documentation. They are the active constraints of our system—the "environment" we design for our users. If we fail to maintain them, we dull the tools that build developer judgment, and we degrade the quality of the AI they trust.

Coming up next

Next in this series, I will share our structural solution: the "Golden Path." This approach moves us away from isolated automation and towards a human-led, AI-scaled system that improves consistency.

I’ll be focusing more on the strategy in this series, but the execution is its own journey. Using AI to write code is well-known, but relying on it to produce production-ready educational content? Adam Ross, and Nim Jayawardena have shared the technical reality of our team's shift in a post on their 7 takeaways from generating samples at scale with Gemini.

Until then, ask yourself:

Are you trying to automate away your documentation debt without first defining a standard of quality?
Are your samples strong enough to serve as the "ground truth" for the AI models your developers rely on?

Special thanks to Katie McLaughlin, Adam Ross, and Nim Jayawardena for reviewing early drafts of this post.

How to enable Secure Boot for your AI workloads

Maciej Strzelczyk — Mon, 21 Jul 2025 14:43:09 +0000

Written in cooperation with Aron Eidelman.

As organizations race to deploy powerful GPU-accelerated workloads, they might overlook a foundational step: ensuring the integrity of the system from the very moment it turns on.

Threat actors, however, have not overlooked this. They increasingly target the boot process with sophisticated malware like bootkits, which seize control before any traditional security software can load and grant them the highest level of privilege to steal data or corrupt your most valuable AI models.

Why it matters: The most foundational security measure for any server is verifying its integrity the moment it powers on. This process, known as Secure Boot, is designed to stop deep-level malware that can hijack a system before its primary defenses are even awake.

Secure Boot is part of Google Cloud’s Shielded VM offering, which allows you to verify the integrity of your Compute VM instances, including the VMs that handle your AI workloads. It’s the only major cloud offering of its kind that can track changes beyond initial boot out of the box and without requiring the use of separate tools or event-driven rules.

The bottom line: Organizations don't have to sacrifice security for performance. There is a clear, repeatable process to sign your own GPU drivers, allowing you to lock down your infrastructure's foundation without compromising your AI workloads.

Google Cloud’s Secure Boot capability can be opted into at no additional charge, and now there’s a new, easier way to set it up for your GPU-accelerated machines.

Understanding the danger of bootkits

It’s important to secure your systems from boot-level threats. Bootkits target the boot process, the foundation of an operating system. By compromising the bootloader and other early-stage system components, a bootkit can gain kernel-level control before the operating system and its security measures load. Malware can then operate with the highest privileges, bypassing traditional security software.

This technique falls under the Persistence and Defense Evasion tactics in the MITRE ATT&CK framework. Bootkits are difficult to detect and remove due to their low-level operation. They hide by intercepting system calls and manipulating data, persisting across reboots, stealing data, installing malware, and disabling security features.

Bootkits and rootkits pose a persistent, embedded threat, and have been observed as part of current threat actor trends from Google Threat Intelligence Group, the European Union Agency for Cybersecurity (ENISA), and the U.S. Cybersecurity and Infrastructure Security Agency (CISA). Google Cloud always works on improving the security of our solutions by strengthening our products and providing tools you can use yourself. In this article, we would like to demonstrate a new, easier way of setting up Secure Boot for your GPU-accelerated machines.

Limitations of Secure Boot with GPUs

Shielded VMs employ a TPM 2.0-compliant virtual Trusted Platform Module (vTPM) as their root of trust, protected by Google Cloud's virtualization and isolation powered by Titan chips. While Secure Boot enforces signed software execution, Measured Boot logs boot component measurements to the vTPM for remote attestation and integrity verification.

Limitations start when you want to use a kernel module that is not part of the official distribution of your operating system. That is especially problematic for AI workloads, which rely on GPUs whose drivers are usually not part of official distributions. If you want to manually install GPU drivers on a system with Secure Boot, the system will refuse to use them because they won’t be properly signed.

How to use Secure Boot on GPU-accelerated machines

There are two ways you can tell Google Cloud to trust your signature when it confirms the GPU driver validity with Secure Boot: with an automated script, or manually.

The script that can help you prepare a Secure Boot compatible image is open-source and is available in our GitHub repository. Here’s how you can use it:

# Download the newest version of the script:
curl -L https://storage.googleapis.com/compute-gpu-installation-us/installer/latest/cuda_installer.pyz --output cuda_installer.pyz

# Make sure you are logged in with gcloud
gcloud auth login

# Check available option for the build process
python3 cuda_installer.pyz build_image --help

# Use the script to build an image based on Ubuntu 24.04
PROJECT = your_project_name
ZONE = zone_you_want_to_use
SECURE_BOOT_IMAGE = name_of_the_final_image

python3 cuda_installer.pyz build_image --project $PROJECT --vm-zone $ZONE --base-image ubuntu-24 $SECURE_BOOT_IMAGE

The script will execute each of the five steps described below for you. It may take up to 30 minutes, as the installation process takes this much time. We’ve also detailed how to use the building script in our documentation.

To manually tell Google Cloud to trust your signature, follow these five steps (also available in our documentation):

Generate your own certificate to be used for signing the driver.
Create a fresh VM with the OS of your choice (Secure Boot disabled, GPU not required).
Install and sign the GPU driver (and optionally CUDA toolkit).
Create a new Disk Image based on the machine with a self-signed driver, adding your certificate to the list of trusted certificates.
The new image can be now used with Secure Boot enabled VMs.

Whether you used the script or performed the task manually, you’ll want to verify that the process worked.

Start a new GPU accelerated VM using the created image

To verify that everything worked, we can create a new VM using the new disk image with the following command (we enable the Secure Boot option to verify that our process worked).

# Create a new VM with T4 GPU to verify that everything works. Note that here ZONE needs to have T4 GPUs available.
TEST_INSTANCE_NAME = name_of_the_test_instance

gcloud compute instances create $TEST_INSTANCE_NAME \
--project=$PROJECT \
--zone=$ZONE \
--machine-type=n1-standard-4 \
--accelerator=count=1,type=nvidia-tesla-t4 \
--create-disk=auto-delete=yes,boot=yes,device-name=$TEST_INSTANCE_NAME,image=projects/$PROJECT/global/images/$SECURE_BOOT_IMAGE,mode=rw,size=100,type=pd-balanced \
--shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--maintenance-policy=TERMINATE

# gcloud compute ssh to run nvidia-smi and see the output
gcloud compute ssh --project=$PROJECT --zone=$ZONE $TEST_INSTANCE_NAME --command "nvidia-smi"

# If you decided to also install CUDA, you can verify it with the following command
gcloud compute ssh --project=$PROJECT --zone=$ZONE $TEST_INSTANCE_NAME --command "python3 cuda_installer.pyz verify_cuda"

Clean up

When you verify that the new image works, there’s no need to keep the verification VM around. You can delete it with:

gcloud compute instances delete --zone=$ZONE --project=$PROJECT $TEST_INSTANCE_NAME

Enabling Secure Boot

Now that you have built a Secure Boot compatible base image for your GPU-based workloads, remember to actually enable Secure Boot on your VM instances when you use those images! Secure Boot is disabled by default, so it needs to be explicitly enabled for Compute Engine instances.

When creating new instances

If you create a new instance using Cloud Console, the checkbox to enable Secure Boot can be found in the Security tab of the creation page, under the Shielded VM section.

For the gcloud enthusiasts, there’s --shielded-secure-boot flag available for the gcloud compute instances create command.

Updating existing instances

You can also enable Secure Boot for instances that already exist, however, make sure that they are running a compatible system. If the driver installed on those machines is not signed with a properly configured key, the driver will not be loaded. To update Secure Boot configuration for existing VMs, you’ll have to follow the stop, update and restart procedure, as described in this documentation page.

Get started

Make sure to visit our documentation page to learn more about the process and follow our GitHub repository to stay up to date with other GPU automation news.

DEV Community: Google Cloud

How to Count Gemini Tokens Locally

✨ Overview

⚙️ Setup

🐍 Google Gen AI Python SDK

🛠️ Google Cloud Project

🤖 Gen AI SDK Client

🧠 Gemini Model

🧩 The Basics: Tokens and Tokenizers

🌐 Baseline: API Token Counting

🚀 Why Count Tokens Locally?

🔤 Using the Local Text Tokenizer

🕵️‍♂️ Accounting for "Hidden" Tokens

🧮 Multimodal Token Math

🎯 Tracking Actual Token Usage

🖼️ Image Tokenization

🔊 Audio Tokenization

🎬 Video Tokenization

📄 Document Tokenization

🎉 Conclusion

➕ More!

Real-time IP capacity in Google Cloud subnets

Automating the search with gcloud and jq

Under the hood: Reading the utilization payload

A few quick constraints

Next steps

Seamless scaling with VPA In-place Pod Resize on GKE

Introducing In-place Pod Resize (IPPR) on GKE

Getting started with IPPR

1. Enable the feature

2. Define your VPA object

3. Watch it scale

Ready to dive deeper?

Surviving the eviction: How to build interrupt-resilient AI workloads on GKE

1. Trap the warning

2. Externalize your checkpoints

3. Design for Idempotency

4. Decouple work queues for batch processing

Key takeaways

Strategies for running AI workloads on GKE without committed quota

Strategy 1: Spot VMs

When to use Spot VMs for accelerators

How to use Spot VMs in GKE

Strategy 2: Dynamic Workload Scheduler (DWS) with flex-start

When to use DWS

How to use DWS with flex-start

Which strategy should you choose?

Inference on GKE Private Clusters

Setting up inference service without access to Internet

GKE Private Nodes

Providing images to the pods

Providing the LLM

Alternative to mounting GCS Bucket - persistent disks

Summary

AI deployment: to host or not to host?

Serverless vs hosted inference service

Serverless (pay per token)

Pros:

Cons:

Hosted (pay per second)

Pros:

Cons:

Couple of considerations

How much traffic do I expect?

Am I legally bound to keep user data in certain region?

Am I likely to hit the hourly/monthly quota?

Mixed-approach

This is not an irrevocable decision

Keep up!

Making Sure Your Prompt Will Be There For You When You Need It

LLMs Can Make Mistakes

Prompt Templates as Functional Interfaces

Phase 1: Build a foundation with Ground Truth

Phase 2: Finding Your Candidate Prompt (Vibe Check)

Phase 3: Statistical Trials – Because Unit Tests Alone Don’t Work

Embracing Statistical Techniques For The Best Performance

Join the Conversation

Read More

How My Team Aligns on Prompting for Production

Natural Language is a Fuzzy Programming Language