DEV Community

Cover image for How to Count Gemini Tokens Locally
Laurent Picard for Google Cloud

Posted on • Originally published at hackernoon.com on

How to Count Gemini Tokens Locally

✨ Overview

This article explores how Gemini tokenizes data and demonstrates how to count or estimate tokens locally. You’ll learn how to use the local tokenizer to estimate text token counts offline, understand the tokenization math for multimodal inputs (images, audio, video, PDFs), and see how to retrieve precise token usage metadata from API responses for accurate tracking and billing.

ℹ️ The complete source code is available in this notebook (including all setup details and future updates) under the Apache 2.0 license. You can also directly open the notebook in Colab. This article reproduces all the results generated by a click on “Run all”.


⚙️ Setup

🐍 Google Gen AI Python SDK

To call the Gemini API, we'll use the Google Gen AI Python SDK. The Gemini API provides a count_tokens method, and the SDK offers an experimental implementation of a LocalTokenizer class.

Make sure you have a recent version of the google-genai package with its local-tokenizer extra:

%pip install --quiet "google-genai[local-tokenizer]>=2.9.0"
Enter fullscreen mode Exit fullscreen mode

🛠️ Google Cloud Project

To get started using the Gemini API on Agent Platform, you must have an existing Google Cloud project and enable the Agent Platform API.

Learn more about setting up a project and a development environment.

import os

# fmt: off
PROJECT_ID = ""  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "global" # @param {type: "string", placeholder: "[your-region]", isTemplate: true}
# fmt: on

if not PROJECT_ID:
    PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
    assert PROJECT_ID, "❌ Please set the PROJECT_ID variable"
if not LOCATION:
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "global")
Enter fullscreen mode Exit fullscreen mode

🤖 Gen AI SDK Client

To interact with the Gemini API, we initialize a genai.Client. Since we're using the enterprise-ready Agent Platform backend (formerly Vertex AI), we pass enterprise=True along with our Google Cloud project and location:

from google import genai


def print_configuration(client: genai.Client) -> None:
    service = "Agent Platform" if client.vertexai else "Google AI"
    print(f"ℹ️ Using the {service} API", end="")
    if client._api_client.project:
        print(f' with project "{client._api_client.project[:7]}"', end="")
        print(f' in location "{client._api_client.location}"')
    elif client._api_client.api_key:
        api_key = client._api_client.api_key
        print(f' with API key "{api_key[:5]}{api_key[-5:]}"', end="")
        print(f" (in case of error, make sure it was created for {service})")


client = genai.Client(enterprise=True, project=PROJECT_ID, location=LOCATION)

print_configuration(client)
Enter fullscreen mode Exit fullscreen mode
ℹ️ Using the Agent Platform API with project "lpdemo-…" in location "global"
Enter fullscreen mode Exit fullscreen mode

🧠 Gemini Model

We'll use gemini-3.1-flash-lite as our default model for token counting and content generation. This lightweight, fast model is ideal for high-throughput tasks.

MODEL_ID = "gemini-3.1-flash-lite"
Enter fullscreen mode Exit fullscreen mode

🧩 The Basics: Tokens and Tokenizers

Tokens

Large language models (LLMs) don't process our inputs directly, nor do they generate the final text or media we see. Instead, they operate on fundamental units called tokens, ingesting them as inputs and generating them as outputs.

Here's what happens when we send an LLM request:

  1. Our inputs are transformed into tokens. In other words, they are tokenized.
  2. The model generates output tokens, which represent the most likely next tokens based on the overall context.
  3. These output tokens are transformed back into the final content we can use.

You can think of a token as a piece of information, and this tokenization process acts as an information compression codec:

  1. Encoding: Input → Input tokens
  2. Decoding: Output tokens → Output

Tokenization is necessary to compress information to the right level of semantic granularity, allowing the model's attention mechanism to focus and develop an understanding of the provided data.

Tokenizers

Gemini is natively multimodal and accepts text, images, audio, video, and PDFs. These media types can be processed by a set of three tokenizers:

Input Text Tokenizer Image Tokenizer Audio Tokenizer Comment
Text The original tokenizer type, when LLMs were only chatbots.
Image An image is can be worth a thousand words tokens!
Audio Text tokens are used for timestamps (MM:SS or H:MM:SS).
Video [✅] By default, one frame is sampled per second, along with its corresponding timestamp. Audio is optional for videos.
PDF PDFs are processed by vision tokenizers. Text tokens are used for OCR and pagination data.

As you can see, up to three tokenizers can be involved, depending on the modality.

💡 Keep in mind that not all underlying tokens are necessarily billed. See the usage_metadata section below for examples of tokens actually billed per modality.

Vocabulary

The complete set of unique tokens that an LLM can ingest or generate makes up its vocabulary. Once an LLM is trained, its vocabulary is fixed and is used for inference.

A vocabulary is essentially a lookup table mapping text sequences to token IDs (which correspond to vector representations in a semantic space). This means tokenizers are simply algorithms that use this vocabulary to encode and decode tokens (i.e., to convert data to and from token IDs).

For example, the Gemini text tokenizers process common words like this:

Text Tokens Tokenization Token IDs
hello 1 A single token for most common sequences 23391
passion 1 passion 208039
passionate 2 pass • ionate 4373 • 84242
passionné 2 passion • né (passionate in French) 208039 • 8504
passionately 2 passion • ately 208039 • 2295
passionalmente 2 pass • ionalmente (passionately in Italian) 4373 • 134916

💡 As you can see, words with the same root aren't necessarily split the same way. Text tokenizers have no concept of syllables, prefixes, or suffixes. They don't think like linguists or grammarians; they think like statisticians and look for statistically optimal combinations.


🌐 Baseline: API Token Counting

The Gemini API lets you count tokens for any multimodal input by sending a count_tokens request. While you need to be authenticated to use it, this method is free of charge, so you can audit your prompts before committing to a paid request. Likewise, the compute_tokens method lets you retrieve the list of corresponding tokens and token IDs.

Let's reproduce the previous table:

from collections.abc import Iterator

import IPython.display

from google.genai.types import (
    ComputeTokensResponse,
    ComputeTokensResult,
    CountTokensResponse,
    CountTokensResult,
)

RowData = tuple[str, str, str, str]


def display_token_info_from_api(model: str, texts: list[str]) -> None:
    def yield_data() -> Iterator[RowData]:
        for text in texts:
            count_result = client.models.count_tokens(model=model, contents=text)
            compute_result = client.models.compute_tokens(model=model, contents=text)
            yield get_text_token_info(text, count_result, compute_result)

    display_token_info(yield_data())


def display_token_info(yield_data: Iterator[RowData]) -> None:
    def yield_row() -> Iterator[RowData]:
        yield "Text", "Tokens", "Tokenization", "Token IDs"
        yield "-", ":-:", "-", "-"
        yield from yield_data

    markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row())
    IPython.display.display(IPython.display.Markdown(markdown))


def get_text_token_info(
    text: str,
    count_tokens_res: CountTokensResponse | CountTokensResult,
    compute_tokens_res: ComputeTokensResponse | ComputeTokensResult,
) -> RowData:
    def inline_code(s: str) -> str:
        return f"`{s}`"

    total_tokens = count_tokens_res.total_tokens
    tokens_info = compute_tokens_res.tokens_info
    assert tokens_info is not None and len(tokens_info) == 1
    info = tokens_info[0]
    assert info.tokens is not None and info.token_ids is not None
    tokenization = "".join(t.decode("utf-8", errors="replace") for t in info.tokens)
    token_ids = "".join(str(token_id) for token_id in info.token_ids)

    return (
        inline_code(text),
        str(total_tokens),
        inline_code(tokenization),
        inline_code(token_ids),
    )


TEXTS = [
    "hello",
    "passion",
    "passionate",
    "passionné",
    "passionately",
    "passionalmente",
]
display_token_info_from_api(MODEL_ID, TEXTS)
Enter fullscreen mode Exit fullscreen mode
Text Tokens Tokenization Token IDs
hello 1 hello 23391
passion 1 passion 208039
passionate 2 pass • ionate 4373 • 84242
passionné 2 passion • né 208039 • 8504
passionately 2 passion • ately 208039 • 2295
passionalmente 2 pass • ionalmente 4373 • 134916

🚀 Why Count Tokens Locally?

Here are a few use cases where counting (or just estimating) tokens locally is useful:

  • Offline & Speed: You can count tokens completely offline. Plus, even when you're online, doing it locally means you don't have to wait for a network round-trip to the Gemini API just to check your prompt size.
  • Quotas: While the count_tokens method is free, counting locally saves bandwidth and prevents you from hitting API rate limits, especially during high-volume token counting.
  • Latency: You can estimate how much time is needed to process your text input before you start receiving a response (for a given model, the time-to-first-token latency is roughly proportional to the number of input tokens).
  • Cost Control: You can estimate and budget your API costs before committing to a paid request.
  • Routing: Knowing which token-count bucket your input falls into lets you route requests to different models based on speed, cost, or context size.
  • Privacy: You can audit the token count of sensitive data without sending it over your network.

🔤 Using the Local Text Tokenizer

Create a local tokenizer for the specific Gemini model you're using:

from google.genai.local_tokenizer import LocalTokenizer

tokenizer = LocalTokenizer(model_name=MODEL_ID)
Enter fullscreen mode Exit fullscreen mode

💡 Remarks

  • Creating a tokenizer takes a few seconds, during which the configuration and vocabulary are loaded into memory.
  • On the first call, the tokenizer data is downloaded and stored in a local cache. This step requires an internet connection and about 30MB of storage.
  • If you want to build a fully offline solution, you can check out the SDK source code and persist the tokenizer assets (e.g., by configuring a persistent cache directory or building a container image).

Checking the internal tokenizer name confirms that the Gemma open-weight models share the same text tokenizer as the Gemini 3 family:

print(f'Text tokenizer name for "{MODEL_ID}": "{tokenizer._tokenizer_name}"')
Enter fullscreen mode Exit fullscreen mode
Text tokenizer name for "gemini-3.1-flash-lite": "gemma4"
Enter fullscreen mode Exit fullscreen mode

Call the count_tokens() method on a small text input:

contents = "Hello World!"
result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens=}")
Enter fullscreen mode Exit fullscreen mode
result.total_tokens=3
Enter fullscreen mode Exit fullscreen mode

Now, let's reproduce the previous API tokenization tests with our local tokenizer:

def display_token_info_from_local_tokenizer(
    tokenizer: LocalTokenizer, texts: list[str]
) -> None:
    def yield_data() -> Iterator[RowData]:
        for text in texts:
            count_result = tokenizer.count_tokens(contents=text)
            compute_result = tokenizer.compute_tokens(contents=text)
            yield get_text_token_info(text, count_result, compute_result)

    display_token_info(yield_data())


display_token_info_from_local_tokenizer(tokenizer, TEXTS)
Enter fullscreen mode Exit fullscreen mode
Text Tokens Tokenization Token IDs
hello 1 hello 23391
passion 1 passion 208039
passionate 2 pass • ionate 4373 • 84242
passionné 2 passion • né 208039 • 8504
passionately 2 passion • ately 208039 • 2295
passionalmente 2 pass • ionalmente 4373 • 134916

💡 As expected, we get exactly the same results, but with 100% local execution this time.

Finally, let's download a longer text, like Hamlet:

import requests


def get_text_from_url(content_url: str, force_encoding: str = "") -> str:
    response = requests.get(content_url, timeout=10)
    response.raise_for_status()
    if force_encoding:  # Use for HTTP headers with unknown/incorrect charset
        response.encoding = force_encoding
    return response.text


TEXT_URL = "https://storage.googleapis.com/dataflow-samples/shakespeare/hamlet.txt"
contents = get_text_from_url(TEXT_URL)

print(contents[:256] + "[…]")
Enter fullscreen mode Exit fullscreen mode
    HAMLET


    DRAMATIS PERSONAE


CLAUDIUS    king of Denmark. (KING CLAUDIUS:)

HAMLET  son to the late, and nephew to the present king.

POLONIUS    lord chamberlain. (LORD POLONIUS:)

HORATIO friend to Hamlet.

LAERTES son to Polonius.

LUCIANUS    nephew to the kin[…]
Enter fullscreen mode Exit fullscreen mode

How many tokens do we need to encode Hamlet?

result = tokenizer.count_tokens(contents)

print(f"{result.total_tokens=:,}")
Enter fullscreen mode Exit fullscreen mode
result.total_tokens=54,660
Enter fullscreen mode Exit fullscreen mode

💡 Hamlet gets broken down locally into 50k+ tokens in a fraction of a second. If you tokenize War and Peace, you'll get 850k+ tokens.


🕵️‍♂️ Accounting for "Hidden" Tokens

When you send a request to Gemini, the total input token count isn't always just the sum of your input data.

To keep things simple, we tested text token counts with default parameters. The count_tokens and compute_tokens methods both have a config parameter. Depending on your request configuration, your inputs and outputs may include additional tokens.

Keep an eye out for these hidden additions:

  • System Instructions: Any system prompt you set will add to the total token count.
  • Thinking: If thinking is enabled, an internal chain of thought can generate additional thinking tokens.
  • Tools and Functions: If you provide a list of tools (like Python execution or custom functions), their declarations, calls, and responses are part of your prompt payload.
  • Response Schema: Enforcing structured outputs (like JSON) requires the model to process the schema definition you provide, which consumes input tokens.
  • Chat History: In multi-turn conversations, the entire chat history is sent back to the model with every new message, meaning your input token count grows with each turn.

🧮 Multimodal Token Math

Multimodal inputs (images, audio, video, and documents) aren't tokenized like text. They usually have specific calculation rules based on the model (and its underlying tokenizers), the media type, and the request configuration.

For multimodal inputs, refer to the documentation for details on how token counts are calculated for different media types:

There are generally multiple tokenization options, even for a single modality. You can use the count_tokens method and the calculation rules to estimate the token count of your own payloads. To get a clearer picture, let's look at actual requests and see how token counts are broken down by modality…


🎯 Tracking Actual Token Usage

While estimating token counts is super useful, you should always rely on the usage_metadata returned in the API response when you need to track your actual usage down to the exact token. It's the single source of truth for billing.

Here's the gist of how usage_metadata lets you get the token counts by modality:

class GenerateContentResponse:
    # …
    usage_metadata: Optional[GenerateContentResponseUsageMetadata]
    # …


class GenerateContentResponseUsageMetadata:
    # …
    prompt_token_count: Optional[int]
    prompt_tokens_details: Optional[list[ModalityTokenCount]]
    # …


class ModalityTokenCount:
    modality: Optional[MediaModality]
    token_count: Optional[int]


class MediaModality(StrEnum):
    MODALITY_UNSPECIFIED = "MODALITY_UNSPECIFIED"
    TEXT = "TEXT"
    IMAGE = "IMAGE"
    VIDEO = "VIDEO"
    AUDIO = "AUDIO"
    DOCUMENT = "DOCUMENT"
Enter fullscreen mode Exit fullscreen mode

🐍 Let's define a few helpers:

from google.genai.types import (
    FileData,
    GenerateContentResponse,
    MediaModality,
    Part,
    PartMediaResolution,
    PartMediaResolutionLevel,
    VideoMetadata,
)

TokensPerModality = dict[MediaModality, int]


def display_tokens_per_modality(response: GenerateContentResponse) -> None:
    usage_metadata = response.usage_metadata
    if not usage_metadata:
        print("⚠️ No usage metadata found in the response.")
        return
    prompt_tokens_details = usage_metadata.prompt_tokens_details or []
    tokens_per_modality = get_empty_tokens_per_modality()

    for tokens_details in prompt_tokens_details:
        modality = tokens_details.modality
        if modality and modality in tokens_per_modality:
            tokens_per_modality[modality] += tokens_details.token_count or 0

    prompt_token_count = usage_metadata.prompt_token_count or 0
    display_token_table(tokens_per_modality, prompt_token_count)


def get_empty_tokens_per_modality() -> TokensPerModality:
    return {
        modality: 0
        for modality in MediaModality
        if modality != MediaModality.MODALITY_UNSPECIFIED
    }


def display_token_table(
    tokens_per_modality: TokensPerModality,
    total_tokens: int,
) -> None:
    def yield_row() -> Iterator[list[str]]:
        yield [mod.value for mod in tokens_per_modality.keys()] + ["Total"]
        yield [":-:" for _ in range(len(tokens_per_modality) + 1)]
        yield [f"{t:,d}" for t in tokens_per_modality.values()] + [f"{total_tokens:,d}"]

    markdown = "\n".join("| " + " | ".join(row) + " |" for row in yield_row())
    IPython.display.display(IPython.display.Markdown(markdown))
Enter fullscreen mode Exit fullscreen mode

Let's check a few examples…


🖼️ Image Tokenization

Image token counts depend on the image itself and the configured media resolution:

class PartMediaResolutionLevel(StrEnum):
    MEDIA_RESOLUTION_UNSPECIFIED = "MEDIA_RESOLUTION_UNSPECIFIED"
    MEDIA_RESOLUTION_LOW = "MEDIA_RESOLUTION_LOW"
    MEDIA_RESOLUTION_MEDIUM = "MEDIA_RESOLUTION_MEDIUM"
    MEDIA_RESOLUTION_HIGH = "MEDIA_RESOLUTION_HIGH"
    MEDIA_RESOLUTION_ULTRA_HIGH = "MEDIA_RESOLUTION_ULTRA_HIGH"
Enter fullscreen mode Exit fullscreen mode

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per image:

media_resolution Tokens
MEDIA_RESOLUTION_LOW 280
MEDIA_RESOLUTION_MEDIUM 560
MEDIA_RESOLUTION_HIGH (default) 1,120
MEDIA_RESOLUTION_ULTRA_HIGH 2,240

🐍 Check how this cat image is tokenized by default:

def display_tokens_for_image(
    image_uri: str,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {media_resolution_level=}")
    contents = Part.from_uri(
        file_uri=image_uri,
        mime_type="image/*",
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


IMAGE_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/image/chair-cat.png"
display_tokens_for_image(IMAGE_URI)
Enter fullscreen mode Exit fullscreen mode
🧪 media_resolution_level=None
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 1,080 0 0 0 1,080

💡 This image is tokenized into only 1,080 tokens (instead of the maximum 1,120), saving us 40 tokens! It's a nice touch that helps keep costs down rather than defaulting to the upper limit.

🐍 For less detailed images, you can reduce token counts by a factor of 2 or 4 using the medium or low levels:

display_tokens_for_image(IMAGE_URI, PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW)
Enter fullscreen mode Exit fullscreen mode
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 264 0 0 0 264

💡 At the other end of the media resolution range, the ultra-high level is great for detailed images (like a photo of a circuit board with many components), ensuring maximum visual understanding. An image at this level uses between 2,000 and 2,240 tokens.


🔊 Audio Tokenization

Audio tokenization currently uses 25 tokens per second to represent the audio stream semantically.

🐍 Here is the tokenization for a 3.049-second audio file:

def display_tokens_for_audio(audio_uri: str) -> None:
    contents = Part.from_uri(file_uri=audio_uri, mime_type="audio/*")
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/hello_gemini_are_you_there.wav"
display_tokens_for_audio(AUDIO_URI)
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 0 0 77 0 77

💡 ceil(3.049 s × 25 tok/s) = ceil(76.225 tok) = 77 tok

🐍 A longer, 30.772-second audio file requires 10 times as many tokens, as expected:

AUDIO_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/audio/sailor_audio.mp3"
display_tokens_for_audio(AUDIO_URI)
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 0 0 770 0 770

💡 ceil(30.772 s × 25 tok/s) = ceil(769.3 tok) = 770 tok


🎬 Video Tokenization

For videos:

  • The audio tokenizer is the same as for standalone audio (25 tokens per second).
  • Video frames are sampled (1 FPS by default) and tokenized based on the media resolution.

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per sampled frame:

media_resolution Max. tokens
MEDIA_RESOLUTION_LOW/MEDIA_RESOLUTION_MEDIUM (default) 70
MEDIA_RESOLUTION_HIGH 280

🐍 Here's the tokenization for a 59-second video:

def display_tokens_for_video(
    video_uri: str,
    fps: float | None = None,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {fps=}, {media_resolution_level=}")
    contents = Part(
        file_data=FileData(file_uri=video_uri, mime_type="video/*"),
        video_metadata=VideoMetadata(fps=fps) if fps is not None else None,
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk"
display_tokens_for_video(VIDEO_URI)
Enter fullscreen mode Exit fullscreen mode
🧪 fps=None, media_resolution_level=None
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 0 3,894 1,475 0 5,369

💡 Details

  • Video: ceil(59 s × 1 frame/s × 66 tok/frame) = ceil(3894 tok) = 3894 tok
  • Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

🐍 Doubling the sampling rate requires twice as many video tokens:

display_tokens_for_video(VIDEO_URI, fps=2)
Enter fullscreen mode Exit fullscreen mode
🧪 fps=2, media_resolution_level=None
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 0 7,788 1,475 0 9,263

💡 Details

  • Video: ceil(59 s × 2 frame/s × 66 tok/frame) = ceil(7788 tok) = 7788 tok
  • Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

🐍 If you switch from low/medium to high media resolution, sampled frames are tokenized in greater detail, requiring four times as many video tokens:

VIDEO_URI = "https://www.youtube.com/watch?v=0pJn3g8dfwk"
display_tokens_for_video(
    VIDEO_URI,
    media_resolution_level=PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH,
)
Enter fullscreen mode Exit fullscreen mode
🧪 fps=None, media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 0 15,576 1,475 0 17,051

💡 Details

  • Video: ceil(59 s × 1 frame/s × 264 tok/frame) = ceil(15576 tok) = 15576 tok
  • Audio: ceil(59 s × 25 tok/s) = ceil(1475 tok) = 1475 tok

📄 Document Tokenization

For a given media resolution level, the Gemini 3 tokenizers will use these maximum token budgets per PDF page:

media_resolution Tokens
MEDIA_RESOLUTION_LOW 280
MEDIA_RESOLUTION_MEDIUM (default) 560
MEDIA_RESOLUTION_HIGH 1,120

🐍 Here's the tokenization for a one-page PDF at different media resolutions:

def display_tokens_for_document(
    document_uri: str,
    media_resolution_level: PartMediaResolutionLevel | None = None,
) -> None:
    print(f"🧪 {media_resolution_level=}")
    contents = Part.from_uri(
        file_uri=document_uri,
        mime_type="application/pdf",
        media_resolution=(
            PartMediaResolution(level=media_resolution_level)
            if media_resolution_level
            else None
        ),
    )
    response = client.models.generate_content(model=MODEL_ID, contents=contents)
    display_tokens_per_modality(response)


DOCUMENT_URI = (
    "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/invoice.pdf"
)
media_resolution_levels = [
    PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW,
    PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM,
    PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH,
]
for media_resolution_level in media_resolution_levels:
    display_tokens_for_document(DOCUMENT_URI, media_resolution_level)
Enter fullscreen mode Exit fullscreen mode
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 266 0 0 0 266
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 532 0 0 0 532
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 1,092 0 0 0 1,092

💡 Remarks

  • Low: 266 tok/pg
  • Medium: 532 tok/pg
  • High: 1092 tok/pg

🐍 Here's another test for a 15-page PDF:

DOCUMENT_URI = "https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf"
for media_resolution_level in media_resolution_levels:
    display_tokens_for_document(DOCUMENT_URI, media_resolution_level)
Enter fullscreen mode Exit fullscreen mode
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW: 'MEDIA_RESOLUTION_LOW'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 3,990 0 0 0 3,990
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM: 'MEDIA_RESOLUTION_MEDIUM'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 7,800 0 0 0 7,800
🧪 media_resolution_level=<PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH: 'MEDIA_RESOLUTION_HIGH'>
Enter fullscreen mode Exit fullscreen mode
TEXT IMAGE VIDEO AUDIO DOCUMENT Total
0 16,530 0 0 0 16,530

💡 Remarks

  • Low: 3990 tok / 15 pg = 266 tok/pg
  • Medium: 7800 tok / 15 pg = 520 tok/pg
  • High: 16530 tok / 15 pg = 1102 tok/pg

🎉 Conclusion

You've now mastered token counting both locally and via the Gemini API!

With the LocalTokenizer, you can estimate text token counts completely offline, saving bandwidth and avoiding rate limits. You've also seen how Gemini's multimodal tokenizers handle images, audio, video, and PDFs, and how to extract precise token usage from usage_metadata for accurate tracking and billing.


➕ More!

Top comments (0)