DEV Community

Cover image for Gemini 3.1 Flash-Lite: Developer guide and use cases
Patrick Loeber for Google AI

Posted on

Gemini 3.1 Flash-Lite: Developer guide and use cases

Gemini 3.1 Flash-Lite is the high-volume, affordable powerhouse of the Gemini family. It’s purpose-built for large-scale tasks where speed and cost-efficiency are the main priorities, making it the ideal engine for background processing. Whether you're handling a constant stream of user interactions or need to process massive datasets with tasks like translation, transcription, or extraction, Flash-Lite provides the optimal balance of speed and capability.

This guide walks through seven practical use cases for Flash-Lite using the google-genai Python SDK.

Setup

Install the SDK and configure your API key:

# pip install -U google-genai
from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")
Enter fullscreen mode Exit fullscreen mode

1. Translation

If you're processing user-generated content at scale, such as chat messages, reviews, or support tickets, you need fast, cheap translation. Flash-Lite handles high-volume translation well, and you can use system instructions to constrain it to output only the translated text with no extra commentary.

text = "Hey, are you down to grab some pizza later? I'm starving!"

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    config={
        "system_instruction": "Only output the translated text"
    },
    contents=f"Translate the following text to German: {text}"
)

print(response.text)
# Hey, hast du Lust, später eine Pizza essen zu gehen? Ich habe riesigen Hunger!
Enter fullscreen mode Exit fullscreen mode

2. Transcription

Flash-Lite supports multimodal inputs and handles speech-to-text tasks fast and at scale, allowing you to pass audio files such as recordings, memos, or voice inputs directly for transcription. Furthermore, you have the option to leverage prompting in the same step to get the transcript in a specific format, making it ready for downstream tasks like agent hand-offs or other workflows.

# URL = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3"

# Upload the audio file to the GenAI File API
uploaded_file = client.files.upload(file="sample.mp3")

prompt = "Generate a transcript of the audio."
# prompt = "Generate a transcript of the audio. Remove filler words such as 'um', 'uh', 'like'."

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents=[prompt, uploaded_file]
)

print(response.text)
Enter fullscreen mode Exit fullscreen mode

3. Lightweight Agentic Tasks and Data Extraction

Flash-Lite supports structured JSON output, which makes it a good fit for entity extraction, classification, and lightweight data processing pipelines. You define your output schema (here using Pydantic) and the model returns valid JSON that conforms to it.

In this example, we extract structured data from an e-commerce customer review, including the specific product aspect mentioned, a summary quote, a sentiment score, and the customer's likelihood of returning.

from pydantic import BaseModel, Field

prompt = "Analyze the user review and determine the aspect, sentiment score, summary quote, and return risk"
input_text = "The boots look amazing and the leather is high quality, but they run way too small. I'm sending them back."

class ReviewAnalysis(BaseModel):
    aspect: str = Field(description="The feature mentioned (e.g., Price, Comfort, Style, Shipping)")
    summary_quote: str = Field(description="The specific phrase from the review about this aspect")
    sentiment_score: int = Field(description="1 to 5 (1=worst, 5=best)")
    is_return_risk: bool = Field(description="True if the user mentions returning the item")

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents=[prompt, input_text],
    config={
        "response_mime_type": "application/json",
        "response_json_schema": ReviewAnalysis.model_json_schema(),
    },
)

print(response.text)
# {
#  "aspect": "Size",
#  "summary_quote": "they run way too small",
#  "sentiment_score": 2,
#  "is_return_risk": true
# }
Enter fullscreen mode Exit fullscreen mode

4. Document Processing & Summarization

Flash-Lite handles high-volume document tasks with ease, from parsing PDFs for concise summaries to performing cross-source comparisons. It is also an ideal fit for document processing pipelines that require quick triage, enabling you to categorize incoming files, run simple pass/fail checks, or perform standard data extraction.

import httpx

# Download PDF document
doc_url = "https://storage.googleapis.com/generativeai-downloads/data/med_gemini.pdf"
doc_data = httpx.get(doc_url).content

prompt = "Summarize this document"
response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents=[
        types.Part.from_bytes(
            data=doc_data,
            mime_type='application/pdf',
        ),
        prompt
    ]
)

print(response.text)
Enter fullscreen mode Exit fullscreen mode

5. Model routing

You don't want to send every request to your most expensive model. A common pattern is to use a fast, cheap model as a classifier that routes queries to the appropriate model based on task complexity. Flash-Lite works well for this because the routing call itself needs to be low-latency and low-cost.

A real-world example of this pattern is the open-source Gemini CLI, which uses Flash-Lite to classify task complexity and route to Gemini Flash or Pro. The following example is adapted from the CLI’s classifier strategy.

FLASH_MODEL = 'flash'
PRO_MODEL = 'pro'

CLASSIFIER_SYSTEM_PROMPT = f"""
You are a specialized Task Routing AI. Your sole function is to analyze the user's request and classify its complexity. Choose between `{FLASH_MODEL}` (SIMPLE) or `{PRO_MODEL}` (COMPLEX).
1.  `{FLASH_MODEL}`: A fast, efficient model for simple, well-defined tasks.
2.  `{PRO_MODEL}`: A powerful, advanced model for complex, open-ended, or multi-step tasks.

A task is COMPLEX if it meets ONE OR MORE of the following criteria:
1.  High Operational Complexity (Est. 4+ Steps/Tool Calls)
2.  Strategic Planning and Conceptual Design
3.  High Ambiguity or Large Scope
4.  Deep Debugging and Root Cause Analysis

A task is SIMPLE if it is highly specific, bounded, and has Low Operational Complexity (Est. 1-3 tool calls).
"""

user_input = "I'm getting an error 'Cannot read property 'map' of undefined' when I click the save button. Can you fix it?"

response_schema = {
  "type": "object",
  "properties": {
    "reasoning": {
      "type": "string",
      "description": "A brief, step-by-step explanation for the model choice, referencing the rubric."
    },
    "model_choice": {
      "type": "string",
      "enum": [FLASH_MODEL, PRO_MODEL]
    }
  },
  "required": ["reasoning", "model_choice"]
}

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents=user_input,
    config={
        "system_instruction": CLASSIFIER_SYSTEM_PROMPT,
        "response_mime_type": "application/json",
        "response_json_schema": response_schema
    },
)

print(response.text)
# {
#   "reasoning": "The user is reporting an error symptom without a known cause. This requires investigation to identify the root cause, which falls under 'Deep Debugging & Root Cause Analysis'.",
#   "model_choice": "pro"
# }
Enter fullscreen mode Exit fullscreen mode

6. Thinking with Gemini Flash-Lite

Flash-Lite supports configurable thinking levels, allowing the model to allocate additional compute to internal reasoning before producing a final response. This is ideal for tasks that benefit from step-by-step logic, such as math, coding, or multi-constraint problems, where you need higher accuracy while maintaining the efficiency of the Flash-Lite model. By default, Flash-Lite’s thinking level is set to minimal, but it can be adjusted to low, medium, or high depending on the complexity of your task.

response = client.models.generate_content(
    model="gemini-3.1-flash-lite-preview",
    contents="How does AI work?",
    config={
        "thinking_config": {"thinking_level": "high"}
    },
)

print(response.text)
Enter fullscreen mode Exit fullscreen mode

For more on configuring thinking levels, see the Gemini API docs.

7. Batch API

If you have large volumes of data to process and low latency isn't a priority, the Gemini Batch API is the perfect companion for Flash-Lite. It is designed specifically for asynchronous, high-throughput tasks at 50% of the standard cost. The target turnaround time is 24 hours, but in the majority of cases, it is much quicker.

You can implement the Batch API in your workflow using the following pattern:

# Create a JSONL file with your requests and upload it
uploaded_batch_requests = client.files.upload(file="batch_requests.json")

# Create the batch job
batch_job = client.batches.create(
    model="gemini-3.1-flash-lite-preview",
    src=uploaded_batch_requests.name,
    config={'display_name': "batch_job-1"}
)

print(f"Created batch job: {batch_job.name}")

# Wait for up to 24 hours 
if batch_job.state.name == 'JOB_STATE_SUCCEEDED':
    result_file_name = batch_job.dest.file_name
    file_content_bytes = client.files.download(file=result_file_name)
    file_content = file_content_bytes.decode('utf-8')

    for line in file_content.splitlines():
        print(line
Enter fullscreen mode Exit fullscreen mode

Conclusion

Gemini 3.1 Flash-Lite excels at the "boring but big" tasks that define high-scale production. It serves as a versatile workhorse for everything from data extraction to agentic routing, enabling you to build more balanced and efficient AI architectures. By leveraging Flash-Lite for high-volume background processing, you can maximize your impact while keeping operational costs in check.

See the following resources to learn more:

Top comments (0)