Evan Lin for Google Developer Experts

Posted on Jan 28 • Originally published at evanlin.com on Jan 28

Gemini 3 Flash: Agentic Vision in LINE Bot - AI Image Annotation and More

#agents #ai #gemini #python

Background

After completing the Multi-Agent Orchestration architecture for LINE Bot, the original image analysis function directly sent the image to gemini-2.5-flash for recognition. However, Google released the Agentic Vision of Gemini 3 Flash in January 2026, which allows the model to not only "see" the image but also actively write Python code to enlarge, crop, and annotate the image.

This made me think of an interesting use case:

A user sends a photo and says, "Help me mark the coffee," and the AI not only replies with a text description but also draws a bounding box and annotates it on the image, then sends the annotated image back to LINE.

This article documents the complete process of implementing this function, including the pitfalls and solutions.

What is Agentic Vision?

Traditional image analysis is static: you give the model an image, and the model returns a text description.

Agentic Vision turns image understanding into an active investigation process, using a Think → Act → Observe cycle:

┌─────────────────────────────────────────────────────────────┐
│ Agentic Vision Process │
│ │
│ 1. Think - Analyze the image and plan how to investigate further │
│ 2. Act - Write Python code (crop, enlarge, annotate, calculate) │
│ 3. Observe - Observe the code execution results (including the generated annotated image) │
│ 4. Repeat the above steps until the analysis is complete │
└─────────────────────────────────────────────────────────────┘

Technical Core

Model: gemini-3-flash-preview
Key Feature: code_execution tool — allows the model to write and execute Python code
Output: In addition to text analysis, it can also return annotated images generated by the model

# Enable Agentic Vision API call
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[image_part, "Help me mark the coffee"],
    config=types.GenerateContentConfig(
        tools=[types.Tool(code_execution=types.ToolCodeExecution)],
        thinking_config=types.ThinkingConfig(thinkingBudget=2048),
    )
)

# Response contains multiple parts: text, code, execution results, annotated images
for part in response.candidates[0].content.parts:
    if part.text: # Text analysis
    if part.executable_code: # Python code written by the model
    if part.code_execution_result: # Code execution results
    if part.as_image(): # Generated annotated image!

Functional Design

User Experience Flow

Instead of directly analyzing the image upon receiving it, it's changed to let the user choose a mode first:

User sends an image
     │
     ▼
┌─────────────────────────────────────┐
│ 📷 Image received, please select an analysis method: │
│ │
│ ┌──────────┐ ┌─────────────────┐ │
│ │ Recognize Image │ │ Agentic Vision │ │
│ └──────────┘ └─────────────────┘ │
│ (Quick Reply Buttons) │
└─────────────────────────────────────┘
     │ │
     ▼ ▼
 gemini-2.5-flash User inputs instructions
 Directly returns a text description "Help me mark the coffee"
                         │
                         ▼
                  gemini-3-flash-preview
                  + code_execution
                         │
                    ┌────┴────┐
                    ▼ ▼
               Text Analysis Annotated Image
               (Text) (Image)
                    │ │
                    ▼ ▼
               LINE TextMsg + ImageSendMessage

Why two steps?

Agentic Vision requires the user to provide specific instructions (e.g., "Mark everyone," "Count how many cats"), unlike general recognition which only needs to "describe the image." Therefore, after selecting Agentic Vision, the user is first asked to input their desired goal.

Implementation Details

1. Image Temporary Storage Mechanism

Because LINE's Quick Reply is asynchronous (user clicks a button to trigger PostbackEvent), the image needs to be temporarily stored:

# main.py
image_temp_store: Dict[str, bytes] = {} # Temporary image storage (user_id → bytes)
pending_agentic_vision: Dict[str, bool] = {} # Waiting for user to input instructions

Process:

Receive image → store in image_temp_store[user_id]
User clicks "Agentic Vision" → set pending_agentic_vision[user_id] = True
User inputs text → detect pending state, retrieve image + text and send them for analysis

2. Quick Reply Implementation

Use LINE SDK's PostbackAction, consistent with the existing YouTube summary and location search Quick Reply modes:

quick_reply_buttons = QuickReply(
    items=[
        QuickReplyButton(
            action=PostbackAction(
                label="Recognize Image",
                data=json.dumps({"action": "image_analyze", "mode": "recognize"}),
                display_text="Recognize Image"
            )
        ),
        QuickReplyButton(
            action=PostbackAction(
                label="Agentic Vision",
                data=json.dumps({"action": "image_analyze", "mode": "agentic_vision"}),
                display_text="Agentic Vision"
            )
        ),
    ]
)

3. Agentic Vision Analysis Core

# tools/summarizer.py
def analyze_image_agentic(image_data: bytes, prompt: str) -> dict:
    client = _get_vertex_client()

    contents = [
        types.Part.from_text(text=prompt),
        types.Part.from_bytes(data=image_data, mime_type="image/png")
    ]

    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=contents,
        config=types.GenerateContentConfig(
            temperature=0.5,
            max_output_tokens=4096,
            tools=[types.Tool(code_execution=types.ToolCodeExecution)],
            thinking_config=types.ThinkingConfig(thinkingBudget=2048),
        )
    )

    result_parts = []
    generated_images = []

    for part in response.candidates[0].content.parts:
        if hasattr(part, 'thought') and part.thought:
            continue # Skip thinking parts
        if part.text is not None:
            result_parts.append(part.text)
        if part.code_execution_result is not None:
            result_parts.append(f"[Code Output]: {part.code_execution_result.output}")
        # Extract the annotated images generated by the model
        img = part.as_image()
        if img is not None:
            generated_images.append(img.image_bytes)

    return {
        "status": "success",
        "analysis": "\n".join(result_parts),
        "images": generated_images # Annotated image bytes
    }

4. Image Return Mechanism

LINE's ImageSendMessage requires a public HTTPS URL. Because we are deployed on Cloud Run (which is inherently public HTTPS), we directly add an image serving endpoint to FastAPI:

# Temporary storage of annotated images (UUID → bytes, 5 minutes TTL)
annotated_image_store: Dict[str, dict] = {}

@app.get("/images/{image_id}")
def serve_annotated_image(image_id: str):
    """Provide temporary annotated images for LINE to download"""
    entry = annotated_image_store.get(image_id)
    if not entry:
        raise HTTPException(status_code=404)
    if time.time() - entry["created_at"] > 300: # 5 minutes expired
        annotated_image_store.pop(image_id, None)
        raise HTTPException(status_code=404)
    return Response(content=entry["data"], media_type="image/png")

Automatically detect the App's base URL (from the webhook request headers):

@app.post("/")
async def handle_webhook_callback(request: Request):
    global app_base_url
    if not app_base_url:
        forwarded_proto = request.headers.get('x-forwarded-proto', 'https')
        host = request.headers.get('x-forwarded-host') or request.headers.get('host', '')
        if host:
            app_base_url = f"{forwarded_proto}://{host}"

Finally, combine into ImageSendMessage:

def _create_image_send_message(image_bytes: bytes):
    image_id = store_annotated_image(image_bytes)
    image_url = f"{app_base_url}/images/{image_id}"
    return ImageSendMessage(
        original_content_url=image_url,
        preview_image_url=image_url,
    )

Results Showcase

Pitfalls Encountered

Pitfall 1: `from_image_bytes` Does Not Exist

ERROR: Error analyzing image: from_image_bytes

Reason: There is no types.Part.from_image_bytes() method in the google-genai SDK, the correct one is types.Part.from_bytes().

# ❌ Incorrect
types.Part.from_image_bytes(data=image_data, mime_type="image/png")

# ✅ Correct
types.Part.from_bytes(data=image_data, mime_type="image/png")

Pitfall 2: `ThinkingLevel` enum Does Not Exist

ERROR: module 'google.genai.types' has no attribute 'ThinkingLevel'

Reason: ThinkingConfig in google-genai==1.49.0 only supports thinkingBudget (integer), and does not support the thinking_level enum. Context7 and the examples in the official documentation are based on a newer version of the SDK.

# ❌ Does not exist in v1.49.0
types.ThinkingConfig(thinking_level=types.ThinkingLevel.MEDIUM)

# ✅ v1.49.0 supported method
types.ThinkingConfig(thinkingBudget=2048)

Lesson: AI-generated code examples may be based on newer or older SDK versions, always use python -c "help(types.ThinkingConfig)" to confirm the actual available parameters.

Pitfall 3: Incomplete Image Recognition Results

Reason: gemini-2.5-flash enables thinking by default, and thinking tokens will consume the quota of max_output_tokens. Originally set max_output_tokens=2048, and thinking used up a large portion, the actual reply was truncated.

# ❌ Before: thinking consumed most of the token quota
config=types.GenerateContentConfig(
    max_output_tokens=2048,
)

# ✅ After: Disable thinking + increase token quota
config=types.GenerateContentConfig(
    max_output_tokens=8192,
    thinking_config=types.ThinkingConfig(thinkingBudget=0), # Disable thinking
)

Key Point: For simple image descriptions, thinking is an unnecessary overhead. thinkingBudget=0 can disable thinking, allowing all tokens to be used for the reply.

Modified Files

┌───────────────────────┬─────────────────────────────────────────────────────┐
│ File │ Modification Content │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ main.py │ Quick Reply process, image temporary storage, pending state management, │
│ │ image serving endpoint, ImageSendMessage return │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ tools/summarizer.py │ Added analyze_image_agentic(), corrected from_bytes, │
│ │ Corrected ThinkingConfig, disabled thinking for image recognition │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/vision_agent.py│ Added analyze_agentic() method │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/orchestrator.py│ Added process_image_agentic() routing method │
└───────────────────────┴─────────────────────────────────────────────────────┘

Complete Architecture

The original VisionAgent only had one path, now it becomes:

LINE Image Message
     │
     ▼
handle_image_message()
     │
     ├── image_temp_store[user_id] = image_bytes
     │
     ▼
Quick Reply: "Recognize Image" / "Agentic Vision"
     │ │
     ▼ ▼
handle_image_analyze_ pending_agentic_vision[user_id] = True
postback() │
     │ ▼
     │ User inputs text instructions
     │ │
     │ ▼
     │ handle_agentic_vision_with_prompt()
     │ │
     ▼ ▼
orchestrator orchestrator
.process_image() .process_image_agentic(prompt=user instructions)
     │ │
     ▼ ▼
VisionAgent.analyze() VisionAgent.analyze_agentic()
     │ │
     ▼ ▼
analyze_image() analyze_image_agentic()
gemini-2.5-flash gemini-3-flash-preview
thinkingBudget=0 + code_execution
                           + thinkingBudget=2048
     │ │
     ▼ ├── Text analysis → TextSendMessage
TextSendMessage ├── Annotated image → /images/{uuid} → ImageSendMessage
                               └── push_message([text, image])

Development Experience

1. SDK Version Differences are the Biggest Pitfall

The most time-consuming part of this development was not the functional design, but the SDK version differences. The API of google-genai changes frequently:

from_image_bytes → from_bytes (method name changed)
ThinkingLevel enum does not exist in v1.49.0 (requires thinkingBudget integer)
The impact of thinking on max_output_tokens is not documented

Suggestion: Before development, run pip show google-genai to confirm the version, and then use help() to confirm the actually available API.

2. Limitations of LINE Bot Image Returns

LINE's ImageSendMessage requires the image to be a public HTTPS URL, and cannot directly transmit bytes. Solutions:

Solution	Advantages	Disadvantages
GCS Upload	Stable, persistent	Requires bucket and permission settings
FastAPI endpoint serves itself	Simple, no external services required	Disappears after restart, memory usage
Base64 embedded text	Simplest	LINE does not support

I chose the FastAPI endpoint solution because:

Cloud Run itself is public HTTPS
Annotated images only need to exist briefly (5 minutes TTL)
No need for additional GCS bucket settings

3. Thinking is a Double-Edged Sword

gemini-2.5-flash enables thinking by default, which is helpful for complex reasoning, but is a burden for simple image descriptions:

Consumes max_output_tokens quota
Increases latency
Replies may be truncated

Principle: Disable thinking for simple tasks (thinkingBudget=0), and only enable it for complex Agentic Vision.

4. Trade-offs in State Management

Agentic Vision requires two-step interaction (select mode → input instructions), which introduces state management:

image_temp_store: Dict[str, bytes] = {} # Image temporary storage
pending_agentic_vision: Dict[str, bool] = {} # Waiting for instructions

Using an in-memory dict is the simplest, but there is a risk: Cloud Run may restart between two requests. This is acceptable for a personal Bot, but if you want to make it a product-level service, you should switch to Redis or Firestore.

References

Introducing Agentic Vision in Gemini 3 Flash - Google Official Blog
Gemini 3 Developer Guide - API Development Documentation
Code Execution - Code Execution Feature Documentation
Image Understanding - Bounding Box and Image Analysis
google-genai Python SDK - SDK Source Code
linebot-helper-python - Source Code of this project

DEV Community

Gemini 3 Flash: Agentic Vision in LINE Bot - AI Image Annotation and More

Background

What is Agentic Vision?

Technical Core

Functional Design

User Experience Flow

Why two steps?

Implementation Details

1. Image Temporary Storage Mechanism

2. Quick Reply Implementation

3. Agentic Vision Analysis Core

4. Image Return Mechanism

Results Showcase

Pitfalls Encountered

Pitfall 1: `from_image_bytes` Does Not Exist

Pitfall 2: `ThinkingLevel` enum Does Not Exist

Pitfall 3: Incomplete Image Recognition Results

Modified Files

Complete Architecture

Development Experience

1. SDK Version Differences are the Biggest Pitfall

2. Limitations of LINE Bot Image Returns

3. Thinking is a Double-Edged Sword

4. Trade-offs in State Management

References

Top comments (0)

Background

What is Agentic Vision?

Technical Core

Functional Design

User Experience Flow

Why two steps?

Implementation Details

1. Image Temporary Storage Mechanism

2. Quick Reply Implementation

3. Agentic Vision Analysis Core

4. Image Return Mechanism

Results Showcase

Pitfalls Encountered

Pitfall 1: from_image_bytes Does Not Exist

Pitfall 2: ThinkingLevel enum Does Not Exist

Pitfall 3: Incomplete Image Recognition Results

Modified Files

Complete Architecture

Development Experience

1. SDK Version Differences are the Biggest Pitfall

2. Limitations of LINE Bot Image Returns

3. Thinking is a Double-Edged Sword

4. Trade-offs in State Management

References

Pitfall 1: `from_image_bytes` Does Not Exist

Pitfall 2: `ThinkingLevel` enum Does Not Exist