DEV Community

Evan Lin for Google Developer Experts

Posted on • Originally published at evanlin.com on

Gemini 3 Flash: Agentic Vision in LINE Bot - AI Image Annotation and More

Background

image-20260128143430908

image-20260128143559301

After completing the Multi-Agent Orchestration architecture for LINE Bot, the original image analysis function directly sent the image to gemini-2.5-flash for recognition. However, Google released the Agentic Vision of Gemini 3 Flash in January 2026, which allows the model to not only "see" the image but also actively write Python code to enlarge, crop, and annotate the image.

This made me think of an interesting use case:

A user sends a photo and says, "Help me mark the coffee," and the AI not only replies with a text description but also draws a bounding box and annotates it on the image, then sends the annotated image back to LINE.

This article documents the complete process of implementing this function, including the pitfalls and solutions.


What is Agentic Vision?

Traditional image analysis is static: you give the model an image, and the model returns a text description.

Agentic Vision turns image understanding into an active investigation process, using a Think → Act → Observe cycle:

┌─────────────────────────────────────────────────────────────┐
│ Agentic Vision Process │
│ │
│ 1. Think - Analyze the image and plan how to investigate further │
│ 2. Act - Write Python code (crop, enlarge, annotate, calculate) │
│ 3. Observe - Observe the code execution results (including the generated annotated image) │
│ 4. Repeat the above steps until the analysis is complete │
└─────────────────────────────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

Technical Core

  • Model: gemini-3-flash-preview
  • Key Feature: code_execution tool — allows the model to write and execute Python code
  • Output: In addition to text analysis, it can also return annotated images generated by the model
# Enable Agentic Vision API call
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[image_part, "Help me mark the coffee"],
    config=types.GenerateContentConfig(
        tools=[types.Tool(code_execution=types.ToolCodeExecution)],
        thinking_config=types.ThinkingConfig(thinkingBudget=2048),
    )
)

# Response contains multiple parts: text, code, execution results, annotated images
for part in response.candidates[0].content.parts:
    if part.text: # Text analysis
    if part.executable_code: # Python code written by the model
    if part.code_execution_result: # Code execution results
    if part.as_image(): # Generated annotated image!

Enter fullscreen mode Exit fullscreen mode

Functional Design

User Experience Flow

Instead of directly analyzing the image upon receiving it, it's changed to let the user choose a mode first:

User sends an image
     │
     ▼
┌─────────────────────────────────────┐
│ 📷 Image received, please select an analysis method: │
│ │
│ ┌──────────┐ ┌─────────────────┐ │
│ │ Recognize Image │ │ Agentic Vision │ │
│ └──────────┘ └─────────────────┘ │
│ (Quick Reply Buttons) │
└─────────────────────────────────────┘
     │ │
     ▼ ▼
 gemini-2.5-flash User inputs instructions
 Directly returns a text description "Help me mark the coffee"
                         │
                         ▼
                  gemini-3-flash-preview
                  + code_execution
                         │
                    ┌────┴────┐
                    ▼ ▼
               Text Analysis Annotated Image
               (Text) (Image)
                    │ │
                    ▼ ▼
               LINE TextMsg + ImageSendMessage

Enter fullscreen mode Exit fullscreen mode

Why two steps?

Agentic Vision requires the user to provide specific instructions (e.g., "Mark everyone," "Count how many cats"), unlike general recognition which only needs to "describe the image." Therefore, after selecting Agentic Vision, the user is first asked to input their desired goal.


Implementation Details

1. Image Temporary Storage Mechanism

Because LINE's Quick Reply is asynchronous (user clicks a button to trigger PostbackEvent), the image needs to be temporarily stored:

# main.py
image_temp_store: Dict[str, bytes] = {} # Temporary image storage (user_id → bytes)
pending_agentic_vision: Dict[str, bool] = {} # Waiting for user to input instructions

Enter fullscreen mode Exit fullscreen mode

Process:

  1. Receive image → store in image_temp_store[user_id]
  2. User clicks "Agentic Vision" → set pending_agentic_vision[user_id] = True
  3. User inputs text → detect pending state, retrieve image + text and send them for analysis

2. Quick Reply Implementation

Use LINE SDK's PostbackAction, consistent with the existing YouTube summary and location search Quick Reply modes:

quick_reply_buttons = QuickReply(
    items=[
        QuickReplyButton(
            action=PostbackAction(
                label="Recognize Image",
                data=json.dumps({"action": "image_analyze", "mode": "recognize"}),
                display_text="Recognize Image"
            )
        ),
        QuickReplyButton(
            action=PostbackAction(
                label="Agentic Vision",
                data=json.dumps({"action": "image_analyze", "mode": "agentic_vision"}),
                display_text="Agentic Vision"
            )
        ),
    ]
)

Enter fullscreen mode Exit fullscreen mode

3. Agentic Vision Analysis Core

# tools/summarizer.py
def analyze_image_agentic(image_data: bytes, prompt: str) -> dict:
    client = _get_vertex_client()

    contents = [
        types.Part.from_text(text=prompt),
        types.Part.from_bytes(data=image_data, mime_type="image/png")
    ]

    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=contents,
        config=types.GenerateContentConfig(
            temperature=0.5,
            max_output_tokens=4096,
            tools=[types.Tool(code_execution=types.ToolCodeExecution)],
            thinking_config=types.ThinkingConfig(thinkingBudget=2048),
        )
    )

    result_parts = []
    generated_images = []

    for part in response.candidates[0].content.parts:
        if hasattr(part, 'thought') and part.thought:
            continue # Skip thinking parts
        if part.text is not None:
            result_parts.append(part.text)
        if part.code_execution_result is not None:
            result_parts.append(f"[Code Output]: {part.code_execution_result.output}")
        # Extract the annotated images generated by the model
        img = part.as_image()
        if img is not None:
            generated_images.append(img.image_bytes)

    return {
        "status": "success",
        "analysis": "\n".join(result_parts),
        "images": generated_images # Annotated image bytes
    }

Enter fullscreen mode Exit fullscreen mode

4. Image Return Mechanism

LINE's ImageSendMessage requires a public HTTPS URL. Because we are deployed on Cloud Run (which is inherently public HTTPS), we directly add an image serving endpoint to FastAPI:

# Temporary storage of annotated images (UUID → bytes, 5 minutes TTL)
annotated_image_store: Dict[str, dict] = {}

@app.get("/images/{image_id}")
def serve_annotated_image(image_id: str):
    """Provide temporary annotated images for LINE to download"""
    entry = annotated_image_store.get(image_id)
    if not entry:
        raise HTTPException(status_code=404)
    if time.time() - entry["created_at"] > 300: # 5 minutes expired
        annotated_image_store.pop(image_id, None)
        raise HTTPException(status_code=404)
    return Response(content=entry["data"], media_type="image/png")

Enter fullscreen mode Exit fullscreen mode

Automatically detect the App's base URL (from the webhook request headers):

@app.post("/")
async def handle_webhook_callback(request: Request):
    global app_base_url
    if not app_base_url:
        forwarded_proto = request.headers.get('x-forwarded-proto', 'https')
        host = request.headers.get('x-forwarded-host') or request.headers.get('host', '')
        if host:
            app_base_url = f"{forwarded_proto}://{host}"

Enter fullscreen mode Exit fullscreen mode

Finally, combine into ImageSendMessage:

def _create_image_send_message(image_bytes: bytes):
    image_id = store_annotated_image(image_bytes)
    image_url = f"{app_base_url}/images/{image_id}"
    return ImageSendMessage(
        original_content_url=image_url,
        preview_image_url=image_url,
    )

Enter fullscreen mode Exit fullscreen mode

Results Showcase

LINE 2026-01-28 15.27.33

image-20260128143559301

Pitfalls Encountered

Pitfall 1: from_image_bytes Does Not Exist

ERROR: Error analyzing image: from_image_bytes

Enter fullscreen mode Exit fullscreen mode

Reason: There is no types.Part.from_image_bytes() method in the google-genai SDK, the correct one is types.Part.from_bytes().

# ❌ Incorrect
types.Part.from_image_bytes(data=image_data, mime_type="image/png")

# ✅ Correct
types.Part.from_bytes(data=image_data, mime_type="image/png")

Enter fullscreen mode Exit fullscreen mode

Pitfall 2: ThinkingLevel enum Does Not Exist

ERROR: module 'google.genai.types' has no attribute 'ThinkingLevel'

Enter fullscreen mode Exit fullscreen mode

Reason: ThinkingConfig in google-genai==1.49.0 only supports thinkingBudget (integer), and does not support the thinking_level enum. Context7 and the examples in the official documentation are based on a newer version of the SDK.

# ❌ Does not exist in v1.49.0
types.ThinkingConfig(thinking_level=types.ThinkingLevel.MEDIUM)

# ✅ v1.49.0 supported method
types.ThinkingConfig(thinkingBudget=2048)

Enter fullscreen mode Exit fullscreen mode

Lesson: AI-generated code examples may be based on newer or older SDK versions, always use python -c "help(types.ThinkingConfig)" to confirm the actual available parameters.

Pitfall 3: Incomplete Image Recognition Results

Reason: gemini-2.5-flash enables thinking by default, and thinking tokens will consume the quota of max_output_tokens. Originally set max_output_tokens=2048, and thinking used up a large portion, the actual reply was truncated.

# ❌ Before: thinking consumed most of the token quota
config=types.GenerateContentConfig(
    max_output_tokens=2048,
)

# ✅ After: Disable thinking + increase token quota
config=types.GenerateContentConfig(
    max_output_tokens=8192,
    thinking_config=types.ThinkingConfig(thinkingBudget=0), # Disable thinking
)

Enter fullscreen mode Exit fullscreen mode

Key Point: For simple image descriptions, thinking is an unnecessary overhead. thinkingBudget=0 can disable thinking, allowing all tokens to be used for the reply.


Modified Files

┌───────────────────────┬─────────────────────────────────────────────────────┐
│ File │ Modification Content │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ main.py │ Quick Reply process, image temporary storage, pending state management, │
│ │ image serving endpoint, ImageSendMessage return │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ tools/summarizer.py │ Added analyze_image_agentic(), corrected from_bytes, │
│ │ Corrected ThinkingConfig, disabled thinking for image recognition │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/vision_agent.py│ Added analyze_agentic() method │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/orchestrator.py│ Added process_image_agentic() routing method │
└───────────────────────┴─────────────────────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

Complete Architecture

The original VisionAgent only had one path, now it becomes:

LINE Image Message
     │
     ▼
handle_image_message()
     │
     ├── image_temp_store[user_id] = image_bytes
     │
     ▼
Quick Reply: "Recognize Image" / "Agentic Vision"
     │ │
     ▼ ▼
handle_image_analyze_ pending_agentic_vision[user_id] = True
postback() │
     │ ▼
     │ User inputs text instructions
     │ │
     │ ▼
     │ handle_agentic_vision_with_prompt()
     │ │
     ▼ ▼
orchestrator orchestrator
.process_image() .process_image_agentic(prompt=user instructions)
     │ │
     ▼ ▼
VisionAgent.analyze() VisionAgent.analyze_agentic()
     │ │
     ▼ ▼
analyze_image() analyze_image_agentic()
gemini-2.5-flash gemini-3-flash-preview
thinkingBudget=0 + code_execution
                           + thinkingBudget=2048
     │ │
     ▼ ├── Text analysis → TextSendMessage
TextSendMessage ├── Annotated image → /images/{uuid} → ImageSendMessage
                               └── push_message([text, image])

Enter fullscreen mode Exit fullscreen mode

Development Experience

1. SDK Version Differences are the Biggest Pitfall

The most time-consuming part of this development was not the functional design, but the SDK version differences. The API of google-genai changes frequently:

  • from_image_bytesfrom_bytes (method name changed)
  • ThinkingLevel enum does not exist in v1.49.0 (requires thinkingBudget integer)
  • The impact of thinking on max_output_tokens is not documented

Suggestion: Before development, run pip show google-genai to confirm the version, and then use help() to confirm the actually available API.

2. Limitations of LINE Bot Image Returns

LINE's ImageSendMessage requires the image to be a public HTTPS URL, and cannot directly transmit bytes. Solutions:

Solution Advantages Disadvantages
GCS Upload Stable, persistent Requires bucket and permission settings
FastAPI endpoint serves itself Simple, no external services required Disappears after restart, memory usage
Base64 embedded text Simplest LINE does not support

I chose the FastAPI endpoint solution because:

  • Cloud Run itself is public HTTPS
  • Annotated images only need to exist briefly (5 minutes TTL)
  • No need for additional GCS bucket settings

3. Thinking is a Double-Edged Sword

gemini-2.5-flash enables thinking by default, which is helpful for complex reasoning, but is a burden for simple image descriptions:

  • Consumes max_output_tokens quota
  • Increases latency
  • Replies may be truncated

Principle: Disable thinking for simple tasks (thinkingBudget=0), and only enable it for complex Agentic Vision.

4. Trade-offs in State Management

Agentic Vision requires two-step interaction (select mode → input instructions), which introduces state management:

image_temp_store: Dict[str, bytes] = {} # Image temporary storage
pending_agentic_vision: Dict[str, bool] = {} # Waiting for instructions

Enter fullscreen mode Exit fullscreen mode

Using an in-memory dict is the simplest, but there is a risk: Cloud Run may restart between two requests. This is acceptable for a personal Bot, but if you want to make it a product-level service, you should switch to Redis or Firestore.


References

Top comments (0)