Background
After completing the Multi-Agent Orchestration architecture for LINE Bot, the original image analysis function directly sent the image to gemini-2.5-flash for recognition. However, Google released the Agentic Vision of Gemini 3 Flash in January 2026, which allows the model to not only "see" the image but also actively write Python code to enlarge, crop, and annotate the image.
This made me think of an interesting use case:
A user sends a photo and says, "Help me mark the coffee," and the AI not only replies with a text description but also draws a bounding box and annotates it on the image, then sends the annotated image back to LINE.
This article documents the complete process of implementing this function, including the pitfalls and solutions.
What is Agentic Vision?
Traditional image analysis is static: you give the model an image, and the model returns a text description.
Agentic Vision turns image understanding into an active investigation process, using a Think → Act → Observe cycle:
┌─────────────────────────────────────────────────────────────┐
│ Agentic Vision Process │
│ │
│ 1. Think - Analyze the image and plan how to investigate further │
│ 2. Act - Write Python code (crop, enlarge, annotate, calculate) │
│ 3. Observe - Observe the code execution results (including the generated annotated image) │
│ 4. Repeat the above steps until the analysis is complete │
└─────────────────────────────────────────────────────────────┘
Technical Core
-
Model:
gemini-3-flash-preview -
Key Feature:
code_executiontool — allows the model to write and execute Python code - Output: In addition to text analysis, it can also return annotated images generated by the model
# Enable Agentic Vision API call
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=[image_part, "Help me mark the coffee"],
config=types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution)],
thinking_config=types.ThinkingConfig(thinkingBudget=2048),
)
)
# Response contains multiple parts: text, code, execution results, annotated images
for part in response.candidates[0].content.parts:
if part.text: # Text analysis
if part.executable_code: # Python code written by the model
if part.code_execution_result: # Code execution results
if part.as_image(): # Generated annotated image!
Functional Design
User Experience Flow
Instead of directly analyzing the image upon receiving it, it's changed to let the user choose a mode first:
User sends an image
│
▼
┌─────────────────────────────────────┐
│ 📷 Image received, please select an analysis method: │
│ │
│ ┌──────────┐ ┌─────────────────┐ │
│ │ Recognize Image │ │ Agentic Vision │ │
│ └──────────┘ └─────────────────┘ │
│ (Quick Reply Buttons) │
└─────────────────────────────────────┘
│ │
▼ ▼
gemini-2.5-flash User inputs instructions
Directly returns a text description "Help me mark the coffee"
│
▼
gemini-3-flash-preview
+ code_execution
│
┌────┴────┐
▼ ▼
Text Analysis Annotated Image
(Text) (Image)
│ │
▼ ▼
LINE TextMsg + ImageSendMessage
Why two steps?
Agentic Vision requires the user to provide specific instructions (e.g., "Mark everyone," "Count how many cats"), unlike general recognition which only needs to "describe the image." Therefore, after selecting Agentic Vision, the user is first asked to input their desired goal.
Implementation Details
1. Image Temporary Storage Mechanism
Because LINE's Quick Reply is asynchronous (user clicks a button to trigger PostbackEvent), the image needs to be temporarily stored:
# main.py
image_temp_store: Dict[str, bytes] = {} # Temporary image storage (user_id → bytes)
pending_agentic_vision: Dict[str, bool] = {} # Waiting for user to input instructions
Process:
- Receive image → store in
image_temp_store[user_id] - User clicks "Agentic Vision" → set
pending_agentic_vision[user_id] = True - User inputs text → detect pending state, retrieve image + text and send them for analysis
2. Quick Reply Implementation
Use LINE SDK's PostbackAction, consistent with the existing YouTube summary and location search Quick Reply modes:
quick_reply_buttons = QuickReply(
items=[
QuickReplyButton(
action=PostbackAction(
label="Recognize Image",
data=json.dumps({"action": "image_analyze", "mode": "recognize"}),
display_text="Recognize Image"
)
),
QuickReplyButton(
action=PostbackAction(
label="Agentic Vision",
data=json.dumps({"action": "image_analyze", "mode": "agentic_vision"}),
display_text="Agentic Vision"
)
),
]
)
3. Agentic Vision Analysis Core
# tools/summarizer.py
def analyze_image_agentic(image_data: bytes, prompt: str) -> dict:
client = _get_vertex_client()
contents = [
types.Part.from_text(text=prompt),
types.Part.from_bytes(data=image_data, mime_type="image/png")
]
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=contents,
config=types.GenerateContentConfig(
temperature=0.5,
max_output_tokens=4096,
tools=[types.Tool(code_execution=types.ToolCodeExecution)],
thinking_config=types.ThinkingConfig(thinkingBudget=2048),
)
)
result_parts = []
generated_images = []
for part in response.candidates[0].content.parts:
if hasattr(part, 'thought') and part.thought:
continue # Skip thinking parts
if part.text is not None:
result_parts.append(part.text)
if part.code_execution_result is not None:
result_parts.append(f"[Code Output]: {part.code_execution_result.output}")
# Extract the annotated images generated by the model
img = part.as_image()
if img is not None:
generated_images.append(img.image_bytes)
return {
"status": "success",
"analysis": "\n".join(result_parts),
"images": generated_images # Annotated image bytes
}
4. Image Return Mechanism
LINE's ImageSendMessage requires a public HTTPS URL. Because we are deployed on Cloud Run (which is inherently public HTTPS), we directly add an image serving endpoint to FastAPI:
# Temporary storage of annotated images (UUID → bytes, 5 minutes TTL)
annotated_image_store: Dict[str, dict] = {}
@app.get("/images/{image_id}")
def serve_annotated_image(image_id: str):
"""Provide temporary annotated images for LINE to download"""
entry = annotated_image_store.get(image_id)
if not entry:
raise HTTPException(status_code=404)
if time.time() - entry["created_at"] > 300: # 5 minutes expired
annotated_image_store.pop(image_id, None)
raise HTTPException(status_code=404)
return Response(content=entry["data"], media_type="image/png")
Automatically detect the App's base URL (from the webhook request headers):
@app.post("/")
async def handle_webhook_callback(request: Request):
global app_base_url
if not app_base_url:
forwarded_proto = request.headers.get('x-forwarded-proto', 'https')
host = request.headers.get('x-forwarded-host') or request.headers.get('host', '')
if host:
app_base_url = f"{forwarded_proto}://{host}"
Finally, combine into ImageSendMessage:
def _create_image_send_message(image_bytes: bytes):
image_id = store_annotated_image(image_bytes)
image_url = f"{app_base_url}/images/{image_id}"
return ImageSendMessage(
original_content_url=image_url,
preview_image_url=image_url,
)
Results Showcase
Pitfalls Encountered
Pitfall 1: from_image_bytes Does Not Exist
ERROR: Error analyzing image: from_image_bytes
Reason: There is no types.Part.from_image_bytes() method in the google-genai SDK, the correct one is types.Part.from_bytes().
# ❌ Incorrect
types.Part.from_image_bytes(data=image_data, mime_type="image/png")
# ✅ Correct
types.Part.from_bytes(data=image_data, mime_type="image/png")
Pitfall 2: ThinkingLevel enum Does Not Exist
ERROR: module 'google.genai.types' has no attribute 'ThinkingLevel'
Reason: ThinkingConfig in google-genai==1.49.0 only supports thinkingBudget (integer), and does not support the thinking_level enum. Context7 and the examples in the official documentation are based on a newer version of the SDK.
# ❌ Does not exist in v1.49.0
types.ThinkingConfig(thinking_level=types.ThinkingLevel.MEDIUM)
# ✅ v1.49.0 supported method
types.ThinkingConfig(thinkingBudget=2048)
Lesson: AI-generated code examples may be based on newer or older SDK versions, always use python -c "help(types.ThinkingConfig)" to confirm the actual available parameters.
Pitfall 3: Incomplete Image Recognition Results
Reason: gemini-2.5-flash enables thinking by default, and thinking tokens will consume the quota of max_output_tokens. Originally set max_output_tokens=2048, and thinking used up a large portion, the actual reply was truncated.
# ❌ Before: thinking consumed most of the token quota
config=types.GenerateContentConfig(
max_output_tokens=2048,
)
# ✅ After: Disable thinking + increase token quota
config=types.GenerateContentConfig(
max_output_tokens=8192,
thinking_config=types.ThinkingConfig(thinkingBudget=0), # Disable thinking
)
Key Point: For simple image descriptions, thinking is an unnecessary overhead. thinkingBudget=0 can disable thinking, allowing all tokens to be used for the reply.
Modified Files
┌───────────────────────┬─────────────────────────────────────────────────────┐
│ File │ Modification Content │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ main.py │ Quick Reply process, image temporary storage, pending state management, │
│ │ image serving endpoint, ImageSendMessage return │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ tools/summarizer.py │ Added analyze_image_agentic(), corrected from_bytes, │
│ │ Corrected ThinkingConfig, disabled thinking for image recognition │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/vision_agent.py│ Added analyze_agentic() method │
├───────────────────────┼─────────────────────────────────────────────────────┤
│ agents/orchestrator.py│ Added process_image_agentic() routing method │
└───────────────────────┴─────────────────────────────────────────────────────┘
Complete Architecture
The original VisionAgent only had one path, now it becomes:
LINE Image Message
│
▼
handle_image_message()
│
├── image_temp_store[user_id] = image_bytes
│
▼
Quick Reply: "Recognize Image" / "Agentic Vision"
│ │
▼ ▼
handle_image_analyze_ pending_agentic_vision[user_id] = True
postback() │
│ ▼
│ User inputs text instructions
│ │
│ ▼
│ handle_agentic_vision_with_prompt()
│ │
▼ ▼
orchestrator orchestrator
.process_image() .process_image_agentic(prompt=user instructions)
│ │
▼ ▼
VisionAgent.analyze() VisionAgent.analyze_agentic()
│ │
▼ ▼
analyze_image() analyze_image_agentic()
gemini-2.5-flash gemini-3-flash-preview
thinkingBudget=0 + code_execution
+ thinkingBudget=2048
│ │
▼ ├── Text analysis → TextSendMessage
TextSendMessage ├── Annotated image → /images/{uuid} → ImageSendMessage
└── push_message([text, image])
Development Experience
1. SDK Version Differences are the Biggest Pitfall
The most time-consuming part of this development was not the functional design, but the SDK version differences. The API of google-genai changes frequently:
-
from_image_bytes→from_bytes(method name changed) -
ThinkingLevelenum does not exist in v1.49.0 (requiresthinkingBudgetinteger) - The impact of
thinkingonmax_output_tokensis not documented
Suggestion: Before development, run pip show google-genai to confirm the version, and then use help() to confirm the actually available API.
2. Limitations of LINE Bot Image Returns
LINE's ImageSendMessage requires the image to be a public HTTPS URL, and cannot directly transmit bytes. Solutions:
| Solution | Advantages | Disadvantages |
|---|---|---|
| GCS Upload | Stable, persistent | Requires bucket and permission settings |
| FastAPI endpoint serves itself | Simple, no external services required | Disappears after restart, memory usage |
| Base64 embedded text | Simplest | LINE does not support |
I chose the FastAPI endpoint solution because:
- Cloud Run itself is public HTTPS
- Annotated images only need to exist briefly (5 minutes TTL)
- No need for additional GCS bucket settings
3. Thinking is a Double-Edged Sword
gemini-2.5-flash enables thinking by default, which is helpful for complex reasoning, but is a burden for simple image descriptions:
- Consumes
max_output_tokensquota - Increases latency
- Replies may be truncated
Principle: Disable thinking for simple tasks (thinkingBudget=0), and only enable it for complex Agentic Vision.
4. Trade-offs in State Management
Agentic Vision requires two-step interaction (select mode → input instructions), which introduces state management:
image_temp_store: Dict[str, bytes] = {} # Image temporary storage
pending_agentic_vision: Dict[str, bool] = {} # Waiting for instructions
Using an in-memory dict is the simplest, but there is a risk: Cloud Run may restart between two requests. This is acceptable for a personal Bot, but if you want to make it a product-level service, you should switch to Redis or Firestore.
References
- Introducing Agentic Vision in Gemini 3 Flash - Google Official Blog
- Gemini 3 Developer Guide - API Development Documentation
- Code Execution - Code Execution Feature Documentation
- Image Understanding - Bounding Box and Image Analysis
- google-genai Python SDK - SDK Source Code
- linebot-helper-python - Source Code of this project



Top comments (0)