References:
- Gemini API - Function Calling with Multimodal
- GitHub: linebot-gemini-multimodel-funcal
- Vertex AI - Multimodal Function Response
- Complete code GitHub
Background
I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:
Traditional process designed by developers:
User: "Help me take a look at the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."
The answer is completely correct, but it always feels like something is missing - the user is talking about "that jacket", and Gemini is just restating the text in the JSON, and has no way to "confirm" what that piece of clothing looks like. If there happen to be three jackets in the database, the AI simply cannot determine which one is the one the user remembers.
AI can read text, but cannot see images - this limitation has always been a blind spot under the traditional Function Calling architecture.
This problem was truly solved until Gemini launched Multimodal Function Response.
What is Multimodal Function Response?
The traditional Function Calling process is as follows:
[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]
Multimodal Function Response changed that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:
[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer after seeing the image]
Gemini can "see" the structured data and images returned by the function at the same time when generating the next round of answers, thereby generating richer and more accurate responses.
The media formats currently supported by the official:
| Category | Supported format |
|---|---|
| Image |
image/jpeg, image/png, image/webp
|
| Document |
application/pdf, text/plain
|
The application scenarios of this function are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF of inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.
Project Goals
This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:
User: "Help me take a look at the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper decorative pockets on the sides. This is your January 15th order ORD-2026-0115, a total of NT$1,890, and has been delivered." + Product photo
The difference is obvious: Gemini really "saw" that piece of clothing, rather than just restating the text in the database.
Architecture Design
Why not use Google ADK?
Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.
But Multimodal Function Response needs to manually include image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it cannot intervene.
So this time, I directly used google.genai.Client to implement the iterative loop of function calls myself:
# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
... # ADK handles all function calls for you, but you cannot control the response content
# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
model=model,
contents=contents,
config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself
Overall Architecture
LINE User
│
▼ POST /
FastAPI Webhook Handler
│
▼
EcommerceAgent.process_message(text, line_user_id)
│
├─ ① Call Gemini (with conversation history)
│
├─ ② Gemini decides to call the tool → function_call
│
├─ ③ _execute_tool()
│ ├─ Execute query function (search_products / get_order_history / get_product_details)
│ └─ Read real product photos in the img/ directory (Unsplash JPEG)
│
├─ ④ Construct Multimodal Function Response
│ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
│
├─ ⑤ Call Gemini again (Gemini sees the image + data)
│
└─ ⑥ Return (ai_text, image_bytes)
│
▼
LINE Reply:
TextSendMessage(text=ai_text)
ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provided
How do the product images come from?
This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the clothing, stored in the img/ directory. The reading logic is very simple:
def generate_product_image(product: dict) -> bytes:
"""Read the product image and return JPEG bytes."""
with open(product["image_path"], "rb") as f:
return f.read()
Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:
| Product ID | Name | Image |
|---|---|---|
| P001 | Brown pilot jacket | tobias-tullius-...-unsplash.jpg |
| P002 | White cotton T-shirt | mediamodifier-...-unsplash.jpg |
| P003 | Dark blue denim jacket | caio-coelho-...-unsplash.jpg |
| P004 | Beige knit shawl | milada-vigerova-...-unsplash.jpg |
| P005 | Light blue simple T-shirt | cristofer-maximilian-...-unsplash.jpg |
The image bytes read have two uses:
- As
FunctionResponseBlobto be included for Gemini analysis - real photos allow Gemini to describe the actual fabric texture and tailoring details - Temporarily stored in the
image_cachedict, provided to the LINE Bot for display through the FastAPI/images/{uuid}endpoint
Core Code Details
Step 1: Define Tools (FunctionDeclaration)
from google.genai import types
ECOMMERCE_TOOLS = [
types.Tool(function_declarations=[
types.FunctionDeclaration(
name="get_order_history",
description="Query the current user's order history",
parameters=types.Schema(
type=types.Type.OBJECT,
properties={
"time_range": types.Schema(
type=types.Type.STRING,
description="Time range: all / last_month / last_3_months",
enum=["all", "last_month", "last_3_months"],
),
},
required=[],
),
),
# ... search_products, get_product_details
])
]
Step 2: Function Call Loop (up to 5 iterations)
async def process_message(self, text: str, line_user_id: str):
contents = self._get_history(line_user_id) + [
types.Content(role="user", parts=[types.Part(text=text)])
]
for _iteration in range(5): # Up to 5 times, to prevent infinite loops
response = await self._client.aio.models.generate_content(
model=self._model,
contents=contents,
config=types.GenerateContentConfig(
system_instruction=_SYSTEM_INSTRUCTION,
tools=ECOMMERCE_TOOLS,
),
)
model_content = response.candidates[0].content
contents.append(model_content)
# Find all function_call parts
fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]
if not fc_parts:
# No function call → final text response
final_text = "".join(p.text for p in model_content.parts if p.text)
break
# Has function call → execute tool, include image
tool_parts = []
for fc_part in fc_parts:
result_dict, image_bytes = _execute_tool(
fc_part.function_call.name,
dict(fc_part.function_call.args),
line_user_id,
)
tool_parts.append(
self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
)
contents.append(types.Content(role="tool", parts=tool_parts))
Step 3: Construct Multimodal Function Response (the most critical step)
def _build_multimodal_response(self, func_name, result_dict, image_bytes):
multimodal_parts = []
if image_bytes:
# ⚠️ Note: Here you need to use FunctionResponseBlob, not types.Blob!
multimodal_parts.append(
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(
mime_type="image/jpeg",
data=image_bytes, # raw bytes, SDK handles base64 internally
)
)
)
return types.Part.from_function_response(
name=func_name,
response=result_dict, # Structured JSON data
parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
)
Gemini will receive result_dict (order JSON) and image_bytes (product image) at the same time in the next generate_content call, and the generated answer can therefore describe the visual content of the image.
Step 4: LINE Bot simultaneously returns text + image
# main.py
ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)
reply_messages = [TextSendMessage(text=ai_text)]
if image_bytes:
image_id = str(uuid.uuid4())
image_cache[image_id] = image_bytes # Temporary storage
image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
reply_messages.append(
ImageSendMessage(
original_content_url=image_url,
preview_image_url=image_url,
)
)
await get_line_bot_api().reply_message(event.reply_token, reply_messages)
LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent at the same time.
Pitfalls
❌ Pitfall 1: FunctionResponseBlob is not Blob
The easiest pitfall to step on: When constructing multimodal image components, you cannot use types.Blob, you need to use types.FunctionResponseBlob:
# ❌ Error (will TypeError)
types.FunctionResponsePart(
inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)
# ✅ Correct
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)
Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".
❌ Pitfall 2: aiohttp.ClientSession cannot be created at the module level
The original code directly created aiohttp.ClientSession() at the module level:
# ❌ Old method: module level
session = aiohttp.ClientSession() # If there is no running event loop, there will be a warning or error
async_http_client = AiohttpAsyncHttpClient(session)
When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is really needed for the first time:
# ✅ New method: lazy init
_line_bot_api = None
def get_line_bot_api():
global _line_bot_api
if _line_bot_api is None:
session = aiohttp.ClientSession() # Called within the async route handler, ensuring there is an event loop
_line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
return _line_bot_api
❌ Pitfall 3: LINE Bot needs HTTPS URL to send images
Gemini receives raw bytes, but LINE Bot's ImageSendMessage needs a publicly accessible HTTPS URL.
The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE retrieves the image through this endpoint:
@app.get("/images/{image_id}")
async def serve_image(image_id: str):
image_bytes = image_cache.get(image_id)
if image_bytes is None:
raise HTTPException(status_code=404, detail="Image not found")
return Response(content=image_bytes, media_type="image/jpeg")
Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.
Demo Display
Mock Database (default data for Demo)
The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:
| Order Number | Date | Product |
|---|---|---|
| ORD-2026-0115 | 2026-01-15 | P001 Brown pilot jacket |
| ORD-2026-0108 | 2026-01-08 | P003 Dark blue denim jacket |
Scenario 1: "Help me take a look at the jacket I bought before"
User sends: "Help me take a look at the jacket I bought before"
[Gemini → function_call]
get_order_history(time_range="all")
[_execute_tool execution]
- get_order_history() returns two orders (P001, P003)
- Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes
[Multimodal Function Response]
Part.from_function_response(
name="get_order_history",
response={"orders": [...], "order_count": 2},
parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
)
[Gemini responds after seeing the real photo]
"From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon
material with a glossy feel, and a metal zipper decorative pocket on the left sleeve. This is your January 15, 2026
order ORD-2026-0115, a total of NT$1,890, status: delivered."
LINE displays: [Text] + [Brown pilot jacket real photo]
Scenario 2: "Are there any dark blue jackets?"
[Gemini → function_call]
search_products(description="dark blue jacket", color="dark blue")
[Gemini sees the real photo of P003 dark blue denim jacket]
"Yes! This dark blue denim jacket (P003) in the photo adopts a retro stitching design,
lapel with metal buttons, the overall garment feel is full, priced at NT$1,490, 8 in stock."
Scenario 3: "What are the features of the P004 knit shawl?"
[Gemini → function_call]
get_product_details(product_id="P004")
[Gemini sees the real photo of the beige knit shawl]
"The photo shows a beige handmade crochet shawl, V-neck design with tassels at the bottom,
you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."
Traditional Function Response vs Multimodal Function Response
| Traditional | Multimodal | |
|---|---|---|
| Function return | Pure JSON | JSON + image/PDF bytes |
| Gemini perception | Text data | Text + visual content |
| Answer quality | "You bought a brown pilot jacket" | "You can see the nylon material gloss, zipper pocket on the left sleeve..." |
| API difference | Part.from_function_response(name, response) |
Part.from_function_response(name, response, parts=[...]) |
| Applicable scenarios | Pure text data query | Scenarios that require visual recognition/confirmation |
Analysis and Outlook
This implementation gave me a new understanding of Gemini's Function Calling capabilities.
The problem that Multimodal Function Response truly solves is to allow the AI agent to bring in visual information in the action of "calling an external system" itself, instead of first checking the text and then uploading the image separately. This will be an important basic capability in areas highly related to visuals such as e-commerce, medical, and design.
However, there are still a few limitations worth noting:
Image URLs cannot be used directly: Gemini's
FunctionResponseBlobneeds raw bytes, and cannot directly fill in the URL (this is different from directly including images in the prompt). If the image is originally a URL, you need to download it withrequests.get()to bytes and then pass it in.No
display_namecan also be used: The official documentation examples havedisplay_nameand$refJSON reference, but in actual tests in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.Model limitations: The official mark supports the Gemini 3 series, but
gemini-2.0-flashcan also handle it normally in actual tests, and the API structure is the same.
There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.
Summary
The focus of this LINE Bot implementation is only one sentence: Let the function response carry images, and Gemini's answer will upgrade from "restating data" to "telling stories based on images".
The core API is just these few lines, but it takes a lot of details to get the whole process through:
# Gemini sees the complete writing of the image returned by the function
types.Part.from_function_response(
name="get_order_history",
response={"orders": [...]},
parts=[
types.FunctionResponsePart(
inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
mime_type="image/jpeg",
data=image_bytes,
)
)
],
)
The complete code is on GitHub, feel free to clone and play with it.
See you next time!




Top comments (0)