Evan Lin

Posted on Mar 29 • Originally published at evanlin.com on Mar 29

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

#ai #api #gemini #llm

Reference articles:

Background

I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:

Traditional process designed by developers:

User: "Help me see the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

The answer is completely correct, but it always feels like something is missing—the user is talking about "that jacket," and Gemini is just restating the text in the JSON, with no way to "confirm" what the jacket looks like. If there happen to be three jackets in the database, the AI can't even determine which one is the one the user remembers.

AI can read text, but it can't see pictures—this limitation has always been a blind spot in the traditional Function Calling architecture.

This problem was truly solved only after Gemini introduced Multimodal Function Response.

What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Multimodal Function Response changes that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer that has seen the image]

When Gemini generates the next round of answers, it can "see" both the structured data and the image returned by the function, thereby generating richer and more accurate responses.

The official currently supported media formats:

Category	Supported formats
Image	`image/jpeg`, `image/png`, `image/webp`
Document	`application/pdf`, `text/plain`

The application scenarios for this feature are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.

Project Goal

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me see the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper pockets on the sides. This is your January 15th order ORD-2026-0115, for a total of NT$1,890, and it has been delivered." + Product photo

The difference is obvious: Gemini really "saw" the jacket, rather than just restating the text in the database.

Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response requires manually including image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it can't be intervened.

So this time, I directly used google.genai.Client to implement the iterative cycle of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you can't control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Overall architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call a tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provides

How to get product images?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the item, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID	Name	Image
P001	Brown pilot jacket	tobias-tullius-…-unsplash.jpg
P002	White cotton university T	mediamodifier-…-unsplash.jpg
P003	Dark blue denim jacket	caio-coelho-…-unsplash.jpg
P004	Beige knitted shawl	milada-vigerova-…-unsplash.jpg
P005	Light blue simple T-shirt	cristofer-maximilian-…-unsplash.jpg

The image bytes read have two uses:

As FunctionResponseBlob to include for Gemini analysis—real photos allow Gemini to describe the actual fabric texture and tailoring details
Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Step 2: Function call cycle (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Use FunctionResponseBlob here, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Gemini will receive both result_dict (order JSON) and image_bytes (product image) in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent simultaneously.

Potholes

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

The most common pitfall: When constructing multimodal image parts, you cannot use types.Blob, you must use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # Will warn or error if there is no running event loop
async_http_client = AiohttpAsyncHttpClient(session)

When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is actually needed for the first time:

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, guaranteeing an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage requires a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE gets the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.

Demo Display

Mock database (default data for demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order number	Date	Product
ORD-2026-0115	2026-01-15	P001 Brown pilot jacket
ORD-2026-0108	2026-01-08	P003 Dark blue denim jacket

Scenario 1: "Help me see the jacket I bought before"

User sends: "Help me see the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon with
   a glossy feel, and a metal zipper pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, for a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of the P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo features a retro stitching design,
   a lapel with metal buttons, and a very complete garment feel, priced at NT$1,490, with 8 in stock."

Scenario 3: "What are the features of the P004 knitted shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knitted shawl]
  "The photo shows a beige handmade crochet shawl, with a V-neck design and tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Traditional Function Response vs Multimodal Function Response

	Traditional	Multimodal
Function return	Pure JSON	JSON + image/PDF bytes
Gemini perception	Text data	Text + visual content
Answer quality	"You bought a brown pilot jacket"	"You can see the nylon texture in the photo, with a zipper pocket on the left sleeve..."
API difference	`Part.from_function_response(name, response)`	`Part.from_function_response(name, response, parts=[...])`
Applicable scenarios	Pure text data queries	Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow AI agents to bring in visual information in the action of "calling an external system" itself, instead of first checking text and then uploading images separately. This will be an important basic capability in areas highly related to visuals, such as e-commerce, medicine, and design.

However, there are still a few limitations worth noting:

Image URLs cannot be used directly: Gemini's FunctionResponseBlob requires raw bytes, and URLs cannot be filled in directly (this is different from bringing images directly in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes before passing it in.
No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual testing in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.
Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual testing, and the API structure is the same.

There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry the image, and Gemini's answer will be upgraded from "restating data" to "telling a story based on the picture".

The core API is just these few lines, but it takes a lot of details to get the whole process working:

# The complete way for Gemini to see the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

DEV Community

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

Background

What is Multimodal Function Response?

Project Goal

Architecture Design

Why not use Google ADK?

Overall architecture

How to get product images?

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

Step 2: Function call cycle (up to 5 iterations)

Step 3: Construct Multimodal Function Response (the most critical step)

Step 4: LINE Bot simultaneously returns text + image

Potholes

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Demo Display

Mock database (default data for demo)

Scenario 1: "Help me see the jacket I bought before"

Scenario 2: "Are there any dark blue jackets?"

Scenario 3: "What are the features of the P004 knitted shawl?"

Traditional Function Response vs Multimodal Function Response

Analysis and Outlook

Summary

Top comments (0)

Background

What is Multimodal Function Response?

Project Goal

Architecture Design

Why not use Google ADK?

Overall architecture

How to get product images?

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

Step 2: Function call cycle (up to 5 iterations)

Step 3: Construct Multimodal Function Response (the most critical step)

Step 4: LINE Bot simultaneously returns text + image

Potholes

❌ Pitfall 1: FunctionResponseBlob is not Blob

❌ Pitfall 2: aiohttp.ClientSession cannot be created at the module level

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Demo Display

Mock database (default data for demo)

Scenario 1: "Help me see the jacket I bought before"

Scenario 2: "Are there any dark blue jackets?"

Scenario 3: "What are the features of the P004 knitted shawl?"

Traditional Function Response vs Multimodal Function Response

Analysis and Outlook

Summary

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level