DEV Community

Evan Lin
Evan Lin

Posted on • Originally published at evanlin.com on

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

image-20260225234804185

image-20260225234701217

Reference articles:

Background

I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:

Traditional process designed by developers:

User: "Help me see the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

Enter fullscreen mode Exit fullscreen mode

The answer is completely correct, but it always feels like something is missing—the user is talking about "that jacket," and Gemini is just restating the text in the JSON, with no way to "confirm" what the jacket looks like. If there happen to be three jackets in the database, the AI can't even determine which one is the one the user remembers.

AI can read text, but it can't see pictures—this limitation has always been a blind spot in the traditional Function Calling architecture.

Google Chrome 2026-02-26 10.34.51

Google Chrome 2026-02-26 10.34.58

This problem was truly solved only after Gemini introduced Multimodal Function Response.


What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Enter fullscreen mode Exit fullscreen mode

Multimodal Function Response changes that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

Google Chrome 2026-02-25 23.04.28

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer that has seen the image]

Enter fullscreen mode Exit fullscreen mode

When Gemini generates the next round of answers, it can "see" both the structured data and the image returned by the function, thereby generating richer and more accurate responses.

The official currently supported media formats:

Category Supported formats
Image image/jpeg, image/png, image/webp
Document application/pdf, text/plain

The application scenarios for this feature are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.


Project Goal

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me see the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper pockets on the sides. This is your January 15th order ORD-2026-0115, for a total of NT$1,890, and it has been delivered." + Product photo

The difference is obvious: Gemini really "saw" the jacket, rather than just restating the text in the database.


Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response requires manually including image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it can't be intervened.

So this time, I directly used google.genai.Client to implement the iterative cycle of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you can't control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Enter fullscreen mode Exit fullscreen mode

Overall architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call a tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provides

Enter fullscreen mode Exit fullscreen mode

How to get product images?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the item, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Enter fullscreen mode Exit fullscreen mode

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID Name Image
P001 Brown pilot jacket tobias-tullius-…-unsplash.jpg
P002 White cotton university T mediamodifier-…-unsplash.jpg
P003 Dark blue denim jacket caio-coelho-…-unsplash.jpg
P004 Beige knitted shawl milada-vigerova-…-unsplash.jpg
P005 Light blue simple T-shirt cristofer-maximilian-…-unsplash.jpg

The image bytes read have two uses:

  1. As FunctionResponseBlob to include for Gemini analysis—real photos allow Gemini to describe the actual fabric texture and tailoring details
  2. Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Enter fullscreen mode Exit fullscreen mode

Step 2: Function call cycle (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Enter fullscreen mode Exit fullscreen mode

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Use FunctionResponseBlob here, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Enter fullscreen mode Exit fullscreen mode

Gemini will receive both result_dict (order JSON) and image_bytes (product image) in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

Enter fullscreen mode Exit fullscreen mode

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent simultaneously.


Potholes

❌ Pitfall 1: FunctionResponseBlob is not Blob

The most common pitfall: When constructing multimodal image parts, you cannot use types.Blob, you must use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

Enter fullscreen mode Exit fullscreen mode

Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".

❌ Pitfall 2: aiohttp.ClientSession cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # Will warn or error if there is no running event loop
async_http_client = AiohttpAsyncHttpClient(session)

Enter fullscreen mode Exit fullscreen mode

When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is actually needed for the first time:

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, guaranteeing an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

Enter fullscreen mode Exit fullscreen mode

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage requires a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE gets the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Enter fullscreen mode Exit fullscreen mode

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.


Demo Display

Mock database (default data for demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order number Date Product
ORD-2026-0115 2026-01-15 P001 Brown pilot jacket
ORD-2026-0108 2026-01-08 P003 Dark blue denim jacket

Scenario 1: "Help me see the jacket I bought before"

User sends: "Help me see the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon with
   a glossy feel, and a metal zipper pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, for a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Enter fullscreen mode Exit fullscreen mode

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of the P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo features a retro stitching design,
   a lapel with metal buttons, and a very complete garment feel, priced at NT$1,490, with 8 in stock."

Enter fullscreen mode Exit fullscreen mode

Scenario 3: "What are the features of the P004 knitted shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knitted shawl]
  "The photo shows a beige handmade crochet shawl, with a V-neck design and tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Enter fullscreen mode Exit fullscreen mode

Traditional Function Response vs Multimodal Function Response

Traditional Multimodal
Function return Pure JSON JSON + image/PDF bytes
Gemini perception Text data Text + visual content
Answer quality "You bought a brown pilot jacket" "You can see the nylon texture in the photo, with a zipper pocket on the left sleeve..."
API difference Part.from_function_response(name, response) Part.from_function_response(name, response, parts=[...])
Applicable scenarios Pure text data queries Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow AI agents to bring in visual information in the action of "calling an external system" itself, instead of first checking text and then uploading images separately. This will be an important basic capability in areas highly related to visuals, such as e-commerce, medicine, and design.

However, there are still a few limitations worth noting:

  1. Image URLs cannot be used directly: Gemini's FunctionResponseBlob requires raw bytes, and URLs cannot be filled in directly (this is different from bringing images directly in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes before passing it in.

  2. No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual testing in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.

  3. Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual testing, and the API structure is the same.

There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry the image, and Gemini's answer will be upgraded from "restating data" to "telling a story based on the picture".

The core API is just these few lines, but it takes a lot of details to get the whole process working:

# The complete way for Gemini to see the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

Enter fullscreen mode Exit fullscreen mode

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Top comments (0)