DEV Community

Evan Lin for Google Developer Experts

Posted on • Originally published at evanlin.com on

Gemini: Building a LINE E-commerce Chatbot That Can "Tell Stories" from Images

image-20260225234804185

image-20260225234701217

References:

Background

I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:

Traditional process designed by developers:

User: "Help me take a look at the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

Enter fullscreen mode Exit fullscreen mode

The answer is completely correct, but it always feels like something is missing - the user is talking about "that jacket", and Gemini is just restating the text in the JSON, and has no way to "confirm" what that piece of clothing looks like. If there happen to be three jackets in the database, the AI simply cannot determine which one is the one the user remembers.

AI can read text, but cannot see images - this limitation has always been a blind spot under the traditional Function Calling architecture.

image-20260225230645814

This problem was truly solved until Gemini launched Multimodal Function Response.


What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Enter fullscreen mode Exit fullscreen mode

Multimodal Function Response changed that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

Google Chrome 2026-02-25 23.04.28

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer after seeing the image]

Enter fullscreen mode Exit fullscreen mode

Gemini can "see" the structured data and images returned by the function at the same time when generating the next round of answers, thereby generating richer and more accurate responses.

The media formats currently supported by the official:

Category Supported format
Image image/jpeg, image/png, image/webp
Document application/pdf, text/plain

The application scenarios of this function are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF of inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.


Project Goals

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me take a look at the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper decorative pockets on the sides. This is your January 15th order ORD-2026-0115, a total of NT$1,890, and has been delivered." + Product photo

The difference is obvious: Gemini really "saw" that piece of clothing, rather than just restating the text in the database.


Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response needs to manually include image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it cannot intervene.

So this time, I directly used google.genai.Client to implement the iterative loop of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you cannot control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Enter fullscreen mode Exit fullscreen mode

Overall Architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call the tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provided

Enter fullscreen mode Exit fullscreen mode

How do the product images come from?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the clothing, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Enter fullscreen mode Exit fullscreen mode

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID Name Image
P001 Brown pilot jacket tobias-tullius-...-unsplash.jpg
P002 White cotton T-shirt mediamodifier-...-unsplash.jpg
P003 Dark blue denim jacket caio-coelho-...-unsplash.jpg
P004 Beige knit shawl milada-vigerova-...-unsplash.jpg
P005 Light blue simple T-shirt cristofer-maximilian-...-unsplash.jpg

The image bytes read have two uses:

  1. As FunctionResponseBlob to be included for Gemini analysis - real photos allow Gemini to describe the actual fabric texture and tailoring details
  2. Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Core Code Details

Step 1: Define Tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Enter fullscreen mode Exit fullscreen mode

Step 2: Function Call Loop (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Enter fullscreen mode Exit fullscreen mode

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Here you need to use FunctionResponseBlob, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Enter fullscreen mode Exit fullscreen mode

Gemini will receive result_dict (order JSON) and image_bytes (product image) at the same time in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

Enter fullscreen mode Exit fullscreen mode

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent at the same time.


Pitfalls

❌ Pitfall 1: FunctionResponseBlob is not Blob

The easiest pitfall to step on: When constructing multimodal image components, you cannot use types.Blob, you need to use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

Enter fullscreen mode Exit fullscreen mode

Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".

❌ Pitfall 2: aiohttp.ClientSession cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # If there is no running event loop, there will be a warning or error
async_http_client = AiohttpAsyncHttpClient(session)

Enter fullscreen mode Exit fullscreen mode

When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is really needed for the first time:

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, ensuring there is an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

Enter fullscreen mode Exit fullscreen mode

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage needs a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE retrieves the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Enter fullscreen mode Exit fullscreen mode

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.


Demo Display

Mock Database (default data for Demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order Number Date Product
ORD-2026-0115 2026-01-15 P001 Brown pilot jacket
ORD-2026-0108 2026-01-08 P003 Dark blue denim jacket

Scenario 1: "Help me take a look at the jacket I bought before"

User sends: "Help me take a look at the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon
   material with a glossy feel, and a metal zipper decorative pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Enter fullscreen mode Exit fullscreen mode

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo adopts a retro stitching design,
   lapel with metal buttons, the overall garment feel is full, priced at NT$1,490, 8 in stock."

Enter fullscreen mode Exit fullscreen mode

Scenario 3: "What are the features of the P004 knit shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knit shawl]
  "The photo shows a beige handmade crochet shawl, V-neck design with tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Enter fullscreen mode Exit fullscreen mode

Traditional Function Response vs Multimodal Function Response

Traditional Multimodal
Function return Pure JSON JSON + image/PDF bytes
Gemini perception Text data Text + visual content
Answer quality "You bought a brown pilot jacket" "You can see the nylon material gloss, zipper pocket on the left sleeve..."
API difference Part.from_function_response(name, response) Part.from_function_response(name, response, parts=[...])
Applicable scenarios Pure text data query Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow the AI agent to bring in visual information in the action of "calling an external system" itself, instead of first checking the text and then uploading the image separately. This will be an important basic capability in areas highly related to visuals such as e-commerce, medical, and design.

However, there are still a few limitations worth noting:

  1. Image URLs cannot be used directly: Gemini's FunctionResponseBlob needs raw bytes, and cannot directly fill in the URL (this is different from directly including images in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes and then pass it in.

  2. No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual tests in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.

  3. Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual tests, and the API structure is the same.

There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry images, and Gemini's answer will upgrade from "restating data" to "telling stories based on images".

The core API is just these few lines, but it takes a lot of details to get the whole process through:

# Gemini sees the complete writing of the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

Enter fullscreen mode Exit fullscreen mode

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Top comments (0)