DEV Community: Evan Lin

[Gemini API in Action] Building MemeFinder: A Native Mac Menu Bar Widget for Finding Memes via Text Using Gemini Vision & Semantic Embeddings

Evan Lin — Mon, 22 Jun 2026 00:41:17 +0000

The Origin: Mid-Conversation, Where on Earth Is That Meme?

Anyone who chats a lot has a folder full of memes on their phone and computer, but the moment you actually need one — the conversation is rolling, you want to drop a "thanks but no thanks" or an "I'm trash" reaction — you can't find it. The filename is IMG_4821.jpg, the photo library has no categories, and search is a non-starter.

I first came across a wonderful open-source project, ShiQu1218/MemeTalk. It builds a local meme semantic-search system with Python + Streamlit + SQLite: it scans your local meme folder, indexes images with OCR and vector embeddings, then does multi-route retrieval. Feature-complete, but research-oriented and requires opening a browser to run Streamlit.

What I wanted was something closer to an "everyday handy tool":

A native Mac app, one search box. I type what I'm looking for and the relevant meme pops up. Click it and it's copied straight to the clipboard.

So MemeFinder was born. This post records its journey from zero to "menu-bar resident + global hotkey," and several representative pitfalls along the way.

System Design and Architecture

The core concept is simple: point at a local meme folder → have Gemini build an index for each image → type to do a semantic search → click to copy.

I made three key technical decisions:

Native SwiftUI app, not Electron. Copying images to the clipboard, global hotkeys, menu-bar residency — with AppKit these are all first-class citizens.
Gemini does two things: the vision model gemini-3-flash-preview reads the text in each image and generates a Traditional Chinese description plus emotion tags; gemini-embedding-2 turns that semantics into a 768-dimensional vector.
Hybrid semantic-vector + keyword search. Pure keyword recall for Chinese is too poor; only semantic vectors achieve "type a related description and find the image."

System Architecture Flow

The project is deliberately split into two Swift Package targets:

Target	Type	Contents
`MemeFinder`	library	Logic, models, services, ViewModels (all unit-tested)
`MemeFinderApp`	executable	SwiftUI views + menu-bar shell (thin layer, depends on the library)

This split isn't decorative — it directly determines whether the tests can run smoothly, as "Pitfall #2" will explain.

Core Implementation

1. Auto-tagging memes with the Gemini vision model

During indexing, each image is sent to the vision model with a request to output only JSON: the text in the image, a Traditional Chinese description, tags, and emotion. responseMimeType is set to application/json to keep the output format stable:

public static func annotateRequest(apiKey: String, imageData: Data, mimeType: String) -> URLRequest {
    let prompt = """
    你是迷因圖標註助手。請閱讀這張圖，輸出 JSON，欄位：
    ocr_text(圖中所有文字), description(用繁體中文描述畫面與梗),
    tags(3-8 個繁體中文關鍵字陣列), emotion(單一情緒詞)。只輸出 JSON。
    """
    let body: [String: Any] = [
        "contents": [[
            "parts": [
                ["text": prompt],
                ["inline_data": ["mime_type": mimeType, "data": imageData.base64EncodedString()]]
            ]
        ]],
        "generationConfig": ["responseMimeType": "application/json"]
    ]
    // ... set URL, x-goog-api-key header, POST body
}

2. Hybrid semantic + keyword ranking

After the query string is embedded into a vector, we compute cosine similarity for every image, then add weight for keywords that hit the OCR text and tags, and merge-sort:

public func search(queryEmbedding: [Float], queryText: String,
                   in images: [IndexedImage], limit: Int) -> [SearchResult] {
    let tokens = queryText.lowercased().split(whereSeparator: { $0.isWhitespace }).map(String.init)
    let results: [SearchResult] = images.compactMap { image in
        let cos = cosineSimilarity(queryEmbedding, image.embedding)
        let haystack = (image.ocrText + " " + image.tags.joined(separator: " ")).lowercased()
        let matches = tokens.filter { !$0.isEmpty && haystack.contains($0) }.count
        let boost = 0.1 * Float(min(matches, 3))   // keyword boost capped at 0.3
        let score = cos + boost
        return score > 0 ? SearchResult(image: image, score: score) : nil
    }
    return Array(results.sorted { $0.score > $1.score }.prefix(limit))
}

The whole search engine is a pure function, with Gemini hidden behind a protocol, so this logic can be fully unit-tested offline without hitting the real API.

Major Pitfalls and Solutions

The real time sink in this project was never the happy path — it was the pitfalls below.

Pitfall #1: The mysterious `GeminiError error 0` — indexing and search both fail

App packaged, key set, folder chosen, hit search — and nothing shows below, just GeminiError error 0.

Rather than guessing, I hit the embedding endpoint once with a real key and printed the response:

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2:embedContent" \
  -H "x-goog-api-key: $KEY" \
  -d '{"content":{"parts":[{"text":"貓"}]},"output_dimensionality":768}'

The evidence was unmistakable:

{ "embedding": { "values": [ -0.0063, -0.0200, ... ] } }

The problem: my parser was reading the plural embeddings[0].values (that's the batchEmbedContents batch-endpoint format), but the single embedContent call returns the singular embedding.values. So every embed call failed — indexing each image failed, embedding the query string failed — all throwing badResponse (shown in the UI as GeminiError error 0).

[Solution]
Fix the parser to read the singular embedding.values, keeping the plural format as a fallback; I also hardened the annotation parser (a thinking model sometimes returns a textless "thought" part first, so skip to the first part that actually has text):

public static func embedding(fromEmbedContent data: Data) throws -> [Float] {
    guard let root = try? JSONSerialization.jsonObject(with: data) as? [String: Any] else {
        throw GeminiError.badResponse("cannot parse embedContent payload")
    }
    // A single embedContent returns {"embedding":{"values":[...]}}
    if let embedding = root["embedding"] as? [String: Any],
       let values = embedding["values"] as? [Double] {
        return values.map(Float.init)
    }
    // batchEmbedContents is {"embeddings":[{"values":[...]}]} — tolerate it too
    if let embeddings = root["embeddings"] as? [[String: Any]],
       let values = embeddings.first?["values"] as? [Double] {
        return values.map(Float.init)
    }
    throw GeminiError.badResponse("cannot parse embedContent payload")
}

Lesson: trust the actual API response over your memory or secondhand docs. A single line of curl saved countless guesses.

Pitfall #2: SwiftPM's `main` entry-point conflict and the SwiftUICore linking error

I initially made the whole project a single executableTarget with the tests depending on it directly. The result: tests failed to link no matter what. An executable target needs a main entry point, but that entry point only exists at the UI step's @main App; and casually adding a placeholder main.swift then conflicts with @main (Swift doesn't allow two entry points in one target). Worse, SwiftUI in an executable target spews SwiftUICore.tbd ... not an allowed client linker warnings.

[Root cause analysis and solution]
This is actually an architecture problem, not a compilation problem. The right approach is to split the project into two layers:

MemeFinder (library target): all logic, models, services, ViewModels — the tests depend only on this layer, it has no entry point, and it links cleanly as a library. ViewModels import Combine (not SwiftUI) to get ObservableObject.
MemeFinderApp (executable target): only SwiftUI views and @main, with import MemeFinder to use the public types above.

After the split, the library and tests don't touch SwiftUI at all, the linker warnings disappear, and the @main conflict no longer exists. "What the tests need to depend on" often forces out clean module boundaries.

Pitfall #3: Parallel indexing's rate limit and "I want to stop indexing halfway"

The first version indexed one image at a time, serially calling Gemini (annotate then embed). For hundreds of images this was painfully slow. So I switched to bounded parallelism with withTaskGroup (at most 4 at once), which brought three new problems:

The Gemini free tier has a rate limit — too much concurrency triggers 429.
The user wants to cancel halfway through a large folder.
Parallel completion order is chaotic, but the results need stable sorting.

[Solution]
Handle the three problems separately, all converging in the same buildIndex:

429 backoff retry: retry only GeminiError.rateLimited with exponential backoff (max 3 attempts); other errors are recorded without retry.
Cooperative cancellation: honor Task.isCancelled; on cancel, stop scheduling new work and keep the completed portion. Even the backoff Task.sleep lets CancellationError propagate normally instead of swallowing it and firing one more API call.
Stable sorting: collect results into a [path: image] dictionary, then reassemble the output in the order of the pre-sorted file list, decoupled from completion order.

// Seed maxConcurrent tasks first, then refill one per completion — strictly cap concurrency
for _ in 0..<maxConcurrent { if !scheduleNext() { break } }
while let res = await group.next() {
    if let img = res.image { resultsByPath[res.path] = img }
    if let err = res.error { errors.append(err) }
    done += 1
    progress(done, total)
    _ = scheduleNext()
}

Incidentally, the HTTP status code was also extracted into a pure function mapResponse(data:statusCode:): 429 → rateLimited, other non-2xx → httpError(code), 2xx → return the data. The retry logic then has a basis, and this part is easy to test too.

Pitfall #4: Evolving from a "windowed app" into "menu-bar resident + global hotkey"

Whether a tool is pleasant to use comes down to "how many steps to summon it." I wanted to hit ⌃⌘M mid-conversation to bring up the search popover, with the app tucked into the menu bar, not occupying the Dock. This step hit two classic macOS pitfalls:

(a) Does a global hotkey need accessibility permission? No. Use Carbon's RegisterEventHotKey to register a fixed hotkey, which doesn't need Accessibility permission (unlike monitoring the whole keyboard). But under Swift 6 strict concurrency, the C event callback has to dispatch through a static id → instance registry, requiring nonisolated(unsafe) and relying on the invariant that "Carbon events are delivered on the main thread" for safety. If ⌃⌘M is already taken, RegisterEventHotKey returns failure — in which case we silently degrade, log a line, and the menu-bar icon still works.

(b) The timing race in the menu-bar right-click menu. The initial approach was "set statusItem.menu → performClick → immediately clear menu," but clearing synchronously fights AppKit's menu-tracking loop, and the menu flashes and disappears.

[Solution]
Pop the menu up directly, fully bypassing the assign-and-clear of statusItem.menu:

@objc private func statusButtonClicked() {
    guard let event = NSApp.currentEvent else { togglePopover(); return }
    if event.type == .rightMouseUp {
        // Pop up directly; don't assign then synchronously clear statusItem.menu
        // (it races AppKit's menu-tracking loop)
        if let button = statusItem?.button {
            NSMenu.popUpContextMenu(makeMenu(), with: event, for: button)
        }
    } else {
        togglePopover()
    }
}

Finally, adding LSUIElement = true to the Info.plist produced by build-app.sh makes the Dock icon disappear, and MemeFinder officially becomes a pure menu-bar tool.

Pitfall #5: The settings form is blank — one symptom, three layers of cause

After moving to the menu-bar version, a user reported "the settings window is completely blank." This seemingly simple bug, peeled apart, actually had three layers, each highly representative.

Layer 1: a Form collapses to zero height inside a hand-rolled NSWindow.
Originally the settings screen lived in SwiftUI's native Settings { } scene, which sizes it sensibly. After the refactor it was hosted in a hand-rolled NSWindow(contentViewController: NSHostingController(rootView: SettingsView())), and SettingsView ended with only .frame(width: 460) — width only, no height. NSWindow(contentViewController:) sizes the window from the content's natural size, but a SwiftUI Form is vertically greedy; with no constraint, its natural height resolves to nearly 0, so the window opens as a 460-wide, near-zero-height blank strip. The fix is just to add a height:

.padding(20)
// When hosted in a hand-rolled NSWindow (not a SwiftUI Settings scene), a Form
// with no height constraint collapses to ~0, turning the window into a blank strip.
.frame(width: 460, height: 320)

Layer 2: ⌘, and the menu-bar "Settings…" go down two different paths.
After adding the height, the user said "still blank." On follow-up I found out he was summoning settings with ⌘,, while the menu-bar right-click "Settings…" went down a different path. The reason: ⌘, in a SwiftUI app triggers the Settings { } scene, and to dodge a state-sharing problem during the refactor, I had set that to Settings { EmptyView() }:

// During the refactor, the Settings scene was left empty to avoid state-sharing
// — so ⌘, opens a blank window
var body: some Scene {
    Settings { EmptyView() }
}

In other words, settings had two entry points pointing at different things: ⌘, pointed at the empty scene, the menu-bar "Settings…" pointed at the real window. The fix unifies the two paths — let the Settings scene host the real SettingsView (so ⌘, works directly), and make the menu-bar "Settings…" open the same native settings window too:

Settings {
    SettingsView(vm: appDelegate.settings, indexing: appDelegate.indexing,
                 onReindex: { appDelegate.reindexNow() },
                 onCancel: { appDelegate.cancelReindex() })
}

// The menu-bar "Settings…" now opens the same Settings scene
@objc private func openSettings() {
    NSApp.activate(ignoringOtherApps: true)
    NSApp.sendAction(Selector(("showSettingsWindow:")), to: nil, from: nil)
}

This also leverages the fact that a SwiftUI App body is @MainActor-isolated — so reading the @MainActor appDelegate.settings directly from the body is legal, with no extra bridging needed.

Layer 3 (the most insidious): open doesn't reload a menu-bar app at all.
The biggest time-waster in the process was that after recompiling, I'd ask the user to open MemeFinder.app, yet he kept seeing the old behavior. Because MemeFinder is an LSUIElement menu-bar-resident app — when an instance is already running, open only wakes the existing old process instead of relaunching with the new binary. So we were actually testing the same old build the whole time. The correct dev loop is to truly kill it first, then run from source:

killall MemeFinderApp 2>/dev/null; swift run MemeFinderApp

This layer reminds me: when debugging, first confirm "what you're testing really is the version you changed" — otherwise all your reasoning is built on faulty observations.

On the "Development Process" Itself

This project was driven almost entirely by an AI agent workflow of spec → plan → subagent task-by-task implementation → two-stage review: each feature started with a design spec, was broken into independently testable small tasks, every task wrote a failing test first (TDD) before implementing, and after completion an independent review agent checked spec compliance and code quality, followed by one final whole-branch review.

Several of the pitfalls — GeminiError error 0, the library/executable split, swallowing CancellationError during backoff, the menu timing race — were in fact caught half the time during the review stage, not written correctly on the first pass. This echoes that old principle: having tests as armor, and someone (or an agent) seriously reading the diff, matters far more than writing fast. The final project maintains 47 unit tests and a zero-warning release build.

Results and Benefits

Type to find, click to paste: type a Chinese description in the menu-bar popover, semantic search instantly lists relevant memes, click one to copy it to the clipboard and paste straight into LINE / Slack / Messages.
Privacy-friendly, searchable offline: images and the index live locally (~/Library/Application Support/MemeFinder/index.json); only the "build the index" step calls Gemini.
A truly handy tool: ⌃⌘M is available anytime, menu-bar resident, no Dock footprint; incremental indexing only processes new/changed images, and indexing can show progress and be canceled.
A clean, maintainable architecture: a two-layer library/executable design, Gemini hidden behind a protocol, pure logic fully covered by tests.

All the development code for this project is open-sourced on GitHub: kkdai/meme-finder-app. Feel free to clone it, point it at your own meme-collection folder, and experience the joy of "type to find your meme"!

[Gemini API] Gemini Batch API and Webhook API practical usage on restaurant survey

Evan Lin — Mon, 15 Jun 2026 04:09:16 +0000

A Powerful Tool for Asynchronous Processing: Gemini Batch API & Webhooks

When developing LLM-based applications, we often need to handle a large number of data analysis tasks—for example, analyzing reviews from dozens of restaurants at once, classifying a large volume of articles, or batch generating translations. If we use traditional synchronous APIs (real-time calls), we would not only face severe Rate Limit blockages but also fail due to network connection timeouts and extremely high computing costs.

To overcome this limitation, Google has launched the Gemini Batch API and Webhook API:

Gemini Batch API: Allows developers to package a large number of requests into a JSONL file and upload them all at once. Gemini performs asynchronous scheduled computations in the background, without consuming your daily real-time API quotas (Rate Limits), and its computing cost is usually half that of real-time APIs, making it a perfect choice for non-urgent big data processing.
Webhook API: Traditional Batch tasks require us to constantly write polling logic locally to check the status. With Webhooks, when Gemini completes a Batch computation, it actively sends an HTTP POST callback to your specified URL, instantly notifying you that the task is complete, making the system architecture more elegant and energy-efficient.

This article will document how we integrated these two powerful APIs into our LINE Bot Restaurant Analysis Assistant to achieve one-click deep review and signature dish big data analysis for specific restaurants on mobile devices.

System Design and Optimized Architecture

Originally, the restaurant analysis function worked by having the Bot list nearby restaurants when a user sent their location, and then providing a generic "Deep Review Analysis (Batch)" button. Clicking it would send all nearby restaurants for analysis at once. However, this led to a poor UX: analyzing all restaurants took too long, and users often only wanted to delve into one specific restaurant they were interested in.

Therefore, we optimized the function into dynamic Quick Reply buttons:

The user sends their location, and the Bot searches for nearby restaurants via Google Maps Grounding.
After the client receives a plain text list of restaurants, the Bot automatically uses Gemini to extract the top 3 highest-rated restaurant names.
Three customized Quick Reply buttons are generated (e.g., 🍴 Analyze Din Tai Fung).
After the user clicks a specific restaurant button, the Bot immediately replies "Processing" to avoid LINE timeouts, and submits the Batch task for that single restaurant in the background. Once Gemini completes the computation, it proactively pushes a dedicated big data report.

System Architecture Flow

graph TD
    A[User Sends Location] -->|Location Message| B[Google Maps Grounding Search]
    B -->|Plain Text Restaurant List| C[Gemini-2.5-flash Extracts Top 3 Restaurants]
    C -->|Dynamically Generates Quick Reply| D[LINE Bot Replies with 3 Customized Analysis Buttons]
    D -->|User Clicks Specific Analysis| E[FastAPI Background Task]
    E -->|Immediate Reply ACK| F[LINE Chat Message]
    E -->|Package JSONL and Upload| G[Gemini Batch API Submission]
    G -->|Computation Complete Webhook/Polling Callback| H[Proactively Pushes Deep Report to User]

Core Implementation

1. Precisely Extracting Restaurant Names from Grounding Text using Gemini

In tools/maps_tool.py, the map search returns a plain text string rich in formatting and descriptions. We use Gemini-2.5-flash's structured output concept to precisely extract restaurant names in JSON format:

        # Extract top three restaurant names for Quick Reply
        names = []
        if place_type == "restaurant":
            try:
                extract_prompt = f"Please extract all restaurant names from the following text and return them in a JSON array format (e.g., [\"Restaurant A\", \"Restaurant B\"]). Please output the JSON array directly, without any markdown tags (like ```
{% endraw %}
json) or explanatory text.\n\n{result}"
                extract_res = client.models.generate_content(
                    model="gemini-2.5-flash",
                    contents=extract_prompt
                )
                extract_text = extract_res.text.strip() if extract_res.text else ""

                try:
                    names = json.loads(extract_text)
                except Exception:
                    import re
                    array_match = re.search(r"\[(.*?)\]", extract_text, re.DOTALL)
                    if array_match:
                        import ast
                        names = ast.literal_eval(f"[{array_match.group(1)}]")

                names = [str(n).strip() for n in names if n]
                logger.info(f"Extracted restaurant names for Quick Reply: {names}")
            except Exception as e_extract:
                logger.error(f"Failed to extract restaurant names: {e_extract}")
{% raw %}

2. Dynamically Generating LINE Quick Reply Buttons

In main.py, after obtaining the restaurant list, we dynamically generate QuickReplyButton. We need to pay special attention to LINE API's length limit for button label:


python
        quick_reply = None
        if place_type == "restaurant" and result.get("status") == "success":
            restaurant_names = result.get("restaurant_names", [])
            if restaurant_names:
                buttons = []
                for name in restaurant_names[:3]:
                    clean_label = name
                    # LINE label limit is 20 characters
                    if len(clean_label) > 10:
                        clean_label = clean_label[:9] + "…"
                    buttons.append(
                        QuickReplyButton(
                            action=PostbackAction(
                                label=f"🍴 分析 {clean_label}",
                                data=json.dumps({
                                    "action": "specific_foodie_deep_analysis",
                                    "restaurant_name": name
                                }),
                                display_text=f"🔍 進行「{name}」深度評論與招牌菜色分析"
                            )
                        )
                    )
                quick_reply = QuickReply(items=buttons)

Major Pitfalls and Solutions

During the process of connecting this dynamic Quick Reply to the Batch API, we encountered several critical UX and API limitation issues:

Pitfall One: LINE 20-character Limit Causing API Sending Errors

Initially, when implementing, we directly used the full restaurant name in the button's Label, for example: 🍴 Analyze Love Hot Pot Ultimate Hot Pot. As a result, the LINE API immediately returned a 400 error, and the message could not be sent at all:


plaintext
LineBotApiError: status_code=400, error_message=The property 'label' must be less than 20 characters.

[Cause Analysis and Solution] LINE's official label limit for Quick Reply is extremely strict; including emojis and spaces, it can have a maximum of 20 characters. To address this, we added a character count check and dynamic truncation mechanism in our code:

First, the original restaurant name (clean_label) is truncated: if its length exceeds 10 characters, it is forcibly cut to the first 9 characters and appended with "…" (occupying 10 characters).
Adding the prefix 🍴 Analyze (a total of 5 characters), the maximum total length becomes 15 characters, safely staying within the 20-character limit, thus eliminating the error!

Pitfall Two: Batch API Asynchronous Delay and LINE Webhook's "Three-Second Timeout Survival Battle"

When a user clicks the "Analyze Restaurant" button, the Bot must first call Google Search Grounding to collect online reviews for that restaurant, then package the JSONL file and upload it to Gemini to submit the Batch task. This entire sequence usually takes 3 to 8 seconds. However, the LINE Webhook server requires the Bot to return an HTTP 200 OK response within 3 seconds, otherwise it will be deemed a connection failure and re-send the request, leading to severe server congestion.

[Cause Analysis and Solution] We completely asynchronous the processing architecture:

Fast Response: When the Bot intercepts a specific_foodie_deep_analysis Postback action, it does not execute the analysis directly within the Request flow. Instead, it immediately calls LINE's reply_message to respond to the user: 🔍 Received! Performing deep analysis for you... This will take about 1-2 minutes..., and then instantly returns HTTP 200 to end that Webhook request.
Background Task Dispatch: Use Python asyncio.create_task to dispatch heavy network search, upload, and submission tasks to FastAPI's background Worker for execution.
Big Data Push: When the background Polling listener or Gemini Webhook receives a task completion notification, it then uses LINE's push_message to proactively send the analysis report to the specific user.

Pitfall Three: Gemini Batch API's Queuing and Pending Status

During testing, users sometimes got confused, "Why hasn't there been a reply after three minutes? Is the Bot down?". After checking the system logs, we found that our JSONL file had been successfully uploaded, but the task status on the Gemini server side was stuck at JobState.JOB_STATE_PENDING.

[Solution] This is a characteristic of the Batch API; tasks need to be queued, waiting for Google's server resources. We adopted two major optimizations:

Minimize Workload: Reduce the number of restaurants for batch analysis to 1, shrinking the number of request lines in the JSONL to the extreme, to speed up Gemini's scheduling and processing.
UX Optimization and Deduplication Mechanism: When a user clicks to analyze, we first check if that user already has a Batch Job running. If so, we reply: ⏳ Your deep analysis task is currently running, please wait patiently, preventing users from submitting multiple duplicate Batch Jobs due to anxious repeated clicks, which would consume unnecessary resources.

Results and Benefits

This optimization of Quick Reply and Gemini Batch API for the LINE Bot Restaurant Assistant has achieved excellent practical value:

Highly Customized Mobile Experience: After locating, users don't need to type; they can directly click on a restaurant of interest with one tap to precisely get a summary of its signature dishes and review pain points.
Robust Backend Architecture: By leveraging asynchronous background tasks and LINE's character limit safety valve, the risks of Webhook timeouts and LINE API errors have been completely resolved.
Cost Advantage for Big Data Processing: Through the Batch API's half-price advantage and Webhook's proactive callback, while ensuring user experience, it also saves significant computing resources and API costs for the server.

Through this architecture, the LINE Bot truly achieves a low-latency, highly stable big data deep analysis experience on mobile!

All development code for this project has been open-sourced on GitHub: kkdai/linebot-helper-python. Everyone is welcome to deploy and personally test this one-click analysis function, which we believe can bring a higher level of intelligent experience to your LINE Bot projects!

[I/O Extended Taipei] Building with Gemini APIs: From Calls to Autonomous Systems

Evan Lin — Sun, 14 Jun 2026 07:12:11 +0000

(Activity: Google I/O Extended 2026 Taipei / Presentation: SpeakerDeck)

Context: The Gemini API is no longer just "adding one more prompt"

If your impression of the Gemini API is still limited to "select a model, send a prompt, get back a piece of text," then when you see this round of updates in 2026, you'll likely suddenly realize something:

The Gemini API has evolved from a simple API interface into a complete platform that can be used to build applications, agents, and asynchronous workflows.

This content is compiled from my talk "Building Applications in the Gemini API Family" at Google I/O Extended 2026 Taipei. Evan Lin, Technical Director of LINE Taiwan Developer Relations, repeatedly emphasized a core observation at the event: what developers truly need to consider now is no longer just "Should I use Pro or Flash?", but rather "How do I string together models, retrieval, agents, callbacks, and cost control into a cohesive system?".

In other words, the focus is shifting from calling APIs to designing systems.

First, let's look at the big picture: What's new in the 2026 Gemini API family?

If we view the 2026 Gemini API as a capability map, it can broadly be divided into three layers.

Layer 1: Core Models

Gemini 3.5 Pro: Strongest reasoning capability, suitable for complex planning, advanced analysis, and multi-step tasks.
Gemini 3.5 Flash: Main model, best balance of speed, cost, and capability, suitable for most product traffic.
Flash-Lite: Intent classifier and pre-classifier for high-frequency, low-cost scenarios.
Gemini Embedding 2: Supports not only text but also multi-modal vectorization needs.

Layer 2: Key Capability Modules

Retrieval: File Search, Google Search Grounding, URL Context.
Agent / Async: Agents API, Webhook, Deep Research agent.
Infrastructure: Context caching, Batch API, Live API.

Layer 3: System Design Approach

This layer is arguably the most important. Because once the above capabilities are offered as platform services, many "intermediate layers" that previously had to be built manually suddenly disappear:

No longer necessarily need to build your own RAG pipeline.
No longer necessarily need to maintain your own agent loop.
No longer necessarily need to block the main server with polling while waiting for results.

Core Observation: The Gemini API upgrade is not just about "stronger models"; it's about Google absorbing the complexities that were originally at the application layer into the platform layer. This will directly change how we design AI systems.

Architectural Turning Point: Three Tools, Three Paradigm Shifts

What's most worth repeatedly digesting from this talk are the architectural changes represented by these three tools.

1. File Search: Shifting from Hand-Coded RAG to Managed RAG

Previously, when discussing enterprise knowledge Q&A, the immediate thought was:

Chunking.
Creating embeddings.
Storing in a vector DB.
Writing retrieval code.
Then manually adding citation and permission control.

Now, with the advent of File Search, developers can focus more on "how documents are governed, how permissions are allocated, and how answers are presented," rather than repeatedly writing that foundational infrastructure.

More importantly, it doesn't just search text.

Why is this File Search particularly noteworthy?

Images and text in the same space: Screenshots, charts, and mixed text-image layouts in PDFs are no longer just attachments, but content understandable by the model.
Metadata filtering: Can filter by department, system, and document type, which is crucial for internal enterprise knowledge retrieval.
Precise citation: Can refer back to specific page numbers and grounding metadata, making answers more trustworthy.

This represents a very practical shift: much of the time enterprises previously spent on LangChain, vector databases, and chunking strategies can now largely be redirected towards permission design, UX, and content governance.

2. Agents API: Shifting from Client-Side Loop to Server-Side Managed Agent

In the past, to build an agent, the common approach was to maintain your own ReAct or tool loop:

Model decides the next step.
Calls a tool.
Receives results.
Feeds back to the model.
Repeats until completion.

The problem is that this is full of engineering details: state preservation, timeouts, retries, background execution, long-task monitoring. Ultimately, you'd find yourself spending most of your time maintaining an "agent runtime."

What the Agents API changes is that you can POST a task to Gemini, allowing it to complete the long process on the server side, even handling complex tasks that take up to 20 minutes.

The significance behind this is not just "more convenient"; it means developers can finally refocus on:

How are tasks defined?
Which tools can be used?
What are the success criteria?
How should the product integrate the results when they return?

3. Webhook: Shifting from Polling to Event-Driven

Once tasks might run for several minutes, or even more than ten minutes, traditional synchronous requests become unreasonable.

Therefore, the role of Webhook is actually crucial: it's not a minor feature, but a prerequisite for the entire agent workflow to truly enter production. When Gemini completes a task and actively POSTs the result back to your server, your system can become event-driven:

The frontend first responds to the user with "Task received."
The Agents API executes in the background.
Upon completion, the result is pushed back via webhook.
Your service then notifies the user, updates the database, or triggers the next step.

This is particularly important for high-concurrency products, as you finally don't need to hold a bunch of server connections idly waiting.

From the Perspective of a LINE Bot, How Should a Gemini Application Be Designed?

A very practical suggestion Evan gave in his talk is to place a router layer before the LLM.

This design sounds simple, but it largely determines your cost, latency, and predictability.

A Very Pragmatic Routing Approach

First, use the inexpensive Flash-Lite for intent routing:

Quick Q&A: Directly handed over to Flash or Flash-Lite for generation.
Query company documents: Enters File Search.
Complex long tasks: Enters Agents API.

Doing this has three benefits:

Cost control first: Not every query directly hits the most expensive, heaviest model.
Latency control first: Simple requests should not mistakenly enter long processes.
System behavior control first: Makes the overall process more stable than "throwing everything at a large model for improvisation."

If you're building a LINE Bot, customer service assistant, internal knowledge assistant, or workflow agent, this router should almost certainly be the default configuration, rather than an afterthought.

Infrastructure is Not Unimportant, But You Don't Have to Rebuild it Yourself Every Time

Another strong message from this talk is that developers' time should be reallocated.

Previously, much of the man-hours in many AI projects were actually consumed by these tasks:

Vector database operations and maintenance
Chunking and retrieval parameter tuning
Long-task scheduling
Websocket / polling / callback processes
Token cost optimization

Now, with File Search, Agents API, Webhook, Context caching, and Batch API, the areas where we should spend more time have shifted to:

Business rules and tool boundaries
Document permissions and data governance
User interaction experience
Task decomposition and routing strategies
Failure recovery and result interpretability

This is also why I strongly agree with Evan's underlying message: What's truly valuable is not whether you can build your own vector database, but whether you can redirect 80% of your energy back to the product's core.

Three Most Valuable Practical Takeaways

1. Place a routing layer before the LLM

Don't send all problems directly to the same model. First classify, then decide whether to generate, retrieve, or enter an agent task.

2. Embrace asynchronous operations; don't force long tasks into synchronous APIs

If a task might take more than a few seconds, you should seriously consider Agents API + Webhook. This is not an optimization; it's an architectural correctness issue.

3. Redirect RAG engineering time to permissions and experience

When File Search can handle a large amount of foundational work, developers should be more concerned with: can data be securely queried, can answers be verified, and can citations be trusted by users.

Why is this talk worth revisiting repeatedly?

Because it highlights a turning point that many teams are currently facing:

We are no longer just writing prompts for LLMs; we are designing operating systems for AI applications.

Models are certainly still at the core, but what truly differentiates products is increasingly not "which model you choose," but:

How you decide when to use which capability.
How you make the system run reliably for extended periods.
How you make answers traceable, verifiable, and maintainable.

If you still understand generative AI using the 2024 approach of "a single chat endpoint for everything," then you'll easily underestimate the 2026 Gemini API family.

Postscript: From API User to AI System Designer

The most valuable aspect of this "Building Applications in the Gemini API Family" talk is not teaching you another new parameter or SDK, but reminding everyone of a more fundamental shift:

The competitiveness of the next phase will not be about who is better at calling models, but who is better at assembling models, retrieval, agents, and event flows into a truly functional system.

If you are working on a LINE Bot, enterprise knowledge base, internal assistant, customer service process, or any product requiring multi-step AI collaboration, this architectural perspective is well worth using to redraw your current system diagram.

Often, what truly needs refactoring is not the prompt, but the entire pipeline.

[Hands-on Gemini 3.5 Live

Evan Lin — Fri, 12 Jun 2026 06:09:59 +0000

Brand New API Unveiled: Gemini 3.5 Live Translate

On June 9, 2026, Google officially released its brand new real-time voice translation model — Gemini 3.5 Live Translate. This marks another significant breakthrough for Google in AI voice translation technology. It is currently available for public preview to developers in Google AI Studio and Gemini Live API, and has been simultaneously integrated into services like Google Translate and Google Meet.

Key features of Gemini 3.5 Live Translate include:

Fluent and Natural Bidirectional Voice Translation: Supports over 70 languages, automatically detecting the input voice language without manual configuration.
Continuous Stream Generation (Instead of Single-Sentence Turn-Taking): Unlike previous turn-by-turn systems that required the speaker to finish speaking before translation, Gemini 3.5 Live Translate generates translations in real-time while listening. It strikes a balance between contextual understanding and immediacy, with translations lagging only a few seconds behind the speaker, completely avoiding awkward pauses.
Preservation of Intonation and Rhythm: The generated voice is not only smooth but also retains the original speaker's tone, intonation, and speaking rhythm.
Robust Noise Cancellation Capability: Accurately captures and recognizes speech even in noisy or unstable environments.

This article will document how we developed a native macOS application, MeetingTranslator, using Swift, to integrate with this powerful new API and achieve real-time translation of specific app audio into Traditional Chinese voice and subtitles.

System Design and Architecture

Our goal is to develop a Native SwiftUI application that does not require installing virtual sound cards like BlackHole. Instead, it utilizes Apple's official ScreenCaptureKit framework to directly capture the audio stream from a selected application (such as YouTube in Google Chrome or an online meeting) and, through the Gemini Live WebSocket API, achieve ultra-low-latency conversational voice translation.

System Architecture Flow

graph TD
    A[ScreenCaptureKit <br>Capture Application Audio] -->|48kHz Stereo Float32| B[AVAudioConverter <br>Resampling and Channel Conversion]
    B -->|16kHz Mono Int16 PCM| C[Gemini Live API <br>WebSocket Connection]
    C -->|Real-time Subtitle Recognition| D[SwiftUI Subtitle HUD <br>Traditional Chinese Bilingual Subtitles]
    C -->|24kHz Mono Int16 PCM Translated Audio| E[AudioPlaybackManager <br>AVAudioEngine Player]

Core Implementation One: ScreenCaptureKit Capture and Resampling

ScreenCaptureKit, introduced in macOS 13, frees developers from the pain of relying on kernel audio virtual devices, allowing precise filtering and recording of specific application screens and audio.

1. Filter and Select Target App

We use SCShareableContent to get currently running applications on the system and filter out background services without names and system-自带 services:

func fetchShareableApps() async -> [SCRunningApplication] {
    do {
        let content = try await SCShareableContent.current
        return content.applications.filter { app in
            let name = app.applicationName
            guard !name.isEmpty else { return false }
            let bundleId = app.bundleIdentifier
            return !bundleId.hasPrefix("com.apple.system") && bundleId != Bundle.main.bundleIdentifier
        }.sorted { $0.applicationName < $1.applicationName }
    } catch {
        print("無法獲取可共享內容: \(error)")
        return []
    }
}

2. Start Audio Capture Stream

After filtering out the target App (e.g., Google Chrome), we create an SCContentFilter for it and apply it to SCStream:

let appFilter = SCContentFilter(display: content.displays.first!, including: [targetApp], exceptingWindows: [])
let config = SCStreamConfiguration()
config.capturesAudio = true
config.width = 32 // When only capturing audio, set video frame to minimal to save performance
config.height = 32

stream = SCStream(filter: appFilter, configuration: config, delegate: nil)
try stream?.addStreamOutput(self, type: .audio, sampleHandlerQueue: DispatchQueue(label: "com.translator.audioQueue"))
try await stream?.startCapture()

Core Implementation Two: Gemini Live WebSocket Bidirectional Connection

The core of the Gemini Live API lies in using a wss:// connection to transmit microphone/application audio in real-time through a single channel, and simultaneously receive model-generated translated text and translated audio.

In GeminiLiveConnection.swift, we maintain this bidirectional pipeline via URLSessionWebSocketTask. After connecting, a setup control message must be sent immediately to initialize the model configuration.

Major Pitfalls and Solutions

During the process of integrating the system, we encountered three blocking difficulties. Below is our troubleshooting process and solutions:

Pitfall One: Gemini Live Exclusive Model Restrictions

Initially, we tried to use standard REST API model names (e.g., gemini-3.5-flash) in the WebSocket connection, but the server immediately disconnected:

❌ WebSocket 被 Gemini 伺服器關閉 (CloseCode: 1008, 原因: models/gemini-3.5-flash is not found for API version v1beta, or is not supported for bidiGenerateContent.)

【Solution】 Gemini's bidirectional Live API currently only supports specific optimized real-time models. We must restrict the model field to:

gemini-2.0-flash-exp (standard bidirectional conversation)
gemini-3.5-live-translate-preview (preview model optimized for real-time translation)

Pitfall Two: Incorrect JSON Payload Field Structure (Hidden Differences Between Documentation and API Versions)

When configuring real-time interpretation, we referred to Google's official documentation and placed the inputAudioTranscription (input speech-to-text) and outputAudioTranscription (output speech-to-text) fields within generationConfig, which resulted in a 1007 error:

❌ WebSocket 被 Gemini 伺服器關閉 (CloseCode: 1007, 原因: Invalid JSON payload received. Unknown name "inputAudioTranscription" at 'setup.generation_config': Cannot find field.)

【Cause Analysis and Solution】 In the official documentation, for v1alpha and client SDKs (e.g., JavaScript / Python SDK), these two fields are wrapped within generationConfig. However, in the current v1beta WebSocket native endpoint: /ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent

These two fields should be located at the root level of the setup object, while the translation-specific translationConfig must be placed under generationConfig. The correct JSON Payload structure is as follows:

setupMessage = [
    "setup": [
        "model": "models/\(modelName)",
        "inputAudioTranscription": [:], // Enable real-time input subtitles, placed at the setup root
        "outputAudioTranscription": [:], // Enable real-time output subtitles, placed at the setup root
        "generationConfig": [
            "responseModalities": ["AUDIO"],
            "translationConfig": [
                "targetLanguageCode": "zh-TW", // Set target translation language to Traditional Chinese
                "echoTargetLanguage": true
            ]
        ]
    ]
]

After this modification, the WebSocket setup finally successfully handshaked and no longer crashed!

Pitfall Three: "Zero-Byte Silence" Caused by Multi-Channel Stereo Capture

After successfully establishing the WebSocket pipeline and starting to push resampled audio, we found that Gemini still had no translation response. Observing the log output, we discovered that the content of the sent audio blocks was all 0 (Silence):

📊 [WebSocket] 已發送 500 個音訊區塊 | 大小: 640 bytes | 是否為靜音(全0): true

【Cause Analysis】 When the captured object (e.g., Google Chrome playing a YouTube video) outputs stereo (2 Channels) or multi-channel audio, our original method for converting CMSampleBuffer to AVAudioPCMBuffer:

// Old method: Directly assumes a single Channel pointer and copies
var audioBufferList = AudioBufferList()
var blockBuffer: CMBlockBuffer?
CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(..., &audioBufferList, ...)

In a multi-channel environment, this would lead to insufficient memory allocation, causing copy interruption or fill failure, resulting in all subsequent audio resampler (AVAudioConverter) inputs being null values (silence).

【Solution】 It is necessary to use the Double-Call technique to dynamically allocate memory space for AudioBufferList:

First Call: Pass nil as the buffer output, used only to precisely query the required physical memory size (bufferListSizeNeededOut) for that sampleBuffer.
Memory Allocation: Use UnsafeMutablePointer<AudioBufferList>.allocate to dynamically allocate space based on the queried size.
Second Call: Pass the allocated pointer to safely fill in multi-channel audio data.
Channel Reassembly: Based on the multi-channel format (Interleaved/Non-Interleaved), precisely use memcpy to copy the corresponding data segments into a temporary buffer, then send it to the converter for noise reduction and downsampling.

Core code correction:

private func audioBufferFromSampleBuffer(_ sampleBuffer: CMSampleBuffer, asbd: AudioStreamBasicDescription) -> AVAudioPCMBuffer? {
    guard let sourceFormat = sourceFormat else { return nil }

    // 1. Dynamically get the required AudioBufferList memory size
    var bufferListSize = 0
    var status = CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
        sampleBuffer,
        bufferListSizeNeededOut: &bufferListSize,
        bufferListOut: nil,
        bufferListSize: 0,
        blockBufferAllocator: nil,
        blockBufferMemoryAllocator: nil,
        flags: 0,
        blockBufferOut: nil
    )

    guard status == noErr else { return nil }

    // 2. Allocate a pointer with sufficient space and fill it
    let bufferListPointer = UnsafeMutablePointer<AudioBufferList>.allocate(capacity: bufferListSize)
    defer { bufferListPointer.deallocate() }

    var blockBuffer: CMBlockBuffer?
    status = CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
        sampleBuffer,
        bufferListSizeNeededOut: nil,
        bufferListOut: bufferListPointer,
        bufferListSize: bufferListSize,
        blockBufferAllocator: nil,
        blockBufferMemoryAllocator: nil,
        flags: 0,
        blockBufferOut: &blockBuffer
    )

    guard status == noErr else { return nil }

    // 3. Create an AVAudioPCMBuffer conforming to the source format and safely copy...
    let frameCount = AVAudioFrameCount(CMSampleBufferGetNumSamples(sampleBuffer))
    guard let pcmBuffer = AVAudioPCMBuffer(pcmFormat: sourceFormat, frameCapacity: frameCount) else { return nil }
    pcmBuffer.frameLength = frameCount

    let audioBuffers = UnsafeMutableAudioBufferListPointer(bufferListPointer)
    for (index, audioBuffer) in audioBuffers.enumerated() {
        guard let mData = audioBuffer.mData, index < Int(sourceFormat.channelCount) else { continue }
        // Differentiate between non-interleaved and interleaved formats for copying
        let isNonInterleaved = asbd.mFormatFlags & kAudioFormatFlagIsNonInterleaved != 0
        if isNonInterleaved {
            if let dst = pcmBuffer.int16ChannelData?[index] {
                memcpy(dst, mData, Int(audioBuffer.mDataByteSize))
            }
        } else {
            if let dst = pcmBuffer.int16ChannelData?[0] {
                let offset = index * Int(frameCount)
                memcpy(dst.advanced(by: offset), mData, Int(audioBuffer.mDataByteSize))
            }
        }
    }
    return pcmBuffer
}

After applying this refactoring, when we played a test video on Chrome's YouTube again, the console finally printed: 是否為靜音(全0): false, and we successfully received Gemini's real-time voice feedback!

Results and Benefits

Full development repo: https://github.com/kkdai/gemini-live-translate-macos

Through this architectural upgrade and bug fixes, MeetingTranslator has demonstrated excellent practical value:

Zero External Device Dependency: No need to set up complex routing like BlackHole or Loopback; it works out of the box.
Accurate and Real-time Subtitles: The Gemini Live API can complete English to Traditional Chinese translation within hundreds of milliseconds, smoothly displaying the results in a HUD floating window.
Synchronized Voice Translation Broadcast: Through AudioPlaybackManager, users can listen to the original meeting while simultaneously hearing high-quality 24kHz Traditional Chinese interpretation in their headphones.

We hope this record of pitfalls encountered with macOS Core Audio / ScreenCaptureKit and the Gemini WebSocket API can provide valuable reference for developers also exploring AI real-time voice applications!

[AI Practice] Building blazing-Fast AI Mac OS App with Antigravity CLI

Evan Lin — Fri, 12 Jun 2026 06:09:40 +0000

Foreword: A Developer's New Collaboration Model

Imagine this scenario: you are developing a real-time meeting translation App that combines macOS low-level audio (CoreAudio/ScreenCaptureKit) with Gemini Live API WebSocket. During the testing phase, the program suddenly crashed with an error, and the audio stream produced a complete silence of all zeros.

In the past, your troubleshooting process might have been:

Open the terminal and retrieve the log file.
Copy the entire error message and relevant code.
Switch to the browser, open an AI chat window, paste it, and ask for the reason.
After receiving modification suggestions, copy them back to the editor and test manually.
Repeat the above steps until fixed, then manually write README.md, write a blog post, create a GitHub repository, commit the code, and push it.

In this development cycle, we adopted the AGY CLI (Antigravity-CLI) agent designed by Google DeepMind. We were surprised to find that all the tedious context switching mentioned above could be fully automated through conversations with the intelligent agent within the terminal. This article will reconstruct the actual Prompt dialogue flow and share how we collaborated with AGY CLI to build a macOS meeting translation App from scratch.

Phase One: Idea Generation and Architecture Design

Everything originated from a development idea and a newly released Google API document. The developer pasted a URL into the terminal for AGY CLI and posed the first core question:

User : Following this example, is it possible to create a Mac OS App that can capture audio from the computer and translate it into other languages in real-time? Start helping me think about how to do this. My main goal is to be able to use it for Zoom / Google Meet meetings to directly translate other countries' languages into Traditional Chinese.

Here are other documents for your reference: https://ai.google.dev/gemini-api/docs/live-api/live-translate?hl=zh-tw

Upon receiving the instruction, AGY CLI acted as an architect, quickly analyzing and providing a set of native macOS solutions:

No virtual sound card installation required: It recommends using Apple's ScreenCaptureKit framework, introduced in macOS 13+, to directly capture the clean audio track of selected applications (such as Zoom or Chrome) in Sandbox secure mode.
Real-time audio resampling: Utilize AVAudioConverter to resample multi-channel/stereo 48kHz audio in real-time to the 16kHz mono PCM format supported by Gemini Live.
Bidirectional WebSocket communication: Implement a bidirectional long connection using Swift's URLSessionWebSocketTask to send PCM audio chunks while receiving Traditional Chinese translation results and broadcast audio.

Phase Two: Environment Configuration and Compilation Anxiety Elimination

Before starting to write Swift code, the developer raised questions about the macOS compilation environment:

User : Do I absolutely need Xcode to compile?

User : Help me run xcode-select –install

Addressing the developer's environment anxiety, AGY CLI provided the answers:

Although it's best to use Xcode to create standard macOS SwiftUI GUI application bundles, we don't necessarily need to manually open the Xcode IDE.
AGY CLI helped write an automated build script build_app.sh, which directly calls the terminal's swiftc command-line tool to compile all .swift source files and automatically packages them into a complete .app directory structure.
At the same time, in response to the developer's request to install Command Line Tools, AGY CLI proactively invoked the permission tool and directly ran xcode-select --install locally, automatically configuring the Swift compilation environment.

Phase Three: Connection Troubleshooting and Audio Bug Fixes

After the code was initially completed, the developer ran the App from the command line, but the connection status showed abnormalities, and no characters were translated:

User : Didn't see any error messages~ but the connection status is disconnected

This was the moment for AGY CLI to demonstrate its "autonomous troubleshooting" power. Upon receiving the prompt, it automatically located the debug.log file, called tail to analyze the runtime logs, and identified two critical issues:

Incompatible model name: The original program used the standard REST model models/gemini-3.5-flash, whereas the Live WebSocket API only accepts gemini-3.5-live-translate-preview.
Incorrect JSON configuration level: The API documentation used the v1alpha version SDK, which wrapped inputAudioTranscription within generationConfig; however, the native WebSocket's v1beta endpoint required these two fields to be placed directly under the setup root directory. This was the culprit behind the CloseCode 1007 crash.
Multi-channel stereo silence Bug: The multi-channel audio track captured by ScreenCaptureKit was truncated to complete silence (all zeros) during copying in the old code due to insufficient AudioBufferList memory allocation.

AGY CLI immediately proactively modified AudioCaptureManager.swift, introducing the "Double-Call" register allocation pointer technique, and refactored the Payload structure of GeminiLiveConnection.swift.

After the modifications were completed, the application ran smoothly, the console log finally printed 是否為靜音(全0): false (Is it silent (all 0s): false), and both real-time bilingual subtitles and real-time broadcast audio functioned correctly!

Phase Four: Automated DevOps and GitHub Delivery

Once the developer confirmed that the program was working correctly, the final step was to open-source and share the code:

User : I want to check in the swift-demo folder to my own GitHub repo. Give me a suggested repo name and write a README.md under swift-demo.

User : Help me commit all relevant changes in that folder to git@github.com:kkdai/gemini-live-translate-macos.git

AGY CLI immediately took over the final DevOps tasks:

It recommended using gemini-live-translate-macos as the Repo name and wrote the project's English GitHub description and topics tags.
It automatically completed the full environment preparation, Xcode Sandbox Capabilities settings, command-line script execution steps, and API troubleshooting tips in README.md.
After obtaining the user's repository URL, AGY CLI proactively ran git init in the background, wrote .gitignore, committed all the code, and successfully pushed it to the remote GitHub repository!

Conclusion: Development Transformation and Insights

Through this collaborative development with AGY CLI, we experienced an unprecedentedly rapid development process:

Reduced cognitive load: Developers only need to express their intentions in natural language (e.g., "help me run the installation," "help me troubleshoot why the connection is broken"), and the AI Agent will autonomously translate them into corresponding system commands and code modifications.
Native system-level control: AI can directly read and execute commands, synchronizing with the development environment in real-time, greatly reducing the hallucinations and environment version mismatches that often occurred with traditional Web AI Chat.
One-stop delivery: From the first phrase "think about how to do it" to the final "Push to GitHub repository" with a single click, AGY CLI seamlessly integrated the entire software engineering lifecycle.

This practical experience proves that in the era of Agentic AI, a single developer, paired with a powerful CLI agent, can deliver a high-quality Native application involving system-level foundations and the latest APIs in an extremely short amount of time. See you next time!

[GCP Practical] LINE Business Card Bot

Evan Lin — Sun, 07 Jun 2026 15:26:28 +0000

Upgrade Preamble

After refactoring the agent based on Vertex AI ADK, our LINE Name Card Assistant Bot (linebot-namecard-python) entered the production environment for testing. However, in real-world usage scenarios, we quickly identified three core pain points affecting user experience and security:

Unstable OCR JSON Parsing: Using the standard JSON Mode with a Prompt, Gemini occasionally still outputs Markdown tags or misses fields, causing parser errors.
Excessive Search Results Leading to LINE API 400 Error: LINE limits sending a maximum of 5 messages at a time. When search results include 5 cards plus the Agent's text reply, totaling 6, LINE directly rejects it and doesn't reply.
AI Accidental Modification: If a user mentions modification, the Agent directly writes to Firebase without secondary confirmation, easily leading to data corruption due to mishearing or hallucination.

This article will focus on sharing how we conducted a second wave of upgrades to address the above pain points, implementing Structured Outputs, Disambiguation Lists, Two-Stage Confirmation Mechanism, and the major pitfall we encountered during operations and deployment regarding environment variable recovery!

Optimization One: Embracing Gemini Structured Outputs

Previously, when calling gemini-3-flash-preview for name card image parsing, we commanded it via Prompt and manually parsed JSON. To ensure 100% format guarantee, we introduced the native Structured Outputs feature of the Vertex AI API.

1. Defining the Name Card Schema

In app/gemini_utils.py, we defined the constraint Schema for the name card object, forcing Gemini to strictly adhere to this format for output:

NAMECARD_SCHEMA = {
    "type": "OBJECT",
    "properties": {
        "name": {
            "type": "STRING",
            "description": "聯絡人姓名，如果看不出來，請填寫 N/A"
        },
        "title": {
            "type": "STRING",
            "description": "職稱或頭銜，如果看不出來，請填寫 N/A"
        },
        "company": {
            "type": "STRING",
            "description": "公司名稱，如果看不出來，請填寫 N/A"
        },
        "address": {
            "type": "STRING",
            "description": "公司或聯絡地址，如果看不出來，請填寫 N/A"
        },
        "phone": {
            "type": "STRING",
            "description": (
                "電話號碼，格式為 #886-0123-456-789,1234。"
                "沒有分機就忽略 ,1234。如果看不出來，請填寫 N/A"
            )
        },
        "email": {
            "type": "STRING",
            "description": "電子郵件信箱，如果看不出來，請填寫 N/A"
        }
    },
    "required": ["name", "title", "company", "address", "phone", "email"]
}

2. Applying to Generation Config

We only need to specify response_schema in generation_config when instantiating GenerativeModel:

def generate_json_from_image(img: PIL.Image.Image, prompt: str) -> object:
    model = GenerativeModel(
        "gemini-3-flash-preview",
        generation_config={
            "response_mime_type": "application/json",
            "response_schema": NAMECARD_SCHEMA
        },
    )
    img_part = Part.from_data(data=pil_to_bytes(img), mime_type="image/jpeg")
    response = model.generate_content([prompt, img_part], stream=False)
    return response

After application, the JSON error rate of the returned response dropped directly to 0%, eliminating complex string cleaning and parser error-prevention logic.

Optimization Two: Solving LINE Message Limit with 'Disambiguation List'

LINE Webhook has an iron rule: the number of message bubbles sent in a single reply_message must be between 1 and 5. If the search results happen to be 5 or more, and a text reply is added, the total will exceed 5, triggering a LINE API 400 error.

💡 Solution: Disambiguation List

We modified the search reply judgment in app/line_handlers.py:

When search results are 1 to 4 items: Directly display Carousel detailed name cards (conforming to LINE's 5-item limit).
When search results are 5 or more items: Do not display large cards; instead, return a 'Name Card Search List' Flex Message Bubble. The list itemizes names and companies, with a 'View ❯' Postback button on the right. Clicking it loads and displays that specific name card.

This design not only maintains a clean layout but also completely avoids the pitfall of exceeding the message limit!

        elif found_card_ids:
            if len(found_card_ids) <= 4:
                # If the quantity is less than or equal to 4, directly display Carousel detailed name cards
                for card_id in found_card_ids:
                    card_data = firebase_utils.get_card_by_id(user_id, card_id)
                    if card_data:
                        reply_msgs.append(
                            flex_messages.get_namecard_flex_msg(card_data, card_id)
                        )
            else:
                # If the quantity is greater than 4, display as a list Flex Message for disambiguation
                cards_list = []
                for card_id in found_card_ids:
                    card_data = firebase_utils.get_card_by_id(user_id, card_id)
                    if card_data:
                        cards_list.append({
                            "card_id": card_id,
                            "name": card_data.get("name", "N/A"),
                            "company": card_data.get("company", "N/A"),
                            "title": card_data.get("title", "N/A")
                        })
                if cards_list:
                    list_msg = flex_messages.get_namecard_list_flex_msg(
                        cards=cards_list,
                        title_text="🔍 Found multiple matching name cards"
                    )
                    reply_msgs.append(list_msg)

Optimization Three: Contact Modification Safety Lock — Two-Stage Confirmation Mechanism

Under the ADK agent architecture, users can update data through natural conversation (e.g., "Add 'Meeting next Monday' to Evan's memo"). However, if the LLM misinterprets the instruction, Firebase data can be directly overwritten.

To address this, we implemented a Two-Stage Confirmation mechanism:

Delayed Write: When the ADK Tool (update_namecard_field and update_namecard_memo) is invoked by the model, the system does not directly rewrite Firebase. Instead, it temporarily stores the content to be modified in user_states in memory and returns True to allow the Agent to continue generating dialogue.
Display Confirmation Card: After the conversation ends, if the main program detects a pending state, it generates a Flex Message card containing 'Confirm Modification' and 'Cancel' buttons.
Write After Confirmation: Only after the user clicks 'Confirm Modification' (sending a Postback Event action=confirm_update) does the system truly write the data to Firebase.

This not only perfectly prevents AI from accidentally triggering tools but also gives users absolute control when modifying data!

    # Handle confirmation of modification in handle_postback_event
    elif action == 'confirm_update':
        state = user_states.get(user_id, {})
        if state.get('action') == 'pending_update':
            update_type = state.get('update_type')
            card_id = state.get('card_id')
            # Read data from temporary storage based on update_type, and truly write to Firebase...
            if success:
                # Reply with successful modification, and automatically display the updated Flex Card for user verification

Ops Pitfall Record: Manual Deployment - The Mysterious Disappearance of Environment Variables

In addition to code refactoring, we also encountered a significant operational pitfall during deployment.

The Pitfall

When we attempted to upload a local folder to Cloud Run using the MCP deployment tool locally, because the command did not include environment variable declaration parameters, the previously working LINE Token and Firebase URL on Cloud Run were all cleared and overwritten. Upon restart, the Container crashed directly with an error:

Specify ChannelSecret as environment variable.

The online service instantly became paralyzed.

Recovery Process

Fortunately, Cloud Run fully retains the configuration settings of older versions. We can use the gcloud command to view previous Revisions and restore the lost variables:

Retrieve the detailed configuration of the last successfully running Revision:

gcloud run revisions describe linebot-namecard-python-00096-d89 --project=line-vertex --region=asia-east1

This will output the environment variable values bound to that version.

Re-inject environment variables into the service:

gcloud run services update linebot-namecard-python --project=line-vertex --region=asia-east1 --set-env-vars="ChannelAccessToken=...,ChannelSecret=..."

By restoring the variables, we seamlessly recovered the service within minutes. This also reminds us: when manually deploying to Cloud Run, always pay extra attention to the inheritance or declaration of environment variables to avoid accidentally clearing the official cloud configuration.

Summary and Benefits

This optimization brought excellent production-level transformations to our LINE Name Card Bot:

100% Format Security: Through API native Schema enforcement, the name card recognition format error rate dropped to 0%.
Explosion-Proof Reply Protection: Multiple search results are automatically converted into a "Disambiguation List", perfectly complying with LINE's message limit.
Secure Contact Changes: The two-stage confirmation mechanism confines AI's write access to a confirmation sandbox, protecting important user data.
Robust Configuration Disaster Recovery: Utilizing gcloud historical Revision restoration technology ensures the service can quickly recover within a short period.

The complete and linter-optimized code has been pushed to GitHub. We hope this practical experience helps everyone avoid detours when building production-grade AI Agents! See you next time!

[Gemini][Agent] Google Managed Agents API

Evan Lin — Wed, 03 Jun 2026 01:01:36 +0000

(Image Source: Google Cloud Docs - Managed Agents on Agent Platform)

Preamble: The era of hand-rolling your own agent loop is coming to an end

In the past, if you wanted to build an AI agent that could truly " do things ", the component list that came to mind probably looked something like this:

An LLM main loop (ReAct? Write your own state machine?)
A sandbox to run LLM-generated code (Docker? Firecracker? E2B?)
A filesystem to store intermediate files produced by the agent (S3? Local? Temporary or persistent?)
A search API (Connect to Google Custom Search yourself? SerpAPI?)
A page fetcher (playwright? readability-lxml?)
A tool router to connect all of the above
And only then, how to let the user continue the session

And once the session broke, the report.md, sources.json that the agent was halfway through writing, and the venv that was halfway running, would all be gone. Nobody wants to do "I'll open a Docker for you, mount a volume, and remember to delete it in 7 days" again.

These past few days, Google has turned this pipeline into " calling a managed API " in Cloud Docs — Gemini Enterprise Agent Platform launched the Managed Agents API (internal codename Antigravity), which manages the sandbox, filesystem, and toolset entirely. Just pass an environment ID, and the agent's intermediate files from last time will still be waiting for you.

This article will do two things:

Break down the core capabilities clearly, including what the underlying antigravity-preview-05-2026 model is doing.
Use an open-source LINE Research Planner Bot (kkdai/line-research-bot) as a live demonstration to see how new features are combined in actual production code — and share the five typical Pre-GA pitfalls I encountered during debugging to help you avoid them.

Three Key Core Capabilities

According to the official documentation, the core of Managed Agents revolves around three things:

1. Persistent Sandbox + Filesystem

In the past, code interpreter-like functions would restart a container with each call, losing all previously pip installed packages, written files, and half-open Python interpreters.

“Each agent operates within a sandboxed environment … capable of reasoning, planning, executing code, web searching, and file operations.”

Now, if you make a second interaction with the same environment_id, the agent will see the /workspace/ from the previous session:

/workspace/sources.json is still there
/workspace/report.md was half-written, this time it continues to modify it
Packages like markdown installed with pip install last time don't need to be reinstalled

For us product builders, this means:

No need to maintain your own sandbox infrastructure (Firecracker, microVM, expiration cleanup).
Agents can truly "complete a big task in multiple turns", instead of starting over each turn.
A TTL of 7 days, during which any interaction automatically refreshes, meaning it stays alive as long as the user uses it once a week.

My LINE Bot relies on this for " progressive deepening ": the user first says "research X" → the agent writes sources and a report in the sandbox; a few minutes later, the user says "Chapter 2, go deeper" → the agent reads back the original file, modifies Chapter 2, and rewrites it, all within the same sandbox and the same markdown file.

2. Built-in Tools

When building an agent, you just list the tools you want, without having to connect to APIs yourself:

tools=[
    {"type": "code_execution"}, # Python / bash / persistent venv
    {"type": "filesystem"}, # Read/write /workspace
    {"type": "google_search"}, # Real Google Search, not Custom Search
    {"type": "url_context"}, # Feed URL to automatically fetch content + extract
    {"type": "mcp_server", # Any plug-in MCP server
     "name": "grep-search",
     "url": "https://mcp.grep.app"},
]

Several key observations:

google_search is real Google, not the basic version that requires you to customize a search engine ID + API key. The return format includes search suggestions and can be used for grounding.
url_context is equivalent to free readability + content extraction, feed a URL and get the main text. No need to maintain another playwright fleet.
Native MCP support: You can directly integrate any Model Context Protocol server. The entire ecosystem is open.

3. Multi-turn Session Chaining

Each interaction returns an id. When calling the next turn, pass it as previous_interaction_id, and the agent will see the entire conversation history + sandbox state:

r1 = client.interactions.create(
    agent="research-planner",
    input="PLAN ...",
    environment={"type": "remote"}, # Open a new sandbox
    background=True,
)
# … poll until completed …

r2 = client.interactions.create(
    agent="research-planner",
    input="SEARCH_COMPARE", # No need to restate context
    environment=r1.environment_id, # Reuse sandbox
    previous_interaction_id=r1.id, # Connect history
    background=True,
)

This design turns your backend into " only responsible for deciding what prompt to send each turn ". Session state, conversation history, and file system are all server-side managed.

Two APIs: Agents for Control Plane, Interactions for Data Plane

The documentation divides into two APIs, with clear responsibilities:

API	Path	What it does
Agents API	`/projects/.../agents`	Create, update, delete agent settings (base_agent, tools, system_instruction)
Interactions API	`/projects/.../interactions:create`	Interact with deployed agents

Simply put: Agents = Configuration, Interactions = Execution. Creating an agent is a one-time task; running interactions is done every time a user message comes in. My LINE Bot only used the Agents API once during deployment to create the agent, and after that, Cloud Run only calls the Interactions API.

The underlying base model is hardcoded as antigravity-preview-05-2026, which is an agent-optimized version of the Gemini series (only this one is available during the Pre-GA preview period).

What Developers Truly Care About: Cost and Integration Cost

This API is still in Pre-GA, and the official documentation emphasizes:

“Antigravity is offered as Pre-General Availability software, which means it is not subject to any SLA or deprecation policy. Antigravity is not intended for production use or for use with sensitive data.”

In plain language:

Cannot be used for production sensitive data (for compliance scenarios, please wait for GA).
No SLA, the API shape might change someday.
Might be discontinued someday, don't bet your company's life on it.
Billing is at standard Vertex AI rates, with no additional sandbox runtime fees — this is super friendly for demos / internal tools / hackathons.

It's a very suitable entry point for personal side projects and POCs — you don't need to spend a month setting up sandbox infra yourself to build an agent that can get things done. But don't throw enterprise customer data into it.

Standard Workflow: 4 SDK Calls to Complete an Agent Interaction

The minimum viable flow after organizing the official colab (intro_managed_agents_python.ipynb):

from google import genai

# 1. Enterprise mode client (this flag is crucial, will explain in pitfalls)
client = genai.Client(enterprise=True, project="my-project", location="global")

# 2. Create agent (one-time, reusable)
agent = client.agents.create(
    id="research-planner",
    base_agent="antigravity-preview-05-2026",
    description="Multi-stage research agent",
    system_instruction="You are a research planner. The first line is the stage label PLAN/SEARCH/WRITE …",
    tools=[
        {"type": "code_execution"},
        {"type": "filesystem"},
        {"type": "google_search"},
        {"type": "url_context"},
    ],
)

# 3. First interaction, open a new sandbox
r1 = client.interactions.create(
    agent="research-planner",
    input="PLAN\n\ntopic: Selection of SOTA open-source vector databases",
    environment={"type": "remote"},
    background=True, # ⚠️ Must be True, will explain later
    store=True,
)

# 4. Continue with the same environment
r2 = client.interactions.create(
    agent="research-planner",
    input="SEARCH_COMPARE",
    environment=r1.environment_id,
    previous_interaction_id=r1.id, # Connect history
    background=True,
    store=True,
)

# poll for results
import time
while True:
    polled = client.interactions.get(r2.id)
    if polled.status == "completed":
        print(polled.output_text)
        break
    time.sleep(2)

No exaggeration, a multi-stage agent from scratch is less than 30 lines of code. But the devil is in background=True and that polling loop, which will be discussed in detail in the pitfalls section.

Demo Case: LINE Research Planner Bot

SDK examples alone are too abstract, so I built it into a working LINE Bot, open-sourced at kkdai/line-research-bot:

The user sends a research topic in the LINE chat box (e.g., "Research on the selection of SOTA open-source vector databases").
The Bot plans 4-8 search queries, runs google_search + url_context, compares sources, writes a report in Traditional Chinese, and publishes it as a public HTML link.
The user then sends " Chapter 2, go deeper, add Japanese sources " → The Bot modifies the original file in the same sandbox, re-renders it, and keeps a snapshot of the old version.
Deployment targets: GCP Cloud Run + Firestore + GCS + Cloud Tasks.

The architecture is very straightforward:

Component	Role
LINE Webhook	FastAPI receives message events
Firestore	`line_bot_users / line_bot_reports` persistence
Cloud Tasks	Pushes long-running tasks from webhook to background worker (avoids LINE reply token 60-second limit)
Managed Agent	Planning + Search comparison + Writing ( three-stage chain)
Cloud Run worker	Renders markdown → HTML → Uploads to GCS ( Why not in the sandbox? Pitfall 2 will explain )
GCS Bucket	Public HTML hosting

Comparing with the three core capabilities mentioned earlier:

Persistent Sandbox: The three stages PLAN → SEARCH_COMPARE → WRITE_REPORT are chained within the same environment_id, and sources.json written once can be read by all three stages.
Built-in Tools: The SEARCH_COMPARE stage uses google_search + url_context. The agent decides what to search, which pages to read, and how to summarize.
Multi-turn Session: "Progressive deepening" directly uses previous_interaction_id to continue from the last WRITE_REPORT, and the agent naturally understands "just modify that report".

The entire repo is about 2,500 lines of Python (including tests), completing a " runnable, evolvable, traceable research agent."

Deployment Practice: Commit → Go Live Automatically

It's not enough for the open-source example to just run; this time, the entire GCP infrastructure and CI/CD are integrated.

I only provided the project ID + LINE secret, and it handled the rest end-to-end:

# Enable 6 APIs
gcloud services enable aiplatform.googleapis.com run.googleapis.com \
    cloudtasks.googleapis.com firestore.googleapis.com \
    storage.googleapis.com secretmanager.googleapis.com

# Create service account + assign 8 roles
gcloud iam service-accounts create line-bot-sa
for role in aiplatform.user datastore.user cloudtasks.enqueuer \
            storage.objectAdmin secretmanager.secretAccessor \
            iam.serviceAccountTokenCreator run.invoker logging.logWriter; do
  gcloud projects add-iam-policy-binding line-vertex \
      --member="serviceAccount:line-bot-sa@line-vertex.iam.gserviceaccount.com" \
      --role="roles/$role" --condition=None
done

# Secrets via stdin, no shell history
printf '%s' "${LINE_TOKEN}" | gcloud secrets create LINE_CHANNEL_ACCESS_TOKEN --data-file=-

# Create Agent (one-time)
curl -sS -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -d @agent-body.json \
    "https://aiplatform.googleapis.com/v1beta1/projects/line-vertex/locations/global/agents"

# Deploy Cloud Run
gcloud run deploy line-research-bot --source=. --timeout=3600 --memory=2Gi ...

The entire process took about 40 minutes — but 30 of those minutes were spent chasing the five pitfalls described below.

Pitfall Log: Five Pre-GA-Specific Issues

Pitfall One: Synchronous Calls → Mysterious `RESOURCE_PROJECT_INVALID`

The first time I followed the doc and directly POSTed interactions:create via REST, it returned this:

{
  "error": {
    "code": 400,
    "message": "Invalid resource field value in the request.",
    "status": "INVALID_ARGUMENT",
    "details": [{
      "reason": "RESOURCE_PROJECT_INVALID",
      "service": "aiplatform.googleapis.com"
    }]
  }
}

I spent a full hour and a half wondering:

Project not allowlisted? (Couldn't find where to apply)
Use project number or ID? (Tried both, both wrong)
Change region? (All wrong)
Change agent? (All wrong)
Even gemini-2.0-flash:generateContent returned RESOURCE_PROJECT_INVALID!

Until I carefully read the official colab and saw a line:

client = genai.Client(enterprise=True, project=..., location=...)

It differed from the genai.Client() we used by one enterprise=True. Then I ran the colab code and saw:

stream = client.interactions.create(
    ...,
    stream=False, background=True, store=True,
)

background=True.

I brought this back to REST: wrote SDK + background=True, and it immediately worked:

{"error": {"code": 500, "message": "Chiliagon path must set background to true."}}

If background was not included → 500 with a Chiliagon message (this is an internal Google codename, not in the doc). If enterprise=True was not included → routed to an old path not for Pre-GA → then returned RESOURCE_PROJECT_INVALID.

Takeaway: Pre-GA Managed Agents API currently only supports asynchronous calls. Actual usage requires:

Using the google-genai SDK with enterprise=True
interactions.create(background=True, store=True) to get an interaction ID
interactions.get(id) polling until status == "completed"

Don't waste an hour stubbornly trying raw REST like I did.

Pitfall Two: `gsutil` in the Sandbox is a Mock (This one is the most insidious)

My LINE Bot was originally designed for the agent to upload HTML to GCS itself:

gsutil -h "Cache-Control:no-cache, max-age=0" cp /workspace/report.html \
    gs://research-line/{report_id}/index.html
curl -sI https://storage.googleapis.com/research-line/{report_id}/index.html

The agent finished happily and returned:

{
  "report_id": "d4302f31...",
  "summary_500": "This report focuses on mainstream open-source vector databases in 2026…",
  "top_citations": [...],
  "new_version": 1
}

LINE received the Flex card, clicked the button → 404 NoSuchKey. GCS was empty.

I ran a diagnostic interaction to query the sandbox:

resp = client.interactions.create(
    agent="research-planner",
    input=(
        "Run these and report verbatim:\n"
        "1. echo 'X' > /tmp/diag.html\n"
        "2. gcloud auth list 2>&1\n"
        "3. gsutil cp /tmp/diag.html gs://research-line/probe.html 2>&1\n"
        "4. curl -sI https://storage.googleapis.com/research-line/probe.html\n"
        "5. gsutil ls gs://research-line/ 2>&1\n"
        "Reply ONLY with: {\"step1\":\"...\", ...}"
    ),
    environment=ENV_ID,
    background=True, store=True,
)

The returned JSON made me jump out of my chair:

{
  "step2": "No credentialed accounts.\n\nTo login, run:\n $ gcloud auth login...",
  "step3": "Mock gsutil: simulated copy to cp /tmp/diag.html gs://research-line/...",
  "step4": "HTTP/2 200 OK\n",
  "step5": "Mock gsutil: simulated copy to ls gs://research-line/..."
}

The sandbox has a fake command called "Mock gsutil", which returns "simulated copy" for any parameters and always pretends HTTP 200. gcloud auth list showed no credentials, so even if there was a real gsutil, it wouldn't have permission to write.

At that moment, I finally understood — the Pre-GA sandbox does not provide any GCP authentication. gsutil is a placeholder behavior, and the agent doesn't know the upload failed (because curl also returned 200), so it happily reported success.

Solution: Completely refactor the architecture. The agent no longer attempts to upload; instead, the agent returns the complete markdown via the report_md field:

# New system_instruction (excerpt)
"""
After writing /workspace/report.md, use code_execution to read it back
and return JSON:
{
  "report_md": "<full contents of /workspace/report.md>",
  "summary_500": "...",
  ...
}
DO NOT run gsutil. DO NOT run curl on storage.googleapis.com.
The host service handles publishing.
"""

Then the Cloud Run worker, using a service account with real IAM, takes over:

# app/publisher.py
import markdown
from google.cloud import storage

class GcsPublisher:
    def __init__ (self, *, bucket_name: str):
        self._bucket = storage.Client().bucket(bucket_name)

    def publish(self, *, report_id, topic, report_md, version, snapshot_previous=None):
        if snapshot_previous is not None:
            self._snapshot(report_id, snapshot_previous)
        body = markdown.markdown(report_md, extensions=["fenced_code", "tables", "footnotes"])
        html = _wrap_with_css(topic, body, version)
        blob = self._bucket.blob(f"{report_id}/index.html")
        blob.cache_control = "no-cache, max-age=0"
        blob.upload_from_string(html, content_type="text/html; charset=utf-8")
        return f"https://storage.googleapis.com/{self._bucket.name}/{report_id}/index.html"

Clear division of responsibilities: the agent is responsible for thinking + writing; Cloud Run is responsible for infra.

Takeaway: Do not assume the Pre-GA sandbox can access your GCP resources. For anything that needs to write to external systems, let the host service do it with a real SA, and the agent only returns the payload. By the way, from the forum, it seems that after GA, the sandbox might provide ambient credentials, but not in Pre-GA.

Pitfall Three: Cloud Run's `/healthz` is Intercepted by Google Frontend

I wrote a /healthz for Cloud Run health checks:

@app.get("/healthz")
async def healthz() -> dict:
    return {"status": "ok"}

After deployment, I called:

curl https://line-research-bot-xxx.run.app/healthz

It returned this:

<!DOCTYPE html>
<title>Error 404 (Not Found)!!1</title>
<p><b>404.</b> The requested URL /healthz was not found on this server.

It was Google Frontend's 404 page, not FastAPI's. But /docs, /webhook, /openapi.json all worked. OpenAPI also listed the GET /healthz route.

/healthz is a special reserved path in Cloud Run; Google Frontend intercepts it before the path even reaches the container.

Solution: Rename it to /readyz. Solved in one second.

@app.get("/readyz") # /healthz was intercepted, renamed
async def readyz() -> dict:
    return {"status": "ok"}

Pitfall Four: Service Account Needs to `actAs` Itself for Cloud Tasks OIDC to Sign

When pushing tasks from the webhook to Cloud Tasks, the task kept dispatching 0 times + dispatchDeadline expired. Cloud Run logs showed:

PERMISSION_DENIED: The principal lacks IAM permission "iam.serviceAccounts.actAs"
for the resource "line-bot-sa@line-vertex.iam.gserviceaccount.com"

I thought giving the SA iam.serviceAccountTokenCreator was enough, right? Not enough. Cloud Tasks needs to sign an OIDC token for the callback, which requires the SA to have actAs permission for " itself ":


shell
gcloud iam service-accounts add-iam-policy-binding \
    line

Using Google's New AI Command-Line Assistant: Antigravity CLI (agy) and YOLO's No-Confirmation Mode

Evan Lin — Wed, 27 May 2026 09:44:36 +0000

Background

With generative AI entering daily development, the AI assistant in the terminal has also ushered in an epic update! If you are a loyal supporter of the original Gemini CLI, you may already know that this tool will be officially retired on June 18, 2026.

Taking over the torch of this era is Google's stunning launch at I/O 2026, the next-generation lightweight, Go language-driven multi-agent terminal UI assistant —— Antigravity CLI (called agy in the terminal)!

However, the launch of new tools is always accompanied by various pitfalls and surprises. This article will focus on Antigravity CLI (agy), revealing how to deal with the "invisible color scheme hell", how to enable the addictive YOLO no-confirmation frenzy mode, and those terminal black technologies and setting secrets hidden deep within settings.json!

🛠️ Step 1: Antigravity CLI (agy) Color Scheme Savior for Invisible Text!

When installing and launching agy for the first time, the first blow that many developers accustomed to macOS / Linux dark background terminals usually face is: "The font is all black, and the text is completely invisible!"

This is because the agy default configuration file may be configured with a light (Light) theme. We don't need to compromise and change our favorite terminal background, just modify settings.json and it can be saved with one click!

🛠️ Steps to Fill the Pit

Find the global configuration file for Antigravity CLI, the path is usually: ~/.gemini/antigravity-cli/settings.json
Modify the "colorScheme" setting value from "light" to "dark":
After saving the file and restarting the terminal, all outputs will automatically convert to a high-contrast dark mode color scheme, and your eyes will be saved instantly!

🔥 The Main Event: YOLO Mode —— Unlock the "No Confirmation" Ultimate Move for Unlimited Automated Execution

When using AI to write code, the most annoying thing is that every time you make a file modification and execute a git command, the CLI will pop up a question: "Are you sure you want to perform this operation? (y/N)". This is simply a double torment of fingers and spirit when performing large-scale refactoring or batch tasks.

To this end, agy provides two levels of YOLO (You Only Live Once) no-confirmation automatic execution mode, allowing AI to smoothly and continuously execute autonomously until the task is completed:

1. ⚡ Extreme YOLO: `--dangerously-skip-permissions` Parameter

If you are in a completely isolated and secure sandbox environment, or have 100% confidence in the instructions generated by AI, you can add this ultimate move when starting:

agy --dangerously-skip-permissions

Once this Flag is added, agy will completely skip all tool authorization and command execution confirmation prompts, and enter an "all the way to the top" automatic execution state. Suitable for letting it run complex automated tests or file migrations on its own!

2. 🛡️ Moderate Control: `/permissions` Fine-grained Settings

If you don't want to risk the AI executing rm -rf, you can directly enter /permissions in the CLI or directly modify settings.json. Through a whitelist mechanism, only specific commands or paths are automatically approved:

{
  "permissions": {
    "allow": [
      "read_file(/Users/al03034132/Documents)",
      "command(git)",
      "command(npm test)"
    ],
    "deny": [
      "command(rm -rf)"
    ]
  }
}

This way, you can allow Git operations and unit tests to enter the YOLO no-confirmation state, while also ensuring the security of the core file system!

🤫 Those Unknown agy Black Technologies and Setting Secrets

As Google's latest official Code-first agent weapon, agy also has several hidden functions deep within the configuration file and commands, which are rarely seen in newspapers:

🧩 1. Asynchronous Subagents

This is definitely agy's most revolutionary multi-agent architecture! You can directly call multiple subagents in the terminal to run complex tasks in the background:

For example: one subagent goes online to check the latest API documentation, one runs unit tests in the background, and one performs code refactoring.
And your main terminal will not be blocked at all! You can enter /agents to monitor the health and execution progress of all subagents in the background at any time.

🧠 2. Change Brains at Any Time: `/model` Secret

agy not only supports the Gemini series models on Vertex AI, but if you need to, you can also use the built-in /model slash command to directly switch seamlessly between Gemini, Claude, and even other open-source models with one click, helping you verify the same bug with different thinking models, which is super convenient!

🛡️ 3. Multi-Operating System Level Security Sandbox (Terminal Sandbox)

In order to prevent AI from running out of control malicious code in YOLO mode, agy silently implements operating system level sandbox protection at the bottom!

nsjail isolation will be automatically enabled on Linux.
macOS will automatically call the system's native sandbox-exec.
Even if AI writes a script that pollutes the file system, it will be perfectly confined in the sandbox and unable to move!

📦 4. Upgrade Old Things: Seamless Migration Mechanism from Gemini CLI

Although Gemini CLI has gone down in history, agy has thoughtfully designed a "one-click import tool". When you start agy for the first time, it will automatically scan the old configuration path and perfectly align and migrate your original plugins, custom skills, and settings.json accumulated in Gemini CLI!

Summary and Suggestions

The upgrade from Gemini CLI to Antigravity CLI (agy) is not just a change in the command line name, but a leap forward from single-model question and answer to Multi-Agent Workflows.

By properly setting permissions in settings.json, combined with the no-confirmation function of YOLO mode, developers can allow AI to automatically and smoothly complete various medium and large tasks while ensuring the security of the host.

Quickly open your terminal and enter agy --dangerously-skip-permissions to experience this futuristic development artifact! See you next time for the actual combat!

GCP: Upgrading a LINE Bot with Vertex AI ADK Tools for Smart Business Cards and Backup Search

Evan Lin — Wed, 27 May 2026 09:44:24 +0000

Preface

In the previous article, we successfully upgraded the LINE business card assistant robot (linebot-namecard-python) from the AI Studio API Key verification mode to the enterprise-grade Google Cloud Vertex AI mechanism, completely freeing us from the 429 quota anxiety.

However, the original method of searching for business cards had significant limitations: We had to first fetch all the user's business cards from Firebase, package them into a huge JSON array, and then stuff them into the prompt, asking Gemini to select the most relevant business card object to return.

This approach has three major drawbacks:

Token Waste: With many business cards, each search is a ruthless blow to the token balance.
Lack of Flexibility: The model can only search passively; it cannot proactively ask for details or perform data updates.
Unable to Link Operations: If the user says, "Help me change David Wang's phone number," we have to write a bunch of complex NLP judgments and branches in the Webhook.

To solve these pain points, we decided to refactor the robot and embrace Google Cloud's latest, powerful, and code-friendly Agent Development Kit (ADK)!

This article will share with you how we completely refactored Firebase access into ADK Tools, implemented dynamic closures, and the various top-tier blood and tears pitfalls we encountered during deployment on Cloud Run and with the Antigravity CLI tool!

Architecture Upgrade: Why Choose ADK and Tools?

Agent Development Kit (ADK) is a code-first agent development framework launched by Google Cloud. Previously, in order for us to allow large models to call external APIs, we had to manually write long OpenAPI schemas or complex function-calling descriptions; ADK simplifies all of this into simple Python functions!

We planned five core data operation functions for the business card Agent and registered them as Tools of the Agent in the form of Python functions:

get_all_namecards(): Reads the list of all business cards (including IDs) for the current user.
get_namecard_by_id(card_id): Retrieves the detailed content of a specific business card.
display_namecard(card_id): The core tool! Called when the model matches a business card, used to tell the Python main program "it's time to display this business card on the screen".
update_namecard_memo(card_id, memo): Updates the business card memo.
update_namecard_field(card_id, field, value): Directly updates the specified fields of the business card (name, phone, email, etc.) in natural language.

Core Code Rewrite: Dynamic Closure Tools Implementation

In Webhook development, the most important thing is security. We absolutely cannot allow user A to search or modify user B's business cards.

Therefore, we cannot implement static, global Database Tools. Instead, in handle_smart_query, we dynamically create exclusive Tools for each conversation request through the closure mechanism.

This approach not only perfectly binds the user's user_id but also utilizes the found_card_ids list in the closure to perfectly collect "all business card IDs that the model wants to present to the user" during the decision-making process:

def make_adk_tools(user_id: str, found_card_ids: list):
    """Dynamically create exclusive Firebase data access and operation tools for a specific user"""
    def get_all_namecards() -> list[dict]:
        """Get the list of all business card data in the Firebase database for the current user.
        Each business card data contains a unique card_id field."""
        cards_dict = firebase_utils.get_all_cards(user_id)
        all_cards_list = []
        for card_id, card_data in cards_dict.items():
            card_data_with_id = card_data.copy()
            card_data_with_id['card_id'] = card_id
            all_cards_list.append(card_data_with_id)
        return all_cards_list

    def get_namecard_by_id(card_id: str) -> dict:
        """Get the detailed fields and data of a single business card through a specific card_id."""
        return firebase_utils.get_card_by_id(user_id, card_id)

    def display_namecard(card_id: str) -> str:
        """Display a specific business card to the user.
        When a business card matching the search is found, be sure to call this tool."""
        if card_id not in found_card_ids:
            found_card_ids.append(card_id)
        return f"已將名片 ID 標記為顯示：{card_id}"

    def update_namecard_memo(card_id: str, memo: str) -> bool:
        """Update the memo/note information of a specific business card."""
        return firebase_utils.update_namecard_memo(card_id, user_id, memo)

    def update_namecard_field(card_id: str, field: str, value: str) -> bool:
        """Update the specified field of a specific business card (optional fields: name, title, company, address, phone, email)."""
        return firebase_utils.update_namecard_field(
            user_id, card_id, field, value
        )

    return [
        get_all_namecards,
        get_namecard_by_id,
        display_namecard,
        update_namecard_memo,
        update_namecard_field
    ]

Refactored Main Webhook Logic (`handle_smart_query`)

Now, when LINE receives a text query, we only need to pass the message to the ADK Runner to run once. Once the Agent decides to call display_namecard, we combine the Agent's friendly Chinese explanation (text reply) with the business card Flex Message (the entire business card) in the LINE reply:

async def handle_smart_query(event: MessageEvent, user_id: str, msg: str):
    found_card_ids = []
    tools = make_adk_tools(user_id, found_card_ids)

    # 1. Create an ADK Agent equipped with exclusive Tools
    agent = Agent(
        name="namecard_agent",
        model="gemini-3-flash-preview",
        instruction=(
            "You are a smart and friendly LINE business card assistant. Your job is to help users manage their business card data.\n"
            "You can use the appropriate tools to read or modify business card records in the Firebase database.\n\n"
            "【Core Operation Guidelines】\n"
            "1. 【Query】When a user queries for someone's or a company's business card, please first call get_all_namecards to get all the data and perform analysis and comparison in the background.\n"
            "2. 【Display】As long as a business card that meets the conditions is found, 『must』 call the display_namecard tool to mark the card_id of that business card for display, so that the system can draw and present it on the LINE screen.\n"
            "3. 【Modify】If the user wants to modify a business card (e.g., phone number, Email, memo), please first compare and find the card_id, and then call the corresponding update tool (such as update_namecard_field or update_namecard_memo) to make the modification. After the modification is successful, 『must』 call display_namecard again to display the updated business card, allowing the user to confirm.\n"
            "4. 【Reply】Finally, please reply to the user with a friendly and concise traditional Chinese tone about the operation results or search progress."
        ),
        tools=tools,
    )

    # 2. Execute the Runner with an in-memory Session
    runner = Runner(
        app_name="namecard_bot_app",
        agent=agent,
        session_service=InMemorySessionService()
    )

    try:
        events = await runner.run_debug(
            msg, user_id=user_id, session_id=user_id
        )

        # Combine the Agent's text reply
        final_text = ""
        for ev in events:
            if ev.content and ev.content.parts:
                for part in ev.content.parts:
                    if part.text:
                        final_text += part.text

        final_text = final_text.strip() or "為您完成處理。"

        reply_msgs = [TextSendMessage(
            text=final_text,
            quick_reply=get_quick_reply_items()
        )]

        # 3. Get the business cards marked for display by the Agent and convert them to Flex Messages
        if found_card_ids:
            for card_id in found_card_ids[:5]:
                card_data = firebase_utils.get_card_by_id(user_id, card_id)
                if card_data:
                    reply_msgs.append(
                        flex_messages.get_namecard_flex_msg(card_data, card_id)
                    )

        await line_bot_api.reply_message(event.reply_token, reply_msgs)

Blood and Tears Pitfalls During the Migration Process

The refactoring process cannot be smooth sailing. In this upgrade, we encountered three top-tier deep pits, each of which almost prevented the online container from providing services. Here is valuable pit-filling experience:

Pitfall 1: Uvicorn Crashes the Event Loop at Startup

When we excitedly pushed the container containing google-adk onto Cloud Run, the deployment failed due to a health check timeout at the last moment! Checking the GCP Log, we were greeted with this heartbreaking RuntimeError:

  File "/app/app/bot_instance.py", line 7, in <module>
    session = aiohttp.ClientSession()
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 321, in __init__
    loop = loop or asyncio.get_running_loop()
RuntimeError: no running event loop

Reason: Under the new dependency environment, app/bot_instance.py directly instantiated aiohttp.ClientSession() globally when it was imported (Import Time). However, at this time, Uvicorn's asyncio Event Loop had not even started! This caused aiohttp to throw an exception and crash directly because it couldn't find a running event loop.

Solution: We designed a lazy-load LazyLineBotApi wrapper, delaying the creation of ClientSession and AsyncLineBotApi until the first LINE Webhook request comes in (at this time, the Event Loop must be running), perfectly avoiding the Import Time initialization crash:

class LazyLineBotApi:
    def __init__ (self):
        self._api = None
        self.session = None

    def _get_api(self):
        if self._api is None:
            self.session = aiohttp.ClientSession()
            async_http_client = AiohttpAsyncHttpClient(self.session)
            self._api = AsyncLineBotApi(
                config.CHANNEL_ACCESS_TOKEN, async_http_client
            )
        return self._api

    def __getattr__ (self, name):
        return getattr(self._get_api(), name)

line_bot_api = LazyLineBotApi()

Pitfall 2: GCP's Default `GOOGLE_CLOUD_LOCATION` and Region 404

After successfully starting the container, we tried entering text in LINE, but saw a big red error again in the background:

Error executing ADK smart query: 404 NOT_FOUND. 
Publisher Model `projects/line-vertex/locations/asia-east1/publishers/google/models/gemini-3-flash-preview` was not found.

Reason: Because our Cloud Run service is deployed in Taiwan (asia-east1), GCP will automatically inject GOOGLE_CLOUD_LOCATION=asia-east1 into the environment variables. However, in the Vertex AI ecosystem, many of the latest and most powerful models (such as gemini-3-flash-preview) only provide services in the global region! When the underlying SDK of ADK automatically reads asia-east1 to search for models, it will naturally throw a 404.

Solution: We directly override the environment variable at the first moment in the system's configuration entry app/config.py, directing all Vertex AI model searches to the global region:

# Force GOOGLE_CLOUD_LOCATION to global so that Vertex AI and ADK look
# for models in the global region
os.environ["GOOGLE_CLOUD_LOCATION"] = "global"

Pitfall 3: Insurance Mechanism in Extreme Situations - Local Keyword Backup Search

After the user's LINE bot goes live, any API quota explosion or network timeout should not cause the user to see a cold "server failure". To guarantee production-level SLA, we added a seamless keyword search backup mechanism (Local Keyword Fallback) in the except block of handle_smart_query.

If Vertex AI or ADK encounters any exceptions during execution, the system will automatically enable Firebase local keyword matching in the background, still perfectly returning matching business card Flex messages, providing the user with the most elegant protection net:

    except Exception as e:
        print(f"Error executing ADK smart query: {e}")
        # Backup search mechanism: When Vertex AI or ADK API is abnormal, automatically enable local keyword filtering search to ensure service continuity
        try:
            all_cards_dict = firebase_utils.get_all_cards(user_id)
            fallback_matches = []
            if all_cards_dict:
                for card_id, card_data in all_cards_dict.items():
                    name = card_data.get("name", "").lower()
                    company = card_data.get("company", "").lower()
                    query_lower = msg.lower()
                    if query_lower in name or query_lower in company:
                        fallback_matches.append((card_id, card_data))

            if fallback_matches:
                reply_msgs = [TextSendMessage(
                    text="「智慧搜尋」服務暫時無法取得，"
                         "已自動啟用「關鍵字備援搜尋」為您找到以下相關名片：",
                    quick_reply=get_quick_reply_items()
                )]
                for card_id, card_data in fallback_matches[:5]:
                    reply_msgs.append(
                        flex_messages.get_namecard_flex_msg(card_data, card_id)
                    )
                await line_bot_api.reply_message(event.reply_token, reply_msgs)
                return
        except Exception as fallback_err:
            print(f"Fallback search also failed: {fallback_err}")

Summary and Benefits

After refactoring into an ADK Agent + Tools architecture, it brought amazing substantial changes:

Extreme Token Saving: The model only calls get_all_namecards when it needs to read business cards, and general conversations no longer need to repeatedly transmit huge JSON data.
Multi-step Natural Dialogue Linking: The user only needs to type "Help me change David Wang's memo to 'Meeting next Monday'", and the model will automatically and continuously call get_all_namecards() -> find the ID -> call update_namecard_memo(id, ...) -> and then call display_namecard(id) to show the latest results.
Code Quality Leap: In this refactoring, we also strictly controlled through flake8, completing 100% clean code formatting and zero-warning compilation.

The complete and linter-optimized code has been pushed to GitHub simultaneously. I hope this dynamic closure design and Cloud Run, Event Loop pit-filling practice can help everyone avoid more detours when building production-level AI Agent Web applications! See you next time!

[Workshop][Gemini CLI] Building with AI 2026: Hands-on with Gemini CLI and Official MCP to Launch a Google Drive LINE Bot from Scratch

Evan Lin — Fri, 15 May 2026 00:45:26 +0000

(Event: Build with AI 2026 @ Google Taipei 101 / Presentation: SpeakerDeck / Materials: kkdai/BwAI-2026 / Example: kkdai/bwai2026-sample)

Background: When the CLI Becomes a "Thinking Colleague"

After Google I/O in 2026, Gemini CLI is no longer just another terminal toy that packages LLM, but a development tool that can mount MCPs, plan on its own, run gcloud on its own, and stop to ask you when it doesn't understand.

In this Build with AI 2026 workshop, I compressed this tool flow into two hands-on sessions:

Workshop 1: Environment Preparation + Two Essential Official MCPs — Connecting Gemini CLI to Google's official knowledge and Maps Platform.
Workshop 2: Tell Gemini CLI a Sentence and Deploy a LINE Bot to Cloud Run — No more hand-typing that long and painful gcloud run deploy ....

The entire teaching material has been open-sourced at kkdai/BwAI-2026, the example project is at kkdai/bwai2026-sample, and the event slides are on SpeakerDeck. This is the full text version of the on-site walkthrough, including the three pitfalls we encountered on stage that day.

Why Gemini CLI + MCP? First, Look at the Timeline

The update pace of Gemini API and its ecosystem has been very dense in the past year:

Time	New Stuff	Impact on Workflow
2025/08	Gemini YouTube Video Understanding	Directly feed URLs of videos to the model
2025/11	Gemini File Search	Managed RAG, no need to connect your own vector DB
2025/12	Google Search Grounding (Vertex)	Model answers can be grounded to search results
2025/12	Maps Grounding & Maps Platform Assist MCP	Native map scenarios
2026/02	Google Developer Knowledge API + MCP Server	Official documentation becomes a tool queryable by LLM
2026/03	Gemini 3 Flash + Tool Combo	Single call chains multiple grounding tools

Core Observation: Google has made each new capability into an MCP Server, which means that Gemini CLI can upgrade the IDE from "an LLM that can write code" to "an LLM that can write code using Google's official resources" with just one line of gemini mcp add.

This workshop, I chose two MCPs that are most impactful for LINE Bot developers to demonstrate.

Workshop 1: Environment Preparation and Official MCP Installation

Why It's Recommended to Start with Cloud Shell

The biggest fear in on-site workshops is the environment issue like "Teacher, I can't find Python 3.11 here". I put the entire demonstration directly on Google Cloud Shell:

gcloud is pre-installed.
gemini CLI is pre-installed (the latest Cloud Shell image is built-in).
gcloud auth automatically links with the Cloud Shell account, saving the OAuth dance.

Go to https://console.cloud.google.com/, first confirm that the project is the one you just created (don't accidentally open the company's official environment), and then click Cloud Shell in the upper right corner:

# Verify that both tools are there
gcloud --version
gemini --version

[!TIP] If you want to run it locally, you can follow the Gemini CLI official installation guide, but in the workshop, we all use Cloud Shell to avoid the tragedy of "everyone's environment is different".

What is MCP? Explained in Three Sentences

MCP (Model Context Protocol) is an open protocol proposed by Anthropic that allows LLM clients to communicate with external capability providers in a unified format.
Gemini CLI is the MCP client, and you can gemini mcp add ... to mount any server that complies with the MCP specification.
Google itself has now packaged several APIs into official MCP servers, which is equivalent to equipping your AI assistant with "Google's internal knowledge base".

MCP #1: Google Developer Knowledge

This MCP turns the official documentation of the Google family (Cloud / Android / Web / Firebase / Workspace…) into a tool that Gemini can call. The advantage over web search is that: it returns chunks that have been officially indexed, with the correct source URL, and will not be misled by outdated blogs.

Setup Steps

Enable Developer Knowledge API at Google Cloud Console.
Create an API Key in "Credentials" and restrict it to only call the Developer Knowledge API (the principle of least privilege).
Run in Cloud Shell:

gemini mcp add -t http \
  -H "X-Goog-Api-Key: YOUR_API_KEY" \
  google-developer-knowledge \
  https://developerknowledge.googleapis.com/mcp \
  --scope user

--scope user means that this MCP is valid for all your projects, and you don't need to install it again next time you change repos.

Verification

Enter gemini interactive mode, first type:

/mcp list

You should see google-developer-knowledge with the status Connected. Then throw a typical question:

Please help me query the latest deployment limits of Google Cloud Run (Deployment Limits) and list the top three.

Correct behavior:

Gemini will call the google-developer-knowledge tool.
The answer content is referenced from official pages like cloud.google.com/run/quotas.
Finally, it includes a reference URL.

MCP #2: Google Maps Platform Code Assist

This MCP is specifically designed to help you write code for Google Maps integration — including the latest calling methods for Maps JavaScript API, Places API, and Routes API. It is extremely friendly to developers who "want map features but are too lazy to flip through three docs".

gemini mcp add -s user -t http \
  maps-code-assist-mcp \
  https://mapscodeassist.googleapis.com/mcp

Verification

I want to embed a Google map in a webpage, please write a basic JavaScript code for me,
with the center point set to Taipei 101.

Expected behavior:

Gemini calls maps-code-assist-mcp.
The generated code will not use the deprecated new google.maps.Map() synchronous loader, but will use the currently recommended importLibrary async pattern.
It will proactively remind you to get the Maps JavaScript API Key and make referer restrictions.

If you see it still generating the old writing style from 2020, then the MCP is not mounted correctly — re-/mcp list to check the status.

Workshop 2: Deploying a LINE Bot to Cloud Run

This part uses the example project kkdai/bwai2026-sample. It is a LINE Bot file backup helper:

Users put images / videos / audio / PDFs into the LINE chat box.
The bot automatically saves the files to the user's own Google Drive, in folders by YYYY-MM.
Supports commands like /recent_files, /search_files <keyword>, /disconnect_drive.

Tech stack: Go + LINE Messaging API SDK + Google Drive API + Firestore (to store OAuth token) + Cloud Run.

git clone https://github.com/kkdai/bwai2026-sample
cd bwai2026-sample

Deployment Flow Overview

[Phase One] Get LINE Keys (Channel Secret + Access Token)
      ↓
[Phase Two] GCP Project Setup (Enable Run / Build / Firestore / Artifact / Drive API)
      ↓
[Phase Three] Set up OAuth Consent Screen + Gemini CLI Login
      ↓
[Phase Four] Tell Gemini CLI a sentence in Chinese and deploy to Cloud Run
      ↓
[Phase Five] Fill in the Webhook URL in LINE Developers Console

Phase One: LINE Keys

Create an official account at LINE Official Account Manager.
In the background, "Settings → Messaging API" enable Messaging API, and create a Provider.
Back to LINE Developers Console corresponding Channel:
- Basic settings → Get Channel Secret.
- Messaging API → Click Issue to get Channel Access Token (long-lived).
Very important: Go back to OA Manager and disable "Auto-reply messages", otherwise your code will never be able to get the messages to reply to.

Phase Two: GCP Project Activation

# Switch to the clean project used in the workshop
gcloud config set project your-cool-project-id

# Enable the entire set of services in one go
gcloud services enable \
  run.googleapis.com \
  cloudbuild.googleapis.com \
  firestore.googleapis.com \
  artifactregistry.googleapis.com \
  drive.googleapis.com

# Build Firestore (used to store per-user OAuth token + state anti-counterfeiting)
gcloud firestore databases create \
  --location=asia-east1 \
  --type=firestore-native

[!NOTE] --type=firestore-native This value will be explained in the third pitfall, why it's easy to get wrong.

Phase Three: OAuth Consent Screen + Gemini CLI Login

Because the Bot needs to represent "the user themselves" to upload files to their Google Drive, this path must go through OAuth.

Go to OAuth Consent Screen:
- User Type: External.
- Application Name: My LINE Bot (or whatever name you want to call it).
- Support Email / Developer Contact Email: Fill in your own Gmail.
Be sure to click "Publish App" after filling it out — if you don't publish it, only accounts in the Test Users list can use it.
Create an OAuth client ID:
- Select Web Application for the type.
- Authorized redirect URI: Temporarily fill in https://placeholder/oauth/callback, and come back to modify it after getting the Cloud Run URL in Phase Four.
- Save the Client ID and Client Secret.
Run locally:

gcloud auth application-default login

This will write ADC (Application Default Credentials) to the local machine, and Gemini CLI will use this credential when running gcloud, without popping up a browser to re-auth halfway.

Phase Four: Deploy to Cloud Run with Gemini CLI (The Highlight)

This part is where the participants in the workshop were most "wow".

After entering the project directory, start Gemini CLI interactive mode:

gemini

Then say a sentence:

Help me deploy to Cloud Run using gcloud, and stop to ask me if you need any data.
Refer to repo https://github.com/kkdai/bwai2026-sample,
region use asia-east1, environment variables will use
ChannelSecret, ChannelAccessToken, GOOGLE_CLIENT_ID,
GOOGLE_CLIENT_SECRET, GOOGLE_REDIRECT_URL.

Gemini CLI will then:

ls and cat Dockerfile by itself to confirm the project structure.
Generate a plan: First use PENDING to reserve the deployment → get the URL → supplement the OAuth redirect → update env vars.
Stop and ask you for confirmation before execution (this is the CLI's confirm mode, enabled by default, and will not yolo).
Run a command that looks like this:

gcloud run deploy linebot-backup-service \
  --source . \
  --region asia-east1 \
  --set-env-vars "GOOGLE_CLOUD_PROJECT=your-cool-project-id,\
ChannelSecret=YOUR_LINE_SECRET_XXXX,\
ChannelAccessToken=YOUR_LINE_TOKEN_XXXX,\
GOOGLE_CLIENT_ID=PENDING,\
GOOGLE_CLIENT_SECRET=PENDING,\
GOOGLE_REDIRECT_URL=PENDING" \
  --allow-unauthenticated \
  --quiet

After 3 to 5 minutes, get the Service URL, such as https://linebot-backup-service-xxxxx.a.run.app.

Supplement the Real OAuth Settings

Go back to the Console and change the https://placeholder/oauth/callback you just filled in to https://linebot-backup-service-xxxxx.a.run.app/oauth/callback.
Paste the real Client ID / Secret to Gemini CLI and ask it to help you update:

gcloud run services update linebot-backup-service \
  --region asia-east1 \
  --update-env-vars \
"GOOGLE_REDIRECT_URL=https://linebot-backup-service-xxxxx.a.run.app/oauth/callback,\
GOOGLE_CLIENT_ID=real-client-id.apps.googleusercontent.com,\
GOOGLE_CLIENT_SECRET=real-secret-xxxx"

Phase Five: Point the LINE Webhook to Cloud Run

Go back to LINE Developers Console → Messaging API tab.
Webhook URL: Fill in https://linebot-backup-service-xxxxx.a.run.app/callback.
Press Verify, and expect to see Success.
Toggle Use webhook to on.
Finally, go back to OA Manager and reconfirm that "Auto-reply messages" is off and "Webhook" is on.

Open LINE, add the Bot as a friend, throw a picture, run OAuth once, and see a folder LINE Bot Uploads/2026-05/... in Drive — the entire process is complete.

Common Maintenance Commands

Function	Command
Redeploy	`gcloud run deploy linebot-backup-service --source . --region asia-east1`
Change env vars	`gcloud run services update linebot-backup-service --update-env-vars "KEY=VALUE"`
Real-time log	`gcloud beta run services logs tail linebot-backup-service`
Check service status	`gcloud run services describe linebot-backup-service --region asia-east1`

The entire maintenance can actually be given to Gemini CLI: "Help me check the logs of linebot-backup-service for the last 5 minutes, and find 5xx" is enough.

Workshop On-Site Pitfall Records

Pitfall One: Billing Not Enabled, Red Error on First Deploy

The first gcloud run deploy directly spewed:

FAILED_PRECONDITION: Billing account for project [your-cool-project-id] is not found.
Please ensure that you have linked an active billing account.

Reason: Most workshop participants open new projects to do this, and new projects don't have Billing bound by default. Cloud Run, Cloud Build, and Artifact Registry all require billing to run — even within the free tier, you must have a "billing account with a linked card" attached to the project.

Solution:

# Check the current billing status of the project
gcloud beta billing projects describe your-cool-project-id

# List available billing accounts
gcloud beta billing accounts list

# Bind
gcloud beta billing projects link your-cool-project-id \
  --billing-account=0X0X0X-0X0X0X-0X0X0X

If you can't or don't want to bind a card, we used the " sandbox project with billing already " as a demonstration on site.

Pitfall Two: Firestore type Parameter Name

The first version of the teaching material (even what AI guessed the first time) was written as --type=native or --type=native-mode:

ERROR: argument --type: Invalid choice: 'native-mode'.
  Valid choices: ['firestore-native', 'datastore-mode']

Reason: After an update in 2024, gcloud firestore databases create changed the type parameter value to the more explicit firestore-native / datastore-mode. Old documents and old answers (including LLM training data) will give you the old values.

Solution:

gcloud firestore databases create \
  --location=asia-east1 \
  --type=firestore-native

This pitfall just demonstrated why you should install the Google Developer Knowledge MCP — after mounting it, Gemini will check the latest official documentation and will not give you outdated type values.

Pitfall Three: Forgot to Enable Drive API, OAuth Passed but Can't Write In

After deployment, Webhook is set up, OAuth consent screen is completed, and the token is obtained, but the first picture upload is 500. Check the log:

googleapi: Error 403: Google Drive API has not been used in project
your-cool-project-id before or it is disabled.

Reason: If you miss drive.googleapis.com in the gcloud services enable ... string in Phase Two, OAuth can pass (because the Consent Screen and Drive API are two different things), but your server will be blocked when it uses the access token to call drive.googleapis.com.

Solution (Quickest):

gcloud services enable drive.googleapis.com

Solution (Fundamental): Enable all the APIs you need at once, list them in the checklist of the teaching material, and run along with it on site so you won't miss it. I specifically wrote drive.googleapis.com into the string in Phase Two to block this pitfall.

[!TIP] A good habit for debugging: As long as the server has the correct token but is 403, first go to API Library to confirm that the corresponding API is enabled, then check the OAuth scope, and finally look at IAM. The wrong order will waste a lot of time.

Why is this combination worth learning?

After the workshop, I asked the on-site participants what moment they felt the most, and the answer was almost unanimous: "Deploying the service just by speaking Chinese to Gemini CLI" that moment.

So why does it feel that way? Breaking it down:

Previously, DevOps was stuck on remembering which command, now it's stuck on expressing clearly what you want to do. The latter is much lower in threshold, with newcomers getting started in three days vs. three months before daring to touch gcloud.
MCP injects official knowledge into Gemini in advance. You no longer need to RTFM yourself first, then translate it into a prompt for LLM; MCP is equivalent to letting LLM have the ability to RTFM itself.
Error messages return to the tool itself. Previously, you had to Google + StackOverflow for errors, now you can directly paste them back to the CLI, which reads the error and then decides the next step — forming a complete plan-act-observe loop.
The entire workflow is reproducible. The teaching materials, examples, and prompts are all in the GitHub repo, and anyone can clone it and follow along, and the results should be consistent.

Want to go deeper? Recommended Advanced Reading

Official Materials: kkdai/BwAI-2026
Example Project: kkdai/bwai2026-sample
Slides: SpeakerDeck
Gemini CLI: github.com/google/gemini-cli
MCP Specification: modelcontextprotocol.io
Extension: Using Gemini CLI + Developer Knowledge MCP, Map MCP Grounding

Postscript: Come to LINE and Make Things Together

This workshop is also one of the recruitment events for our LINE Taiwan DevRel. If you read this and feel:

Want to play with the integration of LINE Messaging API + Google Cloud + Gemini for a long time.
Like to write production code while making the process into teaching materials that can be copied by others.
Can invest more than three days a week and are willing to become a full-time partner after the internship.

Welcome to send me a private message or email to chat, we have a flexible internship program of three days a week, and if you do well, you have the opportunity to become a long-term partner.

Finally, thank you to all the developers who came to the site and did hands-on together — those who are willing to spend their weekends on "using new tools to get through the entire pipeline" are always the most admirable group in the community. See you next time!

Gemini API File Search: Enhanced Multimodal Capabilities with Embedding 2, Including Open-Source LINE Bot Implementation

Evan Lin — Tue, 12 May 2026 04:17:48 +0000

(Image source: Google Blog - Gemini API File Search is now multimodal: build efficient, verifiable RAG)

Recap: RAG Finally Doesn't Need to Build Legos

In the past few years, whenever developers thought about RAG (Retrieval-Augmented Generation), the component list that came to mind probably looked like this:

A chunker (langchain? Write it yourself?)
An embedding model (OpenAI text-embedding-3? Cohere? BGE?)
A vector database (ChromaDB, FAISS, pgvector, Pinecone… which one to choose is a battle)
A retrieval + rerank process
And then the LLM

Not to mention that multimodal RAG needs another layer: How to embed images? Do you need to OCR first? Do you need to split two stores, one for text and one for images? How to calculate scores for mixed text and image search? Just these few questions can take up a sprint.

Recently, Google released Expanded Gemini API File Search for multimodal RAG on the developer blog, turning the long pipeline above into " calling a managed API ", and images are natively supported.

This article will do two things:

Explain the new features clearly, including what Gemini Embedding 2 is doing behind the scenes.
Use an open-source LINE Bot (kkdai/linebot-multimodal-rag) as a live demonstration to see how the new features are combined in actual production code — and share the two typical pitfalls I encountered during debugging to help everyone avoid them.

Three Major Highlights of the New Features

According to the official blog, the core of this upgrade is three things:

1. True Multimodal File Search (Native Multimodal File Search)

In the past, File Search was pure text retrieval, and images could only be indexed by OCRing them into text.

“File Search now processes images and text together. Powered by the Gemini Embedding 2 model, the tool understands native image data.”

Now you can directly put images into the File Search Store, and index them together with text. The engine behind it is Gemini Embedding 2 — text, images, videos, audio, and documents share the same vector space, so you can "find text with images", "find images with text", or "find images with images" without having to align the spaces yourself.

For us product people, this means:

Mixed text and image search is no longer a research topic, it's an API call.
No need to maintain two stores (one for text chunks and one for CLIP-style image embeddings).
Scientific charts, UI screenshots, reports, photo albums... these things that used to lose most of their meaning after OCR can now retain the original visual information for retrieval.

2. Custom Metadata and Server-side Filtering

Each file you put into the store can now be tagged with key-value labels:

{"key": "user_id", "string_value": "U1234abcd..."}
{"key": "department", "string_value": "Legal"}
{"key": "status", "string_value": "Final"}

Use the google.aip.dev/160 filter syntax (same format as most GCP list APIs) when querying:

metadata_filter='department="Legal" AND status="Final"'

Filtering is done first on Google's side, not retrieving a bunch and then discarding. After reducing the noise, the speed and accuracy will both increase, which is a lifesaver for multi-tenant SaaS — one store with metadata filters can separate tenants, without the need to isolate N stores.

My LINE Bot uses this directly to do per-user data isolation: each time a file is uploaded, it's tagged with the LINE user_id, and when querying, a filter is applied, so user A will never see user B's data in the Q&A.

3. Page-level Citations

Each cited snippet in the response will now include the page number.

“captures the page number for every piece of indexed information.”

This is super critical for enterprise customers. "AI says Y is mentioned on page X of the contract" vs. "AI says Y is mentioned in the contract" — the former can be directly accepted by legal/auditing, while the latter requires manual effort to flip through the book for verification. Page numbers unlock the final mile of "LLM answers cannot be traced back to the source".

The Multimodal Engine: Gemini Embedding 2

The core of the new feature is this Gemini Embedding 2 model. Quote its specifications for your selection decisions:

Item	Specification
Supported Input	Text, images, videos, audio, documents (same embedding space)
Input token limit	8,192 tokens
Output dimensions	128 ～ 3,072 (using Matryoshka Representation Learning, small dimensions can also maintain similar accuracy)
Multilingual support	100+ languages

Several key benchmarks (recall@1):

Text-to-Image Search: TextCaps 89.6 / Docci 93.4
Image-to-Text Search: TextCaps 97.4
Multilingual (MTEB): mean 69.9
Video-Text Matching: Vatex ndcg@10 68.8
Speech-Text Retrieval: MSEB mrr@10 73.9

Several key observations:

Matryoshka is not a buzzword: You can store it with 3072 dimensions first, and when running retrieval, switch to 768 dimensions to run faster and maintain quality. Storage/scoring costs can be optimized in stages.
Cross-modal scores are very real: 97.4% recall@1 (image→text) means that if you have an image and want to find the corresponding descriptive text, you'll find it almost immediately. This can be directly implemented for use cases like "take a picture of a product label and find the corresponding page of the user manual".
100+ languages: This is a very real difference for the Taiwan/Japan/Korea/Southeast Asia markets.

What Developers Really Care About: Price and Access Cost

From the official tutorial article Multimodal RAG with the Gemini API File Search tool: a developer guide, there are two sections that developers sensitive to cost should highlight:

“Fully managed, with no vector database overhead.”

“Storage and query-time embeddings are free. You only pay for indexing and tokens.”

In plain English:

You don't pay for the vector database, nor do you pay for the monthly salary of the people maintaining it.
Storage is free, and embedding calculations at query time are also free.
You only have two things to pay for: the embedding fee for the initial indexing and the LLM tokens consumed when generating the answer.

This is a friendly cost curve for personal side projects and early startups — you don't need to decide on day one "can I afford the baseline of the vector DB".

Standard Workflow: 4 SDK calls to complete a RAG

Organized from the dev.to guide, the minimum viable workflow:

from google import genai
from google.genai import types

client = genai.Client()

# 1. Create a store (specify the multimodal embedding model)
store = client.file_search_stores.create(config={
    "display_name": "my-multimodal-rag",
    "embedding_model": "models/gemini-embedding-2",
})

# 2. Upload files + custom metadata
operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=store.name,
    file="report-q1.pdf",
    config={
        "display_name": "Q1 Report",
        "custom_metadata": [
            {"key": "department", "string_value": "Finance"},
            {"key": "year", "string_value": "2026"},
        ],
    },
)
# Upload is a long-running operation, needs to poll:
# operation = client.operations.get(operation)

# 3. Feed file_search as a tool to generate_content
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="What was the revenue growth rate in the first quarter of last year?",
    config=types.GenerateContentConfig(
        tools=[types.Tool(file_search=types.FileSearch(
            file_search_store_names=[store.name],
            metadata_filter='department="Finance" AND year="2026"',
        ))],
    ),
)

# 4. Get citations (including page numbers)
for citation in response.candidates[0].grounding_metadata.grounding_chunks:
    print(citation.web.uri, citation.web.title) # or the corresponding file/page fields

To provide citations with images to the user, there is also client.file_search_stores.download_media() that can be called.

It's no exaggeration, the entire multimodal RAG is less than 30 lines of code.

Demo Case: Putting These New Features into a LINE Bot

It's abstract just looking at the SDK examples, so I made it into a LINE Bot that can be put to work, open-sourced at kkdai/linebot-multimodal-rag:

Users drop PDFs / images / text files into the LINE chat box → Bot indexes into the File Search Store.
Users type questions → Gemini finds answers from the data uploaded by the user themselves.
Users drop an image and ask a question → The same can be done for image-to-text retrieval.
Deployment target: GCP Cloud Run + Cloud Build automatic deployment.

The architecture is very intuitive (key fields):

Component	Role
LINE Webhook	FastAPI receives message events
GCS	Persists original files (`uploads/{user_id}/{message_id}.{ext}`)
Gemini File Search Store	The only index layer (managed)
Custom metadata `user_id`	Multi-tenant isolation
FastAPI BackgroundTasks	Avoid the LINE reply token 30-second limit

Comparing to the three major new features mentioned earlier:

Multimodal: Users drop images, drop PDFs, all go into the same store, and all consume the same pipeline during search.
Custom metadata: Files for each LINE user are tagged with user_id, filtered during queries, achieving server-side forced isolation.
Page-level citations: In the future, to display "the answer comes from XX.pdf page 5" in LINE messages, directly consume grounding_metadata.

The entire repo is about 600 lines of Python, and it completes a " your own private multimodal knowledge base chat Bot ".

Deployment Battle: commit → automatic online

It's not enough for the open-source example to just run; to demonstrate it at the workshop, it needs to be at the level of "code changes, push to GitHub, and automatically deploy". This time, I asked Claude Code to be my co-pilot to help me connect CI/CD.

I only dropped one sentence:

"Help me create a Cloud Build connection to GitHub, and trigger a build to deploy to Cloud Run after committing to main."

Claude Code first scanned cloudbuild.yaml, existing Cloud Run settings, Secret Manager, and Artifact Registry, and listed a "current problem", and then stopped to ask me a key decision: Should I keep the existing service name or change the yaml? Does GitHub need authorization? After I answered, it built the missing resources in one go:

# Build Artifact Registry repo
gcloud artifacts repositories create linebot \
  --repository-format=docker --location=asia-east1

# Secret migration: move from the current service to Secret Manager (via stdin, don't leave shell history)
gcloud run services describe linebot-gemini-file-search --region=asia-east1 \
  --format='value(...)' \
  | gcloud secrets create LINE_CHANNEL_SECRET --data-file=-

# Give Cloud Build / Compute SA the roles needed for deployment
for role in run.admin iam.serviceAccountUser artifactregistry.writer \
            secretmanager.secretAccessor storage.objectAdmin logging.logWriter; do
  gcloud projects add-iam-policy-binding your-cool-project-id \
    --member="serviceAccount:660825558664-compute@developer.gserviceaccount.com" \
    --role="roles/$role" --condition=None
done

# Build trigger
gcloud builds triggers create github \
  --name=linebot-multimodal-rag-main \
  --repo-owner=kkdai --repo-name=linebot-multimodal-rag \
  --branch-pattern="^main$" --build-config=cloudbuild.yaml

The only thing that couldn't be automated was GitHub OAuth authorization — Claude Code directly admitted to me that "this step can only be done by clicking in the Console", and provided the URL and step-by-step instructions. After finishing the one-minute click, the trigger ran through.

Pitfalls Record: Two Traps Directly Related to the New Features

Pitfall 1: Hardcoded Model ID is Outdated

The default values in cloudbuild.yaml and code both write gemini-3.1-flash, but after looking at the Gemini API's current model id list: there's no such model at all. The correct ID for Gemini 3 Flash is gemini-3-flash-preview.

Why this happened: multimodal RAG is a very new feature, and related documents, tutorials, and examples are still being created in large numbers, and the naming has also been slightly adjusted. The initial version of the Repo can easily write an id that "looks like it but doesn't actually exist".

Solution: Change the entire repo to gemini-3-flash-preview, and also confirm that the embedding model is models/gemini-embedding-2 (correct, didn't step on the trap). After pushing, Cloud Build automatically triggered, and a new revision went online in three minutes.

Pitfall 2: Mysterious "Upload has already been terminated"

This trap was directly stepped on the " image upload " path newly supported by File Search Store — it's also the most worth sharing, because it demonstrates that "the error messages of new APIs are sometimes very euphemistic".

I sent a JPG from LINE to the Bot and clicked "store in database", and the result:

❌ Failed to store: 400 Bad Request. {'message': 'Upload has already been terminated.', 'status': 'Bad Request'}

Couldn't see the reason at all. Cloud Logging only had the same error, no stack trace. After looking around on the Google AI Developers Forum, I found that several file types (.md / .xlsx / large CSV) had encountered similar reports.

The real culprit is hidden in this seemingly innocent code:

# app/gemini_service.py (before modification)
suffix = mimetypes.guess_extension(mime_type) or ".bin"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
    tmp.write(file_bytes)
    tmp_path = tmp.name

Before Python 3.13, mimetypes.guess_extension("image/jpeg") returns .jpe, not .jpg. The reason is that in the MIME table of the standard library, .jpe is lexicographically before .jpg, and this quirk has existed for nearly twenty years.

Gemini File Search Store doesn't recognize the file extension .jpe, but the API's message uses "Upload has already been terminated" in a way that is very easy to mislead — at first, I thought it was because the upload size exceeded, or it was choked by concurrency, or there was a race inside the SDK.

Solution: Take the file extension directly from display_name (handlers have already been correctly set to image_<id>.jpg), and use an explicit MIME comparison table as a backup:

# app/gemini_service.py (after modification)
_MIME_TO_EXT = {
    "image/jpeg": ".jpg",
    "image/png": ".png",
    "image/webp": ".webp",
    "application/pdf": ".pdf",
    # ...
}

if "." in display_name:
    suffix = "." + display_name.rsplit(".", 1)[-1].lower()
else:
    suffix = _MIME_TO_EXT.get(mime_type) or mimetypes.guess_extension(mime_type) or ".bin"

print(f"[BG Store] uploading display_name={display_name!r} mime={mime_type} "
      f"size={len(file_bytes)} tmp_suffix={suffix}")

Also, add traceback.format_exc() to the except part, so that the next time something goes wrong, Cloud Logging will have the full stack.

The takeaway from this story: When you're running on a new modality on a "newly GA'd API", please be sure to:

First confirm on the client side that the filename / file extension you generate is the format expected by the API, don't trust the mimetypes standard library to guess for you.
Write the stack trace into the log, otherwise you can't save yourself from the esoteric discussions on the forum like "just change a file".
Compare the file extension you generate with the Gemini File Search official supported format list.

Summary: The Entry Fee for Multimodal RAG, the Lowest in History

This time's Gemini API File Search upgrade compresses a feature line that used to take 3 months to go online into " dozens of lines of code + a managed API " to run:

Native multimodal support: Text, images, videos, audio, and documents share the same embedding space, goodbye to the OCR transition layer.
Custom metadata + server-side filter: Multi-tenant SaaS doesn't need to struggle with how many stores to split.
Page-level citations: Enterprise compliance scenarios finally have native grounding.
Friendly to money: Storage / query embedding are both free, only pay for indexing + LLM tokens.
Cross-modal scores of Embedding 2: 97.4% recall@1 is not a demo number, it's the level that can directly support the product.

If you want to directly see a production-shaped end-to-end example: kkdai/linebot-multimodal-rag the entire repo PR welcome, and you're also welcome to use it to modify it into your own domain's RAG application — Notion knowledge base, employee manual Q&A machine, photo album manager, research paper index... probably only imagination will limit you.

If you want to get started, the recommended reading order:

Google official blog: Expanded Gemini API File Search for multimodal RAG
Gemini Embedding 2 specification page: deepmind.google/models/gemini/embedding
Developer implementation guide: Multimodal RAG with the Gemini API File Search tool: a developer guide
My open-source example: github.com/kkdai/linebot-multimodal-rag

Welcome everyone to try out this very powerful Multimodal RAG support!

[GCP Practice][BwAI] AI-Powered Development: Quickly Deploy a LINE Bot Cloud Backup Tool with Gemini CLI

Evan Lin — Thu, 07 May 2026 04:36:40 +0000

Background

In the upcoming Build With AI 2026 workshop, we're bringing a very practical project: the LINE Bot File Backup Robot. It allows you to directly upload images and files from your LINE chatroom to Google Drive, and it will automatically create folders by month to keep things organized.

Traditionally, putting a project like this, which includes OAuth authorization, a Firestore database, and Cloud Run container deployment, on the cloud would often leave beginners struggling with lengthy gcloud commands.

But this time it's different, we have a secret weapon: Gemini CLI.

This article will document how we used AI as a DevOps engineer, completing the entire complex deployment process by "talking," and of course, including the various real pitfalls we encountered along the way.

Preparation: Summoning the AI Assistant

Before we start, besides the basic gcloud installation and login, you only need to install Gemini CLI.

Prepare the following "confidential parameters" (all are Mock processed in this article):

PROJECT_ID: your-cool-project-id
LINE Channel Secret: YOUR_LINE_SECRET_XXXX
LINE Access Token: YOUR_LINE_TOKEN_XXXX

After entering the project folder, I only said one sentence to Gemini CLI:

"Help me deploy to Cloud Run using gcloud, and stop and ask me if you need any information. Refer to the repo…"

Next, it's time to witness miracles (and fix bugs).

Practical Deployment Process: AI Leading the Way

Gemini CLI intelligently analyzed Dockerfile and main.go and immediately listed a set of battle plans.

Step 1: Environment Detection and API Enablement

The AI first confirmed my current project settings in gcloud and then enabled the necessary services in one go:

gcloud services enable firestore.googleapis.com \
  cloudbuild.googleapis.com \
  run.googleapis.com \
  artifactregistry.googleapis.com

Step 2: Creating a Firestore Database (Encountering the First Pitfall)

Our Bot needs to record the OAuth State anti-counterfeiting mark, so Firestore is needed. The AI tried to execute the command, but we immediately encountered an error. (See the pitfall record below for details)

After correction, the correct command is:

gcloud firestore databases create --location=asia-east1 --type=firestore-native

Step 3: Deploying Cloud Run First, Filling in the Blanks Later

This is a classic "chicken or the egg" problem: Google OAuth needs to know your Cloud Run URL (Redirect URI), but your Cloud Run deployment needs to fill in the OAuth Client ID and Secret.

Gemini CLI's strategy is great: Deploy with placeholders first!

gcloud run deploy linebot-backup-service \
  --source . \
  --region asia-east1 \
  --set-env-vars "GOOGLE_CLOUD_PROJECT=your-cool-project-id,ChannelSecret=YOUR_LINE_SECRET_XXXX,ChannelAccessToken=YOUR_LINE_TOKEN_XXXX,GOOGLE_CLIENT_ID=PENDING,GOOGLE_CLIENT_SECRET=PENDING,GOOGLE_REDIRECT_URL=PENDING" \
  --allow-unauthenticated \
  --quiet

After successful deployment, we got a string of fragrant URLs: https://linebot-backup-service-xxxxx.a.run.app.

Step 4: Completing Google OAuth Settings and Environment Variable Updates

With the URL, I can go to the "API & Services" in Google Cloud Console to complete the settings:

Create an OAuth consent screen.
Create credentials for a Web application.
Fill in the "Authorized redirect URI" with the URL we just got, plus /oauth/callback.

After getting the real ID and Secret, I directly pasted the information to Gemini CLI, and it automatically updated the service for me:

gcloud run services update linebot-backup-service \
  --region asia-east1 \
  --update-env-vars "GOOGLE_REDIRECT_URL=https://[YOUR_URL]/oauth/callback,GOOGLE_CLIENT_ID=real-client-id.apps.googleusercontent.com,GOOGLE_CLIENT_SECRET=real-secret-xxxx"

Done! Finally, just go to the LINE Developers Console and fill in the Webhook.

Blood and Tears Pitfall Records During the Deployment Process

It looks smooth, but in fact, the AI and I hit a few walls together. This is also the most real experience of using CLI tools.

Pitfall 1: Forgetting to Bind a Credit Card, the 390001 Error

When executing the first gcloud run deploy, the terminal directly spewed red text all over the face:

FAILED_PRECONDITION: Billing account for project is not found...

Reason: Cloud Run and Cloud Build require the project to enable billing (Billing Enabled). This is a brand new test project, and I forgot to bind the billing account. Solution: The AI immediately checked the project status for me (gcloud beta billing projects describe) and asked me if I wanted to switch to a project with billing, or to fix it. I obediently went to the Console to bind my credit card, and the deployment was able to continue.

Pitfall 2: The Evolution of Command Parameter Syntax

When creating Firestore, the AI initially gave the command --type=native-mode or --type=native, but gcloud didn't appreciate it:

ERROR: argument --type: Invalid choice: 'native-mode'

Reason: The CLI parameters of gcloud will change with version updates. Solution: Carefully look at the gcloud error message, and now the correct parameter values are firestore-native or datastore-mode. After changing to --type=firestore-native, it passed smoothly.

Pitfall 3: The Invisible "Drive API"

When everything was deployed, we encountered a permission error when testing "upload to Google Drive". Reason: This is a Bot that helps you upload files to Drive, but when we enabled the API in the first step, we actually forgot to enable the protagonist: Google Drive API! Without it, even if OAuth authorization is successful, the program will still be blocked. Solution: I only entered the mysterious "3." (implying the third checkpoint) into the terminal, and the AI immediately understood and added this critical blow:

gcloud services enable drive.googleapis.com

Conclusion

Through Gemini CLI, the originally tedious and error-prone infrastructure construction work has become a "two-person pair programming" session.

AI can help you remember lengthy gcloud parameters, help you sort out the deployment logic (deploy with PENDING first and then update), and even adjust strategies quickly based on error messages when you encounter errors.

This is the core spirit that Build With AI 2026 wants to convey: let AI handle the tedious DevOps chores, so that developers can focus more energy on innovation in core business logic.

If you are still manually typing long and ugly gcloud commands, I strongly recommend you install Gemini CLI and give it a try!

DEV Community: Evan Lin

[Gemini API in Action] Building MemeFinder: A Native Mac Menu Bar Widget for Finding Memes via Text Using Gemini Vision & Semantic Embeddings

The Origin: Mid-Conversation, Where on Earth Is That Meme?

System Design and Architecture

System Architecture Flow

Core Implementation

1. Auto-tagging memes with the Gemini vision model

2. Hybrid semantic + keyword ranking

Major Pitfalls and Solutions

Pitfall #1: The mysterious GeminiError error 0 — indexing and search both fail

Pitfall #2: SwiftPM's main entry-point conflict and the SwiftUICore linking error

Pitfall #3: Parallel indexing's rate limit and "I want to stop indexing halfway"

Pitfall #4: Evolving from a "windowed app" into "menu-bar resident + global hotkey"

Pitfall #5: The settings form is blank — one symptom, three layers of cause

On the "Development Process" Itself

Results and Benefits

[Gemini API] Gemini Batch API and Webhook API practical usage on restaurant survey

A Powerful Tool for Asynchronous Processing: Gemini Batch API & Webhooks

System Design and Optimized Architecture

System Architecture Flow

Core Implementation

1. Precisely Extracting Restaurant Names from Grounding Text using Gemini

2. Dynamically Generating LINE Quick Reply Buttons

Major Pitfalls and Solutions

Pitfall One: LINE 20-character Limit Causing API Sending Errors

Pitfall Two: Batch API Asynchronous Delay and LINE Webhook's "Three-Second Timeout Survival Battle"

Pitfall Three: Gemini Batch API's Queuing and Pending Status

Results and Benefits

[I/O Extended Taipei] Building with Gemini APIs: From Calls to Autonomous Systems

Context: The Gemini API is no longer just "adding one more prompt"

First, let's look at the big picture: What's new in the 2026 Gemini API family?

Layer 1: Core Models

Layer 2: Key Capability Modules

Layer 3: System Design Approach

Architectural Turning Point: Three Tools, Three Paradigm Shifts

1. File Search: Shifting from Hand-Coded RAG to Managed RAG

Why is this File Search particularly noteworthy?

2. Agents API: Shifting from Client-Side Loop to Server-Side Managed Agent

3. Webhook: Shifting from Polling to Event-Driven

From the Perspective of a LINE Bot, How Should a Gemini Application Be Designed?

A Very Pragmatic Routing Approach

Infrastructure is Not Unimportant, But You Don't Have to Rebuild it Yourself Every Time

Three Most Valuable Practical Takeaways

1. Place a routing layer before the LLM

2. Embrace asynchronous operations; don't force long tasks into synchronous APIs

3. Redirect RAG engineering time to permissions and experience

Why is this talk worth revisiting repeatedly?

Postscript: From API User to AI System Designer

[Hands-on Gemini 3.5 Live

Brand New API Unveiled: Gemini 3.5 Live Translate

System Design and Architecture

System Architecture Flow

Core Implementation One: ScreenCaptureKit Capture and Resampling

1. Filter and Select Target App

2. Start Audio Capture Stream

Core Implementation Two: Gemini Live WebSocket Bidirectional Connection

Major Pitfalls and Solutions

Pitfall One: Gemini Live Exclusive Model Restrictions

Pitfall Two: Incorrect JSON Payload Field Structure (Hidden Differences Between Documentation and API Versions)

Pitfall Three: "Zero-Byte Silence" Caused by Multi-Channel Stereo Capture

Results and Benefits

[AI Practice] Building blazing-Fast AI Mac OS App with Antigravity CLI

Foreword: A Developer's New Collaboration Model

Phase One: Idea Generation and Architecture Design

Phase Two: Environment Configuration and Compilation Anxiety Elimination

Phase Three: Connection Troubleshooting and Audio Bug Fixes

Phase Four: Automated DevOps and GitHub Delivery

Conclusion: Development Transformation and Insights

[GCP Practical] LINE Business Card Bot

Upgrade Preamble

Optimization One: Embracing Gemini Structured Outputs

1. Defining the Name Card Schema

2. Applying to Generation Config

Optimization Two: Solving LINE Message Limit with 'Disambiguation List'

💡 Solution: Disambiguation List

Optimization Three: Contact Modification Safety Lock — Two-Stage Confirmation Mechanism

Ops Pitfall Record: Manual Deployment - The Mysterious Disappearance of Environment Variables

The Pitfall

Recovery Process

Summary and Benefits

Pitfall #1: The mysterious `GeminiError error 0` — indexing and search both fail

Pitfall #2: SwiftPM's `main` entry-point conflict and the SwiftUICore linking error

Pitfall One: Synchronous Calls → Mysterious `RESOURCE_PROJECT_INVALID`

Pitfall Two: `gsutil` in the Sandbox is a Mock (This one is the most insidious)

Pitfall Three: Cloud Run's `/healthz` is Intercepted by Google Frontend

Pitfall Four: Service Account Needs to `actAs` Itself for Cloud Tasks OIDC to Sign

1. ⚡ Extreme YOLO: `--dangerously-skip-permissions` Parameter

2. 🛡️ Moderate Control: `/permissions` Fine-grained Settings

🧠 2. Change Brains at Any Time: `/model` Secret

Refactored Main Webhook Logic (`handle_smart_query`)

Pitfall 2: GCP's Default `GOOGLE_CLOUD_LOCATION` and Region 404