DEV Community: Karl Weinmeister

The future of autonomous software maintenance: a dependency update agent

Karl Weinmeister — Thu, 09 Jul 2026 21:14:10 +0000

Most AI tools focus on writing new code, but up to 80% of the cost of software comes from maintenance. What if an intelligent agent could handle that burden for you?

In this article, I’ll walk through an autonomous agent called dependency-director that handles one of the most common maintenance scenarios: dependency updates. It goes beyond merging dependency updates that are passing tests; it upgrades your code to match changes in the dependency. This provides a sneak peek into the future of automated engineering workflows.

Note: dependency-director is an experimental proof of concept and is not intended for production environments. Although it uses security guardrails like sandboxing and command filters, running LLM-generated code dynamically always involves security trade-offs. If you experiment with the code on GitHub, please do so in an isolated testing environment, start with the --dry-run flag, and use highly restricted API tokens.

Loop engineering for software updates

A continuous triage loop is at the center of the process. The agent scans repositories and maps every open pull request to a state and triggers an action.

When a dependency update breaks a test suite, the agent automatically clones the repository and updates the code to align with the updated dependency.

Dependency Director runs on the Google Antigravity SDK, with Gemini taking care of the reasoning. You can configure the agent like this (full version in main.py):

config = LocalAgentConfig(
    model="gemini-3.5-flash",
    system_instructions=system_instructions,
    policies=policies,
    tools=agent_tools, # 12 GitHub API tools, plus a sandboxed shell
    workspaces=[project_root, workspace_tmp],
    capabilities=types.CapabilitiesConfig(
        enable_subagents=False,
        disabled_tools=[types.BuiltinTools.RUN_COMMAND],
    ),
)

Two lines do most of the security work. disabled_tools=[types.BuiltinTools.RUN_COMMAND] turns off the SDK’s built-in shell, so the only way the agent can run a command is through my sandboxed wrapper. And workspaces pins file operations to the project directory and a per-repo temp workspace, so the agent can’t wander through the filesystem.

Guardrails live in the tools, not the prompt

A common mistake in agent design is writing rules into the system prompt and hoping the model follows them. All important constraints are enforced in the tool functions themselves.

For example, the author of the pull request must be one of the allowed bots. The defaults are dependabot[bot] and renovate[bot].

def _check_bot_author(author: str, bots: list[BotConfig]) -> BotConfig:
    allowed = {b.author for b in bots}
    if author not in allowed:
        raise PermissionError(
            f"Security Block: Only pull requests authored by {allowed} "
            f"can be processed. This PR was authored by '{author}'."
        )
    return next(b for b in bots if b.author == author)

There’s also a --dry-run parameter, so that you can simulate the behavior locally, without updating the pull request.

async def merge_bot_pr(owner: str, repo: str, pr_number: int) -> str:
    author = await client.get_pr_author(owner, repo, pr_number)
    _check_bot_author(author, bots)
    if dry_run:
        return f"[DRY-RUN] Would have merged PR #{pr_number} in {owner}/{repo}."

All fixes are held up as an open PR for your review by default, unless you use --automerge.

Patching a red pull request

When a PR has a conflict, the agent follows a procedure in its system instructions located in instructions.py. It uses the code-review-and-quality skill from Addy Osmani to go beyond just making the test pass, and ensure a code review and quality standards are enforced:

RED: Retrieve logs via 'get_pr_workflow_run_logs'. Clone into a subdirectory,
then check out the PR branch:
  git fetch origin pull/<pr_number>/head:pr-<pr_number> && git checkout pr-<pr_number>
Otherwise install deps, test, fix, verify, and push to the remote branch.
Max 3 fix attempts per RED PR before skipping.
Run 'code-review-and-quality' self-review before committing.

The agent is also very clear about what counts as a fix. It’s told to consider what’s changed in the API documentation, and not just chase a green checkmark:

- Fix root causes (types, signatures, API changes) using changelogs.
- NEVER suppress errors (type: ignore, noqa, Any) unless upstream bugs
  leave no alternative (require comment).

Running untrusted code in a sandbox

To restrict filesystem and network operations, Dependency Director uses sandbox-runtime. It’s an open-source OS-level sandbox. Every shell command the agent runs is wrapped in it:

process = await asyncio.create_subprocess_exec(
    "srt", "--settings", active_config_path, "-c", command_line,
    cwd=target_cwd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env,
)

You can configure the settings file where the sandbox is configured. The default settings allow network access to package registry domains such as PyPI, npm, crates.io. The agent is blocked from reading the home directory and writing to the .env environment variable file and .git/hooks. Here’s the bundled srt-settings.json, trimmed to a few representative entries:

{
  "network": {
    "allowedDomains": ["github.com", "pypi.org", "registry.npmjs.org"],
    "deniedDomains": ["*.ngrok.io", "*.webhook.site", "*.requestbin.com"]
  },
  "filesystem": {
    "denyRead": ["~"],
    "allowWrite": ["/tmp"],
    "denyWrite": [".env", ".git/hooks"],
    "allowGitConfig": false
  }
}

Trying it out yourself

Solid test coverage is essential for updating dependencies. Tests give you the confidence to know that your code still functions. When you use Dependency Director, they’re absolutely required so that the agent can safely patch your code to work with the updated dependencies. For tips on effective tests, see my blog post on five essential testing patterns.

Using an AI agent does consume tokens, of course. I performed a trial run on a repo with 6 open pull requests using gemini-3.5-flash. It used about 200k tokens. At the current published rate, that’s about $0.30 USD for the run, or roughly 6 cents per PR.

The code to try out dependency-director is on GitHub for you to try. Start with --dry-run to see how it works for you!

What maintenance work would you hand to an agent first? Tell me on X, LinkedIn, or Bluesky.

Google Antigravity SDK: The developer guide

Karl Weinmeister — Tue, 09 Jun 2026 15:01:34 +0000

The Google Antigravity SDK is a Python framework for building and running autonomous agents. It decouples your agent’s logic from where it runs, letting you focus on what the agent does while the SDK manages execution and state.

The Python SDK interfaces with a bundled Go harness over WebSockets. The local Go harness runs the core agentic loop and manages sandboxed tool execution. Python acts as the control plane where you configure tools, safety policies, and lifecycle hooks.

This guide outlines the SDK’s architecture one layer at a time, referencing the official source repository. Note that the SDK is currently pre-v1.0 and subject to change.

Where Antigravity fits in Google’s AI stack

Google’s AI stack offers multiple levels of abstraction for building with Gemini. Choosing the right one depends on how much control you need over the execution loop.

The Gemini API is stateless. You make an API call and get a response. You manage the entire loop.
The Agent Development Kit sits one level up. With the ADK, you design the event loops, pick the foundation models, and control how agents route messages to each other.
The Antigravity SDK is a pre-packaged runtime tightly integrated with Gemini. You don’t build the agentic loop; you’re given one. Your role is to govern it.

Getting started

Install the package with pip install google-antigravity, ensuring that GEMINI_API_KEY is set in your environment. Then you’re ready to build your first agent!

import asyncio
from google.antigravity import Agent, LocalAgentConfig

# 1. Define a tool function with a descriptive docstring
async def get_weather(location: str) -> str:
    """Gets the current weather for a location."""
    return f"The weather in {location} is sunny, 72°F."

async def main():
    # 2. Register the tool in the agent configuration
    config = LocalAgentConfig(
        system_instructions="You are a helpful weather assistant.",
        tools=[get_weather]
    )

    # 3. Initialize the agent and query it
    async with Agent(config) as agent:
        response = await agent.chat("How's the weather in San Diego?")
        async for token in response:
            print(token, end="", flush=True)

asyncio.run(main())

What’s happening here? The Agent context manager starts the Go harness, establishes a WebSocket connection, and registers the get_weather function as an available tool. The model automatically decides when to invoke it based on the user’s prompt. When the async with block exits, the harness shuts down and all connections are closed.

The three-layer architecture

The SDK separates concerns into three layers, each with a distinct responsibility.

Layer 1: Agent and LocalAgentConfig. The high-level entry point. Manages configuration, session lifecycle, tool wiring, hooks, and triggers. This is where you spend most of your time.

Layer 2: Conversation. The stateful session manager. Wraps the connection and handles message history accumulation, context window compaction, and token usage tracking (including Gemini’s “thinking tokens”).

Layer 3: Connection and ConnectionStrategy. The transport abstraction. For local development, LocalConnection communicates via WebSockets with the Go harness. This layer is what makes it possible to eventually swap in remote backends without changing your application code.

Now let’s look at what you can build on top of those three layers.

Tools and MCP

Built-in tools

The Go harness ships with optimized native tools for standard OS interactions: view_file, edit_file, create_file, list_directory, search_directory, run_command, and generate_image. These run inside the harness process, not in Python, so they’re fast and sandboxed.

Custom Python tools

If you need the agent to call your business logic, you write a standard Python function. The SDK’s ToolRunner uses reflection to inspect type hints and parse docstrings, generating the Gemini FunctionDeclaration automatically.

async def lookup_customer_tier(email: str) -> str:
    """Looks up a customer's subscription tier.

    Args:
        email: The customer's registered email address.
    """
    tier = await db.query(email)
    return f"The customer is on the {tier} plan."

config = LocalAgentConfig(tools=[lookup_customer_tier])

ToolContext for stateful tools

Sometimes a tool needs to remember things across invocations in the same conversation, like a pagination cursor or a running counter. Passing that state through the LLM wastes tokens and bloats the context window.

The SDK provides ToolContext, a conversation-scoped key-value store. Add ctx: ToolContext to your function signature and the SDK injects it automatically. The model never sees the parameter.

from google.antigravity.tools.tool_context import ToolContext

def process_logs(batch_size: int, ctx: ToolContext) -> str:
    """Processes the next batch of server logs."""
    cursor = ctx.get_state("log_cursor", 0)
    logs = fetch_logs(offset=cursor, limit=batch_size)
    ctx.set_state("log_cursor", cursor + batch_size)
    return logs

MCP integration

The SDK has native support for the Model Context Protocol using both Stdio transport and Streamable HTTP. Point your agent at an MCP server and it for access to its exposed tools.

Because MCP tools are integrated at the ToolRunner level, they’re governed by the exact same safety policies as built-in and custom tools.

Lifecycle hooks

The SDK treats agent lifecycles through composable middleware using hooks.

A common security flaw in custom agent frameworks is the Time-Of-Check to Time-Of-Use, or TOCTOU, vulnerability. A security hook approves a tool call’s arguments, then a subsequent middleware mutates those arguments before execution. Antigravity prevents this by categorizing hooks into three archetypes, enforced by the type system.

Decide hooks are read-only and blocking. They inspect incoming data (like a pending tool call) and return HookResult(allow=True/False). They can’t modify the payload. If any Decide hook denies, execution short-circuits. Example: PreToolCallDecideHook.

Inspect hooks are read-only and non-blocking. They receive data after an event and run concurrently. They can’t block the main flow. Example: PostToolCallHook (writing tool outputs to external systems).

Transform hooks are modifying and blocking. They receive data, mutate it, and pass the transformed payload back. Example: OnToolErrorHook.

The OnToolErrorHook is particularly useful. When a tool throws an exception, instead of crashing the entire loop or dumping a raw Python traceback into the model’s context, you intercept the error and feed strategic recovery guidance:

from typing import Optional
from google.antigravity.hooks import hooks

class FallbackHook(hooks.OnToolErrorHook):
    """Intercepts tool errors and returns recovery guidance."""

    async def run(self, context: hooks.HookContext, data: Exception) -> Optional[str]:
        if isinstance(data, ValueError):
            return (
                "[System: Invalid parameters. "
                "Try 'search_directory' to find the correct ID.]"
            )
        return None

config = LocalAgentConfig(hooks=[FallbackHook()])

You can stack these hook types together to build a middleware pipeline. For example, you could include rate-limiting via Decide hooks, audit logging via Inspect hooks, and crash recovery via Transform hooks.

Safety policies

Giving an autonomous agent access to your system requires guardrails. The SDK employs a declarative, priority-based policy engine that evaluates every single action at the runtime hook level.

Out of the box, the SDK takes a strict security stance. If you spin up an agent with zero configuration, it defaults to confirm_run_command(): the agent can read and write files, but shell execution requires explicit approval.

Policies evaluate top-down using a priority model. You configure rules with policy.allow(), policy.deny(), and policy.ask_user().

from google.antigravity import Agent, LocalAgentConfig
from google.antigravity.hooks import policy

policies = [
    # Block dangerous arguments instantly
    policy.deny(
        "run_command",
        when=lambda args: "rm " in args.get("CommandLine", "")
    ),
    # Ask the human for any other shell command
    policy.ask_user("run_command", handler=my_cli_prompt_function),
    # Allow safe tools silently
    policy.allow("view_file"),
    # Deny everything else
    policy.deny("*")
]

config = LocalAgentConfig(policies=policies)

Human-in-the-loop

The policy.ask_user() builder pauses the execution loop, invokes your custom handler, and waits for approval before continuing.

Disabling vs. denying

There’s an important distinction between disabling vs denying tools. CapabilitiesConfig.disabled_tools physically removes a tool’s JSON Schema from the context window before sending the prompt to Gemini. The model doesn’t know the tool exists, and you save input tokens. policy.deny() keeps the tool visible but blocks it at runtime. The model attempts to use it, gets an error message, and learns why it was blocked. It costs tokens for the failed attempt, but enables dynamic, argument-based restrictions and lets the model adapt.

Background triggers

True autonomous systems monitor their environment and alert you proactively. The SDK’s triggers are long-lived async tasks that run alongside the agent session, reacting to external events.

When you start an Agent context, the TriggerRunner spawns a separate asyncio task for each registered trigger. A crashing trigger won’t take down the agent. A busy agent won’t block the triggers.

Each trigger receives a TriggerContext. When it notices something in the outside world, it calls ctx.send(“Message”) to inject a notification into the agent’s conversation history. The agent reacts as if the user had typed it.

from google.antigravity import Agent, LocalAgentConfig
from google.antigravity.triggers import every, TriggerContext

async def monitor_queue(ctx: TriggerContext):
    tickets = await fetch_pagerduty_alerts()
    if tickets:
        await ctx.send(f"[System Alert]: {len(tickets)} new P0 alerts detected.")

config = LocalAgentConfig(triggers=[every(60, monitor_queue)])

The SDK also ships triggers.on_file_change() for OS-level file watching (great for local coding assistants) and @triggers.trigger for custom async listeners like GitHub webhook receivers.

Streaming and thoughts

When an agent is executing a multi-step task, waiting for a final output can make the application feel frozen.

The SDK addresses this by streaming execution events in real time. Instead of blocking, await agent.chat() immediately returns a ChatResponse object. This object acts as a shared, memory-cached buffer.

Unlike standard Python generators, which are exhausted once read, ChatResponse lets you attach multiple independent cursors to the same stream. This allows you to route different aspects of the same agent turn concurrently:

Main text stream (e.g., rendering markdown chunks to your frontend UI)
Chain-of-thought stream (e.g., logging the agent’s internal reasoning to a developer console)
Tool-call stream (e.g., displaying a live status widget as the agent invokes tools)

response = await agent.chat("Write a short story.")

# Stream raw text tokens
async for token in response:
    print(token, end="", flush=True)

The response.thoughts stream exposes the model’s Chain-of-Thought reasoning in real-time. Token costs are tracked with response.usage_metadata.thoughts_token_count.

The response.tool_calls stream yields strongly-typed ToolCall objects as soon as the agent dispatches them, so your UI can render updates instantly.

Subagents

One of the most common pitfalls in autonomous agents is context window bloat. The SDK solves this through hierarchical delegation.

Instead of doing all the work in a single thread, the main agent invokes the built-in start_subagent tool. This prompts the harness to spin up a fresh agent session with a clean context window to handle the subtask in isolation. The subagent works through the problem using its own tools and MCP servers, then shuts down. It returns only a synthesized summary of its findings, keeping the main agent’s context window clean and focused on high-level orchestration.

from google.antigravity import Agent, LocalAgentConfig

config = LocalAgentConfig(
    system_instructions="You are a lead developer. Delegate heavy research to subagents."
)

async with Agent(config) as agent:
    prompt = (
        "Use a subagent to research the /docs directory and "
        "write a synthesized lesson plan based on what it finds."
    )
    response = await agent.chat(prompt)
    print(await response.text())

To prevent privilege escalation, safety policies and hooks cascade hierarchically. If the main agent is restricted from running terminal commands, those same restrictions automatically apply to any subagents it spawns. You can also intercept and inspect subagent lifecycles using the same hook middleware (PreToolCallDecideHook and PostToolCallHook) that governs regular tool calls.

What will you build?

Building an agent loop is relatively straightforward, but securing and monitoring it in production is where challenges typically begin. The Antigravity SDK bridges this gap by decoupling your agent’s logic from its execution environment.

To get started, review the SDK overview docs and clone the source repository. Then try out one of the examples.

Stay tuned for the next agent I’ll build with the Antigravity SDK! Share with me what you’re building on X, LinkedIn, or Bluesky.

A Practical Guide to Evaluating Multi-Turn Agent Trajectories

Karl Weinmeister — Wed, 27 May 2026 17:33:04 +0000

Would you let an AI agent run in your terminal for hours, executing hundreds of tools, without being able to see what it is doing under the hood?

Harnesses like Antigravity or Claude Code can run for hours without intervention. If you’re driving one of these systems, you’re in the driver’s seat: you pick the base model, configure the harness, add skills, and plug in MCP servers.

But how do you know if your configuration is working? Response-based grading won’t cut it. You need trajectory-level evaluation that goes beyond analyzing just the final answer. This post walks through a telemetry-driven framework for measuring multi-turn agent systems.

Evaluating a Trajectory

Standard LLM evaluation grades a single response to a single prompt, assessing factors like factual correctness and semantic relevance.

Evaluating only the final output limits what you can see. An agent can stumble into the right answer despite a broken intermediate plan. Likewise, a minor formatting bug at the very end can mask an otherwise successful run.

To evaluate an agent, you need to examine the entire trajectory: every prompt, thought, tool call, and state change across dozens of turns.

Take a typical GKE cluster deployment trajectory. It chains multiple steps from gcloud to kubectl commands beforereporting success:

The compounding decay of sequential decision-making explains why this matters. If your agent completes ksteps, overall success probability is the product of each step’s reliability. If each step is 95% reliable, look at how quickly your overall success rate drops as the number of steps increases:

This is why performing well in single-turn demos doesn’t necessarily translate into real-world success. One failure can derail your agent or get it stuck in a loop.

Public benchmarks such as GAIA and SWE-bench measure general capabilities, and may not correlate with performance for your use case. Another pitfall is that researchers have shown that agents can game these benchmarks without actually solving the tasks. A well-defined, customized evaluation trajectory can address these challenges.

Optimization Metrics

When tuning your agentic stack, tracking success rates with resource consumption is critical.

Your agent’s success rate is the cornerstone, but binary pass/fail is insufficient. You need structured milestones that provide signal even when a run fails part of the way through. When a run fails, you award partial credit for the furthest milestone reached. That gradient gives you an actionable signal for prompt and tool adjustments.

Long-running agents are token amplifiers. One prompt can trigger dozens of sequential LLM calls. Because the full conversation history is re-sent each turn, input token usage can grow quadratically. That quadratic growth drives both cost and latency. Monitoring factors like cache hit ratios and total step counts is essential for determining whether your agent is production-viable.

Duration , considered in clock time or number of steps, also figures into the cost analysis. A cheaper model may enter endless execution loops or fail entirely, forcing developer intervention. Meanwhile, a premium model that completes the task in fewer turns can cost less overall.

Tracing with OpenTelemetry

When high-level metrics slip, you need to know why. OpenTelemetry exports from your agent harness to capture standardized traces, metrics, and logs for every step.

At the interaction level, you can log prompt length and turn counters. This is where you monitor plan coherence and detect loops. You can also verify result utilization (does the agent actually use the tool’s output in its next planning step, or does it ignore it and rely on hallucinated memories?). Be careful about enabling OTEL_LOG_USER_PROMPTS without privacy controls in place.

At the LLM request level, capture cache tokens to compute cost and model efficiency. This is where you verify tool selection accuracy: is the agent choosing the right tool? Or is it running a broad web search when it should be running a local database query?

At the tool level, log structural parameters like the MCP server name and argument payloads, pinpointing exactly which tool call failed. This is where you validate argument extraction (do the arguments match the target JSON schema?) and track active error recovery. When a tool fails, does the agent gracefully recover and try an alternative, or does it just crash?

Optimizing your API Spend

Long agent sessions are going to accumulate cost quickly. Combining the right strategies can reduce API spend significantly, while maintaining or even improving reliability.

Model routing analyzes prompt complexity to route routine tasks to cheaper models. Context compaction uses verbatim compaction to remove low-signal lines from history, preserving exact code signatures and error codes. Prompt caching natively caches stable prefixes (such as system rules and tool schemas) at the model level.

One catch to keep in mind: prefix caching is sequential. Any change in the middle of your prompt invalidates the cache for everything that follows. How can you address this?

First, relocate dynamic data to the tail by moving system messages like progress tracking to the end of your prompt. Wrapping them in custom XML tags like <system-reminder> ensures the model still parses them correctly, isolating churn to the tail and leaving your large static system prompt and tool definitions fully cached.

Second, sort your tool definitions alphabetically. This keeps tool schemas and subagent configurations byte-identical across runs, allowing different user sessions to share the same cache prefix.

Finally, freeze the clock by avoiding real-time clock injections. Freezing the datetime at task start (e.g., “Thursday, April 3, 2026”) ensures that minor clock ticks do not bust the cache.

Managing Context Bloat

In long-running tasks, verbose terminal logs and large file dumps clog the context window. This bloat drives up API costs and degrades model quality because the current goal gets buried in stale noise. Keeping an agent stable over multi-hour runs requires active context management.

Left unmanaged, token usage grows linearly until you hit the context wall. You can tackle this at two levels: the harness and the agent architecture.

Harness level compaction

Modern harnesses implement auto-compaction to protect the context window. They don’t just drop the oldest messages. Instead, they use strategies like truncating massive stdout payloads, stripping out verbatim file dumps once they are no longer needed, and replacing long conversational exchanges with AI-generated summaries. For example, Claude Code triggers compaction at 95% context usage by default, but you can tune this using the CLAUDE_AUTOCOMPACT_PCT_OVERRIDE parameter. Antigravity uses a similar approach, focusing on preserving the most recent working state while compressing the history.

Focusing your agent

While harness compaction is a great safety net, you can build active compression directly into your agent’s toolset using a focus agent pattern.

You can expose two tools to the LLM to “start focus” and “complete focus.” When your agent is about to start a complex sub-task, it can call start focus to checkpoint the context. Then, once it finishes, it can call complete focus to summarize what it learned. The harness can then prune all of the messy details like logs from the context that is no longer needed.

This creates a sawtooth pattern in your token usage. You keep the context window focused on the active problem instead of dragging the entire history of the session along for the ride:

Applying what you’ve learned

Building agents that run for hours requires a different playbook than single-turn LLM development. Performance decays over long sequences, so trace-level telemetry is how you see what’s actually happening. Alphabetical tool sorting, relocating dynamic data, context compaction, and model routing together can cut API costs.

With the right evaluation strategy in place, you can deploy agents confidently in production. I’d love to compare notes on how you’re measuring your stack. Find me on LinkedIn, X, or Bluesky.

On-Device AI with the Google AI Edge Gallery and Gemma 4

Karl Weinmeister — Mon, 06 Apr 2026 21:40:03 +0000

Until recently, running an LLM on your phone meant one thing: chat. You could have a conversation or maybe summarize some text. You were back to the cloud the moment you needed the model to do something more.

The Google AI Edge Gallery app, updated with the release of the Gemma 4 open-weight model family, shows what’s now possible. It can generate structured code and control device settings with natural language, all running offline on your phone. This post covers the Gallery’s key features, walks through building a custom Agent Skill, and shows how to transition to Google Cloud when you’re ready to try larger model variants.

Gemma 4 for Edge AI

Let’s start with a brief introduction to Gemma 4, and how it makes agentic AI at the edge possible.

The Gemma 4 family includes two edge-optimized variants that the Gallery app runs natively: Gemma 4 E2B (Effective 2 Billion parameters) and Gemma 4 E4B (Effective 4 Billion). “Effective” is the keyword: these models use a per-layer embedding architecture that keeps memory footprints tiny, while punching well above their weight class in reasoning benchmarks. All of the Gemma 4 models are fully open-weight, shipping under the Apache 2.0 license.

What makes these models useful beyond chat is a combination of three capabilities. First, they’ve been fine-tuned for structured output. Given a tool schema, they reliably emit parsable JSON. Second, a 128K context window, accelerated locally via LiteRT-LM, gives the model enough memory to handle long conversations and multi-step interactions without losing track of earlier context. Third, multimodal vision lets E2B and E4B process images and output bounding box coordinates for UI elements, opening the door to screen-aware applications.

The Google AI Edge Gallery

The Google AI Edge Gallery is an open-source app designed to showcase what on-device generative AI can actually do. It’s available right now on both major mobile platforms:

Once installed, you can download Gemma 4 E2B or E4B models directly within the app from Hugging Face and see what a fully offline LLM can do on your hardware. The app is entirely open-source (Kotlin on Android, Swift on iOS), so you can study the implementation, fork it, or use it as a reference for integrating LiteRT-LM into your own mobile apps.

If you want to build function calling into your own Android app, the repo’s Function Calling Guide walks through the Kotlin patterns for cloning the Gallery, defining custom ActionType enums, annotating tools with @Tool and @ToolParam, and wiring up performAction handlers. iOS developers can reference the same architectural patterns with the open-source Swift implementation.

Google AI Edge Gallery UI on iOS

Prompt Lab

The Prompt Lab gives you single-turn prompt execution with granular control over temperature, top-k, and other generation parameters. It ships with several task templates: Freeform Prompt, Summarize Text, Rewrite Tone, and Code Snippet.

To try it out, select Code Snippet, choose Python, and type: “Print the numbers 1 through 10.” The model generates working code on-device:

for i in range(1, 11):
    print(i)

That’s a trivial example, but the point is what’s happening underneath: the model parsed a natural language instruction, selected the correct language target, and emitted structured, executable output. Swap the prompt for something harder (“Write a function that fetches JSON from a URL and retries with exponential backoff”) and you’ll see the same pattern hold up.

Prompt Lab UI on iOS

Agent Skills

The Agent Skills feature is where things get interesting. Skills are modular tool packages: each one gives the model a new capability without bloating the system prompt with instructions it doesn’t need for the current task.

Each skill is defined by a SKILL.md file containing metadata and instructions. The LLM reviews available skill names and descriptions appended to its system prompt, and if a user’s request aligns with a skill, it invokes it automatically. Built-in skills include Wikipedia lookups, interactive maps, QR code generation, and mood tracking. You can load custom skills three ways: from the community-featured gallery, via a URL, or by importing from a local file.

For developers who want to build their own skills, the architecture supports two execution paths: JavaScript skills (custom logic running inside a hidden webview, with full access to the web ecosystem including fetch(), CDN libraries, and even WebAssembly) and Native App Intents (leveraging built-in OS capabilities — currently sending email and text messages out of the box, with the ability to add more by extending the app’s source code).

Agent Skills UI on iOS

Mobile Actions and Beyond

The Gallery also includes Mobile Actions, a feature powered by a fine-tuned FunctionGemma 270M model, that demonstrates offline device controls. These include toggling the flashlight, adjusting volume, or launching apps, all triggered by natural language.

Other workspaces include AI Chat with Thinking Mode (multi-turn conversations where you can toggle the model’s step-by-step reasoning visualization, currently supported for the Gemma 4 family), Ask Image (multimodal object recognition and visual Q&A using your camera or photo gallery), Audio Scribe (on-device voice transcription and translation), and Model Management & Benchmark for profiling how each model performs on your specific hardware.

For a full walkthrough of every feature, check the Project Wiki.

Mobile Actions UI on iOS

Scaling to the Cloud

The Edge Gallery shows you what Gemma 4 can do at the edge. When you’re ready for more power, every model in the Gemma 4 family shares the same chat template, tokenizer, and function-calling format. The prompts and skills you develop locally will work the same way with a larger Gemma 4 model running in the cloud.

Google Cloud provides an official guide for deploying Gemma 4 on Cloud Run using a prebuilt vLLM container with GPU support, and Vertex AI offers managed endpoints with fine-tuning capabilities for enterprise deployments. The Agent Development Kit (ADK) provides the orchestration framework for building production agents on top of either target.

Gemma 4 in the Vertex AI Model Garden

Getting Started

On-device AI just got a lot more capable. The Google AI Edge Gallery makes it easy to see for yourself. Here’s my roadmap to get started:

Download the Google AI Edge Gallery on Android or iOS.
Try the Code Snippet template in the Prompt Lab.
Build a custom Agent Skill by following the Skills guide.
Head to the Google Cloud Console to spin up a larger Gemma 4 variant on Cloud Run or Vertex AI for your backend agent.

If you build something cool with the Google AI Edge Gallery, I’d love to hear about it. You can find me on LinkedIn, X, or Bluesky.

How to Use the Gemini Deep Research API in Production

Karl Weinmeister — Wed, 04 Mar 2026 16:08:05 +0000

How many of us have gone down the research rabbit hole? Way too many tabs, links, and notes in the pursuit of knowledge? It’s all useful stuff, but time-consuming and distracting.

Since I discovered the Gemini Deep Research Agent, I haven’t turned back. And best of all, it has a powerful and straightforward API to kick off research programmatically. Let’s explore how to use it, and the patterns for including this in a production architecture.

Async changes everything

A single research task can trigger dozens of search queries and take several minutes to complete. The asynchronous Interactions API provides a polling-based interface with a required background=True parameter to check on progress.

If you’ve ever worked with a Pub/Sub pipeline or job queue, this will feel familiar.

Meet the Interactions API

The Interactions API is a newer, unified interface for working with Gemini models and agents. It replaces the older generateContent pattern for scenarios that need state management, tool orchestration, or background execution.

You create an interaction, point it at the deep research agent, and tell it to run in the background:

from google import genai

client = genai.Client(api_key=GEMINI_API_KEY)

# Launch the research agent in the background
interaction = client.interactions.create(
    input="Research the history and future of Solid State Batteries.",
    agent='deep-research-pro-preview-12-2025',
    background=True
)

That call returns immediately with an interaction ID. The agent is now off doing its thing, autonomously planning search queries, reading pages, and iterating on its analysis. Your application is free to do whatever it needs to do in the meantime.

Polling for results

Now you need a way to check whether the agent has finished. The status field tells you everything you need to know:

while True:
    interaction = client.interactions.get(interaction.id)

    if interaction.status == "completed":
        # The full research report is ready
        print(interaction.outputs[-1].text)
        break
    elif interaction.status == "failed":
        print(f"Research failed: {interaction.error}")
        break

    # Still working. Check again in 10 seconds.
    time.sleep(10)

Taking it to production with Cloud Run

In a notebook, a while True loop gets the job done. In production, you want something that scales, recovers from failures, and doesn’t burn compute waiting. Google Cloud offers three Cloud Run compute models that each map to a different integration pattern with the Deep Research agent.

Cloud Run service: webhook-triggered research

A Cloud Run service works when you want to trigger research from an HTTP request. The service accepts the request, kicks off the agent, stores the interaction ID, and returns immediately. A separate mechanism (a Cloud Scheduler cron, a Cloud Workflow, or a callback) handles checking the results later.

from fastapi import FastAPI
from pydantic import BaseModel
from google import genai

app = FastAPI()
client = genai.Client()

class ResearchRequest(BaseModel):
    topic: str

@app.post("/research")
async def start_research(req: ResearchRequest):
    interaction = client.interactions.create(
        input=req.topic,
        agent="deep-research-pro-preview-12-2025",
        background=True,
    )

    # Store the ID for later retrieval (e.g., in Firestore or Cloud SQL)
    save_interaction_id(interaction.id, req.topic)

    return {"interaction_id": interaction.id, "status": "started"}

Cloud Run job: batch research tasks

A Cloud Run job is a natural fit for one-shot or scheduled research. Jobs execute code and stop, which maps cleanly to “launch, poll, write, exit.” If you have a batch of research topics, you can fan them out as parallel job tasks.

from google import genai
from google.cloud import storage

client = genai.Client()

def run_research_job():
    topic = os.environ.get("RESEARCH_TOPIC", "Default research topic")

    interaction = client.interactions.create(
        input=topic,
        agent="deep-research-pro-preview-12-2025",
        background=True,
    )

    # Poll until done
    while True:
        result = client.interactions.get(interaction.id)
        if result.status == "completed":
            # Write the report to Cloud Storage and exit
            bucket = storage.Client().bucket("my-research-reports")
            bucket.blob(f"{interaction.id}.md").upload_from_string(
                result.outputs[-1].text
            )
            return
        elif result.status == "failed":
            raise RuntimeError(f"Research failed: {result.error}")
        time.sleep(10)

run_research_job()

Cloud Run worker pool: continuous research dispatcher

The most interesting option for a production pipeline is a Cloud Run worker pool. Worker pools are designed for continuous, non-HTTP, pull-based background processing. They don’t need a public endpoint, they don’t autoscale by default (you bring your own logic), and they cost up to 40% less than instance-billed services.

If you’re building a system that continuously pulls research requests from a Pub/Sub subscription, dispatches them to the agent, and writes completed reports to storage, a worker pool is purpose-built for that pattern.

from google import genai
from google.cloud import pubsub_v1, storage

client = genai.Client()
subscriber = pubsub_v1.SubscriberClient()
subscription_path = "projects/my-project/subscriptions/research-requests"

def handle_message(message):
    topic = message.data.decode("utf-8")

    interaction = client.interactions.create(
        input=topic,
        agent="deep-research-pro-preview-12-2025",
        background=True,
    )

    # Poll until done, then write results
    while True:
        result = client.interactions.get(interaction.id)
        if result.status == "completed":
            bucket = storage.Client().bucket("my-research-reports")
            bucket.blob(f"{interaction.id}.md").upload_from_string(
                result.outputs[-1].text
            )
            message.ack()
            return
        elif result.status == "failed":
            message.nack() # Retry later
            return
        time.sleep(10)

# Pull messages continuously (worker pool stays alive)
streaming_pull = subscriber.subscribe(subscription_path, callback=handle_message)
streaming_pull.result()

Grounding with your own data

Web research is powerful, but sometimes you need the agent to work with private data or internal documents. The Deep Research agent supports a file search tool for exactly this. Think of it as RAG, but orchestrated automatically by the agent rather than wired up manually.

interaction = client.interactions.create(
    input="Compare our 2025 fiscal year report against current public web news.",
    agent='deep-research-pro-preview-12-2025',
    background=True,
    tools=[{
        "type": "file_search",
        "file_search_store_names": [FILE_SEARCH_STORE_NAME]
    }]
)

This is where the architecture gets interesting for enterprise use cases. The agent can combine internet research with grounded analysis of your internal documents, all within a single research task.

Stateful follow-ups

After a research task completes, you can ask follow-up questions that reference the original research context without re-running the entire workflow:

follow_up = client.interactions.create(
    input="Can you elaborate on the key findings?",
    model="gemini-3.1-pro-preview",
    previous_interaction_id=interaction.id
)

print(follow_up.outputs[-1].text)

Getting started

This Deep Research notebook walks you through the entire flow, from setting up the client to launching research tasks. For pricing details, check the Gemini API pricing page.

Ready to stop Googling and start delegating? Grab the notebook and run your first deep research task. I’d love to hear what you build with it. Come find me on LinkedIn, X, or Bluesky and share what research tasks you’re automating.

Skills Made Easy with Google Antigravity and Gemini CLI

Karl Weinmeister — Thu, 26 Feb 2026 16:52:06 +0000

When you ask an AI assistant a question, you have two choices: hope its training is current, or burn through tokens reading documentation. What if you could give your agent the right answer, right away?

That’s the power of Agent Skills. Skills are reusable packages of knowledge that extend what your agent can do without overwhelming its context window. Defined with a SKILL.md file, they allow you to teach your agent how to accomplish tasks consistently. Instead of forcing an agent to process an entire library’s worth of documentation at once, Skills act as on-demand expertise.

You can learn more about the open standard at Agent Skills and discover community capabilities at skills.sh.

In this post, we’ll explore how to manage these skills in the Gemini CLI, a powerful terminal-native AI assistant, and Antigravity, an advanced agentic coding assistant.

Installing skills

Both the Gemini CLI and Antigravity access skills by reading them from standard directories on your local machine. To add new skills, you can drop them into these locations:

Managing Skills in Gemini CLI

Gemini CLI offers built-in skill management. You can use either interactive slash commands during a session, or terminal commands:

These commands makes it easy to pull in skills from a Git repository or local directory, and manage whether they are active for your current project.

For example, if you want to install a specific skill located inside a subdirectory of a larger repository (like Firebase’s firebase-ai-logic-basics), you can use the --path flag:

gemini skills install https://github.com/firebase/agent-skills.git — path skills/firebase-ai-logic-basics

To audit which skills are currently loaded into your agent’s context, you can run:

gemini skills list

This command provides a clear overview of all discovered skills across your workspace and global environments, showing their descriptions and file locations so you know exactly what expertise your agent has access to.

Unified management with the skills tool

While Gemini CLI has robust built-in tools, what if you want to manage skills across both Gemini CLI and Antigravity simultaneously? Managing them by hand across the different ~/.gemini/skills/ and ~/.gemini/antigravity/skills/ directories can get tedious.

That’s where the open-source CLI tool from vercel-labs/skills shines. It uses a symlink approach to easily install, update, and remove skills centrally, sharing them across multiple agents without duplicating files.

Getting Started with skills

The easiest way to begin with the unified CLI is by using the add command. You can add the -a or --agent parameter for each client you’d like to add the skill to.

For example, suppose you want to equip your agent with deep knowledge of Firebase to help build full-stack apps. You could run:

npx skills add firebase/agent-skills -a gemini-cli -a antigravity

⚠️ Note that the skill will be added to the Gemini CLI even without the -a parameter, as it supports the default ~/.agents/skills global directory. The extra parameter provided here for clarity to show both clients in one command.

This installs the skill and instantly makes it available to both Gemini and Antigravity. By adding firebase/agent-skills, your agents can reliably build and deploy apps with Firebase Auth, Firestore, and more. For more details on how this skill works, read Introducing Agent Skills for Firebase.

If you’re looking for skills related to a specific technology, you can search for them directly from your terminal. For instance, if you’re building a mobile app, you might want to find capabilities related to Flutter. You can use the find command to discover relevant skills:

npx skills find flutter

This command searches the community skills registry and returns a list of matching capabilities, displaying the most popular ones first alongside their installation commands. You can quickly copy those commands to add the expertise directly to your active agents.

Keeping your agent’s context clean

It’s easy to get excited and install dozens of skills. While progressive disclosure means your agent isn’t reading the entire instruction manual for every skill on every prompt, simply loading the names, descriptions, and metadata of 50 different skills can still clutter the initial context window, leading to confusion or degraded performance.

To keep your agents focused and efficient, make sure to keep your essential skills up-to-date with your chosen tool’s update commands. More importantly, if you find you aren’t using a skill anymore, take a moment to disable or remove it (e.g., /skills disable <name> in Gemini CLI or npx skills remove <name>) to free up that precious context space.

By managing skills in Gemini CLI and Antigravity with the skills CLI, you can tailor and organize your environment to your liking. To get more hands-on experience building skills, you can try out the Agent Skills codelab.

Have you built any interesting workflows using Agent Skills? I’d love to hear how you’re extending your agents. Share what you’ve built with me on LinkedIn, X, or Bluesky!

Performance shouldn’t be an afterthought: Hardening the AI-Assisted SDLC

Karl Weinmeister — Mon, 26 Jan 2026 17:31:22 +0000

Performance shouldn’t be an afterthought: Hardening the AI-assisted SDLC

It’s amazing how quickly you can now build a working application with AI assistance. It’s even more amazing how easily you can harden your application for production. But that’s a step that’s often left out of the “vibe coding” software development lifecycle, or SDLC. I hope to change that.

Why does it matter? The impact of high latency is lost users, and the impact of excess memory usage is lost budget.

Study after study shows that your application’s latency directly correlates with user satisfaction, a key ingredient for business success. Meanwhile, your application’s memory usage impacts your Cloud infrastructure cost. For example, Cloud Run offers memory limits at various tiers ranging from 512 MiB to 32 GiB. Not to mention, if you underprovision memory, your application reliability will suffer.

In this post, I’ll walk through steps I recommend that ensure your application is hardened for production. I’ll use Google Antigravity to build an application with sample application code available on GitHub.

Discovery and Tool Selection

If you aren’t an expert in the tooling ecosystem for your application’s language, use AI to bridge the gap. Avoid guessing and ask for industry standards. For example, you can ask:

“I need to profile a Python application for both CPU execution time and memory leaks. What are the most modern, low-overhead tools available? I know about cProfile, but are there better options with visualization (like flame graphs)?”

What modern stack might your AI assistant suggest? scalene is a high-performance profiler whose standout capability is separating time spent in Python versus native code. To dig into memory details, memray can track allocations in native extensions and generate flame graphs that make it easy to spot areas for improvement. Finally, pytest-benchmark is a useful plugin that handles warm-up rounds and statistical analysis automatically.

If you’re writing code in other languages, the same strategy applies. You might discover pprof for Go, clinic.js for Node.js, and other useful tools.

Establish a Baseline

My use case is calculating the perplexity of a given text, which is helpful for AI detection and other use cases. The initial implementation started with a naïve algorithm which processes one token at a time, which isn’t uncommon when you simply ask for a solution.

for i in range(seq_len - 1):
    current_token = input_ids_int64[i]

    # 1. Construct single-token input
    inputs = { "input_ids": np.array([[current_token]]) }
    inputs.update(past_key_values)

    # 2. Run inference for just this token
    outputs = session.run(None, inputs)

Optimize for Speed

While this code works, it’s slow. With our tools selected from the research phase, we can ask our AI agent to benchmark the baseline code.

“Generate a Python script using pytest-benchmark to benchmark my perplexity function against a baseline. Create a mock dataset to simulate load.”

Once we have a benchmark, we can then ask our AI agent to optimize it:

“Profile this baseline code and suggest an optimized routine. Focus on throughput.”

A standard engineering strategy to address loop overhead is vectorization. The revised approach feeds the entire sequence to the model in one go:

def calculate_perplexity_batch(context, text):
    # 1. Encode entire text at once
    input_ids = tokenizer.encode(text)

    # 2. Single inference call for the whole sequence
    outputs = session.run(None, inputs)
    logits = outputs[0] # Shape: [1, SeqLen, Vocab]

    # 3. Vectorized loss calculation (No loops)
    # ... numpy vector operations ...
    return float(np.exp(mean_nll))

In my test environment, this change led to an overall 2.5x speed improvement over the naïve loop.

Optimize Memory Usage

Unfortunately, this speed came at a cost. By loading all logits for the entire sequence into memory at once, I created an unbounded memory situation. Long documents would cause peak memory usage to spike uncontrollably. I had solved for latency, but in doing so, I had broken cost constraints.

How could I prompt Antigravity to help?

“Analyze my optimized perplexity routine. The target environment is Google Cloud Run with a strict 2GB memory limit. Identify the peak memory usage and refactor the code to stay under this limit without reverting to the slow loop.”

The solution balanced speed and memory, processing data in batches large enough to achieve high throughput but small enough to manage peak memory:

chunk_size = 128
logits_list = []

for i in range(len(input_ids) - 1):
    append_tokens(input_ids[i : i + 1])
    logits_list.append(get_logits()[0, 0, :])

    if len(logits_list) >= chunk_size:
        # Process this chunk
        _process_logits_chunk(logits_list, targets_list)

        # Free memory immediately to clip the peak
        logits_list = []

Final Thoughts

Before unleashing this process across your codebase, let’s be clear that performance engineering is a rigorous discipline that goes beyond optimizing functions. Industry veteran Brendan Gregg famously warns against the Streetlight Anti-Method: looking for performance problems where it’s easiest, rather than where the problems actually exist.

Providing your AI assistant the broader context of your application is key, and it’s easy to overlook important details in your prompting. An AI assistant doesn’t know that your production workload is 10 million rows, not the 100 rows in your test script. It can’t see that your database is missing an index or that your network bandwidth is saturated. Most importantly, an AI assistant doesn’t know your intent. If you steer it towards speeding up a query, it will focus on what you asked for, but it likely won’t ask why that data isn’t cached in the first place.

With those considerations in mind, using AI as a final check is a low-risk, high-reward step. It takes minutes and often catches low-hanging fruit that is overlooked. Then, the next step is maintaining your application’s performance. Consider leveraging tools for continuous application monitoring to identify regressions and ensure reliability in a live environment.

I’d love to hear how you’re innovating with your software development lifecycle. Connect with me on LinkedIn, X, or Bluesky!

AI Agent Engineering in Go with the Google ADK

Karl Weinmeister — Tue, 20 Jan 2026 16:41:44 +0000

While Python remains popular for model training and research, the requirements for serving and orchestrating AI agents align closely with Go’s strengths: low latency, high concurrency, and type safety.

Transitioning from a prototype to a production agent introduces engineering challenges that Golang can handle exceptionally well. Go’s static typing eliminates runtime errors when parsing structured LLM outputs. Its lightweight goroutines, which start with just a few kilobytes of stack memory, allow agents to handle thousands of concurrent tool executions without the overhead of heavy thread management.

In recent years, Go’s adoption for cloud-native microservices has surged: it showed the fourth-highest promise for languages, and maintained a 93% satisfaction rate. Google’s Agent Development Kit, or ADK, bridges the gap between these architectural advantages and generative AI.

In this guide, I’ll walk through scaffolding a new project and deploying it as a secure microservice on Google Cloud.

Get Started with the Agent Starter Pack

The good news is you don’t need to start from scratch. The Agent Starter Pack is a CLI tool that scaffolds a production-ready folder structure, including CI/CD pipelines, infrastructure configuration, and boilerplate code.

To get started, just run the create command with uvx:

uvx agent-starter-pack create

The CLI guides you through an interactive setup. For this project, I selected:

Project Name: my-first-go-agent
Template: Option 6 (Go ADK, Simple ReAct agent)
CI/CD: Option 3 (GitHub Actions)
Region: us-central1

Agent Starter Pack CLI

The tool automatically authenticates with Google Cloud, enables the necessary Vertex AI APIs, and configures your local environment. Once you see the green Success! message, you’re good to go.

Web User Interface

One of the most convenient features of the ADK is the ability to visually debug your agent before deploying it. By running the command make install && make playground, you launch a local development server with a built-in UI. Yes, it has a chat window, but it goes way beyond that by tracing events, tool calls, and more.

In the screenshot below, I’m interacting with the newly created agent. The agent is configured with a ReAct (Reasoning and Acting) pattern — a framework introduced by Yao et al. in 2022 that has become foundational in agentic AI. The ReAct pattern’s continuous loop of “Thought,” “Action,” and “Observation” enhances problem-solving and interpretability, making the agent’s decision-making process transparent. It recognized the intent, invoked the get_weather tool, and returned the structured data (“It’s sunny and 72°F”).

Agent Development Kit web user interface

Understanding the Code

Now that we’ve seen the agent in action, let’s look at the Go code that makes this work. The logic lives in agent/agent.go. This file handles tool definitions, model configuration, and initialization.

The ADK uses standard Go structs to define how the Large Language Model (LLM) interacts with your code. For example, to define the input parameters for our weather tool, we simply define a struct with json and jsonschema tags:

type GetWeatherArgs struct {
    City string `json:"city" jsonschema:"City name to get weather for"`
}

GetWeatherResult defines the structure of the data returned to the agent after the tool executes.

type GetWeatherResult struct {
 Weather string `json:"weather"`
}

GetWeather is a standard Golang function that accepts tool.Context and the arguments struct, performing the business logic and returning the result struct.

func GetWeather(_ tool.Context, args GetWeatherArgs) (GetWeatherResult, error) {
 return GetWeatherResult{
  Weather: "It's sunny and 72°F in " + args.City,
 }, nil
}

The NewRootAgent function is responsible for assembling and returning the agent.Agent instance that the application launcher requires. It begins by initializing the model configuration, creating a gemini-2.5-flash model instance backed by genai.BackendVertexAI.

Next, it bridges the gap between Go code and the LLM by wrapping the local GetWeather function into a [functiontool](https://pkg.go.dev/google.golang.org/adk/tool/functiontool). This step registers the tool with the name get\_weather and provides the necessary description for the model’s context. Finally, it constructs the agent using llmagent.New, which combines the initialized Gemini model, the system instructions that define the agent’s behavior, and the slice of available tools into a single unit.

The NewRootAgent looks like this (with some error-handling removed):

func NewRootAgent(ctx context.Context) (agent.Agent, error) {
 model, err := gemini.NewModel(ctx, "gemini-2.5-flash", &genai.ClientConfig{
  Backend: genai.BackendVertexAI,
 })

 weatherTool, err := functiontool.New(functiontool.Config{
  Name: "get_weather",
  Description: "Get the current weather for a city.",
 }, GetWeather)

 rootAgent, err := llmagent.New(llmagent.Config{
  Name: "my-first-go-agent",
  Model: model,
  Description: "A helpful AI assistant.",
  Instruction: "You are a helpful AI assistant designed to provide accurate and useful information.",
  Tools: []tool.Tool{weatherTool},
 })

Testing

The project contains both unit tests for internal logic, and end-to-end tests for server integration.

In agent/agent\_test.go, the GetWeather function is called with a suite of test cases, and verifies that the output string matches its expectations.

func TestGetWeather(t *testing.T) {
 // tests struct initialized with "San Francisco" and "New York"

 for _, tt := range tests {
  t.Run(tt.name, func(t *testing.T) {
   // Pass nil for tool.Context since GetWeather doesn't use it
   result, err := GetWeather(nil, GetWeatherArgs{City: tt.city})
   if err != nil {
    t.Fatalf("GetWeather() error = %v", err)
   }
   if !strings.Contains(result.Weather, tt.wantCity) {
    t.Errorf("GetWeather() = %v, want city %v in response", result.Weather, tt.wantCity)
   }
  })
 }
}

The end-to-end tests verify that the agent works correctly when running as a server, specifically checking that A2A or Agent-to-Agent protocol support is working correctly. The E2E tests start a real instance of the server, sending HTTP requests to it, and check the responses. Here’s a snippet from e2e/integration/server\_e2e\_test.go:

func TestA2AMessageSend(t *testing.T) {
    if testing.Short() { t.Skip("Skipping E2E test in short mode") }

    // Start server (local variable to avoid race conditions)
    t.Log("Starting server process")
    serverProcess := startServer(t)
    defer stopServer(t, serverProcess)

    if !waitForServer(t, 90*time.Second) {
   t.Fatal("Server failed to start")
}
t.Log("Server process started")

You can run all tests with make test:

make test                      
go test -v ./agent/... ./e2e/...
=== RUN TestGetWeather
=== RUN TestGetWeather/San_Francisco
=== RUN TestGetWeather/New_York
--- PASS: TestGetWeather (0.00s)
    --- PASS: TestGetWeather/San_Francisco (0.00s)
    --- PASS: TestGetWeather/New_York (0.00s)
PASS
ok my-first-go-agent/agent 0.218s

Deployment

The make deploy command automatically builds your application from source using Google Cloud Buildpacks, triggered by the --source . flag. It deploys this image to Cloud Run with several production-optimized flags: --memory “4Gi” to provide ample RAM for LLM operations, and --no-cpu-throttling to ensure the CPU remains allocated 24/7. This configuration is particularly valuable for Go applications.

`make deploy` builds the container and deploys to Cloud Run

To ensure your agent runs securely, the command is enabled with a strict configuration. It uses --no-allow-unauthenticated to block all public access by default, requiring Identity and Access Management (IAM) authentication for any requests. It also injects environment variables via --update-env-vars, including the use of Vertex AI GOOGLE\_GENAI\_USE\_VERTEXAI=True. After running the command, I have a service URL!

If you want to view the deployed web UI, I recommend deploying with make deploy IAP=true. This will handle the steps to enable IAP for Cloud Run. You will also need to provide access to users within your organization following the instructions in the documentation.

Adding a principal to IAP with the Google Cloud Console

With IAP enabled, I can now view the web UI or the deployed Agent Card. This card serves as your agent’s standard interface, allowing it to be dynamically discovered by other agents, orchestrators, or human-facing UI:

What’s next?

To continue your journey building production AI agents in Golang:

ADK Documentation: Complete guides on advanced patterns, multi-agent orchestration, and memory systems
Agent Starter Pack: Explore templates, including multi-agent systems and complex architectures
Cloud Run Documentation: Deep dives on performance optimization, scaling strategies, and security best practices
Go Concurrency Patterns: Understanding goroutines and channels will help you build more efficient agent tooling
Vertex AI Agent Engine: For managed agent infrastructure with built-in orchestration and tooling

As you scale from one agent to many, the engineering decisions we’ve discussed here compound in value. Go’s concurrency model and Cloud Run’s autoscaling are both necessary ingredients. Share what you’re building with me on LinkedIn, X, or Bluesky!

The Six Failures of Text-to-SQL (And How to Fix Them with Agents)

Karl Weinmeister — Tue, 11 Nov 2025 14:23:12 +0000

I’ve written countless SQL queries over the years. Unfortunately, like my golf game, I don’t write SQL enough to be a pro at it. Outside of straightforward SELECT statements, I approach SQL queries iteratively. I’ll inspect the tables, draft a query, and hope for the best. If there are any errors, I’ll go through this loop again.

While AI models are much better than me at SQL, they aren’t perfect. And that loop I described is just as important for automated approaches to be effective. Text-to-SQL is a deceptively difficult problem with challenges including linguistic ambiguity and rare SQL operations.

This is where a multi-agent architecture, built with a framework like Google’s Agent Development Kit (ADK), becomes essential. We can build a “virtual data analyst” by composing a team of specialized agents. A SchemaExtractor can find the right tables, a SqlGenerator can write the draft, and a SqlCorrector can critique and fix it. A SequentialAgent acts as the manager, ensuring the process is followed, every single time.

In this guide, we’ll walk through the six most common failure points for Text-to-SQL and show how to solve each one by building out our team of agents, moving from a simple script to a full-fledged agentic system. We’ll use the sample project kweinmeister/text-to-sql-agent to illustrate these solutions.

Problem 1: Agent Order Issues

Here’s the issue with a single LlmAgent that holds all the tools: it decides the order of operations. It might confidently skip fetching the schema and invent a table name. Or it might try to run a query before validating it. A single LLM is deciding what to do next, and it can (and will) make mistakes. That’s not a reliable process.

Solution: SequentialAgent for Order Control

The ADK gives us “Workflow Agents” for this. These specialized agents don’t use an LLM for flow control. They’re deterministic.

The SequentialAgent is the simplest and most powerful one to start with. It runs its sub-agents in the exact order you list them. Using a sequential agent also separates the concerns of “what to do” (our specialized agents) from “the order to do it in” (the workflow agent).

The SequentialAgent also acts as a guardrail. It turns our best practices (“always get the schema first,” “always validate before running”) into enforced infrastructure, not just suggestions in a prompt.

Code Example: Defining the Workflow Manager

Let’s define our root agent. Instead of a single LlmAgent, our root_agent will be a SequentialAgent. We’ll start by defining the specialists as stubs (we’ll build them out in the next sections):

from google.adk.agents import SequentialAgent

from .agents import (
    schema_extractor_agent,
    sql_correction_loop,
    sql_generator_agent,
)
from .callbacks import capture_user_message

root_agent = SequentialAgent(
    name="TextToSqlRootAgent",
    before_agent_callback=capture_user_message,
    sub_agents=[
        schema_extractor_agent,
        sql_generator_agent,
        sql_correction_loop,
    ],
)

Problem 2: LLM Schema Hallucinations

This is the classic failure mode. The LLM just doesn’t know your schema.

A common but flawed fix is to dump the entire database schema into the prompt. This backfires for two reasons. First, huge enterprise schemas won’t even fit in the context window. Second, even if they did, giving the LLM 100 irrelevant tables to find the 2 relevant ones just drowns it in noise and leads to worse results.

Solution: Dedicated Schema-Retrieval Tool

The answer is dynamic retrieval. Don’t give the agent a static block of schema; give it a tool to fetch schema. This lets the LLM reason about what it needs first, and then request only that specific information.

We can build a simple Python function for this. The ADK makes it easy to turn any function into an agent-callable tool with FunctionTool. The agent automatically figures out how to use it from its docstring, a best practice you’ll see in projects like gabrielpreda/adk-sql-agent.

Code Example: The Schema Tool

💡 In the kweinmeister/text-to-sql-agent project, the functions are not wrapped as tools, since they are directly called by a deterministic agent. They are provided centrally in a tools.py file, so that they can be easily leveraged as tools in a future LlmAgent.

import logging
from typing import Any

from .config import DB_URI
from .dialects.dialect import DatabaseDialect

logger = logging.getLogger( __name__ )

def load_schema_into_state(state: dict[str, Any], dialect: DatabaseDialect) -> None:
    """
    Loads the DDL and SQLGlot schema into the state dictionary.
    This function relies on the caching mechanism within the dialect object.
    """
    logger.info(f"Loading schema for dialect: {dialect.name}")

    db_uri = DB_URI
    # Error handling code omitted

    try:
        logger.info(f"Loading schema from database: {db_uri}")
        # The dialect object handles its own caching.
        # The first call to get_ddl will trigger the DB query and cache the DDL.
        logger.info("Calling dialect.get_ddl...")
        state["schema_ddl"] = dialect.get_ddl(db_uri)
        logger.info("DDL loaded successfully")

        # The call to get_sqlglot_schema will use the cached DDL if available,
        # then parse it and cache the result.
        logger.info("Calling dialect.get_sqlglot_schema...")
        state["sqlglot_schema"] = dialect.get_sqlglot_schema(db_uri)
        logger.info("SQLGlot schema loaded successfully")
        logger.info(f"SQLGlot schema keys: {list(state['sqlglot_schema'].keys())}")

    except Exception as e:
        error_msg = f"Error extracting schema: {e}"
        logger.error(error_msg, exc_info=True)
        state["schema_ddl"] = f"Error loading schema: {error_msg}"
        state["sqlglot_schema"] = {"error": error_msg}

Problem 3: Query Logic Errors

Even with the right schema, the LLM can still make logical mistakes with complex joins or aggregations. A human analyst would spot the error, critique it (“That join is wrong, you need to use user_id”), and refine it.

Our SequentialAgent is too simple for this. It’s a waterfall. It can’t go backwards and iterate.

Solution: LoopAgent for Iterative Refinement

The ADK has another workflow agent for this: the LoopAgent. This agent runs its sub-agents iteratively until a condition is met. It’s perfect for a “generate-and-critique” pattern.

We don’t have to replace our SequentialAgent. We can enhance it by nesting workflow agents. We’ll replace the single query generation step inside our SequentialAgent with a new LoopAgent. This loop will contain a team of two specialists:

A Writer Agent: An LlmAgent that writes the SQL draft.
A Critic Agent: A second LlmAgent with a different prompt, whose only job is to correct the writer’s SQL.

This is a powerful way to get LLMs to self-correct, which improves the quality of the final query.

Code Example: Building a “Generate-and-Critique” Loop

sql_generator_agent = Agent(
    name="sql_generator_agent",
    model=MODEL_NAME,
    description="Generates an initial SQL query from a natural language question.",
    instruction=get_generator_instruction,
    output_key="sql_query",
    after_model_callback=clean_sql_query,
)

sql_corrector_agent = Agent(
    name="sql_corrector_agent",
    model=MODEL_NAME,
    description="Corrects a failed SQL query.",
    instruction=get_corrector_instruction,
    output_key="sql_query",
    tools=[],
    after_model_callback=clean_sql_query,
)

sql_correction_loop = LoopAgent(
    name="SQLCorrectionLoop",
    sub_agents=[
        sql_processor_agent,
        sql_corrector_agent,
    ],
    max_iterations=3,
)

Problem 4: Agent Performance and Cost

We’re now using three LLM-powered agents. This is great for quality, but it’s slow and costs money with every API call.

What about simple, deterministic steps? Things like validating SQL syntax, formatting data, or cleaning up LLM output. Using a powerful LLM for these jobs is like using a sledgehammer to hang a picture. It’s slow, expensive, and surprisingly unreliable.

Solution: Custom Agents for Code-Based Logic

The ADK isn’t just for LLMs. You can create a “Custom Agent” by inheriting from BaseAgent and implementing the _run_async_impl method.

This agent has no LLM. It runs pure Python code. It’s fast and 100% deterministic. We’ll create a custom agent for our next problem: validation.

Code Example: Building a Non-LLM ValidationAgent

This agent will use the sqlglot library (which we’ll discuss in detail next) and will be a custom BaseAgent.

class SQLProcessor(BaseAgent):
    """
    Agent that handles the mechanical steps of:
    1. Validating the current SQL.
    2. Executing it ONLY if validation passed.
    3. Escalating to exit the loop on successful execution.
    """

    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event]:
        logger.info(f"[{self.name}] Starting SQL processing.")
        # ...

Problem 5: Dangerous Query Execution

This is the big one. You can’t execute LLM-generated code directly against your database. Ever. It’s a massive security and stability risk.

We need a fast, reliable check for syntax errors. What if the LLM produces a query that’s syntactically invalid? Or for the wrong SQL dialect?

Solution: Non-Destructive Dry Run with sqlglot

This is where our custom SqlValidationAgent shines. We’ll use the sqlglot library, a pure-Python SQL parser and transpiler.

Why sqlglot? It’s fast and local, building a real Abstract Syntax Tree (AST) which is infinitely more reliable than regex. It’s also dialect-aware, so it can catch syntax errors specific to, say, PostgreSQL.

We can just wrap sqlglot.parse_one(sql) in a try…except block. If it parses, the syntax is valid. If it throws a ParseError, it’s not. This gives us a fast and cheap validation signal.

Code Example: Full ValidationAgent Implementation

Here is the full implementation of the SqlValidationAgent we previewed with sqlglot validation.

from google.adk.agents import BaseAgent
from google.adk.core import InvocationContext, Event
from google.genai.types import Content, Part
import sqlglot
import sqlglot.expressions as exp
import asyncio
from typing import AsyncGenerator

class SQLProcessor(BaseAgent):
    """
    Agent that handles the mechanical steps of:
    1. Validating the current SQL.
    2. Executing it ONLY if validation passed.
    3. Escalating to exit the loop on successful execution.
    """

    async def _run_async_impl(self, ctx: InvocationContext) -> AsyncGenerator[Event]:
        logger.info(f"[{self.name}] Starting SQL processing.")
        state = ctx.session.state
        dialect = get_dialect()

        val_result: dict[str, Any] = run_sql_validation(state, dialect)
        yield Event(
            author=self.name,
            invocation_id=ctx.invocation_id,
            custom_metadata={"validation_result": val_result},
        )

        if val_result.get("status") == "success":
            exec_result: dict[str, Any] = run_sql_execution(state, dialect)

            result_event = Event(
                author=self.name,
                invocation_id=ctx.invocation_id,
                custom_metadata={"execution_result": exec_result},
            )

            # If execution succeeds, this is the final answer.
            # Escalate to exit the loop and provide the final content.
            if exec_result.get("status") == "success":
                logger.info(
                    f"[{self.name}] SQL execution successful. Escalating to exit loop."
                )
                result_event.actions.escalate = True

                final_query: str | None = state.get("sql_query")
                state["final_sql_query"] = final_query

                if final_query:
                    result_event.content = Content(
                        role="model", parts=[Part(text=final_query)]
                    )

            yield result_event
        else:
            logger.info(f"[{self.name}] Skipping execution due to validation failure.")
            state["execution_result"] = {
                "status": "skipped",
                "reason": "validation_failed",
            }

Problem 6: Messy LLM Output

One last thing. LLMs are trained to be helpful conversationalists. So when you ask for a SQL query, you often get this:

“Sure! Here is the SQL query you asked for: SELECT * FROM users;”

That conversational fluff will break our SqlValidationAgent every single time. We need a way to programmatically clean the LLM’s output before it’s passed to the next agent.

Solution: Callbacks for Post-Processing

We could add another CustomAgent just to strip the text, but that feels a bit heavy for such a simple task.

The ADK offers a more elegant solution: Callbacks.

An AfterAgentCallback is a function you attach to an agent that’s guaranteed to run immediately after the agent finishes. It can even modify the agent’s final output.

Code Example: Attaching a Cleanup Callback

import re
from google.adk.core import InvocationContext, Content

def cleanup_sql_output(
    context: InvocationContext,
    agent_output: Content
) -> Content:
    """
    This callback runs *after* the agent and cleans its output.
    """
    raw_text = agent_output.parts.text

    # Simple regex to find content within ```
{% endraw %}
sql...
{% raw %}

match = re.search(r"```

sql\s*(.?)\s

```", raw_text, re.DOTALL | re.IGNORECASE)

cleaned_text = raw_text
if match:
    cleaned_text = match.group(1)
else:
    # Fallback: simple stripping
    cleaned_text = raw_text.strip().strip("`").strip()

# Add a semicolon if it's missing (another common cleanup)
if not cleaned_text.endswith(";"):
    cleaned_text += ";"

# Return a *new* Content object to *replace* the original output
return Content.from_text(cleaned_text)




### Final Architecture

We’ve systematically tackled the six hardest problems in Text-to-SQL, evolving a brittle script into an extensible multi-agent system.

Our final root\_agent is a [SequentialAgent](https://google.github.io/adk-docs/api-reference/python/google.adk.agents.html#google.adk.agents.SequentialAgent) that orchestrates a team of specialists: a schema-fetching agent, a looping agent for iterative query improvement (with its own writer and critic), and a fast, deterministic validation agent using sqlglot.

The point is that modern agent development is about _composition_. You have to choose the right ADK construct for the right task. This table is a cheat sheet for making that decision.

### Agent Design: The “Right Tool for the Job”

![](https://cdn-images-1.medium.com/max/1024/1*OuHQ_Jb7kQ0IaBrlWKVICg.png)

### Conclusion: Building Reliable AI Systems

This pattern of **Specialization** , **Orchestration** , and **Safeguards** is the future of building production-ready AI. It’s not just for SQL, either. You can use this same architecture for autonomous code generation, document analysis, and much more.

So stop trying to build one “super-prompt” and start building teams of specialized agents. Welcome to the world of reliable, agentic systems.

What’s next? Get started in 3 simple steps in the [sample repository](https://github.com/kweinmeister/text-to-sql-agent). If you want a hands-on lab exercise, check out [Build Multi-Agent Systems with ADK](https://codelabs.developers.google.com/codelabs/production-ready-ai-with-gc/3-developing-agents/build-a-multi-agent-system-with-adk?hl=en#0&utm_campaign=CDR_0x2b6f3004_default_b459252462&utm_medium=external&utm_source=blog). To learn about powerful, built-in natural language capabilities in AlloyDB, try out the [AlloyDB AI NL SQL](https://codelabs.developers.google.com/alloydb-ai-nl-sql?hl=en#0&utm_campaign=CDR_0x2b6f3004_default_b459252462&utm_medium=external&utm_source=blog) codelab.

Want to keep the discussion going about multi-agent systems? Connect with me on [LinkedIn](https://www.linkedin.com/in/karlweinmeister/), [X](https://x.com/kweinmeister), or [Bluesky](https://bsky.app/profile/kweinmeister.bsky.social).

* * *

Deploy Faster with Terraform: Your Guide to vLLM on GKE with Infrastructure-as-Code

Karl Weinmeister — Sun, 12 Oct 2025 23:24:44 +0000

Somewhere in your AI journey, you’re going to push the limits of what models can do.

You might need to squeeze out that extra bit of performance, or try to fit a big model right under a GPU’s VRAM limit. All of these situations require tweaking and redeployment. That’s not as simple as it sounds, when the infrastructure includes everything from GPU clusters to storage to networking.

The solution is to treat your infrastructure the same way you treat your application code. It needs to be versioned in Git. It needs to be tested. And it needs to be deployed through an automated pipeline. This practice, known as Infrastructure as Code, or IaC, is the foundation of any serious MLOps strategy.

This article is a practical guide on how to use Terraform for agile ML engineering. I’ll walk through a real-world example of deploying a high-with vLLM on Google Kubernetes Engine. You can follow along with the complete source code on GitHub in the vllm-gke-terraform repository.

We will use the Qwen3–32B model in this article, which can be run on easily accessible NVIDIA L4 GPUs on Google Cloud. The Terraform script has been tested on larger models, such as the Qwen/Qwen3–235B-A22B-Instruct-2507 on a cluster with 8 H100 GPUs.

The scripts currently use GKE standard clusters for maximum flexibility. For production workloads where you want to offload node management and focus purely on the application, it’s recommended to leverage GKE Autopilot capabilities.

Declarative Infrastructure

Terraform uses a declarative language (HCL) where you define the desired end state of your infrastructure. You specify what you need, and Terraform’s engine calculates the necessary API calls to make the real-world infrastructure match that state. Before applying any changes, you can run the terraform plan command to see a detailed preview of what Terraform will create, modify, or destroy.

This allows for a thorough review to ensure the proposed changes align with your intentions, preventing unintended modifications. This declarative model is the key to eliminating configuration drift and ensuring that every environment is provisioned identically, a critical requirement for reproducible experiments.

The Terraform provider for Google Cloud is the interface between Terraform and Google Cloud. For example, the google_container_cluster resource is used to manage a GKE cluster. You can find the full set of GKE resources here.

In our project, the gke.tf file declares the desired state of a GKE cluster with specific node pools:

# gke.tf
resource "google_container_cluster" "qwen_cluster" {
  name = local.cluster_name
  location = var.zone
  project = var.project_id
  # ...
}

resource "google_container_node_pool" "gpu_pools" {
  # ...
  node_config {
    machine_type = each.value.machine_type
    guest_accelerator {
      type = each.value.accelerator_type
      count = each.value.accelerator_count
    }
  }
}

To manage this, Terraform maintains a state file that maps these definitions to their real-world resources. For team collaboration, using a remote state backend like Cloud Storage is recommended. It provides a centralized source of truth and uses locking mechanisms to prevent conflicting changes. Here’s how to instruct Terraform to use GCS as its backend:

# main.tf
terraform {
  backend "gcs" {
    prefix = "terraform/state/vllm-gke"
  }
}

Reusable Modules

Terraform modules are the primary mechanism for abstraction and reuse. MLOps teams can create a library of standardized modules for common components like a GKE cluster or a vector database.

Modules are made reusable through input variables. This allows an engineer to maintain a single, version-controlled set of Terraform files and use variable files (.tfvars) to launch new, isolated deployments.

To test a new model, you could simply create a new variable file like llama3-test.tfvars. By overriding a few default values, you can spin up an entirely new, isolated environment to test Llama-3–8B on L4 GPUs:

# my-experiment.tfvars
project_id = "my-gcp-project"
name_prefix = "my-llama3-deployment"
model_id = "meta-llama/Llama-3-8B-Instruct"
gpu_type = "l4"

Running terraform apply -var-file=llama3-test.tfvars makes spinning up parallel experiments a trivial, declarative operation, dramatically increasing a team’s experimental throughput.

For production systems, this same principle allows for sophisticated, zero-downtime strategies like Blue/Green deployments. A second, parallel “green” version of the entire stack is deployed by instantiating the Terraform configuration with a different set of variables. Once the new environment is fully validated, production traffic can be instantly switched at the load balancer or DNS level. The old “blue” environment can then be decommissioned. By codifying these complex release strategies, the entire deployment process becomes a version-controlled, auditable artifact.

Configuring the vLLM Engine

Provisioning hardware consistently is the first step. Configuring software to utilize that hardware efficiently is next.

The sample project uses the popular vLLM inference engine. Let’s show how to effectively link Terraform variables to configuration parameters in vLLM.

In variables.tf, the high-level knobs for experiments are defined:

# variables.tf
variable "gpu_memory_utilization" {
  description = "GPU memory utilization ratio"
  type = number
  default = 0.9
}

variable "max_model_len" {
  description = "The maximum model length."
  type = number
  default = 8192
}

variable "vllm_max_num_seqs" {
  description = "The maximum number of sequences (requests) to batch together."
  type = number
  default = 64
}

Then, the deployment in kubernetes.tf consumes these variables to construct the vLLM server’s startup arguments:

# kubernetes.tf
...
container {
  name = "vllm-container"
  args = compact([
    # --- Base Model Arguments ---
    "--model",
    var.model_id,
    "--tensor-parallel-size",
    tostring(local.gpu_config.accelerator_count),

    # --- Performance Tuning from Variables ---
    "--gpu-memory-utilization",
    tostring(var.gpu_memory_utilization),
    "--max-model-len",
    tostring(var.max_model_len),
    "--max-num-seqs",
    tostring(var.vllm_max_num_seqs),
  ])
}

Production-Grade Architecture

The sample project showcases a blueprint for a production-grade inference endpoint on GKE designed for both performance and cost-efficiency.

The gke.tf file provisions a GKE cluster with both spot and on-demand GPU node pools, which allows for a flexible and cost-effective approach to managing expensive GPU resources. You can read more here about the strategy to back up spot VMs with an on-demand node pool.

To avoid re-downloading large models on every pod restart, a kubernetes_persistent_volume_claim is created in kubernetes.tf to provide a persistent cache for the Hugging Face models. A Kubernetes Job, defined in kubernetes_jobs.tf, is then used to download the specified model into this persistent volume. This job runs to completion before the main vLLM deployment is scaled up, ensuring the model is ready before the inference server starts.

Automated Workflows

While Terraform itself is a big leap forward from shell scripting, it’s crucial that teams don’t stop there. The next step beyond running manual terraform commands is to embrace an automated, end-to-end CI/CD workflow, often called GitOps. The source control repository becomes the single source of truth for both application code and infrastructure.

The sample project includes a basic GitHub Actions workflow that validates the Terraform code on every push and pull request.

# .github/workflows/terraform-validate.yml
name: 'Terraform Validate'
on: [push, pull_request]

jobs:
  validate:
    name: 'Terraform Validate'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate

A complete CI/CD pipeline would extend this by running terraform plan on pull requests to preview changes and automatically running terraform apply on merge to the main branch to deploy them. This creates a flywheel where code is pushed and infrastructure is updated without manual intervention.

Infrastructure-as-Code is a now an AI Competency

The main takeaway is this: mastering Infrastructure as Code isn’t an optional “DevOps” skill. It’s a core competency for the modern ML engineer. For any organization serious about productionizing AI, Terraform on Google Cloud is the a key step toward building a scalable engineering culture.

If you’d like to keep learning more, I recommend the step-by-step guide on using a GKE cluster with Terraform: Quickstart: Deploy a workload with Terraform.

From there, I’d love to hear more about your journey with AI and Cloud infrastructure. Connect on LinkedIn, X, or Bluesky to continue the discussion!

A Developer’s Guide to Model Routing

Karl Weinmeister — Mon, 25 Aug 2025 16:26:04 +0000

Not long ago, building with LLMs meant picking one general-purpose model and sticking with it. Today, the landscape is flooded with thousands of options: large and small, open and closed-source, generalist and specialist, each with unique capabilities and costs.

This explosion of choice has fundamentally changed how we build AI applications. The one-size-fits-all approach is over.

Instead, we architect systems that select the best model for each task. This is the idea behind model routing. This architectural pattern can be implemented today, and has the potential to change the economics of model inference. Let’s get into it!

Understanding Model Routing

As a developer building with LLMs, you’re constantly juggling three competing priorities: performance, cost, and latency.

Performance (Quality): For complex reasoning and creative generation, you might reach for state-of-the-art models like Google’s Gemini 2.5 Pro. These models deliver high-quality, accurate responses.
Cost: While premium models deliver state-of-the-art performance, they represent a significant investment. The key to a sustainable AI strategy is to reserve these powerful models for tasks where their advanced capabilities provide a clear return on investment. For more routine queries, smaller, highly efficient models can deliver excellent results at a fraction of the cost. Recent studies show this approach can yield cost savings without significantly degrading performance.
Latency: In interactive applications like chatbots, a fast response time is critical for a positive user experience. Smaller, specialized models can deliver near-instantaneous responses, making them ideal for real-time, conversational AI. By routing interactive queries to these faster models, you can create a more engaging and responsive application.

Relying on a single model forces an unnecessary compromise. Use a top-tier model for everything, and you pay a premium for power you don’t always need. Use a smaller model for everything, and you sacrifice quality on complex queries. So why are we still forcing ourselves to choose just one?

Model routing is an architectural pattern designed to solve this optimization problem. It involves maintaining a pool of candidate LLMs and routing each incoming prompt to the most suitable model. That’s often the smallest, fastest, and most cost-effective model that can successfully complete the task.

Common Routing Patterns

Implementing a model router involves choosing an architectural pattern that determines how routing decisions are made. These patterns exist on a spectrum of complexity and intelligence, from simple, predefined rules to sophisticated, AI-driven classification. We will focus on dynamic routing patterns that assess the content, intent, and complexity of the prompt to select the optimal model.

Rule-Based Routing

This is the simplest form of dynamic routing. It uses hard-coded logic, typically a series of if/else statements, to make routing decisions based on simple characteristics of the prompt.

The rules are based on easily measurable attributes of the prompt, such as the presence of certain keywords, its overall length, or matches against regular expressions. For instance, a system might check for specific terms to identify a task category or measure the prompt’s length to estimate its complexity.

Pros: This approach is predictable, transparent, and fast to execute. It’s an excellent choice for well-defined, simple workflows where task categories can be reliably distinguished by straightforward heuristics.
Cons: Rule-based systems are brittle and inflexible because they lack a true understanding of language. They can be easily confused by semantic nuance, such as negation or context. The system also becomes difficult to maintain and scale as the number of rules grows.

LLM-Based Routing

This pattern leverages the intelligence of an LLM to perform the routing task itself. A dedicated, often smaller and faster, “router LLM” acts as a classification engine.

The user’s prompt is fed into the router LLM. The router LLM is given a prompt that instructs it to analyze the query and classify it into predefined categories. To ensure the output is machine-readable, the router LLM is instructed to respond in a structured format like JSON. The application then parses this JSON output to determine which model to call next.

Pros: This is a powerful and flexible approach. The router LLM can understand complex, ambiguous, and nuanced language. It can handle multi-intent queries and can be adapted to new routing tasks simply by updating its system prompt.
Cons: The primary drawback is significant overhead. This method introduces an additional, full LLM API call into the critical path of every request. This adds both cost and latency, which can undermine the goals of optimization the router was intended to achieve.

Semantic Routing

Semantic routing offers a powerful compromise, combining the speed of rule-based systems with the intelligence of LLM-based approaches. It operates on the principle of semantic similarity in vector space and is the core mechanism we’ll implement.

The process involves four steps. First, routes are defined, each with a name and a list of representative example phrases, or utterances. Next, a text embedding model converts all of these utterances into high-dimensional numerical vectors that capture their semantic meaning, which are then stored in an efficient index. When a new user query arrives, the same embedding model converts it into a vector. Finally, a vector similarity search is performed between the query’s vector and all the utterance vectors in the index, and the route whose utterances are most similar to the query is selected as the winner.

Pros: This method is fast, with decision times often in the milliseconds, because it relies on optimized vector math rather than a slow, generative LLM call. It’s highly scalable to thousands of potential routes and is more robust than simple keyword matching because it understands meaning and context. Modern libraries often allow this configuration to be externalized into declarative files like YAML, separating the routing logic from the application code for better maintainability.
Cons: The effectiveness of a semantic router is highly dependent on the quality and comprehensiveness of the example utterances provided for each route. It can also struggle with contextual, multi-turn conversational queries where the user’s intent is not explicitly stated in their most recent message.

The choice of routing architecture is governed by the “Router Latency Paradox”: a component designed to reduce overall application latency must itself be exceptionally low-latency. An LLM-based router introduces a full inference step to every request, increasing both latency and cost. For this approach to be a net positive, the downstream savings must consistently outweigh its operational overhead, which is a high bar for most interactive applications. Semantic routing, in contrast, replaces this slow inference with a near-instantaneous vector search. This performance difference establishes semantic routing as the default architectural best practice for dynamic, real-time model routing. LLM-based routing is thus reserved for cases where the routing logic is too complex to be captured by semantic similarity alone and the added latency is an acceptable trade-off.

The Gemini 2.5 Model Family

To build an effective router, you need a solid grasp of the candidate models in your pool. For our implementation, we’ll use Google’s Gemini 2.5 family, a suite of models with a tiered structure of capability and cost that’s perfect for a routing architecture.

A key innovation across the Gemini 2.5 family is their capability as “thinking models.” This means they can be configured to perform internal reasoning steps, akin to a chain of thought, before generating a final response. This feature, controllable via an API parameter known as the “thinking budget,” can significantly improve performance and accuracy on complex tasks. This controllable reasoning becomes another powerful dimension for our routing logic to consider.

Gemini 2.5 Pro

Capabilities: Gemini 2.5 Pro is Google’s flagship model, engineered for maximum performance and state-of-the-art accuracy. It’s optimized for the most complex and demanding tasks, including deep logical reasoning, advanced code generation, and sophisticated multimodal understanding across text, images, audio, and video.
Router Use Case: This is our designated “strong” model. We’ll route only the most challenging queries here: prompts that involve complex problem-solving, novel algorithm design, in-depth analysis of dense technical documents, or multi-step logical puzzles.
Thinking: For this model, the “thinking” capability is on by default, as it’s integral to its high-end performance.

Gemini 2.5 Flash

Capabilities: Gemini 2.5 Flash is designed to be the best model in the family in terms of its price-to-performance ratio. It offers well-rounded, powerful capabilities that approach those of Pro but at a significantly lower operational cost. It also features a controllable thinking budget.
Router Use Case: This is our “default” or “go-to” model. It’s the workhorse that will handle the majority of general-purpose queries. These are tasks that are more complex than simple classification but don’t require the full power (and expense) of Pro. Ideal use cases include general conversation, creative writing, drafting emails, and performing detailed summarizations.

Gemini 2.5 Flash-Lite

Capabilities: As its name suggests, Gemini 2.5 Flash-Lite is the fastest and most cost-efficient model in the 2.5 family. It’s highly optimized for low latency and high-throughput scenarios, making it a cost-effective upgrade from previous generations of Flash models.
Router Use Case: This is our fastest model. We’ll route simple, high-volume, and latency-sensitive tasks here. It’s perfect for text classification (e.g., sentiment analysis), simple data extraction (e.g., pulling names and dates from text), translation, and answering straightforward factual questions.
Thinking: To maximize its speed and cost-efficiency, “thinking” is turned off by default for Flash-Lite. However, it can be optionally enabled, providing granular control for tasks that might need a small boost in reasoning without escalating to the full Flash model.

Implementing a Semantic Router

With the theory covered, let’s get to the code. This section walks through the gemini-model-router project, which builds a semantic router to intelligently distribute queries among the Gemini 2.5 Pro, Flash, and Flash-Lite models. It uses the open-source semantic-router library as its engine and serves it all up with FastAPI.

Embeddings are created upfront for each route, and then matched to queries at runtime

Project Setup

To get started, clone the repository and follow the setup instructions in the README.md file, which covers creating the .env file and installing the required dependencies from requirements.txt.

Centralizing Configuration

A key architectural decision in the gemini-model-router project is the separation of configuration from code. All routing logic, including the routes, their representative utterances, and the specific LLM assigned to each route, is defined in a single router.yaml file. This makes the system highly maintainable and easy to modify without changing the application’s Python code.

The router.yaml file has two main sections:

encoder : Specifies the embedding model to use for converting text to vectors. In this case, it uses Google’s gemini-embedding-001 via the semantic-router’s GoogleEncoder.
routes : A list of route definitions. Each route has:

name: A unique identifier that maps directly to a Gemini model.
description: A human-readable explanation of the route’s purpose.
utterances: A list of example phrases that define the semantic space of the route.
llm: An object specifying the custom class (GoogleLLM), the Python module where it’s defined (main), and the target model ID (e.g., gemini-2.5-pro).

Here is a snippet from the router.yaml file, defining the route for complex queries. A key parameter in the full configuration is the score_threshold. When the router compares a query to its routes, it calculates a similarity score. By setting the threshold to 0.0, we ensure that the router always selects the route with the highest similarity, effectively guaranteeing that a decision is always made.

# router.yaml
encoder_name: gemini-embedding-001
encoder_type: google
routes:
- name: gemini-2.5-pro
  description: For complex, multi-step tasks requiring deep reasoning, code generation, and analysis of large documents.
  utterances:
  - Develop a comprehensive, multi-year business plan for a direct-to-consumer sustainable
    fashion brand, including financial projections and marketing strategies.
  - Write a Python script to perform sentiment analysis on a large CSV of customer
    reviews, generate visualizations, and create a summary report.
  - Compare and contrast the philosophical implications of determinism and free will
    in the context of advanced artificial intelligence, citing relevant academic sources.
  llm:
    module: main
    class: GoogleLLM
    model: gemini-2.5-pro
#... other routes for flash and flash-lite follow...

Routing Logic

The main.py file contains the FastAPI application that serves the router. It includes several key components that work together to bring the YAML configuration to life.

The GoogleLLM Wrapper

The semantic-router library requires a compatible LLM object for each route. To integrate with Google’s GenAI SDK, the project defines a custom GoogleLLM class that inherits from semantic_router.llms.BaseLLM. This class acts as a bridge, translating the semantic-router’s call signature into an asynchronous request to the Vertex AI Gemini API.

# main.py (simplified)
from semantic_router.llms import BaseLLM
from google import genai

class GoogleLLM(BaseLLM):
    _client: ClassVar[Optional[genai.Client]] = None

    @classmethod
    def get_client(cls) -> genai.Client:
        if cls._client is None:
            project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
            cls._client = genai.Client(vertexai=True, project=project_id)
        return cls._client

    async def __acall__ (self, messages: List[Message], **kwargs) -> Optional[str]:
        contents = kwargs.get("multimodal_contents", messages[0].content)
        config = kwargs.get("config", self.kwargs.get("config", {}))

        response = await self.get_client().aio.models.generate_content(
            model=self.name,
            contents=contents,
            **config,
        )
        return response.text if response else ""

The /query Endpoint

The main API endpoint uses a series of helper functions to route and execute the query. The handle_query function orchestrates the process: it extracts text for routing, determines the best route, and executes the LLM call.

# main.py (simplified)
@app.post("/query", response_model=RouterResponse)
async def handle_query(request: QueryRequest, fastapi_request: Request):
    router = fastapi_request.app.state.router
    default_route = fastapi_request.app.state.default_route_name

    # 1. Extract text and determine the route
    text_for_routing = _get_text_for_routing(request.contents)
    route_choice = _determine_route(router, text_for_routing, default_route)
    chosen_route = router.get(route_choice.name)

    # 2. Execute the call using the LLM from the chosen route
    model_response = await _execute_llm_call(
        chosen_route, request.contents, request.config, text_for_routing
    )

    return RouterResponse(
        route_name=chosen_route.name, model_response=model_response
    )

Deploying to Production

While FastAPI’s web server uvicorn is perfect for local development, a production deployment requires a robust, scalable hosting environment. Cloud Run is an ideal choice for this service because it’s a fully managed, serverless platform that takes your containerized application (including the Uvicorn server) and handles all the underlying infrastructure, scaling, and request management.

To deploy the router, you first need to have the Google Cloud SDK installed and configured. Then, you can deploy the service with a single command:

gcloud run deploy gemini-model-router \
  --source . \
  --region us-central1

This command builds a container from your source code, pushes it to the Artifact Registry, and deploys it as a public-facing service. Cloud Run handles all the infrastructure, so you can focus on the application logic.

Production Best Practices

Deploying a model router to production requires building an observable and resilient system. An API management platform like Google Cloud’s Apigee can serve as a unified and secure gateway to your model routing service. It can provide essential capabilities like enforcing security policies, managing traffic with rate limiting and quotas, and offering deep visibility through analytics and monitoring. Let’s review the key principles needed to move beyond a proof-of-concept.

First, treat the router as a mission-critical, standalone service. Because it can be a single point of failure and a performance bottleneck, it must be independently scalable and fault-tolerant. Containerize the router and deploy it on a platform like Cloud Run to ensure high availability, allowing it to scale independently of the applications that consume it.

Second, you cannot optimize what you cannot measure. Implement comprehensive logging and monitoring for every routing decision. For each request, log the chosen route, similarity score, final model, latency, and estimated cost. This data can be fed into Google Cloud’s observability suite to create dashboards for tracking key performance indicators like route distribution, cost per query, and P99 latency. This allows you to set up alerts for anomalies, such as a sudden shift in routing patterns or an increase in fallback rates.

Third, the initial configuration is just a starting point. True optimization requires a data-driven feedback loop. Collect and review production queries to identify misrouted requests, and use this analysis to refine your route utterances. A/B testing frameworks are invaluable for comparing different routing strategies or model configurations in a live environment to validate improvements.

Finally, enterprise-grade reliability requires planning for failure. Implement a chain of fallbacks that goes beyond a simple default route. For instance, if a request to gemini-2.5-pro fails, the system should automatically retry with exponential backoff. If that also fails, it should fall back to the next best model, gemini-2.5-flash.

The Future of Model Routing

There is a broader trend towards more modular and dynamic AI architectures, and model routing is no exception. The future of model routing could include:

Multimodal Routing: The next logical step is routing on more than just text. The current router simplifies the problem by extracting the text from a multimodal prompt, but the concept of vector similarity works for any modality you can embed.
Hierarchical Routing: The concept of system-level model routing is a macro-scale analog of what Mixture-of-Experts or MoE architectures do within a single neural network. In an MoE model, an internal “router” network dynamically selects which “expert” sub-networks should process each token of an input sequence. Our external router does the same thing, but its “experts” are entire, independent LLMs. Future systems may employ hierarchical routing, where a top-level semantic router first selects the best specialized MoE model for a task, which then performs its own fine-grained, internal routing to process the request.

Ultimately, model routing is a foundational building block for the next generation of complex, multi-agent AI systems. As we’ve shown, the combination of a powerful model family like Google’s Gemini 2.5, a serverless platform like Cloud Run, and the open-source gemini-model-router project makes this advanced architecture an achievable engineering task. The tools are here. The patterns are clear.

It’s time to start building. Share what you’ve built with me on LinkedIn, X, or Bluesky!

Mastering Agentic Development with Gemini and Roo Code

Karl Weinmeister — Sun, 20 Jul 2025 04:56:25 +0000

The conversation around AI in software development has matured beyond the “AI as a chatbot” and into sophisticated AI agents. We’re moving toward building a living blueprint that can reason about your code in its entirety and evolve with it over time.

For developers who want a powerful, all-in-one AI experience, Google’s Gemini Code Assist is a fantastic solution that provides a seamless, out-of-the-box experience, bringing the power of Gemini directly into your workflow.

For those who love to assemble best-in-class technologies from the open ecosystem, this article is for you. We will explore a production-ready stack for those who want a customized and self-hosted solution. This stack combines the Roo Code VS Code extension, powered by Google’s underlying Gemini models, and takes it to the next level with a self-hosted Qdrant vector database on Google Kubernetes Engine.

Solution architecture for agentic development with Roo Code, Gemini, and Qdrant

Solution Components

Roo Code is a VS Code extension that can be thought of as an “AI Dev Team” with modes ranging from Architect to Debug. You can give it a high-level task, like “refactor this module to use the new logging service,” and it will create a plan, identify the necessary code changes, and execute them across multiple files. For a deeper dive, check out the Roo Code documentation.

Using Roo Code to update a project README based on the current codebase

You can take full advantage of Roo Code’s capabilities with the massive context window available in Gemini models. This allows Roo Code to hold a vast amount of code in its “short-term memory,” enabling it to understand the intricate relationships between files and modules and to generate code that is consistent with the entire project. You can learn more about the Gemini API in the official documentation.

To make the use of this large context window efficient, Roo Code leverages prompt caching, a feature now available in Gemini models. When Roo Code sends the initial instructions and context to the model, Gemini generates an internal representation and returns a cache reference. On subsequent requests, Roo Code can send this cache reference instead of the full prompt, dramatically reducing token usage and improving latency, which is a key feature for making the system both cost-effective and performant.

For codebase indexing, Roo Code supports Gemini’s gemini-embedding-001 state-of-the-art embedding model. This is crucial for the accuracy of the semantic search, and you can find more information on Gemini’s embedding models here.

Using Gemini Models in Roo Code

The connection between Roo Code and a model is what enables its agentic capabilities: planning, executing commands, and writing code across your entire project. You can connect to Gemini’s models through the Gemini API or through Google Cloud’s Vertex AI.

To use the Gemini API, you simply create an API key in Google AI Studio, then in Roo Code’s settings, select the Google Gemini provider, paste your key, and choose a model. For detailed, step-by-step instructions on this process, refer to the Roo Code documentation for the Gemini provider.

For teams and enterprises using Google Cloud, connecting via Vertex AI provides unified billing, IAM permissions, and more. You will create a service account with the “Vertex AI User” role in the Google Cloud Console and download its JSON key file. Within Roo Code’s settings, select the GCP Vertex AI provider, provide the credentials from your JSON key, and enter your Project ID and Region. The Roo Code documentation for Vertex AI provides a complete walkthrough of this setup.

The Vertex AI LLM provider for Gemini in Roo Code

For both connection methods, we recommend starting with gemini-2.5-pro for the best experience. Its powerful reasoning capabilities and large context window are ideal for complex, multi-step tasks. For faster, more cost-effective use, gemini-2.5-flash is an excellent alternative.

With Roo Code’s reasoning engine now powered by Gemini, the next step is to give it a persistent, long-term memory of your code.

Codebase Indexing

Codebase indexing creates a semantic “long-term memory” of your code that the agent can access at any time. This is a multi-stage process that transforms your source code into a searchable knowledge base.

Intelligent Chunking

First, Roo Code uses Tree-sitter to parse your code into an Abstract Syntax Tree (AST). This gives it a deep, structural understanding of your code, just like a compiler does. Instead of arbitrarily splitting a file every few hundred lines, the AST is used to intelligently chunk the code into complete, semantic blocks.

This “semantic chunking” means the pieces of code being indexed are meaningful and self-contained units, such as:

A complete function or method.
An entire class or struct definition.
A specific configuration block.

This ensures that the context isn’t lost by splitting a function in half. For unsupported languages, Roo Code falls back to line-based chunking.

Generating Embeddings

Once the code is broken down into these intelligent chunks, the next step is to capture their semantic meaning in a way a machine can understand. This is where Gemini’s gemini-embedding-001 model comes in.

Each semantic chunk produced by Tree-sitter is fed into the embedding model, which outputs a high-dimensional numerical vector. This vector is the embedding — a mathematical representation of the code’s meaning. The Gemini embedding model captures fine details with 3072 dimensions in every embedding. For a deeper dive into Matryoshka Representation Learning, a technique used to train the model, see this video:

Storing and Searching Embeddings

With the codebase converted into a collection of semantically-rich embeddings, they need a place to be stored and searched efficiently. Roo Code uses Qdrant, a high-performance vector database, for this purpose.

When you ask a question, Roo Code’s search tool follows this process:

Query: Your natural language query (e.g., “where is our user authentication logic?”) is sent to the Gemini embedding model.
Vectorize: The model converts your query into an embedding vector.
Search: Roo Code performs a vector search in the Qdrant database, looking for the code chunk embeddings that are most similar (i.e., closest in vector space) to your query’s embedding.
Retrieve: The tool then returns the most relevant code snippets, along with their file paths and similarity scores.

Roo Code also provides a user-friendly interface for configuring the codebase indexer. You can easily select your embedding provider, enter your API keys, and specify the Qdrant URL. The advanced configuration options allow you to fine-tune the search behavior by adjusting the Search Score Threshold and Maximum Search Results. You can also specify which files to ignore by adding patterns to a .rooignore file.

Indexing a codebase in Roo Code

From Local to Centralized Indexing

The easiest way to get started is with a local Qdrant instance. As the official Qdrant Quickstart shows, you can be up and running in minutes with a single Docker command:

docker run -p 6333:6333 -v "$(pwd)/qdrant_storage:/qdrant/storage:z" qdrant/qdrant

For an individual developer, this is a fantastic way to get all the benefits of codebase indexing without any external dependencies.

As your team grows, managing dozens of individual Docker instances can become cumbersome. This is where a centralized Qdrant instance provides value — not as a single, conflict-prone shared index, but as a managed, cost-effective platform to host a fleet of personal indexes.

Google Kubernetes Engine, or GKE, is an excellent choice for this, offering high availability and enterprise-grade security. The principle is the same regardless of the platform: provide a robust, central service to host many isolated environments. You can deploy the infrastructure within minutes using the GKE tutorial for deploying Qdrant.

Using the instructions in the tutorial, you can easily access it from your local system using port forwarding:

PROJECT_ID="your-project-id"
REGION="us-central1"

gcloud container clusters get-credentials qdrant-cluster --region "$REGION" --project "$PROJECT_ID"

kubectl port-forward service/qdrant 6333:6333

Roo Code generates a unique Qdrant collection name by hashing the absolute local workspace path. This means that even when using a central Qdrant instance, each developer’s index is completely isolated. To avoid conflicts, each developer needs to ensure they are using a different path:

Developer A: /Users/alice/projects/my-app
Developer B: /Users/bob/projects/my-app

Conclusion

The future of AI-assisted development is about choice. Whether you prefer a powerful, all-in-one solution like Google’s Gemini Code Assist for a seamless, integrated experience, or the composable stack detailed in this article, the goal is the same: to create a truly intelligent development environment.

What will you build with Gemini and Roo Code? Feel free to continue the discussion on LinkedIn, X, and Bluesky.