DEV Community: Jubin Soni

Jules: Google's Async Coding Agent Is Changing How We Think About AI and Software Development

Jubin Soni — Fri, 29 May 2026 18:12:09 +0000

There's a quiet architectural shift happening in how we build software, and it doesn't look like what most people expected.

We've spent the last two years treating AI like a very fast autocomplete — a co-pilot sitting shotgun, responding the moment we type. Cursor, Copilot, Gemini Code Assist: all synchronous, all requiring you to stay in the loop, all fundamentally keeping you as the CPU driving execution.

Jules breaks that model.

Google's async coding agent, which went generally available in 2025 and got major updates at I/O 2026, doesn't help you write code faster. It removes you from the writing loop entirely. You assign a task. Jules works. You review a pull request. That's it.

This article breaks down how Jules works technically — with architecture diagrams, sequence flows, and real code — and why the async model might be more significant than it first appears.

What Jules Actually Does (and Doesn't Do)

Jules is not an IDE plugin. It's not an inline suggestion engine. It's not a chat interface for your codebase.

Jules is a task-based async agent. You give it a scoped coding task — fix a bug, migrate a module, add a feature, write tests — and it:

Clones your repository into a secure Google Cloud VM
Analyzes the relevant codebase context (2M token context window as of I/O 2026)
Writes a step-by-step implementation plan using Gemini Pro
Executes that plan: writing code, running tests, fixing errors
Opens a pull request against your branch with a description, diff, and change summary

When it's done, you're not staring at a chat window waiting to approve line-by-line edits. You're reviewing a PR — just like you would from any engineer on your team.

The 2026 update closes the loop further: if the CI/CD pipeline fails on the Jules-authored PR, Jules automatically receives the error, analyzes it, applies a fix, and re-pushes the commit — often without any human intervention at all.

Architecture: How Jules Is Built

Jules is your end-to-end agentic product development platform. It reads your entire product context to figure out what to build next, comes up with solutions, and then ships a PR.

jules.google.com

Here's how the components fit together:

Key architectural choices:

Isolated VM per task: no shared state between runs, reproducible environments
Network access retained: Jules can npm install, run builds, call APIs — unlike Codex which sandboxes with no egress
Two-model split: Gemini Pro handles planning and hard reasoning; Gemini Flash handles lighter subtasks for cost efficiency
Native GitHub integration: reads issues, creates branches, authors commits, opens PRs — not a wrapper, it's first-class

Sequence: The Async Flow End-to-End

The sequence below shows what happens from task assignment to merged PR, and where the developer is actually free:

The key insight: the developer's attention is only required at step 1 (spec) and step 12 (review). Everything in between is Jules.

Code: Using Jules in Your Workflow

1. Jules CLI (Jules Tools — GA at I/O 2026)

# Install Jules Tools CLI
npm install -g @google/jules-tools

# Authenticate
jules auth login

# Submit a task against a GitHub repo
jules task create \
  --repo your-org/your-repo \
  --branch main \
  "Fix the race condition in payment/processor.go — 
   two concurrent requests can double-charge. 
   Add regression tests covering the concurrent case."

# Check task status
jules task status <task-id>

# List open tasks
jules task list --status=in-progress

2. Via Gemini CLI Extension

# Install Gemini CLI
npm install -g @google/gemini-cli

# Add Jules extension
gemini extensions install https://github.com/gemini-cli-extensions/jules --auto-update

# Submit directly from your terminal
/jules Fix the flaky integration tests in auth/session_test.go. 
       Root cause appears to be missing teardown between test runs.

# Jules responds async — you get a PR link when it's done

3. Jules API (for CI/CD integration)

import google.auth
from jules_client import JulesClient

credentials, project = google.auth.default()
client = JulesClient(credentials=credentials)

# Submit a task programmatically
task = client.tasks.create(
    repo="your-org/your-repo",
    branch="main",
    description="""
        Migrate the UserRepository class from raw SQL to 
        the new ORM layer introduced in db/orm.py.
        Preserve all existing query behaviour and update tests.
    """,
    labels=["migration", "automated"]
)

print(f"Task submitted: {task.id}")
print(f"Track at: {task.url}")

# Poll for completion (or use webhooks)
import time
while task.status not in ["completed", "failed"]:
    time.sleep(30)
    task = client.tasks.get(task.id)

if task.status == "completed":
    print(f"PR ready: {task.pull_request_url}")

4. GitHub Actions Integration

# .github/workflows/jules-debt.yml
name: Weekly tech debt sweep

on:
  schedule:
    - cron: '0 9 * * MON'   # Every Monday at 9am

jobs:
  sweep:
    runs-on: ubuntu-latest
    steps:
      - name: Submit Jules tasks from tech-debt.md
        uses: google/jules-action@v1
        with:
          jules-api-key: ${{ secrets.JULES_API_KEY }}
          task-file: .github/tech-debt.md
          branch: main
          auto-merge: false   # Always require human review

Practical Workflow: What Jules Is Good At

Jules excels when the unit of work maps to a ticket. The sharper your spec, the better the output.

Task Type	Jules Fit	Why
Bug fix with clear repro steps	✅ Excellent	Deterministic target, testable outcome
Add test coverage to a module	✅ Excellent	Well-defined scope, no design decisions
Dependency upgrades with API changes	✅ Good	Mechanical but multi-file
Migrate module to new framework/ORM	✅ Good	Repetitive pattern Jules handles well
Security patch + regression tests	✅ Good	Scoped + CI validates automatically
Exploratory refactor (uncertain scope)	⚠️ Risky	Scope drift, Jules may over-engineer
Greenfield architecture design	❌ Wrong tool	No acceptance criteria to validate against
Real-time pair debugging	❌ Wrong paradigm	Needs synchronous back-and-forth

The honest rule of thumb: if you could write a solid Jira ticket for it, Jules can probably do it.

Jules vs. the Field

	Jules	Claude Code	OpenAI Codex	GitHub Copilot
Execution model	Async (PR delivery)	Sync (interactive terminal)	Async (PR delivery)	Sync (inline suggestion)
Runtime environment	Google Cloud VM	Local / container	Cloud sandbox	Editor plugin
Network access in VM	✅ Yes	✅ Yes	❌ No (strict sandbox)	N/A
GitHub integration	Native (issues → PR)	Via CLI	Native	Native
Languages supported	Node, Python, Go, Rust, Java	Any	Node, Python primary	Any
Parallel task execution	✅ Yes	❌ One at a time	✅ Yes	❌ One at a time
CI auto-fix loop	✅ Yes (2026)	❌ No	❌ No	❌ No
Context window	2M tokens	~200K tokens	~128K tokens	~8K tokens
Best for	Delegated ticket work	Complex collaborative tasks	Security-sensitive workflows	Inline acceleration

What Jules Gets Right — and Where It's Still Incomplete

What's working:

The async PR model genuinely removes you from low-value execution loops
CI integration with auto-fix is a real quality-of-life improvement for teams
Multi-language runtime support (Node, Python, Go, Rust, Java) is broader than most competitors
The CLI and Gemini CLI extension make it composable into existing dev workflows
2M token context means Jules can reason across large codebases without truncation

What's still incomplete:

Jules validates against tests — codebases with thin coverage expose the reviewer to unknown unknowns
The debugging story for multi-agent ADK workflows is thin; distributed AI agent observability is largely unsolved
Spec quality gates: Jules has no way to flag an underspecified task before burning compute on it
For exploratory or greenfield work, you still need a synchronous collaborator

The Bigger Picture: What This Means for SWEs

Jules isn't a replacement for engineers. It's a redefinition of what "engineering work" means at the margin.

The value of a senior engineer is increasingly not in the ability to implement — it's in:

Writing specs precise enough that an agent can execute them
Reviewing AI-generated PRs for correctness, design quality, and unintended side effects
Knowing when to reach for async delegation vs. interactive collaboration
Building and maintaining the test coverage and CI infrastructure that makes async agents safe to trust

Google I/O 2026 framed this explicitly: the engineers who get the most from agentic coding will be those who run both patterns in parallel — async for ticket-level work, sync for exploration — not those who pick a favorite.

Jules is a real tool for real workflows right now. If you have a backlog of well-scoped tasks and a codebase with decent test coverage, it's worth spinning up.

References

Google Developers Blog — All the news from the Google I/O 2026 Developer keynote
Google Blog — 100 things we announced at Google I/O 2026
Google Research — AI in software engineering at Google: Progress and the path ahead
Google Cloud Blog — Innovations from Google I/O 26 on Google Cloud
TechCrunch — Google's Jules enters developers' toolchains as AI coding agent competition heats up
AI Builder Club — Google I/O 2026: Everything That Matters for AI Builders

Amazon Quick: AWS's Agentic Workspace, Explained for Engineers

Jubin Soni — Thu, 21 May 2026 03:47:40 +0000

AWS has been building agentic infrastructure for some time now — Bedrock, AgentCore, Strands — mostly aimed at engineers who want to build their own agent systems from scratch. Amazon Quick is a different layer of the same bet: a ready-to-use agentic workspace that targets teams directly, without requiring custom orchestration code.

This article walks through what Quick is, how its components fit together technically, how the MCP integration model works with real code, and where it sits relative to the rest of AWS's agent stack.

What Amazon Quick Is

Amazon Quick is an AI assistant for work that connects to your existing tools — Slack, Microsoft Teams, Outlook, CRMs, databases, and local files — and gives a unified layer for querying, automating, and acting across them. It launched in preview at AWS's "What's Next with AWS" event on April 28, 2026.

The product is aimed at teams, not just individual users. One person can build a custom agent scoped to a specific dataset or workflow, and the whole team benefits from it. Responses from Quick agents are grounded in your actual business data, not the underlying model's training distribution.

Under the hood, Quick is built on Amazon Bedrock AgentCore and uses the Model Context Protocol (MCP) as its standard for connecting to external tools. It runs on AWS IAM and VPC, which means it inherits the same security and compliance posture as the rest of your AWS workloads.

Product Components

Quick bundles five distinct capabilities. It helps to understand each one separately before thinking about how they compose.

Component	What it does
Spaces	Collaborative workspaces where teams pool files, dashboards, and data sources. Agents in a Space are grounded in that Space's data.
Agents	Custom, domain-scoped agents built on your team's specific data. One person builds, everyone uses.
Research	Multi-source synthesis across internal data, the public web, and third-party datasets. Produces structured reports.
Visualize (Quick Sight)	Integrated BI layer. Conversational access to dashboards, charts, and forecasting — no separate BI tool required.
Automate (Quick Flows)	Workflow automation from simple daily tasks to complex multi-step processes with cross-app action execution.

Each component is available through the web app, mobile, and a native desktop app (currently in preview for macOS and Windows) that can read local files and calendar context without requiring browser access.

Where Quick Sits in the AWS Agent Stack

AWS is building in two directions at once. AgentCore is the infrastructure layer for engineers who want to compose their own agent systems — runtime, memory, gateway, observability — with any model and any framework. Quick is the product layer on top: opinionated, team-facing, and deployable without writing orchestration code.

The practical implication: if you're an engineer building internal tools or automation pipelines, you'll likely interact with both layers. AgentCore for the infrastructure wiring; Quick as a surface where non-technical teammates interact with the agents you build.

The Integration Architecture

The core question for any engineer evaluating Quick is: how does it actually connect to external systems, and what does the request path look like?

Quick uses MCP (Model Context Protocol) as its primary integration standard. This is significant because MCP is an open protocol — it means Quick agents are not locked into AWS-specific connectors, and any MCP-compatible server can be registered as a tool source.

High-Level Request Flow

The sequence below shows the full lifecycle of a single agent-triggered tool call — from the moment Quick receives a prompt through to the response returning from a downstream API.

Quick acts as the MCP client. Your MCP server exposes tools via listTools and callTool. Quick discovers them at registration time and makes them available to any agent or automation in the workspace. Authentication flows through OAuth 2.0, with support for Dynamic Client Registration (DCR) so Quick can register itself automatically without manual credential setup.

Building an MCP Server for Quick

Here is a minimal Python MCP server using the mcp SDK that exposes two tools Quick can invoke — get_ticket and list_open_tickets. This pattern works whether you host the server yourself or run it on AgentCore Runtime.

Install dependencies

pip install mcp[server] httpx uvicorn

Server implementation

# server.py
from mcp.server import Server
from mcp.server.sse import SseServerTransport
from mcp.types import Tool, TextContent
import httpx
import json
from starlette.applications import Starlette
from starlette.routing import Route

app = Server("jira-quick-integration")

JIRA_BASE_URL = "https://yourorg.atlassian.net"
JIRA_TOKEN    = "Bearer <your-token>"  # in production, load from AWS Secrets Manager


@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="get_ticket",
            description="Retrieve details for a single Jira ticket by issue key.",
            inputSchema={
                "type": "object",
                "properties": {
                    "issue_key": {
                        "type": "string",
                        "description": "The Jira issue key, e.g. ENG-1234"
                    }
                },
                "required": ["issue_key"]
            }
        ),
        Tool(
            name="list_open_tickets",
            description="List open Jira tickets assigned to a given user.",
            inputSchema={
                "type": "object",
                "properties": {
                    "assignee": {
                        "type": "string",
                        "description": "The Jira username or email of the assignee"
                    }
                },
                "required": ["assignee"]
            }
        )
    ]


@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    headers = {"Authorization": JIRA_TOKEN, "Content-Type": "application/json"}

    async with httpx.AsyncClient() as client:
        if name == "get_ticket":
            key = arguments["issue_key"]
            resp = await client.get(
                f"{JIRA_BASE_URL}/rest/api/3/issue/{key}",
                headers=headers
            )
            resp.raise_for_status()
            data = resp.json()
            summary = data["fields"]["summary"]
            status  = data["fields"]["status"]["name"]
            return [TextContent(type="text", text=f"{key}: {summary} [{status}]")]

        elif name == "list_open_tickets":
            assignee = arguments["assignee"]
            jql = f"assignee={assignee} AND status != Done ORDER BY updated DESC"
            resp = await client.get(
                f"{JIRA_BASE_URL}/rest/api/3/search",
                headers=headers,
                params={"jql": jql, "maxResults": 20}
            )
            resp.raise_for_status()
            issues = resp.json().get("issues", [])
            results = [
                f"{i['key']}: {i['fields']['summary']}"
                for i in issues
            ]
            return [TextContent(type="text", text="\n".join(results) or "No open tickets found.")]

    raise ValueError(f"Unknown tool: {name}")


# Wire up SSE transport for Quick compatibility
sse = SseServerTransport("/messages/")

async def handle_sse(request):
    async with sse.connect_sse(
        request.scope, request.receive, request._send
    ) as streams:
        await app.run(streams[0], streams[1], app.create_initialization_options())

starlette_app = Starlette(
    routes=[Route("/sse", endpoint=handle_sse)]
)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(starlette_app, host="0.0.0.0", port=8080)

A few design constraints to be aware of when building for Quick:

Each MCP tool call has a 300-second hard timeout. Operations that exceed this fail with HTTP 424. Keep individual tool calls narrow and fast.
The tool list is treated as static after registration. If you add or remove tools on the server, the Quick admin must re-establish the connection to pick up changes.
Quick supports both Server-Sent Events (SSE) and streamable HTTP as transports. Streamable HTTP is preferred for new implementations.

Registering the MCP Server in Quick

Once your server is running and publicly reachable over HTTPS, registration in Quick takes the following path:

Quick Console → Integrations → Add Integration → MCP

Fields:
  Server URL:        https://your-mcp-server.example.com/sse
  Auth type:         OAuth 2.0 (or Service, or None)
  Client ID:         <from your identity provider>
  Authorization URL: https://auth.example.com/oauth/authorize
  Token URL:         https://auth.example.com/oauth/token

If your identity provider supports OAuth Dynamic Client Registration, Quick will auto-register and you skip the manual client ID step entirely. Quick sends an initial unauthenticated request to the MCP server; if it receives a 401 with a WWW-Authenticate header containing a resource_metadata URL, it fetches the metadata document and proceeds with DCR automatically.

Once registered, Quick calls listTools at startup and exposes every discovered tool to agents and automations in the workspace.

The AgentCore Gateway Option

For teams that don't want to write and operate an MCP server from scratch, Amazon Bedrock AgentCore Gateway provides a managed alternative. You point Gateway at a Lambda function or an OpenAPI spec, and it handles the MCP wrapping, auth, logging, and semantic tool discovery automatically. If you use it, Quick never calls your internal APIs directly — everything flows through Gateway's auth and routing layer, as shown in the sequence diagram above.

The semantic search capability is worth noting specifically. When an agent has access to dozens or hundreds of tools, passing the full tool list on every turn wastes context and causes the model to pick the wrong tool. Gateway's built-in x_amz_bedrock_agentcore_search tool lets Quick find the right tool by semantic similarity rather than scanning the entire registry each turn.

Practical Considerations

A few things worth keeping in mind before integrating:

Tool scope matters. When agents are given too many tools simultaneously, selection accuracy degrades — the model reasons over too many options per turn and picks incorrectly more often. Keeping each agent or MCP server to a focused set of 3–5 tools produces better results than exposing everything through one endpoint. This is a known pattern in multi-agent architectures and applies equally to Quick agents.

The 300-second timeout is real. Design each tool call to complete a single, bounded operation. Avoid chaining multiple downstream API calls inside a single tool invocation. If you need a multi-step workflow, model it as separate tools and let the agent orchestrate the sequence.

Local context on the desktop app. The desktop app reads local files and calendar events directly, without upload. For engineers who work primarily in terminals and local editors, this is a meaningful integration point — meeting context, local documentation, and recent file changes are all available to the assistant without any configuration.

MCP interoperability. Because Quick uses MCP as the standard, the same MCP server you build for Quick can also be consumed by Claude Code, Amazon Q Developer, and other MCP-compatible clients. The integration contract is portable.

References

Run Gemma 4 on Your Laptop — A Hands-On Guide to Google's Latest Open Multimodal LLM

Jubin Soni — Fri, 15 May 2026 02:36:05 +0000

If you've been watching the open-source LLM space, you've probably noticed it's been a great couple of years. Llama, Mistral, Phi, Qwen — a whole zoo of models you can download and run on your own machine. Google's entry into that zoo is Gemma, and the fourth generation, Gemma 4 (released April 2, 2026), is the biggest leap yet: built from Gemini 3 research, multimodal (text + image + video + audio), 256K context, native function calling, configurable "thinking mode," and — finally — a clean Apache 2.0 license.

In this post we're going to:

Understand what Gemma 4 actually is, with an architecture diagram
Get it running on your laptop with Ollama in about 5 minutes
Chat with it from the terminal
Send it an image and ask questions about it
Turn on thinking mode for harder problems
Call it from a Python script like a real API
Build a small project that glues it all together

No GPU rental, no API keys, no telemetry. Let's go.

Heads up: This guide assumes zero ML background. If you can install software and run a terminal command, you can do this.

What is Gemma 4?

Gemma is Google DeepMind's family of open-weight language models. "Open-weight" means the actual neural network weights — the giant matrices of numbers that make the model work — are freely downloadable. You can run them, modify them, fine-tune them, ship them in your product.

Gemma 4 brings several big changes over Gemma 3:

Apache 2.0 license. Earlier Gemma releases used a custom license with a Prohibited Use Policy that made some enterprise legal teams nervous. Gemma 4 is plain Apache 2.0 — unlimited commercial use, no MAU caps, no special permissions. This alone is a big deal for production deployments.
Mixture-of-Experts. A new 26B MoE variant activates only ~4B parameters per token, giving you 13B-class quality at 4B-class cost.
Thinking mode. A configurable reasoning mode where the model thinks step-by-step before answering. Toggle it on for hard problems, off for fast chat.
Native function calling. Built-in support for structured tool use — write an agent without needing prompt engineering hacks.
More modalities. Image, video frames, and (on the smaller E2B/E4B models) native audio input. Native system prompt support too.
Bigger context. 128K on the small models, 256K on the larger ones.

Model sizes at a glance

Model	Disk (Ollama)	Active params	Total params	Multimodal	Context	Best for
E2B	~7.2 GB	~2B	~2.3B	text + image + audio	128K	Phones, edge devices, browser
E4B	~9.6 GB	~4B	~4.5B	text + image + audio	128K	Most laptops — the sweet spot
26B A4B (MoE)	~18 GB	~4B	26B	text + image	256K	Consumer GPUs, agentic workloads
31B Dense	~20 GB	31B	31B	text + image	256K	Workstations, highest-quality answers

Two naming notes worth understanding:

E2B / E4B. The "E" stands for Effective parameters. These are dense edge-first models that use a trick called Per-Layer Embeddings (PLE — more on this below) to do more with fewer active parameters.
26B A4B. This is the Mixture-of-Experts model. 26B parameters total, but only ~4B "activate" per forward pass. Latency and cost behave like a 4B model; quality is closer to a 13B dense model. Caveat: you still need to load all 26B into memory.

For most readers on a laptop, E4B is the right starting point. It runs comfortably on a 16 GB Mac or any modern dev machine.

Gemma 4 vs the rest of the open-model zoo (May 2026)

Model	Sizes	Multimodal	Context	License
Gemma 4	E2B / E4B / 26B MoE / 31B	text + image + video + audio (small)	128K / 256K	Apache 2.0
Llama 4	various	text + image	128K+	Llama community license
Qwen 3.5	various	text + image	128K+	Apache 2.0
DeepSeek V4 Flash	MoE	text	128K	MIT

Gemma 4's pitch: the only family that spans phones to servers under Apache 2.0, with multimodal and audio in the same release.

The architecture (in plain English)

You don't need this section to use Gemma 4 — feel free to skip to the install steps. But if you've ever wondered what's actually happening when a multimodal model "sees" and "hears," here it is.

A few pieces worth understanding:

Three input paths. Text goes through a SentencePiece tokenizer (shared with Gemini). Images go through a vision encoder that handles variable aspect ratios and resolutions natively (no more square-only inputs like Gemma 3). On the E2B and E4B models, audio goes through a USM-style conformer encoder borrowed from Gemma 3n. All three paths produce tokens that get interleaved in a single stream — so you can freely mix text, images, and audio in any order in one prompt.
Alternating local/global attention. Most layers only look at a sliding window of recent tokens (cheap). A subset of layers attend to the full context (expensive but rare). This is the standard trick for keeping the KV cache from blowing up at 256K context.
Per-Layer Embeddings (PLE) — the small-model secret. In a normal transformer, each token gets one embedding vector at input and that's all the residual stream has to work with. PLE adds a parallel pathway: for each token, every layer gets its own small conditioning vector from a lookup table. The embedding tables are large (lots of memory) but the "active" parameters per token stay small — that's why a 4-billion-active-parameter E4B can punch above its weight.
Mixture-of-Experts (26B A4B). The MoE layer has multiple "expert" feed-forward networks. A small router picks 2 of 8 (or similar) for each token. Total params = 26B (all loaded), active params per token = ~4B (only those fire). Pareto-optimal for quality-per-FLOP.
Thinking mode. When you include the special <|think|> token at the start of the system prompt, the model emits internal reasoning between <|channel>thought\n...<channel|> markers before the final answer. Disable it for fast chat; enable it for math, code, multi-step reasoning.

That's most of what's worth knowing. Now let's actually run it.

Step 1: Install Ollama

There are a few ways to run Gemma 4 locally, but the easiest by a mile is Ollama. Think of it as "Docker for LLMs" — it handles downloading the model, managing memory, GPU acceleration, and exposing a local API. You don't have to think about CUDA versions or PyTorch.

Install it:

macOS / Windows: Download the installer at ollama.com/download and run it.
Linux:

  curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version

You should see a version number. Gemma 4 requires Ollama v0.20.0 or later — if you're on an older version, update first.

Step 2: Pull a Gemma 4 model

Download the default (E4B, ~9.6 GB):

ollama pull gemma4

This downloads about 9.6 GB. Grab a coffee. ☕

Other sizes if you want them:

ollama pull gemma4:e2b   # ~7.2 GB — smallest, for low-RAM machines
ollama pull gemma4:e4b   # ~9.6 GB — the default; same as `gemma4`
ollama pull gemma4:26b   # ~18 GB  — the MoE; 256K context
ollama pull gemma4:31b   # ~20 GB  — biggest dense model

Hardware reality check: On Apple Silicon, 16 GB unified memory handles E4B comfortably. NVIDIA users need the model to fit entirely in VRAM for GPU-accelerated inference. The 26B model fits on 24 GB but leaves very little headroom — treat it as the ceiling, not the target.

List what you've got:

ollama list

Step 3: Chat with it in the terminal

Easiest possible test:

ollama run gemma4

You'll get an interactive prompt:

>>> Explain what a hash map is, like I'm a junior dev.

Hit enter and watch it stream a response. To exit, type /bye.

That's it. You're running a state-of-the-art LLM locally with zero cloud dependency. Try:

"Write a Python function that finds duplicates in a list, with three different approaches and their tradeoffs."
"What's the difference between TCP and UDP? Use an analogy."
"Translate 'Where is the nearest train station?' into Japanese, Spanish, and Hindi."

Step 4: Send it an image

Gemma 4 can see. Drop any image file in your current directory, then:

ollama run gemma4
>>> Describe what's in this image: ./screenshot.png

Ollama loads the image, sends it through the vision encoder, and the model answers. Unlike Gemma 3 (which resized everything to 896×896), Gemma 4 handles variable aspect ratios and resolutions natively — so tall screenshots, wide diagrams, and high-res photos all work without manual cropping.

Try:

"What error is shown in this screenshot?" (paste a stack trace)
"What's the bounding box for the 'submit' button in this UI?" (Gemma 4 will answer in JSON — natively!)
"Read the handwriting in this note and transcribe it."

Step 5: Turn on thinking mode

For harder problems — multi-step math, complex code, logic puzzles — turn on thinking mode. Include the <|think|> token at the very start of your system prompt:

ollama run gemma4
>>> /set system "<|think|>You are a careful, methodical assistant."
>>> Three friends split a $73.42 dinner bill. Alice had a $12 appetizer, Bob had a $9 drink. The rest is shared. What does everyone pay?

The model will emit its reasoning in a <|channel>thought\n...<channel|> block before the final answer. For fast chat, leave the token out and the model answers directly.

🧠 When to use it: Code generation, math, multi-hop reasoning, agentic planning — yes. Single-turn factual questions, summarization, translation — no, it just adds latency.

Step 6: Call Gemma 4 from Python

A chat prompt is nice, but you're a developer — you want to call this thing from code. When Ollama is running, it exposes a local REST API on http://localhost:11434. There's also an official Python client.

Install it:

pip install ollama

Basic chat

import ollama

response = ollama.chat(
    model="gemma4",
    messages=[
        {"role": "system", "content": "You are a senior code reviewer. Be concise and direct."},
        {"role": "user",   "content": "Review this code:\n\ndef add(a, b):\n    return a+b"},
    ],
)

print(response["message"]["content"])

Streaming responses (ChatGPT-style)

import ollama

stream = ollama.chat(
    model="gemma4",
    messages=[{"role": "user", "content": "Write a haiku about debugging."}],
    stream=True,
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Sending an image

import ollama

response = ollama.chat(
    model="gemma4",
    messages=[{
        "role": "user",
        "content": "What's in this image?",
        "images": ["./my_photo.jpg"],
    }],
)

print(response["message"]["content"])

Thinking mode + function calling (the agentic combo)

This is where Gemma 4 actually starts feeling like a "real" agent. You declare your tools as JSON schemas, the model decides when to call them, and you execute the call and pass results back. No prompt engineering hacks needed.

import ollama

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
            },
            "required": ["city"],
        },
    },
}]

def get_weather(city: str) -> str:
    # Pretend this hits a real API.
    return f"{city}: 22°C, partly cloudy"

response = ollama.chat(
    model="gemma4",
    messages=[
        {"role": "system", "content": "<|think|>You are a helpful weather assistant."},
        {"role": "user",   "content": "Should I bring an umbrella in Tokyo today?"},
    ],
    tools=tools,
)

# If the model wants to call a tool, execute it and feed the result back:
for tool_call in response["message"].get("tool_calls", []):
    name = tool_call["function"]["name"]
    args = tool_call["function"]["arguments"]
    if name == "get_weather":
        result = get_weather(**args)
        # Send result back for the model to finalize its answer
        followup = ollama.chat(
            model="gemma4",
            messages=[
                {"role": "user", "content": "Should I bring an umbrella in Tokyo today?"},
                response["message"],
                {"role": "tool", "content": result, "name": name},
            ],
        )
        print(followup["message"]["content"])

Raw HTTP (no Python client needed)

For any other language:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

Same JSON shape works from Node, Go, Rust, your shell — anything that can make an HTTP request.

A small project: folder-watching image describer

Here's a useful ~30-line script. It watches a folder, and any new image dropped in gets automatically described by Gemma 4. Great for accessibility tools, content moderation prototypes, or just learning.

import os, time
import ollama

WATCH_DIR = "./inbox"
os.makedirs(WATCH_DIR, exist_ok=True)
SEEN = set(os.listdir(WATCH_DIR))

print(f"📁 Watching {WATCH_DIR}/ — drop an image in to describe it.")
print("   (Ctrl+C to stop)\n")

IMAGE_EXTS = (".png", ".jpg", ".jpeg", ".webp", ".gif")

try:
    while True:
        current = set(os.listdir(WATCH_DIR))
        new_files = sorted(current - SEEN)

        for filename in new_files:
            if not filename.lower().endswith(IMAGE_EXTS):
                continue

            path = os.path.join(WATCH_DIR, filename)
            print(f"📸 New image: {filename}")

            response = ollama.chat(
                model="gemma4",
                messages=[{
                    "role": "user",
                    "content": (
                        "Describe this image in 2-3 sentences. "
                        "Mention any visible text. Be specific."
                    ),
                    "images": [path],
                }],
            )

            print(f"   → {response['message']['content']}\n")

        SEEN = current
        time.sleep(2)
except KeyboardInterrupt:
    print("\n Stopped.")

Run it, drag images into the inbox/ folder, and watch descriptions appear. That's a real, useful, completely local AI tool — written in 30 lines.

Things to know before shipping anything serious

A few honest caveats:

Caveat	Why it matters
Hallucination	Local models still confidently make things up. Don't trust factual claims without verification. Thinking mode reduces this for reasoning tasks but doesn't eliminate it.
CPU latency	Expect 1–3 tokens/sec on a CPU-only laptop with E4B. A GPU gives 3–10× speedup.
Context costs RAM	256K context is real, but actually filling it eats memory. Most use cases need <16K tokens.
MoE memory	The 26B MoE runs fast (only 4B active per token), but you still need to load all 26B into RAM. Don't confuse active params with memory footprint.
Audio is small-model only	E2B/E4B have native audio input. The 26B and 31B models do not.
Apache 2.0 ≠ no responsibilities	The license is permissive, but you're still on the hook for safety, bias, and compliance in whatever you ship.

📚 References & further reading

Gemma 4 announcement — Google blog — The launch post (April 2, 2026).
Gemma 4 model overview — Google AI for Developers — Official docs: sizes, capabilities, hardware requirements.
Welcome Gemma 4 — Hugging Face blog — Best technical write-up: covers PLE, MoE, USM audio encoder, benchmarks, and code samples.
Gemma 4 model card on Hugging Face — E4B instruct model weights and configuration.
Gemma 4 Complete Guide 2026 — dev.to — Community guide with architecture details and competitor comparisons.
SigLIP (Zhai et al., 2023) — The vision encoder family Gemma's image path builds on.
Mixture-of-Experts (Shazeer et al., 2017) — The original sparsely-gated MoE paper. The 26B A4B is a direct descendant.
Switch Transformer (Fedus et al., 2021) — Modern MoE techniques.
Llama 4 — Meta's competing open-weight family.

AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work

Jubin Soni — Mon, 11 May 2026 22:53:13 +0000

The agentic IDE space has gotten crowded fast. Cursor, Claude Code, Copilot, Windsurf — they all share the same core model: you type a prompt, the AI writes some code, you iterate. It works well for prototyping. It breaks down when you're building production systems on a large codebase with a team of more than one.

AWS Kiro takes a different bet. Instead of chat-first, it's spec-first. The unit of work isn't a prompt — it's a structured specification that the agent uses to plan, implement, verify, and document your feature end to end. That's a meaningful philosophical difference, and in practice it changes what the tool is useful for.

Here's what Kiro actually is, how its core concepts fit together, and an honest take on when it makes sense over the alternatives.

What Kiro Is

Kiro launched from AWS in mid-2025 and is built on top of Amazon Bedrock, routing between Claude Sonnet for reasoning-heavy work and Amazon Nova for high-throughput code generation. It ships in three forms:

Kiro IDE — a VS Code-compatible editor (built on Code OSS, so you can import your existing themes, keybindings, and Open VSX plugins)
Kiro CLI — the same agent in your terminal, useful for SSH sessions or scripted workflows
Kiro Autonomous Agent — a background agent that picks up tasks, implements them, and opens PRs without you sitting in the loop

You don't need an AWS account to get started — you can sign in with GitHub or Google. The IDE feels immediately familiar if you've used VS Code, which removes one of the usual adoption barriers for new tooling.

In January 2026, AWS also announced the end of Amazon Q Developer for new signups (effective May 15, 2026), explicitly directing users to Kiro as its successor for IDE-based AI assistance. That's a significant signal about where AWS is placing its bets.

The Three Concepts That Make Kiro Different

1. Specs

When you start a new feature in Kiro, you don't jump straight to code. You describe what you want to build, and Kiro generates three structured files:

requirements.md — user stories and acceptance criteria
design.md — system design, component breakdown, data flow
tasks.md — a numbered implementation checklist the agent works through

These become the source of truth. Code is a build artifact of the spec. When you come back to the feature a month later, or hand it to a new team member, the reasoning behind every decision is documented — not in a Confluence page nobody reads, but in the repo next to the code it describes.

This is the thing chat-first tools can't replicate. Cursor or Claude Code can generate excellent code from a good prompt. What they can't do is maintain a structured paper trail of why the code looks the way it does.

2. Hooks

Hooks are event-driven automations that fire when things happen in your workspace — file save, new file created, commit opened. You define what Kiro should do in response, and it runs those actions in the background without you having to think about them.

Common hooks teams set up:

Run the linter and auto-fix on every file save
Regenerate unit tests when implementation files change
Update the relevant section of design.md when a module is modified
Run a security scan before any commit

The practical effect is that a junior developer's output passes the same automated quality bar as a senior's, because the standards are enforced by the environment rather than by code review heroics.

3. Steering Files

Steering files are Markdown files that give Kiro persistent context about your project — your conventions, the libraries you've standardized on, your architecture decisions, your security requirements. You create them once, and Kiro reads them on every interaction without you having to re-explain your stack in every prompt.

They live in two places:

~/.kiro/steering/ — global rules that apply across all your projects
.kiro/steering/ — project-specific overrides checked into the repo

A typical global steering file might say things like "always use TypeScript strict mode," "prefer AWS CDK over raw CloudFormation," or "all Lambda functions must have structured logging with a correlation ID." Project steering files add things like "this service is a multi-tenant SaaS, tenant ID is always passed in the request context."

The result is that Kiro's context isn't reset between sessions and doesn't depend on whoever wrote the last prompt being thorough.

The Hooks + Specs Flywheel

The real power emerges when hooks and specs work together. Here's what that looks like in practice:

You describe a new feature. Kiro generates requirements.md, design.md, and tasks.md.
You review and refine the spec — add an edge case to requirements, adjust the component breakdown in design.
Kiro implements the tasks list, following your steering files for conventions.
On each file save, hooks run: linter, tests, security scan. Issues surface immediately.
When you're done, a hook generates the commit message from the spec diff.
The PR description writes itself from requirements.md.

The spec doesn't go stale because hooks keep it in sync with the code. The code doesn't drift from the design because the design was written before the code. This is what "engineering rigor" means in the context of agentic development — not slower, but structured.

AWS-Native Advantages (and the Honest Tradeoff)

Kiro has deep integration with the AWS ecosystem: CodeCatalyst for repositories and CI/CD, Bedrock for model access, IAM Identity Center for enterprise auth, and "Kiro Powers" — pre-packaged MCP servers for AWS-specific domains like CDK, CloudFormation, pricing, and (recently) HealthOmics workflows.

If your team is already AWS-first, this is a genuine multiplier. Your Kiro agent can query your actual AWS account context, reference live Bedrock documentation, and generate CDK constructs that match your organization's guardrails.

The honest tradeoff: if your team isn't AWS-first, some of this integration feels like overhead rather than lift. Kiro works perfectly well as a general-purpose agentic IDE — the spec/hooks/steering system has value regardless of your cloud provider — but the ecosystem integrations are clearly designed for AWS shops. Most teams running mixed infrastructure (some AWS, some not) find it practical to use Kiro for the AWS-native services and keep their existing editor for everything else. The two coexist fine.

How It Compares to the Alternatives

	Kiro	Cursor	Claude Code
Primary paradigm	Spec-driven	Chat-driven	Task-driven (CLI)
Persistent context	Steering files	Rules / `.cursorrules`	AGENTS.md
Automation	Hooks (event-driven)	Manual	Manual
AWS integration	Native	None	None
IDE	Standalone (VS Code-compatible)	Fork of VS Code	Terminal only
Background agent	Yes (autonomous agent)	Limited	Yes
Best for	Production features, team consistency	Fast prototyping, exploration	Complex refactors, agentic tasks

Kiro and Claude Code aren't direct competitors in practice — Kiro is an IDE product and Claude Code is a terminal agent. Many teams run both, using Kiro for structured feature work and Claude Code for open-ended refactors or one-off tasks.

Getting Started

Download the IDE from kiro.dev — no AWS account required. Sign in with GitHub or Google, point it at an existing repo, and run through the onboarding to import your VS Code settings.

A good first experiment: take a feature you're planning to build anyway, describe it to Kiro, and look at the spec it generates before writing any code. The value of the approach becomes obvious when you see your vague "add user preferences" idea turn into a concrete requirements doc with six acceptance criteria and a data model.

From there:

Create one global steering file in ~/.kiro/steering/ with your language and framework defaults
Set up one hook that runs your linter on file save
Build the feature using the task list Kiro generated

That's the feedback loop that makes the tool click. The full power of the hooks and autonomous agent comes later, but even the basic spec workflow is a meaningful improvement over prompt-and-iterate for anything that takes more than a day to build.

Worth Watching

A few things that make Kiro worth keeping an eye on even if you're not ready to switch:

The spec-as-artifact model is genuinely novel. When agents get better, spec-driven codebases will be better positioned to benefit — the structured requirements and design docs give future agents much richer context than a commit history and some comments.

Kiro Powers (the MCP server marketplace) is growing fast. The HealthOmics extension in February 2026 showed that domain-specific agent packs are a real product direction, not just a demo.

And with Amazon Q Developer sunsetting for new users, AWS is clearly consolidating its developer AI bet onto Kiro. Whatever the roadmap looks like from here, it's going to get resources.

Kiro isn't the right tool for every workflow. If you're prototyping solo or doing exploratory work, the spec-first overhead is friction you don't need. But for teams shipping production features that need to be documented, tested, and maintained — the bet that specs should be the unit of work is a compelling one.

Kiro vs. the Alternatives

Feature	Kiro	Cursor	Claude Code	GitHub Copilot
Primary paradigm	Spec-driven	Chat-driven	Task-driven (CLI)	Inline completion
Persistent context	Steering files	`.cursorrules`	`AGENTS.md`	None
Event automation	Hooks (file save, commit)	None	None	None
Structured specs	✅ Native	❌	❌	❌
Background agent	✅ Autonomous agent	Limited	✅	❌
AWS-native integration	✅ Deep	❌	❌	❌
Dynamic MCP loading	✅ Powers	Manual	Manual	❌
IDE base	Code OSS (VS Code compat.)	VS Code fork	Terminal only	Plugin
Free tier	✅	✅	✅	✅

How Spec-Driven Development Works

┌─────────────────────────────────────────────────────────┐
│              YOU: describe a feature                     │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                KIRO GENERATES SPECS                      │
│                                                          │
│  .kiro/specs/my-feature/                                 │
│  ├── requirements.md  ← user stories + EARS notation     │
│  ├── design.md        ← architecture, data flow, APIs    │
│  └── tasks.md         ← ordered implementation plan      │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│              YOU: review + refine specs                  │
│  add edge cases, adjust design, approve task list        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│            KIRO IMPLEMENTS task by task                  │
│  guided by steering files + spec context                 │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│              HOOKS FIRE AUTOMATICALLY                    │
│  on every file save:                                     │
│  → linter + autofix                                      │
│  → test generation / update                              │
│  → security scan                                         │
│  → design.md sync                                        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│        PR OPENS — description from requirements.md       │
│        commit message generated from spec diff           │
└─────────────────────────────────────────────────────────┘

Steering File Layout

~/.kiro/steering/              ← global, applies to every project
├── typescript.md              "always use strict mode, no any"
├── aws.md                     "prefer CDK over raw CloudFormation"
├── security.md                "IAM roles must follow least privilege"
├── git.md                     "use conventional commits"
└── testing.md                 "80% coverage minimum, jest + RTL"

your-repo/
└── .kiro/
    └── steering/              ← project-specific overrides (checked in)
        ├── architecture.md    "multi-tenant SaaS, one DB schema per tenant"
        ├── api.md             "all endpoints versioned under /v1"
        └── data-model.md      "tenant ID always in request context, never inferred"

Hook Definition Example

# .kiro/hooks/test-sync.yaml
name: Sync Tests on Component Save
trigger:
  event: onSave
  pattern: "src/**/*.tsx"
instructions: |
  When a React component file is saved:
  1. Check if a corresponding test file exists in __tests__/
  2. If not, create one with basic render and snapshot tests
  3. If it exists, update it to cover any new props or exported functions
  4. Run the test file and report failures inline

# .kiro/hooks/security-scan.yaml
name: Pre-commit Security Scan
trigger:
  event: onCommit
instructions: |
  Before every commit:
  1. Scan staged files for hardcoded secrets, API keys, and credentials
  2. Check for any 0.0.0.0/0 ingress rules in IaC files
  3. Flag any new IAM policies that use wildcard actions (*)
  4. Block the commit and explain any findings — do not auto-fix

How Powers Solve Context Rot

Without Powers, connecting multiple MCP servers front-loads your entire context window before you write a single line:

Without Powers
──────────────────────────────────────────────────
Context window (200K tokens)

[Figma MCP tools]     ~12K tokens  ████
[Postman MCP tools]   ~18K tokens  ██████
[Stripe MCP tools]    ~10K tokens  ███
[Supabase MCP tools]  ~15K tokens  █████
[Datadog MCP tools]   ~9K tokens   ███
                      ──────────────────
Total overhead        ~64K tokens  (32% gone before first prompt)

With Powers (dynamic loading)
──────────────────────────────────────────────────
You mention "payment" → Stripe power activates
You mention "database" → Supabase activates, Stripe deactivates
Baseline overhead: ~0K tokens until context is needed

Workspace Architecture for AWS Teams

AWS Organization
└── Management Account
    ├── Client A Account
    │   ├── Kiro workspace  (.kiro/ scoped here)
    │   ├── CodeCatalyst repo
    │   ├── Bedrock access (us-east-1)
    │   └── Secrets Manager (client A secrets only)
    │
    ├── Client B Account
    │   ├── Kiro workspace  (.kiro/ scoped here)
    │   ├── CodeCatalyst repo
    │   ├── Bedrock access (us-east-1)
    │   └── Secrets Manager (client B secrets only)
    │
    └── Shared Services Account
        ├── IAM Identity Center (SSO for all Kiro logins)
        └── Billing consolidated

This pattern keeps client IP, secrets, and Bedrock spend isolated by account boundary — IAM does the enforcement, not convention.

Resources

kiro.dev — download is free, no AWS account required
Introducing Kiro — the original launch post, good context on the design philosophy behind specs and hooks
Introducing Powers — explains why dynamic MCP loading matters and how Powers solve context rot
Teaching Kiro new tricks with steering and MCP — practical deep dive on using steering + MCP to handle custom libraries and DSLs
Specs documentation — full reference including the Design-First and Bugfix spec workflows
Kiro Powers marketplace — browse Figma, Stripe, Supabase, Datadog, Terraform and more
IDE Changelog — how fast the product is moving
Amazon Q Developer end-of-support announcement — official AWS post confirming Kiro as Q Developer's successor
github.com/kirodotdev/Kiro — issue tracker and feedback repo

Build a Git Commit Analyzer with Gemma 4 31B and a 256K Context Window

Jubin Soni — Fri, 08 May 2026 17:41:18 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Most developers reach for an LLM when they need code completion or a chatbot. This article is about something more useful and less obvious: feeding your entire sprint's git history to Gemma 4 31B — diffs, commit messages, authors and all — and getting back structured, actionable analysis of what actually changed and why it might matter.

The 31B Dense model's 256K context window is the key enabler here. It means you can pass tens of thousands of lines of patch output in a single prompt and ask the model to reason across the whole thing — not chunk-and-summarize, but genuinely cross-reference commits, spot patterns, and flag risk. That's a qualitatively different capability from what a smaller model or an older Gemma generation could provide.

By the end of this guide you'll have a working Python CLI tool that:

Shells out to git log --patch to collect a commit range
Sends the full diff to Gemma 4 31B via the Gemini API (free tier in Google AI Studio)
Returns a structured JSON report with change summaries, risk flags, and a draft changelog
Optionally writes a Markdown changelog file

Why Gemma 4 31B Is the Right Model for This

Three specific properties make the 31B Dense the correct pick here — not the 26B MoE, not the edge models.

256K context window. A week's worth of commits on a mid-size codebase generates 20,000–80,000 tokens of patch text. The 31B handles that in a single pass. Chunking and summarizing separately loses cross-commit signal: the model can't notice that a refactor in commit 3 introduced the same variable name collision that commit 7 later fixed.

Maximum quality per query. The 31B Dense is the highest-accuracy model in the Gemma 4 family. For code analysis you care about precision — a false positive risk flag wastes a senior engineer's time, and a false negative ships a bug. You're making one expensive call per analysis run, so raw quality beats throughput.

Native structured output. Gemma 4 has first-class support for function calling and structured JSON output. The analyzer requests a strict JSON schema and the model reliably returns it — no fragile string parsing required.

The 26B MoE is the right choice if you're building something that calls the model thousands of times per day and want cost efficiency. This tool calls it once per analysis run and prioritizes signal quality, so the Dense wins.

Prerequisites

Python 3.10+
A Google AI Studio API key (free — get one here)
A git repository to analyze
The google-generativeai Python SDK

pip install google-generativeai

Set your API key as an environment variable:

export GEMINI_API_KEY="your-key-here"

Step 1: Collect the Git Diff

The first job is gathering the raw patch data. We use git log --patch with a commit range and pipe the output to a string. We also collect structured commit metadata separately so the model has author and timestamp context alongside the diff.

import subprocess
import sys

def collect_git_history(repo_path: str, since: str = "1 week ago", until: str = "HEAD") -> tuple[str, list[dict]]:
    """
    Returns (full_patch_text, list_of_commit_metadata).
    `since` accepts anything git understands: '7 days ago', 'v1.2.3', a SHA, etc.
    """
    # Collect the full unified diff
    patch_result = subprocess.run(
        ["git", "log", "--patch", "--no-merges", f"--since={since}", f"--until={until}",
         "--pretty=format:COMMIT: %H%nAuthor: %an <%ae>%nDate: %ci%nMessage: %s%n"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True
    )

    # Collect lightweight metadata for the summary header
    meta_result = subprocess.run(
        ["git", "log", "--no-merges", f"--since={since}", f"--until={until}",
         "--pretty=format:%H|%an|%ci|%s"],
        cwd=repo_path,
        capture_output=True,
        text=True,
        check=True
    )

    commits = []
    for line in meta_result.stdout.strip().splitlines():
        if not line:
            continue
        sha, author, date, *msg_parts = line.split("|")
        commits.append({
            "sha": sha[:8],
            "author": author,
            "date": date,
            "message": "|".join(msg_parts)
        })

    return patch_result.stdout, commits

A week of commits on a real codebase might be 40,000–100,000 tokens. We'll let the model handle the full text — that's exactly what the 256K window is for.

Step 2: Build the Prompt

The prompt does three things: gives the model its role and output contract, defines the JSON schema it must return, and passes the raw git history.

SYSTEM_PROMPT = """You are a senior staff engineer performing a structured code review
of a git commit history. Your job is to analyse the provided patch text and return a
single JSON object — nothing else, no markdown fences, no explanation outside the JSON.

The JSON object must match this schema exactly:

{
  "summary": "2-3 sentence plain-English summary of the overall change set",
  "changed_areas": [
    {
      "path": "path/to/file_or_directory",
      "change_type": "added | modified | deleted | renamed",
      "description": "what changed and why it likely changed"
    }
  ],
  "risk_flags": [
    {
      "severity": "low | medium | high",
      "area": "file or component",
      "reason": "specific, concrete reason this change carries risk"
    }
  ],
  "patterns": [
    "notable cross-commit pattern, refactor theme, or repeated change"
  ],
  "changelog_entry": "A polished, user-facing changelog entry in Markdown. Use ## [Unreleased] as the heading. Group under Added, Changed, Fixed, Removed as appropriate."
}

Be specific. Do not flag risk without a concrete reason tied to the actual diff.
Do not invent changes that are not present in the patch text."""


def build_prompt(patch_text: str, commits: list[dict]) -> str:
    commit_count = len(commits)
    authors = list({c["author"] for c in commits})
    date_range = f"{commits[-1]['date'][:10]} to {commits[0]['date'][:10]}" if commits else "unknown"

    header = (
        f"ANALYSIS REQUEST\n"
        f"Commits: {commit_count}\n"
        f"Authors: {', '.join(authors)}\n"
        f"Date range: {date_range}\n\n"
        f"FULL PATCH TEXT FOLLOWS\n"
        f"{'='*60}\n"
    )

    return header + patch_text

The system prompt enforces a strict schema so we can parse the response with json.loads — no regex, no fallbacks. One of Gemma 4's standout improvements over Gemma 3 is how reliably it follows structured output instructions at this schema complexity.

Step 3: Call Gemma 4 31B

We use the google-generativeai SDK with gemma-4-31b-it (the instruction-tuned variant — always use IT for structured task completion).

import google.generativeai as genai
import json
import os

def analyze_with_gemma(patch_text: str, commits: list[dict]) -> dict:
    genai.configure(api_key=os.environ["GEMINI_API_KEY"])

    model = genai.GenerativeModel(
        model_name="gemma-4-31b-it",
        system_instruction=SYSTEM_PROMPT,
        generation_config=genai.GenerationConfig(
            temperature=0.2,      # Low temperature for consistent structured output
            top_p=0.9,
            max_output_tokens=4096,
        )
    )

    prompt = build_prompt(patch_text, commits)

    print(f"Sending {len(prompt.split()):,} words to Gemma 4 31B...", file=sys.stderr)

    response = model.generate_content(prompt)

    raw = response.text.strip()

    # Strip markdown fences if the model adds them despite instructions
    if raw.startswith("```

"):
        raw = raw.split("

```")[1]
        if raw.startswith("json"):
            raw = raw[4:]

    return json.loads(raw)

Temperature at 0.2 keeps the output deterministic and schema-compliant. For creative changelog prose you could nudge it to 0.4 — but for risk flags you want the model to be conservative and consistent.

Step 4: Format and Output the Report

from datetime import datetime

def print_report(analysis: dict, commits: list[dict]) -> None:
    print("\n" + "="*60)
    print("GIT HISTORY ANALYSIS — Gemma 4 31B")
    print("="*60)
    print(f"\nCommits analysed: {len(commits)}")
    print(f"\nSUMMARY\n{analysis['summary']}\n")

    if analysis.get("risk_flags"):
        print("RISK FLAGS")
        for flag in sorted(analysis["risk_flags"], key=lambda f: {"high": 0, "medium": 1, "low": 2}[f["severity"]]):
            icon = {"high": "🔴", "medium": "🟡", "low": "🟢"}[flag["severity"]]
            print(f"  {icon} [{flag['severity'].upper()}] {flag['area']}")
            print(f"     {flag['reason']}")
        print()

    if analysis.get("patterns"):
        print("PATTERNS DETECTED")
        for p in analysis["patterns"]:
            print(f"  • {p}")
        print()

    print("CHANGED AREAS")
    for area in analysis.get("changed_areas", []):
        print(f"  [{area['change_type'].upper():8}] {area['path']}")
        print(f"             {area['description']}")
    print()


def write_changelog(analysis: dict, output_path: str) -> None:
    entry = analysis.get("changelog_entry", "")
    if not entry:
        return

    # Inject today's date if the entry has a placeholder
    entry = entry.replace("[Unreleased]", f"[Unreleased] — {datetime.today().strftime('%Y-%m-%d')}")

    with open(output_path, "w") as f:
        f.write(entry + "\n")

    print(f"Changelog written to {output_path}", file=sys.stderr)

Step 5: Wire It Together as a CLI

import argparse

def main():
    parser = argparse.ArgumentParser(
        description="Analyse a git commit range with Gemma 4 31B"
    )
    parser.add_argument("repo", help="Path to git repository")
    parser.add_argument("--since", default="1 week ago",
                        help="Start of range (default: '1 week ago'). Accepts any git date or ref.")
    parser.add_argument("--until", default="HEAD",
                        help="End of range (default: HEAD)")
    parser.add_argument("--changelog", default=None,
                        help="Write changelog entry to this file")
    parser.add_argument("--json", dest="json_out", default=None,
                        help="Write full JSON report to this file")
    args = parser.parse_args()

    patch_text, commits = collect_git_history(args.repo, args.since, args.until)

    if not commits:
        print("No commits found in the specified range.", file=sys.stderr)
        sys.exit(0)

    analysis = analyze_with_gemma(patch_text, commits)
    print_report(analysis, commits)

    if args.changelog:
        write_changelog(analysis, args.changelog)

    if args.json_out:
        with open(args.json_out, "w") as f:
            json.dump(analysis, f, indent=2)
        print(f"Full JSON report written to {args.json_out}", file=sys.stderr)


if __name__ == "__main__":
    main()

Running It

Analyse the last week of commits in the current repo:

python git_analyzer.py . --since "1 week ago"

Analyse a specific SHA range and write a changelog:

python git_analyzer.py /path/to/repo \
  --since v1.4.0 \
  --until v1.5.0 \
  --changelog CHANGELOG.md \
  --json report.json

Analyse a single sprint (two-week window):

python git_analyzer.py . --since "14 days ago" --changelog CHANGELOG.md

Sample Output

Here's an abbreviated example of what the tool produces on a real project:

============================================================
GIT HISTORY ANALYSIS — Gemma 4 31B
============================================================

Commits analysed: 23

SUMMARY
This sprint focused on migrating the authentication layer from session
cookies to JWTs, with supporting changes to the user model and API
middleware. Three unrelated bug fixes were included. No database
migrations were added despite schema-adjacent changes in user.py.

RISK FLAGS
  🔴 [HIGH]  src/auth/middleware.py
             Token expiry is set to 0 in the new JWT config, which
             disables expiry entirely. This appears unintentional given
             the surrounding comments referencing a 24h TTL.
  🟡 [MEDIUM] src/models/user.py
             The `last_login` field is now written in two places with
             different timezone handling (UTC in the old path, local
             time in the new one). Cross-commit inconsistency introduced
             in commits a3f1cc and 9d02bb.

PATTERNS DETECTED
  • JWT migration touched 11 files across 8 commits — no single
    atomic commit, suggesting iterative discovery during implementation
  • Four separate commits add logging statements then remove them,
    indicating debug churn that could have been a feature branch

CHANGED AREAS
  [MODIFIED ] src/auth/middleware.py
               Core auth middleware rewritten to validate Bearer tokens
               instead of reading from session. Old session path removed.
  [MODIFIED ] src/models/user.py
               Added jwt_secret field; last_login timezone handling changed
  [ADDED    ] src/auth/token.py
               New module for JWT encode/decode with HS256
  ...

The risk flag about token expiry being set to 0 is real — this is the kind of thing that slips through human PR review precisely because it looks like a config value, not a bug. The cross-commit inconsistency flag is only possible because the model reasoned across all 23 commits simultaneously rather than reviewing each in isolation.

Context Window Headroom

The 256K window on Gemma 4 31B means you have significant headroom. At roughly 3 characters per token, the practical limits look like this:

Scenario	Approx. tokens	Fits in 256K?
1 day of commits, small team	~5,000	✅
1 sprint (2 weeks), small team	~40,000	✅
Full quarter, mid-size team	~180,000	✅
1 year of active development	~500,000+	❌ use `--since` to segment

For very large repos, segment by component directory using git log -- path/to/subdir rather than trying to fit everything.

Where to Take This Next

GitHub Action. Trigger the analyzer on each PR, post the risk flags as a PR comment, and block merge if any high severity flags are found. One YAML file and a secrets entry gets you there.

Slack/Teams digest. Run on a cron, pipe the changelog entry to a webhook. Engineering managers get a plain-English weekly summary without reading git.

Fine-tuning. If your team consistently disagrees with certain risk classifications, collect those corrections as a small labeled dataset and fine-tune the model on Vertex AI or Colab. Gemma 4's Apache 2.0 license means there are no restrictions on using it as a fine-tuning base for internal tools.

Multi-repo analysis. Pass diffs from multiple services in the same prompt window. The 256K context means you can compare what changed across your backend, frontend, and infra repos in the same analysis run.

Why This Matters

The git history is one of the most information-dense artifacts a software team produces, and it's almost entirely ignored outside of git blame. Gemma 4 31B's context window is large enough to treat a sprint's history as a single document rather than a stream of individual events.

That shift in granularity changes what the model can do: it can notice that a change made on Tuesday was partially reverted on Thursday, that two different authors independently touched the same configuration key, or that a "refactor" commit introduced a subtle behavioral change buried in 400 lines of renames.

None of that is possible when each commit is reviewed in isolation.

Resources

Engineering LLMOps: Building Robust CI/CD Pipelines for LLM Applications on Google Cloud

Jubin Soni — Fri, 01 May 2026 23:48:44 +0000

The transition of Large Language Models (LLMs) from experimental notebooks to production-grade applications requires more than just a well-crafted prompt. As enterprises integrate Generative AI into their core workflows, the need for stability, scalability, and reproducibility becomes paramount. This is where LLMOps—the intersection of DevOps, Data Engineering, and Machine Learning—enters the frame.

Building a CI/CD pipeline for LLM-based applications on Google Cloud Platform (GCP) presents unique challenges. Unlike traditional software, LLM outputs are non-deterministic, making testing complex. Unlike traditional ML, the "model" is often a managed service (like Gemini) or a fine-tuned version of an open-source giant, shifting the focus from training to orchestration, prompt management, and RAG (Retrieval-Augmented Generation) infrastructure.

In this technical deep dive, we will explore how to architect a robust CI/CD pipeline for LLM applications using Google Cloud's suite of tools, ensuring your AI deployments are as reliable as your backend microservices.

The Evolution of the Pipeline: From DevOps to LLMOps

Traditional CI/CD focuses on code integrity, unit tests, and artifact deployment. LLMOps extends this by adding layers for prompt versioning, evaluation against golden datasets, and semantic monitoring.

On Google Cloud, the backbone of this workflow is Cloud Build for orchestration, Vertex AI for model management and evaluation, and Artifact Registry for versioning. The goal is to move away from manual testing in the Vertex AI Studio and toward an automated, repeatable process.

Core Components of the GCP LLM Stack

Vertex AI Model Garden & Model Registry: Centralized hubs for discovering and managing models.
Cloud Build: A serverless CI/CD platform that executes builds on GCP infrastructure.
Vertex AI Pipelines: Based on Kubeflow, these allow you to orchestrate complex ML workflows.
Cloud Run / GKE: For hosting the application logic or serving custom model containers.
Vertex AI Evaluation Service: Provides automated metrics for model performance (e.g., faithfulness, answer relevancy).

Architectural Blueprint: The LLM CI/CD Lifecycle

A robust pipeline must handle three distinct types of updates: changes to the application code, changes to the prompt templates, and updates to the retrieval data (in RAG systems).

The Workflow Logic

This flowchart illustrates the progression from code commit to production. The "Performance Gate" is the most critical addition in LLMOps. It prevents models that hallucinate or provide poor-quality answers from reaching the end user.

Continuous Integration: Beyond Unit Testing

In a standard application, O(1) or O(n) performance and logical correctness are the benchmarks. In LLM apps, we must test for semantic accuracy. CI for LLMs on GCP should include:

Prompt Linting: Checking for formatting and required variables in prompt templates.
Deterministic Testing: Testing the helper functions that format data for the LLM.
LLM-based Evaluation (LLM-as-a-judge): Using a stronger model (like Gemini 1.5 Pro) to grade the output of a smaller, faster model (like Gemini 1.5 Flash).

Practical Code: Automated Evaluation Script

Using the Vertex AI SDK, we can automate the evaluation of a prompt change during the CI phase. The following Python snippet demonstrates how to trigger an evaluation job that measures "fluency" and "safety."

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import EvalTask, PointwiseMetric

# Initialize Vertex AI
vertexai.init(project="your-project-id", location="us-central1")

# Define the evaluation metric (LLM-as-a-judge)
fluency_metric = PointwiseMetric(
    metric="fluency",
    metric_prompt_template="Rate the fluency of the following text from 1-5.",
)

def run_evaluation(candidate_model_output, reference_data):
    eval_task = EvalTask(
        dataset=reference_data,
        metrics=[fluency_metric],
        experiment="llm-app-v1-eval"
    )

    # Run the evaluation
    results = eval_task.evaluate(
        prompt_template="Summarize this text: {text}",
        model="google/gemini-1.5-flash"
    )

    return results.summary_metrics

# Example usage in a CI script
# if results.summary_metrics['fluency'] < 4.0:
#     sys.exit(1) # Fail the build

Data Management and Versioning

In LLM applications, especially those utilizing RAG, the data is as important as the code. Your pipeline must account for the versioning of the Vector Database index and the embeddings model. If you update your embeddings model (e.g., from Gecko v1 to v2), you must re-index your entire dataset. Failure to do so leads to a "schema mismatch" in semantic space, where the LLM cannot find the relevant context.

Technology Comparison: Serving Options on Google Cloud

Feature	Vertex AI Endpoints	Cloud Run	Google Kubernetes Engine (GKE)
Best For	Managed model serving	Lightweight AI APIs	Large-scale custom deployments
Auto-scaling	Built-in (to zero with some models)	Highly responsive to HTTP traffic	Complex scaling based on GPU usage
Cold Start	Medium	Low (Serverless)	High (unless using warm pools)
GPU Support	Seamlessly managed	Limited (via Sidecars)	Full control over GPU types
Pricing Model	Per-node-hour	Per-request/CPU-second	Cluster-based provisioning

Continuous Delivery: Deployment Strategies

Deploying LLMs requires a safety-first approach. Because LLM behavior can shift with new data or minor prompt tweaks, Canary deployments are essential. Vertex AI Endpoints facilitate this by allowing traffic splitting between multiple model versions.

Sequence of a Managed Deployment

This sequence ensures that if the new prompt version causes a spike in 400-level errors or results in lower semantic confidence scores, the pipeline can automatically roll back to the stable version.

Infrastructure as Code (IaC) with Terraform

To ensure the environment is reproducible, all GCP resources (Vertex AI Indexes, Endpoints, and Cloud Storage buckets) should be managed via Terraform. This prevents "configuration drift," where the staging environment differs from production.

resource "google_vertex_ai_endpoint" "llm_endpoint" {
  name         = "gemini-service-endpoint"
  display_name = "Gemini Service Endpoint"
  location     = "us-central1"
  project      = var.project_id
}

resource "google_cloudbuild_trigger" "llm_pipeline_trigger" {
  name = "deploy-llm-on-push"

  github {
    owner = "your-org"
    name  = "your-repo"
    push {
      branch = "^main$"
    }
  }

  filename = "cloudbuild.yaml"
}

Implementing a "PromptOps" Strategy

One of the most significant shifts in LLMOps is treating prompts as first-class citizens. Instead of hardcoding prompts in the application code, store them as versioned assets.

Branching Strategy for Prompts

Using a Git-based workflow for prompts allows prompt engineers to experiment without breaking the production application logic.

The Cloud Build Configuration

The following is an example of a cloudbuild.yaml file that orchestrates the entire process: running tests, performing model evaluation, and deploying to a staging environment.

steps:
  # Step 1: Install dependencies and run unit tests
  - name: 'python:3.10'
    entrypoint: /bin/sh
    args:
      - -c
      - |
        pip install -r requirements-test.txt
        pytest tests/unit

  # Step 2: Run Vertex AI Evaluation
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: 'python'
    args: ['scripts/evaluate_model.py']
    env:
      - 'PROJECT_ID=$PROJECT_ID'

  # Step 3: Build the application container
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', 'us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA', '.']

  # Step 4: Push to Artifact Registry
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', 'us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA']

  # Step 5: Update Cloud Run Service
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    entrypoint: gcloud
    args: 
      - 'run'
      - 'deploy'
      - 'llm-service-staging'
      - '--image=us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'
      - '--region=us-central1'

images:
  - 'us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'

Monitoring and Feedback Loops

Once an LLM application is in production, the CI/CD pipeline doesn't stop. It transforms into a feedback loop. Google Cloud Monitoring and Cloud Logging can be used to track:

Token Usage: Monitoring costs to prevent budget overruns.
Latency: Tracking time-to-first-token (TTFT) and total response time.
Human-in-the-loop Feedback: Sending flagged responses back to a labeling task in Vertex AI for future fine-tuning.

Handling Non-Determinism

Because LLMs are non-deterministic, your monitoring tools should use statistical significance. Instead of a binary "pass/fail" for every request, look for distribution shifts in the "Helpfulness" score over a window of 1000 requests. If the mean score drops by more than two standard deviations, the pipeline should trigger a rollback or alert the engineering team.

Security and Governance in LLMOps

Security in the CI/CD pipeline for LLMs involves protecting the data used for RAG and the API keys for the model providers.

Secret Manager: Use GCP Secret Manager to store API keys and database credentials. Never hardcode these in your cloudbuild.yaml or application containers.
VPC Service Controls: For enterprises with strict data residency requirements, ensure that Vertex AI is used within a VPC Service Control perimeter to prevent data exfiltration.
IAM Granularity: Assign the least privilege roles. The Cloud Build service account needs roles/aiplatform.user to trigger evaluations but should not have permission to delete model registries.

Conclusion: The Path to Mature AI Delivery

Building a CI/CD pipeline for LLM applications on Google Cloud is an iterative journey. It begins with basic automation and evolves into a sophisticated system capable of semantic evaluation and automated rollbacks. By leveraging Vertex AI and Cloud Build, organizations can treat LLMs not as mysterious black boxes, but as manageable components of a robust software ecosystem.

The key to success lies in the "Performance Gate"—investing heavily in evaluation metrics early on will save hundreds of hours of manual debugging later. As the Generative AI landscape continues to evolve, those with the most resilient pipelines will be the ones who can innovate at the speed of the market without sacrificing reliability.

The Most Important Announcement at NEXT '26 Was a Sidecar

Jubin Soni — Sun, 26 Apr 2026 08:07:59 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

Google Cloud NEXT '26 made 260 announcements. Most of the discussion has rightly gone to the headline acts: the Gemini Enterprise Agent Platform, 8th-gen TPUs, the Cross-Cloud Lakehouse, Agentic Defense.

Announcement #124 is not one of those.

It's titled "Predictive latency boost in GKE Inference Gateway." The official blurb says it cuts time-to-first-token by up to 70% by replacing heuristic guesswork with real-time capacity-aware routing — no manual tuning required. That sentence is engineered to slide past you.

Here's why I think it's the most consequential thing Google shipped this week.

The problem this is actually solving

If you've ever stood up a vLLM cluster on Kubernetes, you've felt this pain:

You have N replicas of the same model. A request lands. Your load balancer has to decide which pod gets it. The "obvious" answers all break:

Round-robin? Ignores that pod 3 is sitting on a 60-token KV cache and pod 7 is at 95% memory pressure.
Least-connections? Treats a 50-token prompt and a 50,000-token prompt as equivalent units of work. They are not.
Cache-aware (route to the pod with the prefix already cached)? Concentrates load. Cache-hot pods melt. Cache-cold pods sit idle.
Utilization-aware (route to the least-loaded pod)? Throws away the entire benefit of prefix caching by scattering related requests.

The standard production answer is what the Kubernetes Inference Gateway calls a "load+prefix scorer" — you give it weights like (prefix=1, queue=1, kv_cache=1) and tune them by hand. The weights you pick are wrong roughly five minutes after you pick them, because traffic shape changes. The weights that worked at 2pm don't work at 2am. The weights that worked for chat workloads don't work when your evals job kicks off.

Everyone running LLM inference at scale has built some version of "we tuned the scorer weights for our workload." Everyone has watched those weights silently rot.

What Google announced

Buried in the GKE keynote, Google linked to a research blog from the llm-d team describing the actual mechanism behind announcement #124. The architecture is shockingly simple — and that's the whole point.

                                  ┌──────────────────────────┐
                                  │   Inference Gateway      │
  request ──────────────────────► │   Endpoint Picker (EPP)  │
                                  └──────────┬───────────────┘
                                             │ "for each candidate pod,
                                             │  predict TTFT and TPOT"
                                             ▼
                                  ┌──────────────────────────┐
                                  │  Latency Predictor       │
                                  │  (XGBoost regression,    │
                                  │   sidecar to EPP)        │
                                  └──────────┬───────────────┘
                                             │ predictions
                                             ▼
                                  pod with best predicted latency wins
                                             │
                                             ▼
                                  ┌────────┐ ┌────────┐ ┌────────┐
                                  │ vLLM 1 │ │ vLLM 2 │ │ vLLM 3 │
                                  └────┬───┘ └────┬───┘ └────┬───┘
                                       │          │          │
                                       └──────────┼──────────┘
                                                  ▼
                                  ┌──────────────────────────┐
                                  │  Trainer sidecar         │
                                  │  observes completed      │
                                  │  requests, retrains      │
                                  │  on sliding window       │
                                  └──────────────────────────┘

There is no large model here. There is no Gemini call in the hot path. The "AI" is a small XGBoost regressor that predicts two numbers per candidate pod:

TTFT — time to first token (dominated by prefill)
TPOT — time per output token (dominated by decode)

It uses six features: KV cache utilization, input length, queue depth, running requests, prefix cache match percentage, and input tokens in flight. That's the whole input.

Then the scheduler routes to the pod with the best predicted outcome. If you provided latency SLOs in the request headers, it does best-fit packing — pick the pod with the least positive headroom, so the others stay free for harder requests later.

That's it. That's the announcement.

Why this matters more than anything else announced

Look at the production numbers from the llm-d post:

Strategy	E2E p50	TTFT p50	TTFT p95	TPOT p99
K8s round-robin baseline	15.98s	4.47s	24.04s	93ms
Load+Prefix `(1,1,1)`	16.42s	2.86s	18.06s	103ms
Load+Prefix `(3,2,2)` (hand-tuned for this workload)	13.42s	3.38s	16.78s	63ms
Predicted-latency	9.06s	0.97s	11.34s	53ms

The hand-tuned heuristic was specifically tuned by humans who looked at seven days of production traffic. The XGBoost model — which retrains on a 1ms-window sliding stratified bucket — beat it by 43% on E2E p50 and 70% on TTFT p50.

This is the part that should make every infrastructure engineer pay attention: the model didn't beat round-robin. It beat the best version of the thing your team is currently running.

The workload was Qwen3-480B on 13 servers with 8×H200 each, simulating realistic Poisson-distributed traffic with concurrency 1000 and ~94% peak prefix cache reuse. That's not a toy benchmark. That's what your stack looks like.

The deeper claim hiding in plain sight

Read this sentence carefully, because it's the actual thesis:

"Accelerator performance is fairly predictable when we account for [server] state and request characteristics."

This is a quietly heretical claim against the entire current direction of LLM ops tooling. A huge amount of effort right now goes into making serving systems more general — disaggregated prefill/decode, KV cache offloading to any filesystem, multi-tier caches across RAM/SSD/GCS (also at NEXT, announcement #125). The complexity is exploding.

The latency-predictor team's bet is the opposite: the system is already deterministic enough that a six-feature regression hits 5% MAPE. Most of what we call "tuning" is just humans doing worse-than-XGBoost approximations of a function that's actually quite learnable.

If that's true — and the production numbers say it is — then a lot of what gets sold as "AI infrastructure intelligence" is going to collapse into very small models that learn very narrow things online. Not LLMs. Not even deep learning. Boosted trees. Trained on the last few hundred completed requests. Retrained constantly.

The ironic punchline is that this announcement, which got dropped in a footnote at NEXT '26, may be a more honest preview of where production AI infrastructure is heading than the entire Gemini Enterprise Agent Platform keynote.

Trying it

You can run this today. The implementation is open-source under the Kubernetes Gateway API Inference Extension. The gateway is the K8s upstream component; what Google did at NEXT was bake it into GKE Inference Gateway as a managed feature.

Once installed, requests opt in via headers — and this is where the design choice gets clever:

curl -v $GW_IP/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'x-prediction-based-scheduling: true' \
  -H 'x-slo-ttft-ms: 200' \
  -H 'x-slo-tpot-ms: 50' \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "prompt": "what is the difference between Franz and Apache Kafka?",
    "max_tokens": 200,
    "stream": "true"
  }'

The two SLO headers are the part to dwell on. You're not telling the gateway how to route. You're telling it what you need, and letting it figure out the routing as a constrained optimization. x-slo-ttft-ms: 200 means "I need first token in 200ms or this is a degraded request." The scheduler computes headroom (predicted_ttft − slo_ttft) per pod and packs accordingly.

This is a real, observable shift in how we think about LLM ops: from imperative ("route to pod 3") to declarative ("meet this SLO"), the same shift that databases went through twenty years ago when query planners replaced hand-written joins.

The EPP exposes -v=4 log lines that let you watch the scorer think:

msg:"Running profile handler"   plugin:"slo-aware-profile-handler"
msg:"Pod score"   scorer_type:"slo-scorer"   pod_name:"vllm-...-9b4wt"   score:0.82
msg:"Picked endpoint"   selected_pod:"vllm-...-9b4wt"

Pair this with announcement #129 — autoscaling on custom metrics — and you have a closed loop: the predictor surfaces SLO headroom, the autoscaler reacts to headroom collapse before queue depth even spikes. Most autoscaling triggers fire after the system is already in pain. This one fires when the forecast says pain is 30 seconds away.

What I'd watch next

A few open questions that the announcement and the underlying paper don't fully resolve:

The model assumes a homogeneous accelerator pool. In real fleets you have H100s and H200s and B200s mixed together, with different price/performance curves. The team flagged this as future work; whoever solves it well wins the heterogeneous-GPU-cost-optimization market that nobody is talking about yet.

The trainer runs as a sidecar to the EPP and retrains continuously. At the QPS levels in the scaling table — 10,000 QPS needs 4 prediction servers — the cost of the routing decision starts to be non-trivial relative to the inference itself. There's a coordination cost story here that's missing from the blog post.

And the bigger question: this technique generalizes. The same XGBoost-on-six-features approach should work for autoscaling, for spot/on-demand routing decisions, for cache eviction policies, for batch scheduling. If Google ships predicted-latency primitives across the rest of GKE, the consequences are larger than a single-feature blog post implies.

Closing

The contest prompt asks for the announcement that "speaks to you." The honest answer for me is: the boring sidecar with the unglamorous name that takes a week of pain — the slow rot of hand-tuned scorer weights — and replaces it with something that retrains itself.

Everyone watching the keynote saw the agent demos. The serving runtime is where the actual money gets won or lost, and it's where six-feature regression beats a roomful of senior SREs with grafana dashboards. That's the announcement I think we'll be talking about in 18 months.

The agent layer makes for a better trailer. The runtime layer is the movie.

Sources & credits: Technical details, production benchmark numbers, and architecture diagram concept drawn from the llm-d project's "Predicted-Latency Based Scheduling for LLMs" post (March 2026) by Kaushik Mitra, Benjamin Braun, Abdullah Gharaibeh, and Clayton Coleman, and the Google Cloud NEXT '26 Wrap-Up (announcement #124). The opinions, framing, and analysis are mine. AI tools were used as a writing assistant; all technical claims trace to the linked primary sources.

What Does OpenClaw Take From You?

Jubin Soni — Sun, 26 Apr 2026 07:46:29 +0000

This is a submission for the OpenClaw Challenge.

TL;DR: The conversation about personal AI is almost entirely about what these agents give you. The harder question — and the one that determines whether the deal is actually good — is what they take. Here are three things personal AI is quietly absorbing, and what I think you should keep.

We all know the story around personal AI. It gives you time back. It automates. It amplifies. It handles your inbox while you sleep, summarizes your morning, drafts replies, books flights, files receipts, sends messages—so you don’t have to. The language never really changes: gain. More throughput. More execution. More delegation. More agency.

This framing is missing half the equation. Anything you offload, you stop doing. Anything you stop doing, you eventually stop being good at. And anything you stop being good at, you eventually stop noticing you used to be good at.

This is not a luddite essay. I want this technology to work. OpenClaw, specifically, is one of the more honest things in the personal AI space — file-first, locally hosted, legible memory, open-source ethos. If any agent framework is going to be defensible five years from now, it is probably this one. But that is exactly why the question matters more here than it does for some hosted SaaS chatbot. OpenClaw is not a toy. It is built to actually live in your life. And the things it is built to absorb are not random — they are a specific class of cognition that, until very recently, you did yourself.

So: what is in that class? Three things, because I think the conversation needs the vocabulary.

The first thing: the friction that makes you decide

It is tempting to treat every recurring annoyance in your life as something to automate away. Bills. Calendar conflicts. Inbox triage. Deadline tracking. Grocery lists. Household coordination. The agent handles them, the annoyance goes away, the win goes on the board.

Friction is not always a bug.

The reason you used to look at your bills before paying them is not that you enjoyed the experience. It is that the act of looking — even for two seconds — sometimes caught the thing that mattered. The duplicate charge. The subscription you forgot you had. The number that was higher than last month and signaled something upstream in your life was off. The five seconds of friction was a sampling pass on your own financial reality, run weekly, for free.

When you build an agent that summarizes the bills and tells you the total, you have not just removed the friction. You have removed the sampling pass. The summary will tell you what the agent thinks is interesting. It will not tell you what you would have thought was interesting if you had looked, because you no longer have the muscle to know.

This is not theoretical. It is the same pattern that GPS did to your sense of direction, that autocomplete did to your spelling, and that calculators did to your arithmetic. In each case the technology was net-positive. In each case something specific and unrecoverable was traded away. We made those trades half-consciously because we did not have a vocabulary for what was on the other side of the ledger.

The personal-agent generation is making bigger trades, faster, with even less vocabulary.

The second thing: the practice of small decisions

There is a category of decision that is too small to think about and too consequential to skip.

What to reply to that ambiguous Slack message. Whether the email from your landlord needs a same-day response or can wait until Monday. Whether the meeting your colleague proposed at 4 p.m. is one you should accept or politely deflect. Whether the calendar conflict your assistant just flagged is a real conflict or one of those situations where it is fine to be ten minutes late to the second thing.

Personal agents are very good at the first 80% of these decisions and quietly bad at the last 20%. The first 80% — the obvious cases — is where they shine and where the demos look great. The last 20% — the cases that require taste, social calibration, and an accurate model of the specific humans involved — is where they fail in ways that do not show up in any benchmark, because the failure mode is the agent did something locally reasonable that was globally wrong, and you did not notice until it was too late.

The deeper problem is that the small-decisions practice is how taste is built in the first place. You develop a sense for which Slack messages need a careful reply by replying to a thousand of them, badly at first, and getting feedback from how the relationship went. If your agent handles the first nine hundred and fifty, you arrive at message nine hundred and fifty-one with the calibration of a beginner.

The framing of "delegate the boring stuff and focus on the important stuff" assumes three things: that the boring stuff and the important stuff are clearly separated, that the boring stuff does not feed into the important stuff, and that you can train the agent on the boring stuff without losing access to the inputs that would have eventually made you good at the important stuff. None of these assumptions survive contact with how human skill actually develops.

The third thing: the silence in which you notice you were wrong

This one is harder to name and I think it is the most important.

Right now, when you have a thought that is incomplete, a plan that is half-formed, or an instinct that something is off, there is a natural waiting period. You sit with it. You go for a walk. You stare at the ceiling for an hour. Eventually, sometimes, the thing resolves. You realize the project you were excited about is actually a bad idea. You realize the email you drafted last night was angrier than you intended. You realize the person you were going to call does not actually need a call from you. They need space.

This kind of cognition does not happen in language. It happens in the gaps between language. It is what your nervous system does when nothing is asking it for output.

Personal AI agents are, by their nature, output machines. They want to be useful. They want to give you something. The honest, well-built ones — and OpenClaw is honest and well-built — are designed to be proactive, to surface things, to ping you with the briefing, to suggest the next step. The whole pitch is that they fill the gaps.

But the gaps were doing work.

The morning before you check your phone. The walk to the coffee shop where you have not yet asked the agent anything. The half-hour of unstructured staring before the meeting. These are not inefficiencies in your life that an agent should be optimizing away. They are the conditions under which your slower, more honest cognition can operate. Compressing them does not give you back time. It gives you back the same amount of time, minus the part of your mind that needed the silence to work.

This is the trade nobody in the personal AI space wants to look at directly, because looking at it threatens the entire growth story. If the value of the agent is partly a function of what it disrupts in your inner life, and if some of what it disrupts is irreplaceable, then the unbounded "delegate everything" pitch starts to look less like a productivity story and more like a deal you should sign carefully.

What to actually do

Use OpenClaw. I mean that. The category is real, the project is good, and the alternative — keeping your data with hosted platforms whose pricing pages will change without your consent — is worse on almost every axis.

But sign the deal carefully. The rule I would actually follow is the simplest one I can write down:

Pick the offloads where the friction is genuinely friction.
Keep the offloads where the friction is doing work.
Leave the gaps alone.

The first one is for things where the human cost is high and the cognitive value is zero. Receipt parsing. Standard meeting confirmations. Repetitive document formatting. Things that genuinely should have been a script.

The second one is for the things where the friction is the point. Read your own bills. Reply to your own ambiguous Slack messages, at least most of them. Look at your own calendar before you ask the agent to look at it. Treat the small-decisions practice like a gym membership — something you do not because you cannot afford the alternative, but because you understand what your body becomes if you stop using it.

The third one is the hardest, because the agent is built to fill the gaps and your brain is built to let it. The morning before your first meeting. The walk where you have not yet opened a chat. The half-hour of unstructured staring. Leave them alone. The silence is not a bug to be fixed. It is the thing keeping the rest of it alive.

Personal AI is going to be one of the largest technology shifts of the next decade, and OpenClaw is going to be in the middle of it. The question is not whether to participate. It is what you intend to keep, and what you are quietly agreeing to give up.

Most of the conversation right now is an accounting of the gains.

Somebody should account for the rest.

What's an offload you regret? Or one you almost made and pulled back from? I'd genuinely like to hear it.

What is AWS Kiro and Why it Matters for Agentic Development

Jubin Soni — Sat, 25 Apr 2026 06:37:19 +0000

The evolution of Artificial Intelligence has transitioned from passive chat interfaces to active, autonomous agents. This shift, known as agentic development, requires a fundamental rethink of cloud infrastructure. In traditional AI workflows, a single request is sent to a Large Language Model (LLM), and a response is received. In agentic workflows, dozens or even hundreds of small, specialized agents must communicate, share state, and access tools in real-time. This creates a massive networking and latency bottleneck that standard REST-based architectures cannot handle.

Enter AWS Kiro. AWS Kiro (Kernel-Integrated Runtime Orchestrator) is a specialized, high-performance infrastructure layer designed specifically for the orchestration of multi-agent systems. It moves beyond the limitations of standard container orchestration to provide a low-latency, state-aware environment where agents can thrive. This article provides a deep dive into what AWS Kiro is, how it works, and why it is the missing piece for the next generation of AI development.

The Infrastructure Gap in Agentic AI

To understand why AWS Kiro matters, we must first look at the unique requirements of agentic systems. Unlike a simple web application, an agentic system involves:

High Concurrency: Multiple agents (e.g., a Researcher, a Writer, and a Fact-Checker) working simultaneously.
State Persistence: Agents need to remember what they were doing across thousands of small sub-tasks.
Low Latency Inter-Agent Communication: If Agent A needs to wait 500ms for a response from Agent B, a chain of 10 agent calls becomes prohibitively slow.
Tool-Heavy Execution: Agents frequently call external APIs, databases, and code execution sandboxes.

Traditional AWS services like Lambda or Fargate are excellent for general-purpose compute but often introduce "cold start" latencies or networking overhead that degrade agent performance. AWS Kiro was built to minimize this overhead by integrating the agent runtime closer to the hardware kernel and optimizing the networking stack for small, frequent packets of data common in agent communication.

Architecture Deep Dive: How AWS Kiro Works

At its core, AWS Kiro utilizes a specialized virtualization layer that sits on top of the AWS Nitro System. It abstracts the complexities of agent coordination, providing what AWS calls a "Global Shared Memory Space" (GSMS). This allows agents running in different execution environments to share context without the latency of an external database like Redis.

The Kiro Control Plane and Data Plane

The architecture is split into two primary components:

Kiro Control Plane: Manages agent lifecycles, task decomposition, and scheduling.
Kiro Data Plane (The Fabric): Handles high-speed message passing and shared state access using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE).

Diagram 1: Multi-Agent Interaction via AWS Kiro

This sequence diagram illustrates how a user request is decomposed into multiple agent tasks through the Kiro fabric, highlighting the sub-millisecond coordination between the Orchestrator and worker agents.

In this flow, notice that A1 and A2 do not call each other directly via REST. Instead, they interact with the Global Shared Memory (GSMS) provided by Kiro. This reduces the serialization/deserialization overhead and allows for O(1) time complexity when accessing shared context, regardless of how many agents are involved.

Key Features of AWS Kiro

1. Kernel-Integrated Tool Execution

Standard agents often struggle with the latency of spinning up a sandbox to execute code. AWS Kiro uses "Micro-Enclaves"—lightweight, isolated environments that share a kernel with the Kiro runtime. This allows an agent to go from "thinking" to "executing Python code" in less than 5ms.

2. Predictive Context Pre-fetching

Kiro uses machine learning to predict which piece of historical context an agent might need next. If Agent B usually follows Agent A, Kiro will pre-fetch Agent A’s output into the local cache of the node where Agent B is scheduled to run.

3. Native Bedrock Integration

While Kiro handles the infrastructure, it is tightly coupled with Amazon Bedrock. It can automatically pull model weights for smaller, specialized models (like Llama 3 or Mistral) into local memory to further reduce inference latency during agentic loops.

Comparing Architectures: Traditional vs. AWS Kiro

To see the value proposition, let's compare a standard agent implementation (using Lambda and S3/Redis for state) against an AWS Kiro-native implementation.

Feature	Traditional Agent (Lambda + Redis)	AWS Kiro-Native Agent
Inter-Agent Latency	50ms - 200ms (HTTP/TLS)	< 2ms (RDMA/Shared Memory)
State Management	External (Redis/DynamoDB)	Native (Global Shared Memory)
Cold Start	Significant (200ms - 2s)	Minimal (< 10ms via Micro-Enclaves)
Context Window Handling	Manual truncation/storage	Automatic predictive pre-fetching
Scalability	Limited by database IOPS	Linearly scalable across Kiro Fabric

Task Decomposition Logic

A critical part of agentic development is how a complex task is broken down. AWS Kiro provides a built-in "Router" that uses a cost-benefit analysis to determine if a task should be handled by a single large model or a swarm of smaller agents.

Diagram 2: Kiro Task Routing Flowchart

Practical Code Example: Implementing a Kiro-Enabled Agent

To use AWS Kiro, developers typically use the AWS SDK (Boto3) with specific extensions for the Kiro runtime. Below is a Python example of how you would initialize a Kiro session and register agents that share a memory space.

import boto3
from kiro_runtime import KiroSession, AgentNode

# Initialize the Kiro Client
kiro = boto3.client('kiro')

# 1. Create a Kiro Session with Shared Memory
def setup_agentic_environment():
    session = kiro.create_session(
        SessionName="MarketAnalysisSystem",
        MemoryType="high_performance",
        SharedContext=True
    )
    return session['SessionArn']

# 2. Define an Agent Node
# This agent will live within the Kiro Fabric for low-latency access
class ResearchAgent(AgentNode):
    def __init__(self, session_arn):
        super().__init__(session_arn)
        self.role = "Researcher"

    def run(self, query):
        # Writing to Shared Memory is nearly instantaneous in Kiro
        self.write_shared_memory("current_query", query)

        # Tool call via Kiro's Micro-Enclave
        result = self.execute_tool("web_search", {"q": query})

        self.write_shared_memory("search_results", result)
        return "Search completed."

# 3. Orchestration
session_arn = setup_agentic_environment()
researcher = ResearchAgent(session_arn)

# Execution within the fabric
status = researcher.run("Latest trends in AWS Kiro")
print(f"Agent Status: {status}")

Code Breakdown:

kiro.create_session: This allocates a segment of the high-speed fabric specifically for your agents. The SharedContext=True flag enables the GSMS, allowing all agents in this session to read/write to the same memory space at O(1) speeds.
AgentNode: This is a specialized class that inherits from Kiro’s runtime, providing methods like write_shared_memory and execute_tool which bypass the standard networking stack.
execute_tool: Instead of a standard API call, this triggers a micro-enclave execution within the same hardware cluster.

The Agent Lifecycle in AWS Kiro

Agents in Kiro are not just short-lived functions; they are stateful entities that transition through various statuses. Managing these transitions is vital for ensuring that agents don't hang or consume unnecessary resources.

Diagram 3: Kiro Agent State Machine

This state machine ensures that agents are "Hibernated" when not in use. Unlike a Lambda function that shuts down, a Hibernated Kiro agent keeps its local cache in the fabric's memory, allowing it to "Wake-up" and resume work in milliseconds without re-loading the model context.

Why AWS Kiro Matters for the Future

Solving the "Thinking Time" Problem

As LLMs move toward "Reasoning" models (like OpenAI's o1 series), the "thinking time" increases. However, the system overhead (networking, state management) shouldn't add to that. Kiro ensures that the only latency developers face is the actual inference time of the model.

Massive Parallelism

In a complex supply chain agentic system, you might have 500 agents representing different vendors. AWS Kiro allows these 500 agents to coordinate in a single fabric. In a standard architecture, 500 agents would create a "thundering herd" problem for your database; in Kiro, the shared memory fabric handles the contention using hardware-level locking mechanisms.

Security and Governance

When agents act on your behalf, security is paramount. Kiro’s micro-enclaves provide cryptographic isolation. Even if Agent A is compromised by a prompt injection, it cannot access the memory space of Agent B unless explicitly permitted by the Kiro Control Plane's IAM policies.

Implementation Strategy: Moving to Kiro

If you are currently building agents using LangChain or AutoGPT on standard AWS infrastructure, the migration to Kiro involves three steps:

Context Migration: Move your state storage from external databases (Redis/Dynamo) to Kiro Shared Memory.
Tool Refactoring: Re-package your tools as Kiro-compatible Micro-Enclaves to take advantage of the kernel-integrated execution.
Topology Definition: Instead of individual functions, define an "Agent Topology" that describes how agents are grouped within the Kiro fabric.

Conclusion

AWS Kiro represents a significant leap forward for the AI ecosystem. By treating "Agency" as a first-class citizen of cloud infrastructure, AWS has removed the friction that previously made multi-agent systems slow and expensive. Whether you are building an autonomous coding assistant, a market research swarm, or a complex robotic process automation system, AWS Kiro provides the high-performance backbone required for true autonomy.

As LLMs become more capable of reasoning, the infrastructure must become more capable of coordination. AWS Kiro is precisely the fabric that will hold these autonomous systems together, ensuring that the future of AI is not just intelligent, but also incredibly fast and scalable.

5 Ways Azure AI Search is Revolutionizing Enterprise RAG Architectures

Jubin Soni — Sat, 25 Apr 2026 06:03:01 +0000

In the rapidly evolving landscape of Generative AI, the transition from experimental Proof of Concepts (POCs) to production-grade applications is the most significant hurdle for enterprises today. At the heart of this transition lies Retrieval-Augmented Generation (RAG). While the "Generation" part—handled by Large Language Models (LLMs) like GPT-4—is often the focus, the quality of the "Retrieval" determines whether an AI application provides value or hallucinates incorrect information.

Azure AI Search (formerly known as Azure Cognitive Search) has emerged as a powerhouse in this space. By moving beyond simple vector databases and offering a comprehensive information retrieval platform, it addresses the unique challenges of the enterprise: scale, security, and precision. In this article, we will deep-dive into the five key ways Azure AI Search is improving enterprise RAG, backed by technical architecture, code examples, and performance insights.

1. Advanced Hybrid Retrieval: Beyond Simple Vector Search

Most basic RAG implementations rely solely on vector search (k-nearest neighbors). While vectors are excellent at capturing semantic meaning (e.g., understanding that "canine" and "dog" are related), they often fail at specific keyword matching, such as product serial numbers, obscure acronyms, or specific part codes.

Azure AI Search solves this through Hybrid Retrieval, which combines full-text search (BM25 algorithm) with vector search (HNSW algorithm) in a single query. The results are then fused using Reciprocal Rank Fusion (RRF).

How Reciprocal Rank Fusion (RRF) Works

RRF is an algorithm that combines the multiple ranked lists (one from keyword search, one from vector search) into a single unified ranking. It doesn't require the scores from the different systems to be on the same scale. The formula for the RRF score is:

Score = sum(1 / (k + rank_i))

Where:

k is a constant (usually 60) that mitigates the impact of high-ranking results from a single source.
rank_i is the position of the document in the i-th list.

Mermaid Flowchart: Hybrid Retrieval Logic

Practical Implementation: Hybrid Query

Using the Azure AI Search Python SDK, a hybrid query is constructed by providing both a vector and a text string.

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential

# Configuration
endpoint = "https://your-service-name.search.windows.net"
key = "your-api-key"
index_name = "enterprise-docs"

client = SearchClient(endpoint, index_name, AzureKeyCredential(key))

# User input
query_text = "What is the warranty period for the X-1500 sensor?"
query_vector = get_embedding(query_text) # Helper function to get embeddings

# Perform Hybrid Search
results = client.search(
    search_text=query_text, 
    vector_queries=[VectorizedQuery(vector=query_vector, k_nearest_neighbors=3, fields="content_vector")],
    select=["title", "content", "category"],
    top=5
)

for result in results:
    print(f"Score: {result['@search.score']} - Title: {result['title']}")

2. The Power of Semantic Ranking (L3 Reranking)

While Hybrid Search significantly improves recall, the enterprise often needs extreme precision. Azure AI Search integrates a "Semantic Ranker"—a technology derived from Bing’s core search engine.

The Reranking Hierarchy

In a typical search flow, the system handles thousands of documents. To be efficient, it uses a tiered approach:

L1 (Retrieval): Fast filtering (Keyword/Vector) to get the top 1,000 documents.
L2 (RRF): Merging keyword and vector results.
L3 (Semantic Ranking): A cross-encoder model that looks at the actual meaning of the top 50 results and re-scores them based on context.

Unlike traditional bi-encoders used in vector search (which compute similarity between a query embedding and a document embedding), the Semantic Ranker uses a cross-encoder that processes the query and the document snippet together. This allows it to capture nuances like negation and complex relationships that vector similarity might miss.

Comparison Table: Retrieval Strategies

Strategy	Pros	Cons	Best For
Keyword (BM25)	Fast, exact matches, low cost	No semantic understanding	Product IDs, codes, names
Vector (HNSW)	Semantic nuance, multi-lingual	"Cold start" issues, bad for jargon	Concept-based questions
Hybrid (RRF)	Combines the best of both	Higher latency than L1	General purpose enterprise RAG
Semantic Ranker	Highest precision, handles nuance	Highest latency/cost per query	High-stakes decision support

3. Integrated Vectorization and Data Pipelines

One of the biggest friction points in RAG is the "ETL for Embeddings" pipeline. Traditionally, developers had to write custom code to monitor data sources, chunk text, call embedding models, and push data to a vector store.

Azure AI Search introduces Skillsets and Indexers, which automate this entire lifecycle.

The Integrated Pipeline Lifecycle

DataSource: Connection to Blob Storage, SQL Server, or Cosmos DB.
Indexer: A crawler that runs on a schedule.
Skillset: A series of AI transformations. This can include:
- Document Cracking (extracting text from PDFs, Office docs).
- Text Chunking (splitting text into manageable segments).
- Azure OpenAI Embedding (converting chunks into vectors automatically).

Sequence Diagram: Integrated Indexing Flow

Code Snippet: Defining an Integrated Vectorizer

This JSON snippet represents how a vectorizer is defined within an index, allowing the search service to handle the embedding generation during both ingestion and query time.

"vectorizers": [
    {
        "name": "my-openai-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
            "resourceUri": "https://my-openai-resource.openai.azure.com",
            "deploymentId": "text-embedding-3-small",
            "apiKey": "<api-key>"
        }
    }
]

4. Scaling Vector Search with HNSW and Disk-Based Indexing

Enterprise data isn't just a few thousand documents; it’s often millions of records. Most vector databases struggle with the memory-to-cost ratio because they keep all vectors in RAM to ensure speed.

Azure AI Search uses the Hierarchical Navigable Small World (HNSW) algorithm for vector indexing. HNSW creates a multi-layered graph where the top layers contain fewer nodes (for fast navigation) and the bottom layers contain all nodes (for precision).

Optimization Parameters

When configuring HNSW in Azure AI Search, two parameters are critical for performance tuning:

m: The number of bi-directional links created for every new element during construction. A higher m improves recall but increases index size and memory usage.
efConstruction: The number of nearest neighbors explored during index building. Increasing this improves the quality of the graph but increases indexing time.
efSearch: The number of nearest neighbors searched during a query. Increasing this improves recall at the cost of latency.

Azure AI Search has also introduced filtered vector search. In an enterprise context, you rarely want to search the entire index. You might want to search only "Documents from Department A created in 2023." Azure AI Search optimizes this by applying filters during the vector navigation, rather than post-filtering, which significantly reduces the search space and improves latency.

Complexity Analysis

Vector Search (HNSW): O(log n) average search time.
Full-Text Search: O(n) in worst case, but optimized with inverted indices.
Storage: Azure AI Search can utilize disk-based storage for vectors, significantly lowering the Total Cost of Ownership (TCO) compared to purely in-memory databases.

5. Enterprise-Grade Security and Governance

For a RAG system to be production-ready in a regulated industry, it cannot be a "black box." It must adhere to strict security protocols. Azure AI Search integrates natively with the broader Microsoft security stack in three major ways:

A. Virtual Network (VNET) and Private Link

Most vector databases are accessed over the public internet. Azure AI Search supports Private Endpoints, ensuring that your data traffic never leaves the Microsoft backbone network. This is a non-negotiable requirement for many financial and healthcare institutions.

B. Role-Based Access Control (RBAC)

Azure AI Search supports fine-grained RBAC. You can grant an application the right to query an index without giving it the right to delete data or view service keys. Furthermore, it supports User-Contextual Filtering. If a user doesn't have permission to see "Document A" in SharePoint, the RAG system can use their identity token to filter "Document A" out of the search results automatically.

C. Integration with Microsoft Purview

Data lineage is critical. By integrating with Microsoft Purview, enterprises can track how sensitive data (PII) flows from a data source into an index and eventually into an LLM response. This provides a layer of governance that is often missing in custom-built RAG stacks.

Putting It All Together: The Production RAG Architecture

When we combine these five improvements, the architecture of an enterprise RAG system transforms from a fragile script into a robust platform.

The End-to-End Workflow

Ingestion: An Indexer pulls data from Azure SQL and Blob Storage. It uses a Skillset to chunk the text and call Azure OpenAI for embeddings. These are stored in an index with HNSW enabled.
Query: A user asks a question via a web app. The web app calls Azure AI Search with a hybrid query (text + vector).
Refinement: Azure AI Search performs the hybrid search, applies security filters based on the user's ID, and uses the Semantic Ranker to find the top 5 most relevant chunks.
Generation: These 5 chunks are sent to the LLM as context. Because the retrieval was so precise, the LLM provides a concise, accurate answer with minimal hallucination risk.

Sample Production-Ready Index Definition

{
  "name": "enterprise-index",
  "fields": [
    {"name": "id", "type": "Edm.String", "key": true},
    {"name": "content", "type": "Edm.String", "searchable": true},
    {"name": "content_vector", "type": "Collection(Edm.Single)", "searchable": true, "retrievable": true, "dimensions": 1536, "vectorSearchProfile": "my-hsnw-profile"},
    {"name": "metadata_auth_group", "type": "Edm.String", "filterable": true}
  ],
  "vectorSearch": {
    "algorithms": [
      {
        "name": "my-hsnw-config",
        "kind": "hnsw",
        "hnswParameters": {
          "m": 4,
          "efConstruction": 400,
          "metric": "cosine"
        }
      }
    ],
    "profiles": [
      {
        "name": "my-hsnw-profile",
        "algorithm": "my-hsnw-config",
        "vectorizer": "my-openai-vectorizer"
      }
    ]
  },
  "semantic": {
    "configurations": [
      {
        "name": "my-semantic-config",
        "prioritizedFields": {
          "contentFields": [{"fieldName": "content"}]
        }
      }
    ]
  }
}

Conclusion

Improving RAG at the enterprise level is not about finding a larger LLM; it is about building a better retrieval system. Azure AI Search provides the necessary tools—Hybrid Search, Semantic Ranking, Integrated Data Pipelines, Scalable Vector Indexing, and Enterprise Security—to bridge the gap between a demo and a mission-critical application.

By leveraging the platform's ability to handle both unstructured text and high-dimensional vectors, while maintaining strict security boundaries, developers can build AI assistants that are not only smart but also reliable and safe for the corporate environment.

S3 Vectors: How to build a RAG without a vector database

Jubin Soni — Tue, 14 Apr 2026 19:37:23 +0000

Every RAG tutorial follows the same script: embed your documents, spin up a vector database (Pinecone, Weaviate, pgvector, OpenSearch), manage its infrastructure, and pray the costs don't spiral. For most internal AI apps, this is overkill.

Amazon S3 Vectors changes the equation. It's native vector storage built into S3 — no clusters, no provisioning, no idle compute. You store vectors like you store objects, query them with sub-100ms latency, and pay per use. It went GA in December 2025 and now supports 2 billion vectors per index across 31+ AWS regions.

This post walks through building a complete RAG pipeline using only S3 Vectors and Amazon Bedrock. No external vector database. ~50 lines of Python.

Architecture

Three phases, two AWS services, zero infrastructure.

S3 Vectors vs Traditional Vector Databases

	S3 Vectors	Managed Vector DB (e.g. OpenSearch, Pinecone)
Infrastructure	None — fully serverless	Clusters, shards, replicas
Scale	2B vectors/index, 10K indexes/bucket	Varies, often requires re-sharding
Query latency	~100ms (frequent), <1s (infrequent)	~10-50ms
Cost model	Pay per PUT + storage + query	Hourly/monthly compute + storage
Cost at scale	Up to 90% cheaper	Idle compute adds up fast
Metadata filtering	Up to 50 keys, filterable by default	Full query language
Best for	RAG, agent memory, semantic search	High-QPS production search, hybrid search

The tradeoff is clear: S3 Vectors trades single-digit-ms latency for zero ops and dramatically lower cost. For internal RAG apps, agent memory, and moderate-QPS workloads, it's the better choice.

Step 1: Set Up S3 Vectors

Create a vector bucket and index. You can do this in the console or via CLI:

# Create a vector bucket
aws s3vectors create-vector-bucket \
  --vector-bucket-name my-rag-bucket

# Create a vector index (1024 dims for Titan Embeddings V2)
aws s3vectors create-vector-index \
  --vector-bucket-name my-rag-bucket \
  --index-name my-rag-index \
  --dimension 1024 \
  --distance-metric cosine

That's your "database" — done in two commands.

Step 2: Ingest Documents

Here's the ingestion pipeline. We chunk text, embed each chunk with Titan Embeddings V2, and store vectors with metadata:

import boto3
import json
import uuid

bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
s3vectors = boto3.client("s3vectors", region_name="us-west-2")

BUCKET = "my-rag-bucket"
INDEX = "my-rag-index"


def embed(text: str) -> list[float]:
    """Generate embeddings using Titan Text Embeddings V2."""
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text}),
    )
    return json.loads(response["body"].read())["embedding"]


def chunk_text(text: str, chunk_size: int = 500) -> list[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - 50):
        chunk = " ".join(words[i : i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks


def ingest(doc_text: str, source: str):
    """Chunk, embed, and store a document."""
    chunks = chunk_text(doc_text)
    vectors = []

    for i, chunk in enumerate(chunks):
        vectors.append({
            "key": f"{source}::chunk-{i}",
            "data": {"float32": embed(chunk)},
            "metadata": {
                "source": source,
                "chunk_index": i,
                "text": chunk,  # store original text for retrieval
            },
        })

    # PutVectors supports batches
    s3vectors.put_vectors(
        vectorBucketName=BUCKET,
        indexName=INDEX,
        vectors=vectors,
    )
    print(f"Ingested {len(vectors)} chunks from {source}")

Usage:

with open("internal-docs.txt") as f:
    ingest(f.read(), source="internal-docs.txt")

Step 3: Query + Generate

Now the RAG loop — embed the question, find similar chunks, and feed them to Claude:

def rag_query(question: str, top_k: int = 5) -> str:
    """Full RAG pipeline: retrieve + generate."""

    # 1. Embed the question
    query_vector = embed(question)

    # 2. Find similar chunks
    results = s3vectors.query_vectors(
        vectorBucketName=BUCKET,
        indexName=INDEX,
        topK=top_k,
        queryVector={"float32": query_vector},
        returnMetadata=True,
        returnDistance=True,
    )

    # 3. Build context from retrieved chunks
    context_parts = []
    for v in results["vectors"]:
        text = v["metadata"]["text"]
        source = v["metadata"]["source"]
        dist = round(v["distance"], 4)
        context_parts.append(
            f"[Source: {source}, Distance: {dist}]\n{text}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # 4. Generate answer with Claude
    prompt = f"""Answer the question based on the provided context. 
If the context doesn't contain enough information, say so.

## Context
{context}

## Question
{question}

## Answer"""

    response = bedrock.invoke_model(
        modelId="us.anthropic.claude-sonnet-4-20250514",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )

    body = json.loads(response["body"].read())
    return body["content"][0]["text"]

Usage:

answer = rag_query("What is our refund policy for enterprise customers?")
print(answer)

That's the entire RAG pipeline — ~50 lines of actual logic, no infrastructure.

Step 4: Metadata Filtering

S3 Vectors supports filtering by metadata during queries. This is powerful for multi-tenant or multi-source RAG:

# Only search chunks from a specific document
results = s3vectors.query_vectors(
    vectorBucketName=BUCKET,
    indexName=INDEX,
    topK=5,
    queryVector={"float32": query_vector},
    returnMetadata=True,
    filter={"source": {"eq": "refund-policy.pdf"}},
)

Filter operators include eq, ne, gt, gte, lt, lte, in, beginsWith, and logical and/or combinators.

Data Flow

Here's how a query flows through the system end to end:

When to Use S3 Vectors (and When Not To)

Use S3 Vectors when:

You're building internal RAG apps, agent memory, or semantic search
Query volume is moderate (not thousands of QPS)
You want zero infrastructure management
Cost matters more than single-digit-ms latency

Use a dedicated vector DB when:

You need <10ms query latency consistently
You need hybrid search (keyword + semantic)
Your QPS is in the hundreds or thousands
You need advanced features like aggregations or faceted search

Use both (tiered): S3 Vectors as cheap, durable storage + OpenSearch for hot queries. AWS supports this integration natively.

Integrating with Bedrock Knowledge Bases

If you don't want to write the chunking and embedding code yourself, Bedrock Knowledge Bases can use S3 Vectors as its vector store directly:

Just select "S3 Vectors" as the vector store when creating your Knowledge Base. Bedrock handles chunking, embedding, and storage automatically.

Cleanup

# Delete the vector index
aws s3vectors delete-vector-index \
  --vector-bucket-name my-rag-bucket \
  --index-name my-rag-index

# Delete the vector bucket
aws s3vectors delete-vector-bucket \
  --vector-bucket-name my-rag-bucket

Resources

Mastering Gemma 4: A Comprehensive Deep Dive into Google's Next-Generation Open Model Architecture and Deployment

Jubin Soni — Tue, 14 Apr 2026 17:53:14 +0000

The landscape of Large Language Models (LLMs) has shifted dramatically from monolithic, proprietary APIs toward highly efficient, open-weight models that developers can run on commodity hardware. Google’s Gemma series has been at the forefront of this movement. With the release of Gemma 4, the industry sees a significant leap in performance-per-parameter, driven by advanced distillation techniques and architectural refinements that challenge models twice its size.

In this deep dive, we will explore the technical underpinnings of Gemma 4, its unique training methodology, and practical strategies for integrating it into your production environment.

1. The Evolution of Gemma: From 1.0 to 4.0

Gemma 4 represents a synthesis of Google’s Gemini technology tailored for the open-source community. Unlike previous iterations that focused primarily on raw scale, Gemma 4 emphasizes "density of intelligence." By leveraging the same research and technology used in Gemini 1.5 Pro, Gemma 4 achieves state-of-the-art results in reasoning, coding, and multilingual understanding.

Key Architectural Pillars

Gemma 4 is built upon a standard transformer decoder architecture but introduces several critical modifications:

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA): Optimized for memory efficiency and faster inference.
Sliding Window Attention (SWA): Allows the model to handle longer contexts by focusing on local segments of the sequence while maintaining global coherence through layer-stacking.
Logit Soft-Capping: Prevents logits from becoming too large, which stabilizes training and improves the effectiveness of distillation.
RMSNorm and RoPE: Utilizes Root Mean Square Layer Normalization and Rotary Positional Embeddings for improved numerical stability and better handling of sequence positioning.

2. Theoretical Foundations: The Power of Knowledge Distillation

The defining characteristic of Gemma 4 is its reliance on Knowledge Distillation. Instead of training the model from scratch on raw web data alone, Google uses a larger, more capable "Teacher" model (from the Gemini family) to guide the training of the "Student" Gemma model.

How Distillation Works in Gemma 4

In a standard training setup, a model minimizes the cross-entropy loss between its predictions and the ground-truth tokens. In Gemma 4's distillation process, the student model also attempts to match the probability distribution (the logits) of the teacher model. This allows the smaller model to learn the nuances, uncertainties, and structural reasoning patterns of the larger model.

By optimizing for both ground truth and teacher distributions, Gemma 4 captures complex logical jumps that are usually only present in models with hundreds of billions of parameters.

3. Comparative Analysis: Gemma 4 vs. The Industry

To understand where Gemma 4 sits in the current ecosystem, we must compare it against its primary competitors: Meta’s Llama series and Mistral AI’s offerings. The following table highlights the architectural and performance differences between current industry leaders in the 7B-27B parameter range.

Feature	Gemma 4 (27B)	Llama 3.1 (70B)	Mistral Large 2	Gemma 4 (9B)
Base Architecture	Decoder-only Transformer	Decoder-only Transformer	MoE (Mixture of Experts)	Decoder-only Transformer
Attention Mech	GQA + Sliding Window	Grouped-Query Attention	Sliding Window	Multi-Query Attention
Context Window	128k Tokens	128k Tokens	128k Tokens	32k Tokens
Training Method	Distillation-heavy	Direct Pre-training	Direct Pre-training	Distillation-heavy
Logit Capping	Yes (Soft-capping)	No	No	Yes (Soft-capping)
License	Gemma Terms of Use	Llama 3 Community	Mistral Research	Gemma Terms of Use

4. Deep Dive into Implementation: Getting Started

Setting up Gemma 4 requires a Python environment with modern libraries. We will use the transformers library by Hugging Face along with accelerate for efficient memory management.

Environment Setup

First, ensure you have the latest versions of the required packages:

pip install -U transformers accelerate bitsandbytes torch

Basic Inference with Gemma 4

The following script demonstrates how to load the Gemma 4 9B model in 4-bit quantization to save VRAM while maintaining performance.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "google/gemma-4-9b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare the prompt using the chat template
messages = [
    {"role": "user", "content": "Explain the concept of quantum entanglement using a cat analogy."}
]

input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids, 
    max_new_tokens=512, 
    do_sample=True, 
    temperature=0.7
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(f"Gemma 4 Response:\n{response}")

Explanation of the Code

BitsAndBytesConfig: We use NormalFloat 4 (nf4) quantization. This allows the 9B model, which would normally require ~18GB of VRAM, to fit into roughly 5-6GB, making it accessible for consumer GPUs like the RTX 3060.
device_map="auto": This automatically handles the distribution of model layers across available GPUs and CPUs.
apply_chat_template: Gemma 4 uses specific control tokens (like <start_of_turn>) to distinguish between user and assistant roles. Using the built-in template ensures the model receives the prompt in the exact format it was trained on.

5. Sequence Flows in Gemma 4 Applications

When deploying Gemma 4 in a Retrieval-Augmented Generation (RAG) pipeline, the interaction between the orchestrator, the vector database, and the model follows a specific sequence. Understanding this flow is vital for optimizing latency.

6. Advanced Optimization: Logit Soft-Capping and Stability

A technical nuance in Gemma 4 is the implementation of Logit Soft-Capping. During the generation process, the raw output of the last layer (logits) can sometimes reach extreme values, leading to "peaky" probability distributions where the model becomes overconfident or starts repeating itself.

Gemma 4 applies a function to constrain these values:

logit = capacity * tanh(logit / capacity)

Where the capacity is typically set around 30.0 for the attention layers and 50.0 for the final layer. This ensures that no single token dominates the distribution too early, leading to more creative and stable outputs during long-form generation.

7. Efficient Fine-Tuning with PEFT and LoRA

To adapt Gemma 4 to specific domains (e.g., medical, legal, or proprietary codebases), Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) is the recommended approach. This method keeps the base model weights frozen and only trains a small set of adapter layers.

Practical LoRA Configuration

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16, 
    lora_alpha=32,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

By targeting all linear layers (including the MLP/gate modules), we ensure that the model can learn the specific linguistic nuances of the new domain without suffering from catastrophic forgetting.

8. The Gemma 4 Ecosystem Mindmap

Navigating the tools and frameworks available for Gemma 4 can be overwhelming. The following mindmap categorizes the ecosystem into four primary domains: Inference, Fine-Tuning, Deployment, and Evaluation.

9. Handling the 128k Context Window

One of the most significant upgrades in Gemma 4 is the massive 128k token context window. However, processing 128k tokens is computationally expensive. Gemma 4 manages this through Sliding Window Attention (SWA).

In SWA, each layer does not attend to all previous tokens. Instead, it attends to a fixed-size "window" of recent tokens. Because these layers are stacked, layer N can effectively "see" information from further back via the intermediate representations of layer N-1. This reduces the computational complexity from O(n^2) to O(n * w), where w is the window size.

Deployment Considerations for Long Context

When utilizing the full 128k window, memory consumption for the KV (Key-Value) cache becomes the bottleneck.

KV Cache Quantization: Storing the KV cache in 8-bit or 4-bit can reduce memory usage by 50-75%.
Paged Attention: Using frameworks like vLLM allows for dynamic memory allocation, preventing fragmentation when handling multiple long-context requests simultaneously.

10. Benchmarking and Performance Metrics

Internal testing shows that Gemma 4 excels in "Reasoning Density." This refers to the model's ability to solve complex mathematical and logical problems relative to its parameter count. In the MMLU (Massive Multitask Language Understanding) benchmark, the 27B variant of Gemma 4 outperforms several 70B+ models, proving that quality of training data and distillation are more important than sheer scale.

Performance Comparison Table

Benchmark	Gemma 4 (27B)	Llama 3.1 (70B)	Gemma 4 (9B)	GPT-4o (Reference)
MMLU	78.2%	79.9%	71.3%	88.7%
GSM8K (Math)	82.1%	82.5%	74.0%	94.2%
HumanEval (Code)	68.5%	67.2%	55.4%	86.6%
MBPP	72.0%	70.1%	62.1%	84.1%

11. Ethical Considerations and Safety

Google has integrated a robust safety framework into Gemma 4. This includes:

Data Filtering: Rigorous removal of personally identifiable information (PII) and harmful content from the pre-training set.
Reinforcement Learning from Human Feedback (RLHF): Tuning the model to follow instructions while refusing harmful requests.
Red Teaming: Extensive testing against adversarial attacks to ensure the model remains helpful yet harmless.

Developers are encouraged to use the Responsible AI Toolkit provided by Google to audit their fine-tuned versions of Gemma 4 before deployment.

12. Conclusion

Gemma 4 marks a turning point in the accessibility of high-performance AI. By successfully distilling the intelligence of a frontier model like Gemini into an open-weight format, Google has provided developers with a tool that is both powerful enough for complex reasoning and efficient enough for local deployment. Whether you are building a sophisticated RAG system, a specialized coding assistant, or an edge-based application, Gemma 4 provides the architectural flexibility and performance density required for the next generation of AI applications.