<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jubin Soni</title>
    <description>The latest articles on DEV Community by Jubin Soni (@jubinsoni).</description>
    <link>https://dev.to/jubinsoni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3304475%2F69e594af-a39b-4e01-81fd-ebd67b67de37.jpeg</url>
      <title>DEV Community: Jubin Soni</title>
      <link>https://dev.to/jubinsoni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jubinsoni"/>
    <language>en</language>
    <item>
      <title>Jules: Google's Async Coding Agent Is Changing How We Think About AI and Software Development</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 29 May 2026 18:12:09 +0000</pubDate>
      <link>https://dev.to/jubinsoni/jules-googles-async-coding-agent-is-changing-how-we-think-about-ai-and-software-development-3j0d</link>
      <guid>https://dev.to/jubinsoni/jules-googles-async-coding-agent-is-changing-how-we-think-about-ai-and-software-development-3j0d</guid>
      <description>&lt;p&gt;There's a quiet architectural shift happening in how we build software, and it doesn't look like what most people expected.&lt;/p&gt;

&lt;p&gt;We've spent the last two years treating AI like a very fast autocomplete — a co-pilot sitting shotgun, responding the moment we type. Cursor, Copilot, Gemini Code Assist: all synchronous, all requiring you to stay in the loop, all fundamentally keeping you as the CPU driving execution.&lt;/p&gt;

&lt;p&gt;Jules breaks that model.&lt;/p&gt;

&lt;p&gt;Google's async coding agent, which went generally available in 2025 and got major updates at I/O 2026, doesn't help you write code faster. It removes you from the writing loop entirely. You assign a task. Jules works. You review a pull request. That's it.&lt;/p&gt;

&lt;p&gt;This article breaks down how Jules works technically — with architecture diagrams, sequence flows, and real code — and why the async model might be more significant than it first appears.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Jules Actually Does (and Doesn't Do)
&lt;/h2&gt;

&lt;p&gt;Jules is not an IDE plugin. It's not an inline suggestion engine. It's not a chat interface for your codebase.&lt;/p&gt;

&lt;p&gt;Jules is a &lt;strong&gt;task-based async agent&lt;/strong&gt;. You give it a scoped coding task — fix a bug, migrate a module, add a feature, write tests — and it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clones your repository into a &lt;strong&gt;secure Google Cloud VM&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Analyzes the relevant codebase context (2M token context window as of I/O 2026)&lt;/li&gt;
&lt;li&gt;Writes a step-by-step implementation plan using Gemini Pro&lt;/li&gt;
&lt;li&gt;Executes that plan: writing code, running tests, fixing errors&lt;/li&gt;
&lt;li&gt;Opens a &lt;strong&gt;pull request&lt;/strong&gt; against your branch with a description, diff, and change summary&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When it's done, you're not staring at a chat window waiting to approve line-by-line edits. You're reviewing a PR — just like you would from any engineer on your team.&lt;/p&gt;

&lt;p&gt;The 2026 update closes the loop further: if the CI/CD pipeline fails on the Jules-authored PR, Jules automatically receives the error, analyzes it, applies a fix, and re-pushes the commit — often without any human intervention at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: How Jules Is Built
&lt;/h2&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://jules.google.com/session" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.gstatic.com%2Flabs-code%2Flabs-logo.svg" height="320" class="m-0" width="329"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://jules.google.com/session" rel="noopener noreferrer" class="c-link"&gt;
            
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Jules is your end-to-end agentic product development platform. It reads your entire product context to figure out what to build next, comes up with solutions, and then ships a PR.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.gstatic.com%2Flabs-code%2Fcode-app%2Ffavicon-32x32.png" width="32" height="33"&gt;
          jules.google.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Here's how the components fit together:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhi6en0opwr4zqvhghz8z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhi6en0opwr4zqvhghz8z.png" alt="Architecture image description" width="799" height="644"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key architectural choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Isolated VM per task&lt;/strong&gt;: no shared state between runs, reproducible environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network access retained&lt;/strong&gt;: Jules can &lt;code&gt;npm install&lt;/code&gt;, run builds, call APIs — unlike Codex which sandboxes with no egress&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-model split&lt;/strong&gt;: Gemini Pro handles planning and hard reasoning; Gemini Flash handles lighter subtasks for cost efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native GitHub integration&lt;/strong&gt;: reads issues, creates branches, authors commits, opens PRs — not a wrapper, it's first-class&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Sequence: The Async Flow End-to-End
&lt;/h2&gt;

&lt;p&gt;The sequence below shows what happens from task assignment to merged PR, and where the developer is actually &lt;em&gt;free&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsb84sg6n7chxmxe63l7y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsb84sg6n7chxmxe63l7y.png" alt="Sequence description" width="800" height="689"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key insight: the developer's attention is only required at &lt;strong&gt;step 1&lt;/strong&gt; (spec) and &lt;strong&gt;step 12&lt;/strong&gt; (review). Everything in between is Jules.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code: Using Jules in Your Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Jules CLI (Jules Tools — GA at I/O 2026)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Jules Tools CLI&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/jules-tools

&lt;span class="c"&gt;# Authenticate&lt;/span&gt;
jules auth login

&lt;span class="c"&gt;# Submit a task against a GitHub repo&lt;/span&gt;
jules task create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--repo&lt;/span&gt; your-org/your-repo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--branch&lt;/span&gt; main &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Fix the race condition in payment/processor.go — 
   two concurrent requests can double-charge. 
   Add regression tests covering the concurrent case."&lt;/span&gt;

&lt;span class="c"&gt;# Check task status&lt;/span&gt;
jules task status &amp;lt;task-id&amp;gt;

&lt;span class="c"&gt;# List open tasks&lt;/span&gt;
jules task list &lt;span class="nt"&gt;--status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="nt"&gt;-progress&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Via Gemini CLI Extension
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Gemini CLI&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli

&lt;span class="c"&gt;# Add Jules extension&lt;/span&gt;
gemini extensions &lt;span class="nb"&gt;install &lt;/span&gt;https://github.com/gemini-cli-extensions/jules &lt;span class="nt"&gt;--auto-update&lt;/span&gt;

&lt;span class="c"&gt;# Submit directly from your terminal&lt;/span&gt;
/jules Fix the flaky integration tests &lt;span class="k"&gt;in &lt;/span&gt;auth/session_test.go. 
       Root cause appears to be missing teardown between &lt;span class="nb"&gt;test &lt;/span&gt;runs.

&lt;span class="c"&gt;# Jules responds async — you get a PR link when it's done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Jules API (for CI/CD integration)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.auth&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jules_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JulesClient&lt;/span&gt;

&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JulesClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Submit a task programmatically
&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-org/your-repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Migrate the UserRepository class from raw SQL to 
        the new ORM layer introduced in db/orm.py.
        Preserve all existing query behaviour and update tests.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;migration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;automated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task submitted: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Track at: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Poll for completion (or use webhooks)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PR ready: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pull_request_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. GitHub Actions Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/jules-debt.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Weekly tech debt sweep&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;9&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;MON'&lt;/span&gt;   &lt;span class="c1"&gt;# Every Monday at 9am&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;sweep&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Submit Jules tasks from tech-debt.md&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google/jules-action@v1&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;jules-api-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.JULES_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;task-file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.github/tech-debt.md&lt;/span&gt;
          &lt;span class="na"&gt;branch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
          &lt;span class="na"&gt;auto-merge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# Always require human review&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Practical Workflow: What Jules Is Good At
&lt;/h2&gt;

&lt;p&gt;Jules excels when &lt;strong&gt;the unit of work maps to a ticket&lt;/strong&gt;. The sharper your spec, the better the output.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Jules Fit&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bug fix with clear repro steps&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;td&gt;Deterministic target, testable outcome&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add test coverage to a module&lt;/td&gt;
&lt;td&gt;✅ Excellent&lt;/td&gt;
&lt;td&gt;Well-defined scope, no design decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency upgrades with API changes&lt;/td&gt;
&lt;td&gt;✅ Good&lt;/td&gt;
&lt;td&gt;Mechanical but multi-file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migrate module to new framework/ORM&lt;/td&gt;
&lt;td&gt;✅ Good&lt;/td&gt;
&lt;td&gt;Repetitive pattern Jules handles well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security patch + regression tests&lt;/td&gt;
&lt;td&gt;✅ Good&lt;/td&gt;
&lt;td&gt;Scoped + CI validates automatically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exploratory refactor (uncertain scope)&lt;/td&gt;
&lt;td&gt;⚠️ Risky&lt;/td&gt;
&lt;td&gt;Scope drift, Jules may over-engineer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Greenfield architecture design&lt;/td&gt;
&lt;td&gt;❌ Wrong tool&lt;/td&gt;
&lt;td&gt;No acceptance criteria to validate against&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time pair debugging&lt;/td&gt;
&lt;td&gt;❌ Wrong paradigm&lt;/td&gt;
&lt;td&gt;Needs synchronous back-and-forth&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The honest rule of thumb&lt;/strong&gt;: if you could write a solid Jira ticket for it, Jules can probably do it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Jules vs. the Field
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Jules&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;OpenAI Codex&lt;/th&gt;
&lt;th&gt;GitHub Copilot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Execution model&lt;/td&gt;
&lt;td&gt;Async (PR delivery)&lt;/td&gt;
&lt;td&gt;Sync (interactive terminal)&lt;/td&gt;
&lt;td&gt;Async (PR delivery)&lt;/td&gt;
&lt;td&gt;Sync (inline suggestion)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime environment&lt;/td&gt;
&lt;td&gt;Google Cloud VM&lt;/td&gt;
&lt;td&gt;Local / container&lt;/td&gt;
&lt;td&gt;Cloud sandbox&lt;/td&gt;
&lt;td&gt;Editor plugin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network access in VM&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No (strict sandbox)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub integration&lt;/td&gt;
&lt;td&gt;Native (issues → PR)&lt;/td&gt;
&lt;td&gt;Via CLI&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Languages supported&lt;/td&gt;
&lt;td&gt;Node, Python, Go, Rust, Java&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Node, Python primary&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel task execution&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ One at a time&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ One at a time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI auto-fix loop&lt;/td&gt;
&lt;td&gt;✅ Yes (2026)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;2M tokens&lt;/td&gt;
&lt;td&gt;~200K tokens&lt;/td&gt;
&lt;td&gt;~128K tokens&lt;/td&gt;
&lt;td&gt;~8K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Delegated ticket work&lt;/td&gt;
&lt;td&gt;Complex collaborative tasks&lt;/td&gt;
&lt;td&gt;Security-sensitive workflows&lt;/td&gt;
&lt;td&gt;Inline acceleration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Jules Gets Right — and Where It's Still Incomplete
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's working:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The async PR model genuinely removes you from low-value execution loops&lt;/li&gt;
&lt;li&gt;CI integration with auto-fix is a real quality-of-life improvement for teams&lt;/li&gt;
&lt;li&gt;Multi-language runtime support (Node, Python, Go, Rust, Java) is broader than most competitors&lt;/li&gt;
&lt;li&gt;The CLI and Gemini CLI extension make it composable into existing dev workflows&lt;/li&gt;
&lt;li&gt;2M token context means Jules can reason across large codebases without truncation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's still incomplete:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jules validates against tests — codebases with thin coverage expose the reviewer to unknown unknowns&lt;/li&gt;
&lt;li&gt;The debugging story for multi-agent ADK workflows is thin; distributed AI agent observability is largely unsolved&lt;/li&gt;
&lt;li&gt;Spec quality gates: Jules has no way to flag an underspecified task before burning compute on it&lt;/li&gt;
&lt;li&gt;For exploratory or greenfield work, you still need a synchronous collaborator&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture: What This Means for SWEs
&lt;/h2&gt;

&lt;p&gt;Jules isn't a replacement for engineers. It's a redefinition of what "engineering work" means at the margin.&lt;/p&gt;

&lt;p&gt;The value of a senior engineer is increasingly &lt;strong&gt;not&lt;/strong&gt; in the ability to implement — it's in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing specs precise enough that an agent can execute them&lt;/li&gt;
&lt;li&gt;Reviewing AI-generated PRs for correctness, design quality, and unintended side effects&lt;/li&gt;
&lt;li&gt;Knowing when to reach for async delegation vs. interactive collaboration&lt;/li&gt;
&lt;li&gt;Building and maintaining the test coverage and CI infrastructure that makes async agents safe to trust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Google I/O 2026 framed this explicitly: the engineers who get the most from agentic coding will be those who run both patterns in parallel — async for ticket-level work, sync for exploration — not those who pick a favorite.&lt;/p&gt;

&lt;p&gt;Jules is a real tool for real workflows right now. If you have a backlog of well-scoped tasks and a codebase with decent test coverage, it's worth spinning up.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Google Developers Blog — &lt;a href="https://developers.googleblog.com/all-the-news-from-the-google-io-2026-developer-keynote/" rel="noopener noreferrer"&gt;All the news from the Google I/O 2026 Developer keynote&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Google Blog — &lt;a href="https://blog.google/innovation-and-ai/technology/ai/google-io-2026-all-our-announcements/" rel="noopener noreferrer"&gt;100 things we announced at Google I/O 2026&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Google Research — &lt;a href="https://research.google/blog/ai-in-software-engineering-at-google-progress-and-the-path-ahead/" rel="noopener noreferrer"&gt;AI in software engineering at Google: Progress and the path ahead&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Google Cloud Blog — &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/innovations-from-google-io-26-on-google-cloud" rel="noopener noreferrer"&gt;Innovations from Google I/O 26 on Google Cloud&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;TechCrunch — &lt;a href="https://tech.yahoo.com/ai/gemini/articles/google-jules-enters-developers-toolchains-180000308.html" rel="noopener noreferrer"&gt;Google's Jules enters developers' toolchains as AI coding agent competition heats up&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AI Builder Club — &lt;a href="https://www.aibuilderclub.com/blog/google-io-2026-complete-roundup" rel="noopener noreferrer"&gt;Google I/O 2026: Everything That Matters for AI Builders&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>google</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Amazon Quick: AWS's Agentic Workspace, Explained for Engineers</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Thu, 21 May 2026 03:47:40 +0000</pubDate>
      <link>https://dev.to/jubinsoni/amazon-quick-awss-agentic-workspace-explained-for-engineers-3lc2</link>
      <guid>https://dev.to/jubinsoni/amazon-quick-awss-agentic-workspace-explained-for-engineers-3lc2</guid>
      <description>&lt;p&gt;AWS has been building agentic infrastructure for some time now — Bedrock, AgentCore, Strands — mostly aimed at engineers who want to build their own agent systems from scratch. Amazon Quick is a different layer of the same bet: a ready-to-use agentic workspace that targets teams directly, without requiring custom orchestration code.&lt;/p&gt;

&lt;p&gt;This article walks through what Quick is, how its components fit together technically, how the MCP integration model works with real code, and where it sits relative to the rest of AWS's agent stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Amazon Quick Is
&lt;/h2&gt;

&lt;p&gt;Amazon Quick is an AI assistant for work that connects to your existing tools — Slack, Microsoft Teams, Outlook, CRMs, databases, and local files — and gives a unified layer for querying, automating, and acting across them. It launched in preview at AWS's "What's Next with AWS" event on April 28, 2026.&lt;/p&gt;

&lt;p&gt;The product is aimed at teams, not just individual users. One person can build a custom agent scoped to a specific dataset or workflow, and the whole team benefits from it. Responses from Quick agents are grounded in your actual business data, not the underlying model's training distribution.&lt;/p&gt;

&lt;p&gt;Under the hood, Quick is built on Amazon Bedrock AgentCore and uses the Model Context Protocol (MCP) as its standard for connecting to external tools. It runs on AWS IAM and VPC, which means it inherits the same security and compliance posture as the rest of your AWS workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Product Components
&lt;/h2&gt;

&lt;p&gt;Quick bundles five distinct capabilities. It helps to understand each one separately before thinking about how they compose.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Collaborative workspaces where teams pool files, dashboards, and data sources. Agents in a Space are grounded in that Space's data.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom, domain-scoped agents built on your team's specific data. One person builds, everyone uses.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Research&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-source synthesis across internal data, the public web, and third-party datasets. Produces structured reports.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visualize (Quick Sight)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integrated BI layer. Conversational access to dashboards, charts, and forecasting — no separate BI tool required.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automate (Quick Flows)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workflow automation from simple daily tasks to complex multi-step processes with cross-app action execution.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each component is available through the web app, mobile, and a native desktop app (currently in preview for macOS and Windows) that can read local files and calendar context without requiring browser access.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Quick Sits in the AWS Agent Stack
&lt;/h2&gt;

&lt;p&gt;AWS is building in two directions at once. AgentCore is the infrastructure layer for engineers who want to compose their own agent systems — runtime, memory, gateway, observability — with any model and any framework. Quick is the product layer on top: opinionated, team-facing, and deployable without writing orchestration code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj2f5f513ginptspgy7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj2f5f513ginptspgy7w.png" alt="Architecture description" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The practical implication: if you're an engineer building internal tools or automation pipelines, you'll likely interact with both layers. AgentCore for the infrastructure wiring; Quick as a surface where non-technical teammates interact with the agents you build.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Integration Architecture
&lt;/h2&gt;

&lt;p&gt;The core question for any engineer evaluating Quick is: how does it actually connect to external systems, and what does the request path look like?&lt;/p&gt;

&lt;p&gt;Quick uses MCP (Model Context Protocol) as its primary integration standard. This is significant because MCP is an open protocol — it means Quick agents are not locked into AWS-specific connectors, and any MCP-compatible server can be registered as a tool source.&lt;/p&gt;

&lt;h3&gt;
  
  
  High-Level Request Flow
&lt;/h3&gt;

&lt;p&gt;The sequence below shows the full lifecycle of a single agent-triggered tool call — from the moment Quick receives a prompt through to the response returning from a downstream API.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fo6ahxrz2jq5iq99wzt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fo6ahxrz2jq5iq99wzt.png" alt="sequence diagram description" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quick acts as the MCP client. Your MCP server exposes tools via &lt;code&gt;listTools&lt;/code&gt; and &lt;code&gt;callTool&lt;/code&gt;. Quick discovers them at registration time and makes them available to any agent or automation in the workspace. Authentication flows through OAuth 2.0, with support for Dynamic Client Registration (DCR) so Quick can register itself automatically without manual credential setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building an MCP Server for Quick
&lt;/h2&gt;

&lt;p&gt;Here is a minimal Python MCP server using the &lt;code&gt;mcp&lt;/code&gt; SDK that exposes two tools Quick can invoke — &lt;code&gt;get_ticket&lt;/code&gt; and &lt;code&gt;list_open_tickets&lt;/code&gt;. This pattern works whether you host the server yourself or run it on AgentCore Runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp[server] httpx uvicorn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Server implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.sse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SseServerTransport&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TextContent&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.applications&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Starlette&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.routing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Route&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira-quick-integration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;JIRA_BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://yourorg.atlassian.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;JIRA_TOKEN&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &amp;lt;your-token&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# in production, load from AWS Secrets Manager
&lt;/span&gt;

&lt;span class="nd"&gt;@app.list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieve details for a single Jira ticket by issue key.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Jira issue key, e.g. ENG-1234&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_open_tickets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;List open Jira tickets assigned to a given user.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assignee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Jira username or email of the assignee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assignee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="nd"&gt;@app.call_tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JIRA_TOKEN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;JIRA_BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/rest/api/3/issue/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_open_tickets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;assignee&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assignee&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;jql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assignee=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;assignee&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AND status != Done ORDER BY updated DESC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;JIRA_BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/rest/api/3/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fields&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;issues&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No open tickets found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# Wire up SSE transport for Quick compatibility
&lt;/span&gt;&lt;span class="n"&gt;sse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SseServerTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/messages/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_sse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect_sse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_send&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_initialization_options&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;starlette_app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Starlette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;handle_sse&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;starlette_app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few design constraints to be aware of when building for Quick:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each MCP tool call has a &lt;strong&gt;300-second hard timeout&lt;/strong&gt;. Operations that exceed this fail with HTTP 424. Keep individual tool calls narrow and fast.&lt;/li&gt;
&lt;li&gt;The tool list is treated as &lt;strong&gt;static after registration&lt;/strong&gt;. If you add or remove tools on the server, the Quick admin must re-establish the connection to pick up changes.&lt;/li&gt;
&lt;li&gt;Quick supports both &lt;strong&gt;Server-Sent Events (SSE)&lt;/strong&gt; and &lt;strong&gt;streamable HTTP&lt;/strong&gt; as transports. Streamable HTTP is preferred for new implementations.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Registering the MCP Server in Quick
&lt;/h2&gt;

&lt;p&gt;Once your server is running and publicly reachable over HTTPS, registration in Quick takes the following path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Quick Console → Integrations → Add Integration → MCP

Fields:
  Server URL:        https://your-mcp-server.example.com/sse
  Auth type:         OAuth 2.0 (or Service, or None)
  Client ID:         &amp;lt;from your identity provider&amp;gt;
  Authorization URL: https://auth.example.com/oauth/authorize
  Token URL:         https://auth.example.com/oauth/token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your identity provider supports OAuth Dynamic Client Registration, Quick will auto-register and you skip the manual client ID step entirely. Quick sends an initial unauthenticated request to the MCP server; if it receives a &lt;code&gt;401&lt;/code&gt; with a &lt;code&gt;WWW-Authenticate&lt;/code&gt; header containing a &lt;code&gt;resource_metadata&lt;/code&gt; URL, it fetches the metadata document and proceeds with DCR automatically.&lt;/p&gt;

&lt;p&gt;Once registered, Quick calls &lt;code&gt;listTools&lt;/code&gt; at startup and exposes every discovered tool to agents and automations in the workspace.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AgentCore Gateway Option
&lt;/h2&gt;

&lt;p&gt;For teams that don't want to write and operate an MCP server from scratch, Amazon Bedrock AgentCore Gateway provides a managed alternative. You point Gateway at a Lambda function or an OpenAPI spec, and it handles the MCP wrapping, auth, logging, and semantic tool discovery automatically. If you use it, Quick never calls your internal APIs directly — everything flows through Gateway's auth and routing layer, as shown in the sequence diagram above.&lt;/p&gt;

&lt;p&gt;The semantic search capability is worth noting specifically. When an agent has access to dozens or hundreds of tools, passing the full tool list on every turn wastes context and causes the model to pick the wrong tool. Gateway's built-in &lt;code&gt;x_amz_bedrock_agentcore_search&lt;/code&gt; tool lets Quick find the right tool by semantic similarity rather than scanning the entire registry each turn.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Considerations
&lt;/h2&gt;

&lt;p&gt;A few things worth keeping in mind before integrating:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool scope matters.&lt;/strong&gt; When agents are given too many tools simultaneously, selection accuracy degrades — the model reasons over too many options per turn and picks incorrectly more often. Keeping each agent or MCP server to a focused set of 3–5 tools produces better results than exposing everything through one endpoint. This is a known pattern in multi-agent architectures and applies equally to Quick agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 300-second timeout is real.&lt;/strong&gt; Design each tool call to complete a single, bounded operation. Avoid chaining multiple downstream API calls inside a single tool invocation. If you need a multi-step workflow, model it as separate tools and let the agent orchestrate the sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local context on the desktop app.&lt;/strong&gt; The desktop app reads local files and calendar events directly, without upload. For engineers who work primarily in terminals and local editors, this is a meaningful integration point — meeting context, local documentation, and recent file changes are all available to the assistant without any configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP interoperability.&lt;/strong&gt; Because Quick uses MCP as the standard, the same MCP server you build for Quick can also be consumed by Claude Code, Amazon Q Developer, and other MCP-compatible clients. The integration contract is portable.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/quick/" rel="noopener noreferrer"&gt;Amazon Quick — Product overview and features&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/integrate-external-tools-with-amazon-quick-agents-using-model-context-protocol-mcp/" rel="noopener noreferrer"&gt;Integrate external tools with Amazon Quick Agents using MCP (AWS ML Blog, Feb 2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/quick/latest/userguide/mcp-integration.html" rel="noopener noreferrer"&gt;MCP integration — Amazon Quick User Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.html" rel="noopener noreferrer"&gt;Amazon Bedrock AgentCore — Overview and documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/machine-learning/introducing-amazon-bedrock-agentcore-gateway-transforming-enterprise-ai-agent-tool-development/" rel="noopener noreferrer"&gt;Introducing Amazon Bedrock AgentCore Gateway (AWS ML Blog)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/top-announcements-of-the-whats-next-with-aws-2026/" rel="noopener noreferrer"&gt;Top announcements of the What's Next with AWS, 2026 (AWS News Blog, Apr 2026)&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Run Gemma 4 on Your Laptop — A Hands-On Guide to Google's Latest Open Multimodal LLM</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 15 May 2026 02:36:05 +0000</pubDate>
      <link>https://dev.to/jubinsoni/run-gemma-4-on-your-laptop-a-hands-on-guide-to-googles-latest-open-multimodal-llm-56a9</link>
      <guid>https://dev.to/jubinsoni/run-gemma-4-on-your-laptop-a-hands-on-guide-to-googles-latest-open-multimodal-llm-56a9</guid>
      <description>&lt;p&gt;If you've been watching the open-source LLM space, you've probably noticed it's been a great couple of years. Llama, Mistral, Phi, Qwen — a whole zoo of models you can download and run on your own machine. Google's entry into that zoo is &lt;strong&gt;Gemma&lt;/strong&gt;, and the fourth generation, &lt;strong&gt;Gemma 4&lt;/strong&gt; (released April 2, 2026), is the biggest leap yet: built from Gemini 3 research, multimodal (text + image + video + audio), 256K context, native function calling, configurable "thinking mode," and — finally — a clean &lt;strong&gt;Apache 2.0 license&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this post we're going to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Understand what Gemma 4 actually is, with an architecture diagram&lt;/li&gt;
&lt;li&gt;Get it running on your laptop with Ollama in about 5 minutes&lt;/li&gt;
&lt;li&gt;Chat with it from the terminal&lt;/li&gt;
&lt;li&gt;Send it an image and ask questions about it&lt;/li&gt;
&lt;li&gt;Turn on thinking mode for harder problems&lt;/li&gt;
&lt;li&gt;Call it from a Python script like a real API&lt;/li&gt;
&lt;li&gt;Build a small project that glues it all together&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No GPU rental, no API keys, no telemetry. Let's go.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heads up:&lt;/strong&gt; This guide assumes zero ML background. If you can install software and run a terminal command, you can do this.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What is Gemma 4?
&lt;/h2&gt;

&lt;p&gt;Gemma is Google DeepMind's family of &lt;strong&gt;open-weight&lt;/strong&gt; language models. "Open-weight" means the actual neural network weights — the giant matrices of numbers that make the model work — are freely downloadable. You can run them, modify them, fine-tune them, ship them in your product.&lt;/p&gt;

&lt;p&gt;Gemma 4 brings several big changes over Gemma 3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache 2.0 license.&lt;/strong&gt; Earlier Gemma releases used a custom license with a Prohibited Use Policy that made some enterprise legal teams nervous. Gemma 4 is plain Apache 2.0 — unlimited commercial use, no MAU caps, no special permissions. This alone is a big deal for production deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixture-of-Experts.&lt;/strong&gt; A new 26B MoE variant activates only ~4B parameters per token, giving you 13B-class quality at 4B-class cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking mode.&lt;/strong&gt; A configurable reasoning mode where the model thinks step-by-step before answering. Toggle it on for hard problems, off for fast chat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native function calling.&lt;/strong&gt; Built-in support for structured tool use — write an agent without needing prompt engineering hacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More modalities.&lt;/strong&gt; Image, video frames, and (on the smaller E2B/E4B models) native audio input. Native system prompt support too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bigger context.&lt;/strong&gt; 128K on the small models, &lt;strong&gt;256K&lt;/strong&gt; on the larger ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Model sizes at a glance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Disk (Ollama)&lt;/th&gt;
&lt;th&gt;Active params&lt;/th&gt;
&lt;th&gt;Total params&lt;/th&gt;
&lt;th&gt;Multimodal&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~7.2 GB&lt;/td&gt;
&lt;td&gt;~2B&lt;/td&gt;
&lt;td&gt;~2.3B&lt;/td&gt;
&lt;td&gt;text + image + &lt;strong&gt;audio&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Phones, edge devices, browser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E4B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~9.6 GB&lt;/td&gt;
&lt;td&gt;~4B&lt;/td&gt;
&lt;td&gt;~4.5B&lt;/td&gt;
&lt;td&gt;text + image + &lt;strong&gt;audio&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Most laptops — the sweet spot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;26B A4B (MoE)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;~4B&lt;/td&gt;
&lt;td&gt;26B&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;Consumer GPUs, agentic workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;31B Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~20 GB&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;Workstations, highest-quality answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two naming notes worth understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;E2B / E4B.&lt;/strong&gt; The "E" stands for &lt;em&gt;Effective&lt;/em&gt; parameters. These are dense edge-first models that use a trick called Per-Layer Embeddings (PLE — more on this below) to do more with fewer active parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;26B A4B.&lt;/strong&gt; This is the Mixture-of-Experts model. 26B parameters total, but only ~4B "activate" per forward pass. Latency and cost behave like a 4B model; quality is closer to a 13B dense model. Caveat: you still need to load all 26B into memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most readers on a laptop, &lt;strong&gt;E4B is the right starting point&lt;/strong&gt;. It runs comfortably on a 16 GB Mac or any modern dev machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemma 4 vs the rest of the open-model zoo (May 2026)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Sizes&lt;/th&gt;
&lt;th&gt;Multimodal&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemma 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;E2B / E4B / 26B MoE / 31B&lt;/td&gt;
&lt;td&gt;text + image + video + audio (small)&lt;/td&gt;
&lt;td&gt;128K / 256K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Apache 2.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4&lt;/td&gt;
&lt;td&gt;various&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;td&gt;128K+&lt;/td&gt;
&lt;td&gt;Llama community license&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5&lt;/td&gt;
&lt;td&gt;various&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;td&gt;128K+&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemma 4's pitch: the only family that spans phones to servers under Apache 2.0, with multimodal and audio in the same release.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture (in plain English)
&lt;/h2&gt;

&lt;p&gt;You don't need this section to use Gemma 4 — feel free to skip to the install steps. But if you've ever wondered what's actually happening when a multimodal model "sees" and "hears," here it is.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv668k4nkd413oydvdgu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv668k4nkd413oydvdgu.png" alt="gemma4 description" width="800" height="961"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few pieces worth understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three input paths.&lt;/strong&gt; Text goes through a SentencePiece tokenizer (shared with Gemini). Images go through a vision encoder that handles variable aspect ratios and resolutions natively (no more square-only inputs like Gemma 3). On the E2B and E4B models, audio goes through a USM-style conformer encoder borrowed from Gemma 3n. All three paths produce tokens that get interleaved in a single stream — so you can freely mix text, images, and audio in any order in one prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternating local/global attention.&lt;/strong&gt; Most layers only look at a sliding window of recent tokens (cheap). A subset of layers attend to the full context (expensive but rare). This is the standard trick for keeping the KV cache from blowing up at 256K context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-Layer Embeddings (PLE)&lt;/strong&gt; — &lt;em&gt;the small-model secret&lt;/em&gt;. In a normal transformer, each token gets one embedding vector at input and that's all the residual stream has to work with. PLE adds a parallel pathway: for each token, every layer gets its &lt;em&gt;own&lt;/em&gt; small conditioning vector from a lookup table. The embedding tables are large (lots of memory) but the "active" parameters per token stay small — that's why a 4-billion-active-parameter E4B can punch above its weight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixture-of-Experts (26B A4B).&lt;/strong&gt; The MoE layer has multiple "expert" feed-forward networks. A small router picks 2 of 8 (or similar) for each token. Total params = 26B (all loaded), active params per token = ~4B (only those fire). Pareto-optimal for quality-per-FLOP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking mode.&lt;/strong&gt; When you include the special &lt;code&gt;&amp;lt;|think|&amp;gt;&lt;/code&gt; token at the start of the system prompt, the model emits internal reasoning between &lt;code&gt;&amp;lt;|channel&amp;gt;thought\n...&amp;lt;channel|&amp;gt;&lt;/code&gt; markers before the final answer. Disable it for fast chat; enable it for math, code, multi-step reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's most of what's worth knowing. Now let's actually run it.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1: Install Ollama
&lt;/h2&gt;

&lt;p&gt;There are a few ways to run Gemma 4 locally, but the easiest by a mile is &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;. Think of it as "Docker for LLMs" — it handles downloading the model, managing memory, GPU acceleration, and exposing a local API. You don't have to think about CUDA versions or PyTorch.&lt;/p&gt;

&lt;p&gt;Install it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS / Windows:&lt;/strong&gt; Download the installer at &lt;a href="https://ollama.com/download" rel="noopener noreferrer"&gt;ollama.com/download&lt;/a&gt; and run it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see a version number. &lt;strong&gt;Gemma 4 requires Ollama v0.20.0 or later&lt;/strong&gt; — if you're on an older version, update first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Pull a Gemma 4 model
&lt;/h2&gt;

&lt;p&gt;Download the default (E4B, ~9.6 GB):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This downloads about 9.6 GB. Grab a coffee. ☕&lt;/p&gt;

&lt;p&gt;Other sizes if you want them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4:e2b   &lt;span class="c"&gt;# ~7.2 GB — smallest, for low-RAM machines&lt;/span&gt;
ollama pull gemma4:e4b   &lt;span class="c"&gt;# ~9.6 GB — the default; same as `gemma4`&lt;/span&gt;
ollama pull gemma4:26b   &lt;span class="c"&gt;# ~18 GB  — the MoE; 256K context&lt;/span&gt;
ollama pull gemma4:31b   &lt;span class="c"&gt;# ~20 GB  — biggest dense model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Hardware reality check:&lt;/strong&gt; On Apple Silicon, 16 GB unified memory handles E4B comfortably. NVIDIA users need the model to fit entirely in VRAM for GPU-accelerated inference. The 26B model fits on 24 GB but leaves very little headroom — treat it as the ceiling, not the target.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;List what you've got:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Chat with it in the terminal
&lt;/h2&gt;

&lt;p&gt;Easiest possible test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll get an interactive prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; Explain what a hash map is, like I'm a junior dev.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hit enter and watch it stream a response. To exit, type &lt;code&gt;/bye&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's it. You're running a state-of-the-art LLM locally with zero cloud dependency. Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Write a Python function that finds duplicates in a list, with three different approaches and their tradeoffs."&lt;/li&gt;
&lt;li&gt;"What's the difference between TCP and UDP? Use an analogy."&lt;/li&gt;
&lt;li&gt;"Translate 'Where is the nearest train station?' into Japanese, Spanish, and Hindi."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 4: Send it an image
&lt;/h2&gt;

&lt;p&gt;Gemma 4 can &lt;em&gt;see&lt;/em&gt;. Drop any image file in your current directory, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; Describe what&lt;span class="s1"&gt;'s in this image: ./screenshot.png
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ollama loads the image, sends it through the vision encoder, and the model answers. Unlike Gemma 3 (which resized everything to 896×896), Gemma 4 handles variable aspect ratios and resolutions natively — so tall screenshots, wide diagrams, and high-res photos all work without manual cropping.&lt;/p&gt;

&lt;p&gt;Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What error is shown in this screenshot?" &lt;em&gt;(paste a stack trace)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"What's the bounding box for the 'submit' button in this UI?" &lt;em&gt;(Gemma 4 will answer in JSON — natively!)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;"Read the handwriting in this note and transcribe it."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 5: Turn on thinking mode
&lt;/h2&gt;

&lt;p&gt;For harder problems — multi-step math, complex code, logic puzzles — turn on thinking mode. Include the &lt;code&gt;&amp;lt;|think|&amp;gt;&lt;/code&gt; token at the very start of your system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; /set system &lt;span class="s2"&gt;"&amp;lt;|think|&amp;gt;You are a careful, methodical assistant."&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; Three friends &lt;span class="nb"&gt;split &lt;/span&gt;a &lt;span class="nv"&gt;$73&lt;/span&gt;.42 dinner bill. Alice had a &lt;span class="nv"&gt;$12&lt;/span&gt; appetizer, Bob had a &lt;span class="nv"&gt;$9&lt;/span&gt; drink. The rest is shared. What does everyone pay?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model will emit its reasoning in a &lt;code&gt;&amp;lt;|channel&amp;gt;thought\n...&amp;lt;channel|&amp;gt;&lt;/code&gt; block before the final answer. For fast chat, leave the token out and the model answers directly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🧠 &lt;strong&gt;When to use it:&lt;/strong&gt; Code generation, math, multi-hop reasoning, agentic planning — yes. Single-turn factual questions, summarization, translation — no, it just adds latency.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Step 6: Call Gemma 4 from Python
&lt;/h2&gt;

&lt;p&gt;A chat prompt is nice, but you're a developer — you want to call this thing from code. When Ollama is running, it exposes a local REST API on &lt;code&gt;http://localhost:11434&lt;/code&gt;. There's also an official Python client.&lt;/p&gt;

&lt;p&gt;Install it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Basic chat
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior code reviewer. Be concise and direct.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this code:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;def add(a, b):&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;    return a+b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Streaming responses (ChatGPT-style)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a haiku about debugging.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sending an image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s in this image?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./my_photo.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Thinking mode + function calling (the agentic combo)
&lt;/h3&gt;

&lt;p&gt;This is where Gemma 4 actually starts feeling like a "real" agent. You declare your tools as JSON schemas, the model decides when to call them, and you execute the call and pass results back. No prompt engineering hacks needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get the current weather for a city.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;City name, e.g. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tokyo&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Pretend this hits a real API.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: 22°C, partly cloudy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;|think|&amp;gt;You are a helpful weather assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Should I bring an umbrella in Tokyo today?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# If the model wants to call a tool, execute it and feed the result back:
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Send result back for the model to finalize its answer
&lt;/span&gt;        &lt;span class="n"&gt;followup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Should I bring an umbrella in Tokyo today?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;followup&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Raw HTTP (no Python client needed)
&lt;/h3&gt;

&lt;p&gt;For any other language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:11434/api/chat &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "gemma4",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same JSON shape works from Node, Go, Rust, your shell — anything that can make an HTTP request.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small project: folder-watching image describer
&lt;/h2&gt;

&lt;p&gt;Here's a useful ~30-line script. It watches a folder, and any new image dropped in gets automatically described by Gemma 4. Great for accessibility tools, content moderation prototypes, or just learning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./inbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SEEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📁 Watching &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/ — drop an image in to describe it.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   (Ctrl+C to stop)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;IMAGE_EXTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.webp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.gif&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;new_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;SEEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IMAGE_EXTS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WATCH_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📸 New image: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe this image in 2-3 sentences. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mention any visible text. Be specific.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;   → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;SEEN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;KeyboardInterrupt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; Stopped.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it, drag images into the &lt;code&gt;inbox/&lt;/code&gt; folder, and watch descriptions appear. That's a real, useful, completely local AI tool — written in 30 lines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things to know before shipping anything serious
&lt;/h2&gt;

&lt;p&gt;A few honest caveats:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Caveat&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hallucination&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local models still confidently make things up. Don't trust factual claims without verification. Thinking mode reduces this for reasoning tasks but doesn't eliminate it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expect 1–3 tokens/sec on a CPU-only laptop with E4B. A GPU gives 3–10× speedup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context costs RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;256K context is real, but actually filling it eats memory. Most use cases need &amp;lt;16K tokens.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MoE memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The 26B MoE runs &lt;em&gt;fast&lt;/em&gt; (only 4B active per token), but you still need to load all 26B into RAM. Don't confuse active params with memory footprint.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio is small-model only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;E2B/E4B have native audio input. The 26B and 31B models do not.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache 2.0 ≠ no responsibilities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The license is permissive, but you're still on the hook for safety, bias, and compliance in whatever you ship.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  📚 References &amp;amp; further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blog.google/technology/developers/gemma-4/" rel="noopener noreferrer"&gt;Gemma 4 announcement — Google blog&lt;/a&gt;&lt;/strong&gt; — The launch post (April 2, 2026).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;Gemma 4 model overview — Google AI for Developers&lt;/a&gt;&lt;/strong&gt; — Official docs: sizes, capabilities, hardware requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/blog/gemma4" rel="noopener noreferrer"&gt;Welcome Gemma 4 — Hugging Face blog&lt;/a&gt;&lt;/strong&gt; — Best technical write-up: covers PLE, MoE, USM audio encoder, benchmarks, and code samples.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/google/gemma-4-E4B-it" rel="noopener noreferrer"&gt;Gemma 4 model card on Hugging Face&lt;/a&gt;&lt;/strong&gt; — E4B instruct model weights and configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dev.to/aniruddhaadak/gemma-4-complete-guide-2026-architecture-benchmarks-deployment-3en9"&gt;Gemma 4 Complete Guide 2026 — dev.to&lt;/a&gt;&lt;/strong&gt; — Community guide with architecture details and competitor comparisons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2303.15343" rel="noopener noreferrer"&gt;SigLIP (Zhai et al., 2023)&lt;/a&gt;&lt;/strong&gt; — The vision encoder family Gemma's image path builds on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/1701.06538" rel="noopener noreferrer"&gt;Mixture-of-Experts (Shazeer et al., 2017)&lt;/a&gt;&lt;/strong&gt; — The original sparsely-gated MoE paper. The 26B A4B is a direct descendant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2101.03961" rel="noopener noreferrer"&gt;Switch Transformer (Fedus et al., 2021)&lt;/a&gt;&lt;/strong&gt; — Modern MoE techniques.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.llama.com/" rel="noopener noreferrer"&gt;Llama 4&lt;/a&gt;&lt;/strong&gt; — Meta's competing open-weight family.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>gemma</category>
      <category>googlecloud</category>
      <category>llm</category>
    </item>
    <item>
      <title>AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Mon, 11 May 2026 22:53:13 +0000</pubDate>
      <link>https://dev.to/jubinsoni/aws-kiro-the-agentic-ide-that-makes-specs-the-unit-of-work-3eko</link>
      <guid>https://dev.to/jubinsoni/aws-kiro-the-agentic-ide-that-makes-specs-the-unit-of-work-3eko</guid>
      <description>&lt;p&gt;The agentic IDE space has gotten crowded fast. Cursor, Claude Code, Copilot, Windsurf — they all share the same core model: you type a prompt, the AI writes some code, you iterate. It works well for prototyping. It breaks down when you're building production systems on a large codebase with a team of more than one.&lt;/p&gt;

&lt;p&gt;AWS Kiro takes a different bet. Instead of chat-first, it's &lt;strong&gt;spec-first&lt;/strong&gt;. The unit of work isn't a prompt — it's a structured specification that the agent uses to plan, implement, verify, and document your feature end to end. That's a meaningful philosophical difference, and in practice it changes what the tool is useful for.&lt;/p&gt;

&lt;p&gt;Here's what Kiro actually is, how its core concepts fit together, and an honest take on when it makes sense over the alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Kiro Is
&lt;/h2&gt;

&lt;p&gt;Kiro launched from AWS in mid-2025 and is built on top of Amazon Bedrock, routing between Claude Sonnet for reasoning-heavy work and Amazon Nova for high-throughput code generation. It ships in three forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kiro IDE&lt;/strong&gt; — a VS Code-compatible editor (built on Code OSS, so you can import your existing themes, keybindings, and Open VSX plugins)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kiro CLI&lt;/strong&gt; — the same agent in your terminal, useful for SSH sessions or scripted workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kiro Autonomous Agent&lt;/strong&gt; — a background agent that picks up tasks, implements them, and opens PRs without you sitting in the loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need an AWS account to get started — you can sign in with GitHub or Google. The IDE feels immediately familiar if you've used VS Code, which removes one of the usual adoption barriers for new tooling.&lt;/p&gt;

&lt;p&gt;In January 2026, AWS also announced the end of Amazon Q Developer for new signups (effective May 15, 2026), explicitly directing users to Kiro as its successor for IDE-based AI assistance. That's a significant signal about where AWS is placing its bets.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Concepts That Make Kiro Different
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Specs
&lt;/h3&gt;

&lt;p&gt;When you start a new feature in Kiro, you don't jump straight to code. You describe what you want to build, and Kiro generates three structured files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;requirements.md&lt;/code&gt; — user stories and acceptance criteria&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;design.md&lt;/code&gt; — system design, component breakdown, data flow&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tasks.md&lt;/code&gt; — a numbered implementation checklist the agent works through&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These become the source of truth. Code is a build artifact of the spec. When you come back to the feature a month later, or hand it to a new team member, the reasoning behind every decision is documented — not in a Confluence page nobody reads, but in the repo next to the code it describes.&lt;/p&gt;

&lt;p&gt;This is the thing chat-first tools can't replicate. Cursor or Claude Code can generate excellent code from a good prompt. What they can't do is maintain a structured paper trail of &lt;em&gt;why&lt;/em&gt; the code looks the way it does.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Hooks
&lt;/h3&gt;

&lt;p&gt;Hooks are event-driven automations that fire when things happen in your workspace — file save, new file created, commit opened. You define what Kiro should do in response, and it runs those actions in the background without you having to think about them.&lt;/p&gt;

&lt;p&gt;Common hooks teams set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run the linter and auto-fix on every file save&lt;/li&gt;
&lt;li&gt;Regenerate unit tests when implementation files change&lt;/li&gt;
&lt;li&gt;Update the relevant section of &lt;code&gt;design.md&lt;/code&gt; when a module is modified&lt;/li&gt;
&lt;li&gt;Run a security scan before any commit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical effect is that a junior developer's output passes the same automated quality bar as a senior's, because the standards are enforced by the environment rather than by code review heroics.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Steering Files
&lt;/h3&gt;

&lt;p&gt;Steering files are Markdown files that give Kiro persistent context about your project — your conventions, the libraries you've standardized on, your architecture decisions, your security requirements. You create them once, and Kiro reads them on every interaction without you having to re-explain your stack in every prompt.&lt;/p&gt;

&lt;p&gt;They live in two places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;~/.kiro/steering/&lt;/code&gt; — global rules that apply across all your projects&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.kiro/steering/&lt;/code&gt; — project-specific overrides checked into the repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A typical global steering file might say things like "always use TypeScript strict mode," "prefer AWS CDK over raw CloudFormation," or "all Lambda functions must have structured logging with a correlation ID." Project steering files add things like "this service is a multi-tenant SaaS, tenant ID is always passed in the request context."&lt;/p&gt;

&lt;p&gt;The result is that Kiro's context isn't reset between sessions and doesn't depend on whoever wrote the last prompt being thorough.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hooks + Specs Flywheel
&lt;/h2&gt;

&lt;p&gt;The real power emerges when hooks and specs work together. Here's what that looks like in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You describe a new feature. Kiro generates &lt;code&gt;requirements.md&lt;/code&gt;, &lt;code&gt;design.md&lt;/code&gt;, and &lt;code&gt;tasks.md&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You review and refine the spec — add an edge case to requirements, adjust the component breakdown in design.&lt;/li&gt;
&lt;li&gt;Kiro implements the tasks list, following your steering files for conventions.&lt;/li&gt;
&lt;li&gt;On each file save, hooks run: linter, tests, security scan. Issues surface immediately.&lt;/li&gt;
&lt;li&gt;When you're done, a hook generates the commit message from the spec diff.&lt;/li&gt;
&lt;li&gt;The PR description writes itself from &lt;code&gt;requirements.md&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The spec doesn't go stale because hooks keep it in sync with the code. The code doesn't drift from the design because the design was written before the code. This is what "engineering rigor" means in the context of agentic development — not slower, but structured.&lt;/p&gt;




&lt;h2&gt;
  
  
  AWS-Native Advantages (and the Honest Tradeoff)
&lt;/h2&gt;

&lt;p&gt;Kiro has deep integration with the AWS ecosystem: CodeCatalyst for repositories and CI/CD, Bedrock for model access, IAM Identity Center for enterprise auth, and "Kiro Powers" — pre-packaged MCP servers for AWS-specific domains like CDK, CloudFormation, pricing, and (recently) HealthOmics workflows.&lt;/p&gt;

&lt;p&gt;If your team is already AWS-first, this is a genuine multiplier. Your Kiro agent can query your actual AWS account context, reference live Bedrock documentation, and generate CDK constructs that match your organization's guardrails.&lt;/p&gt;

&lt;p&gt;The honest tradeoff: if your team isn't AWS-first, some of this integration feels like overhead rather than lift. Kiro works perfectly well as a general-purpose agentic IDE — the spec/hooks/steering system has value regardless of your cloud provider — but the ecosystem integrations are clearly designed for AWS shops. Most teams running mixed infrastructure (some AWS, some not) find it practical to use Kiro for the AWS-native services and keep their existing editor for everything else. The two coexist fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Compares to the Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Kiro&lt;/th&gt;
&lt;th&gt;Cursor&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary paradigm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spec-driven&lt;/td&gt;
&lt;td&gt;Chat-driven&lt;/td&gt;
&lt;td&gt;Task-driven (CLI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persistent context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Steering files&lt;/td&gt;
&lt;td&gt;Rules / &lt;code&gt;.cursorrules&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hooks (event-driven)&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standalone (VS Code-compatible)&lt;/td&gt;
&lt;td&gt;Fork of VS Code&lt;/td&gt;
&lt;td&gt;Terminal only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Background agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (autonomous agent)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production features, team consistency&lt;/td&gt;
&lt;td&gt;Fast prototyping, exploration&lt;/td&gt;
&lt;td&gt;Complex refactors, agentic tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Kiro and Claude Code aren't direct competitors in practice — Kiro is an IDE product and Claude Code is a terminal agent. Many teams run both, using Kiro for structured feature work and Claude Code for open-ended refactors or one-off tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Download the IDE from &lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;kiro.dev&lt;/a&gt; — no AWS account required. Sign in with GitHub or Google, point it at an existing repo, and run through the onboarding to import your VS Code settings.&lt;/p&gt;

&lt;p&gt;A good first experiment: take a feature you're planning to build anyway, describe it to Kiro, and look at the spec it generates before writing any code. The value of the approach becomes obvious when you see your vague "add user preferences" idea turn into a concrete requirements doc with six acceptance criteria and a data model.&lt;/p&gt;

&lt;p&gt;From there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create one global steering file in &lt;code&gt;~/.kiro/steering/&lt;/code&gt; with your language and framework defaults&lt;/li&gt;
&lt;li&gt;Set up one hook that runs your linter on file save&lt;/li&gt;
&lt;li&gt;Build the feature using the task list Kiro generated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the feedback loop that makes the tool click. The full power of the hooks and autonomous agent comes later, but even the basic spec workflow is a meaningful improvement over prompt-and-iterate for anything that takes more than a day to build.&lt;/p&gt;




&lt;h2&gt;
  
  
  Worth Watching
&lt;/h2&gt;

&lt;p&gt;A few things that make Kiro worth keeping an eye on even if you're not ready to switch:&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;spec-as-artifact&lt;/strong&gt; model is genuinely novel. When agents get better, spec-driven codebases will be better positioned to benefit — the structured requirements and design docs give future agents much richer context than a commit history and some comments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kiro Powers&lt;/strong&gt; (the MCP server marketplace) is growing fast. The HealthOmics extension in February 2026 showed that domain-specific agent packs are a real product direction, not just a demo.&lt;/p&gt;

&lt;p&gt;And with &lt;strong&gt;Amazon Q Developer sunsetting for new users&lt;/strong&gt;, AWS is clearly consolidating its developer AI bet onto Kiro. Whatever the roadmap looks like from here, it's going to get resources.&lt;/p&gt;




&lt;p&gt;Kiro isn't the right tool for every workflow. If you're prototyping solo or doing exploratory work, the spec-first overhead is friction you don't need. But for teams shipping production features that need to be documented, tested, and maintained — the bet that specs should be the unit of work is a compelling one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Kiro vs. the Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Kiro&lt;/th&gt;
&lt;th&gt;Cursor&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;GitHub Copilot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary paradigm&lt;/td&gt;
&lt;td&gt;Spec-driven&lt;/td&gt;
&lt;td&gt;Chat-driven&lt;/td&gt;
&lt;td&gt;Task-driven (CLI)&lt;/td&gt;
&lt;td&gt;Inline completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent context&lt;/td&gt;
&lt;td&gt;Steering files&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.cursorrules&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;AGENTS.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event automation&lt;/td&gt;
&lt;td&gt;Hooks (file save, commit)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured specs&lt;/td&gt;
&lt;td&gt;✅ Native&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Background agent&lt;/td&gt;
&lt;td&gt;✅ Autonomous agent&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS-native integration&lt;/td&gt;
&lt;td&gt;✅ Deep&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dynamic MCP loading&lt;/td&gt;
&lt;td&gt;✅ Powers&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IDE base&lt;/td&gt;
&lt;td&gt;Code OSS (VS Code compat.)&lt;/td&gt;
&lt;td&gt;VS Code fork&lt;/td&gt;
&lt;td&gt;Terminal only&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How Spec-Driven Development Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│              YOU: describe a feature                     │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                KIRO GENERATES SPECS                      │
│                                                          │
│  .kiro/specs/my-feature/                                 │
│  ├── requirements.md  ← user stories + EARS notation     │
│  ├── design.md        ← architecture, data flow, APIs    │
│  └── tasks.md         ← ordered implementation plan      │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│              YOU: review + refine specs                  │
│  add edge cases, adjust design, approve task list        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│            KIRO IMPLEMENTS task by task                  │
│  guided by steering files + spec context                 │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│              HOOKS FIRE AUTOMATICALLY                    │
│  on every file save:                                     │
│  → linter + autofix                                      │
│  → test generation / update                              │
│  → security scan                                         │
│  → design.md sync                                        │
└─────────────────────────┬───────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│        PR OPENS — description from requirements.md       │
│        commit message generated from spec diff           │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Steering File Layout
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.kiro/steering/              ← global, applies to every project
├── typescript.md              "always use strict mode, no any"
├── aws.md                     "prefer CDK over raw CloudFormation"
├── security.md                "IAM roles must follow least privilege"
├── git.md                     "use conventional commits"
└── testing.md                 "80% coverage minimum, jest + RTL"

your-repo/
└── .kiro/
    └── steering/              ← project-specific overrides (checked in)
        ├── architecture.md    "multi-tenant SaaS, one DB schema per tenant"
        ├── api.md             "all endpoints versioned under /v1"
        └── data-model.md      "tenant ID always in request context, never inferred"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Hook Definition Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .kiro/hooks/test-sync.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Sync Tests on Component Save&lt;/span&gt;
&lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;onSave&lt;/span&gt;
  &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/**/*.tsx"&lt;/span&gt;
&lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;When a React component file is saved:&lt;/span&gt;
  &lt;span class="s"&gt;1. Check if a corresponding test file exists in __tests__/&lt;/span&gt;
  &lt;span class="s"&gt;2. If not, create one with basic render and snapshot tests&lt;/span&gt;
  &lt;span class="s"&gt;3. If it exists, update it to cover any new props or exported functions&lt;/span&gt;
  &lt;span class="s"&gt;4. Run the test file and report failures inline&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .kiro/hooks/security-scan.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pre-commit Security Scan&lt;/span&gt;
&lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;onCommit&lt;/span&gt;
&lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Before every commit:&lt;/span&gt;
  &lt;span class="s"&gt;1. Scan staged files for hardcoded secrets, API keys, and credentials&lt;/span&gt;
  &lt;span class="s"&gt;2. Check for any 0.0.0.0/0 ingress rules in IaC files&lt;/span&gt;
  &lt;span class="s"&gt;3. Flag any new IAM policies that use wildcard actions (*)&lt;/span&gt;
  &lt;span class="s"&gt;4. Block the commit and explain any findings — do not auto-fix&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How Powers Solve Context Rot
&lt;/h2&gt;

&lt;p&gt;Without Powers, connecting multiple MCP servers front-loads your entire context window before you write a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without Powers
──────────────────────────────────────────────────
Context window (200K tokens)

[Figma MCP tools]     ~12K tokens  ████
[Postman MCP tools]   ~18K tokens  ██████
[Stripe MCP tools]    ~10K tokens  ███
[Supabase MCP tools]  ~15K tokens  █████
[Datadog MCP tools]   ~9K tokens   ███
                      ──────────────────
Total overhead        ~64K tokens  (32% gone before first prompt)

With Powers (dynamic loading)
──────────────────────────────────────────────────
You mention "payment" → Stripe power activates
You mention "database" → Supabase activates, Stripe deactivates
Baseline overhead: ~0K tokens until context is needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Workspace Architecture for AWS Teams
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Organization
└── Management Account
    ├── Client A Account
    │   ├── Kiro workspace  (.kiro/ scoped here)
    │   ├── CodeCatalyst repo
    │   ├── Bedrock access (us-east-1)
    │   └── Secrets Manager (client A secrets only)
    │
    ├── Client B Account
    │   ├── Kiro workspace  (.kiro/ scoped here)
    │   ├── CodeCatalyst repo
    │   ├── Bedrock access (us-east-1)
    │   └── Secrets Manager (client B secrets only)
    │
    └── Shared Services Account
        ├── IAM Identity Center (SSO for all Kiro logins)
        └── Billing consolidated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern keeps client IP, secrets, and Bedrock spend isolated by account boundary — IAM does the enforcement, not convention.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;kiro.dev&lt;/a&gt; — download is free, no AWS account required&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/blog/introducing-kiro/" rel="noopener noreferrer"&gt;Introducing Kiro&lt;/a&gt; — the original launch post, good context on the design philosophy behind specs and hooks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/blog/introducing-powers/" rel="noopener noreferrer"&gt;Introducing Powers&lt;/a&gt; — explains why dynamic MCP loading matters and how Powers solve context rot&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/blog/teaching-kiro-new-tricks-with-agent-steering-and-mcp/" rel="noopener noreferrer"&gt;Teaching Kiro new tricks with steering and MCP&lt;/a&gt; — practical deep dive on using steering + MCP to handle custom libraries and DSLs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/docs/specs/" rel="noopener noreferrer"&gt;Specs documentation&lt;/a&gt; — full reference including the Design-First and Bugfix spec workflows&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/powers/" rel="noopener noreferrer"&gt;Kiro Powers marketplace&lt;/a&gt; — browse Figma, Stripe, Supabase, Datadog, Terraform and more&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kiro.dev/changelog/ide/" rel="noopener noreferrer"&gt;IDE Changelog&lt;/a&gt; — how fast the product is moving&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/devops/amazon-q-developer-end-of-support-announcement/" rel="noopener noreferrer"&gt;Amazon Q Developer end-of-support announcement&lt;/a&gt; — official AWS post confirming Kiro as Q Developer's successor&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kirodotdev/Kiro" rel="noopener noreferrer"&gt;github.com/kirodotdev/Kiro&lt;/a&gt; — issue tracker and feedback repo&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>productivity</category>
      <category>kiro</category>
    </item>
    <item>
      <title>Build a Git Commit Analyzer with Gemma 4 31B and a 256K Context Window</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 08 May 2026 17:41:18 +0000</pubDate>
      <link>https://dev.to/jubinsoni/build-a-git-commit-analyzer-with-gemma-4-31b-and-a-256k-context-window-2ddh</link>
      <guid>https://dev.to/jubinsoni/build-a-git-commit-analyzer-with-gemma-4-31b-and-a-256k-context-window-2ddh</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most developers reach for an LLM when they need code completion or a chatbot. This article is about something more useful and less obvious: feeding your entire sprint's git history to &lt;strong&gt;Gemma 4 31B&lt;/strong&gt; — diffs, commit messages, authors and all — and getting back structured, actionable analysis of what actually changed and why it might matter.&lt;/p&gt;

&lt;p&gt;The 31B Dense model's &lt;strong&gt;256K context window&lt;/strong&gt; is the key enabler here. It means you can pass tens of thousands of lines of patch output in a single prompt and ask the model to reason across the whole thing — not chunk-and-summarize, but genuinely cross-reference commits, spot patterns, and flag risk. That's a qualitatively different capability from what a smaller model or an older Gemma generation could provide.&lt;/p&gt;

&lt;p&gt;By the end of this guide you'll have a working Python CLI tool that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shells out to &lt;code&gt;git log --patch&lt;/code&gt; to collect a commit range&lt;/li&gt;
&lt;li&gt;Sends the full diff to Gemma 4 31B via the Gemini API (free tier in Google AI Studio)&lt;/li&gt;
&lt;li&gt;Returns a structured JSON report with change summaries, risk flags, and a draft changelog&lt;/li&gt;
&lt;li&gt;Optionally writes a Markdown changelog file&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Gemma 4 31B Is the Right Model for This
&lt;/h2&gt;

&lt;p&gt;Three specific properties make the 31B Dense the correct pick here — not the 26B MoE, not the edge models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;256K context window.&lt;/strong&gt; A week's worth of commits on a mid-size codebase generates 20,000–80,000 tokens of patch text. The 31B handles that in a single pass. Chunking and summarizing separately loses cross-commit signal: the model can't notice that a refactor in commit 3 introduced the same variable name collision that commit 7 later fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maximum quality per query.&lt;/strong&gt; The 31B Dense is the highest-accuracy model in the Gemma 4 family. For code analysis you care about precision — a false positive risk flag wastes a senior engineer's time, and a false negative ships a bug. You're making one expensive call per analysis run, so raw quality beats throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native structured output.&lt;/strong&gt; Gemma 4 has first-class support for function calling and structured JSON output. The analyzer requests a strict JSON schema and the model reliably returns it — no fragile string parsing required.&lt;/p&gt;

&lt;p&gt;The 26B MoE is the right choice if you're building something that calls the model thousands of times per day and want cost efficiency. This tool calls it once per analysis run and prioritizes signal quality, so the Dense wins.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;A Google AI Studio API key (free — &lt;a href="https://aistudio.google.com/app/apikey" rel="noopener noreferrer"&gt;get one here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;A git repository to analyze&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;google-generativeai&lt;/code&gt; Python SDK
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;google-generativeai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set your API key as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key-here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Collect the Git Diff
&lt;/h2&gt;

&lt;p&gt;The first job is gathering the raw patch data. We use &lt;code&gt;git log --patch&lt;/code&gt; with a commit range and pipe the output to a string. We also collect structured commit metadata separately so the model has author and timestamp context alongside the diff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_git_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HEAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Returns (full_patch_text, list_of_commit_metadata).
    `since` accepts anything git understands: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;7 days ago&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;v1.2.3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, a SHA, etc.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Collect the full unified diff
&lt;/span&gt;    &lt;span class="n"&gt;patch_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--no-merges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--pretty=format:COMMIT: %H%nAuthor: %an &amp;lt;%ae&amp;gt;%nDate: %ci%nMessage: %s%n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Collect lightweight metadata for the summary header
&lt;/span&gt;    &lt;span class="n"&gt;meta_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--no-merges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--pretty=format:%H|%an|%ci|%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;meta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;msg_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;patch_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A week of commits on a real codebase might be 40,000–100,000 tokens. We'll let the model handle the full text — that's exactly what the 256K window is for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Build the Prompt
&lt;/h2&gt;

&lt;p&gt;The prompt does three things: gives the model its role and output contract, defines the JSON schema it must return, and passes the raw git history.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a senior staff engineer performing a structured code review
of a git commit history. Your job is to analyse the provided patch text and return a
single JSON object — nothing else, no markdown fences, no explanation outside the JSON.

The JSON object must match this schema exactly:

{
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2-3 sentence plain-English summary of the overall change set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changed_areas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    {
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/file_or_directory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;change_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;added | modified | deleted | renamed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what changed and why it likely changed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    {
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low | medium | high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file or component&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific, concrete reason this change carries risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notable cross-commit pattern, refactor theme, or repeated change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changelog_entry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A polished, user-facing changelog entry in Markdown. Use ## [Unreleased] as the heading. Group under Added, Changed, Fixed, Removed as appropriate.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
}

Be specific. Do not flag risk without a concrete reason tied to the actual diff.
Do not invent changes that are not present in the patch text.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;commit_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;authors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;date_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANALYSIS REQUEST&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Commits: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commit_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authors: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;authors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date range: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date_range&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FULL PATCH TEXT FOLLOWS&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;patch_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt enforces a strict schema so we can parse the response with &lt;code&gt;json.loads&lt;/code&gt; — no regex, no fallbacks. One of Gemma 4's standout improvements over Gemma 3 is how reliably it follows structured output instructions at this schema complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Call Gemma 4 31B
&lt;/h2&gt;

&lt;p&gt;We use the &lt;code&gt;google-generativeai&lt;/code&gt; SDK with &lt;code&gt;gemma-4-31b-it&lt;/code&gt; (the instruction-tuned variant — always use IT for structured task completion).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_with_gemma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma-4-31b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system_instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;generation_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerationConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Low temperature for consistent structured output
&lt;/span&gt;            &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sending &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; words to Gemma 4 31B...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Strip markdown fences if the model adds them despite instructions
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature at 0.2 keeps the output deterministic and schema-compliant. For creative changelog prose you could nudge it to 0.4 — but for risk flags you want the model to be conservative and consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Format and Output the Report
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GIT HISTORY ANALYSIS — Gemma 4 31B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Commits analysed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;SUMMARY&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RISK FLAGS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]):&lt;/span&gt;
            &lt;span class="n"&gt;icon&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔴&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟢&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;icon&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;area&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PATTERNS DETECTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  • &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CHANGED AREAS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changed_areas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;change_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;             &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_changelog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changelog_entry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="c1"&gt;# Inject today's date if the entry has a placeholder
&lt;/span&gt;    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Unreleased]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Unreleased] — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Changelog written to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Wire It Together as a CLI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse a git commit range with Gemma 4 31B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path to git repository&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start of range (default: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;). Accepts any git date or ref.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HEAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;End of range (default: HEAD)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--changelog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write changelog entry to this file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write full JSON report to this file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect_git_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No commits found in the specified range.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;analyze_with_gemma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changelog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;write_changelog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changelog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full JSON report written to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;

&lt;p&gt;Analyse the last week of commits in the current repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"1 week ago"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyse a specific SHA range and write a changelog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py /path/to/repo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since&lt;/span&gt; v1.4.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--until&lt;/span&gt; v1.5.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--changelog&lt;/span&gt; CHANGELOG.md &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--json&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyse a single sprint (two-week window):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"14 days ago"&lt;/span&gt; &lt;span class="nt"&gt;--changelog&lt;/span&gt; CHANGELOG.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Sample Output
&lt;/h2&gt;

&lt;p&gt;Here's an abbreviated example of what the tool produces on a real project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
GIT HISTORY ANALYSIS — Gemma 4 31B
============================================================

Commits analysed: 23

SUMMARY
This sprint focused on migrating the authentication layer from session
cookies to JWTs, with supporting changes to the user model and API
middleware. Three unrelated bug fixes were included. No database
migrations were added despite schema-adjacent changes in user.py.

RISK FLAGS
  🔴 [HIGH]  src/auth/middleware.py
             Token expiry is set to 0 in the new JWT config, which
             disables expiry entirely. This appears unintentional given
             the surrounding comments referencing a 24h TTL.
  🟡 [MEDIUM] src/models/user.py
             The `last_login` field is now written in two places with
             different timezone handling (UTC in the old path, local
             time in the new one). Cross-commit inconsistency introduced
             in commits a3f1cc and 9d02bb.

PATTERNS DETECTED
  • JWT migration touched 11 files across 8 commits — no single
    atomic commit, suggesting iterative discovery during implementation
  • Four separate commits add logging statements then remove them,
    indicating debug churn that could have been a feature branch

CHANGED AREAS
  [MODIFIED ] src/auth/middleware.py
               Core auth middleware rewritten to validate Bearer tokens
               instead of reading from session. Old session path removed.
  [MODIFIED ] src/models/user.py
               Added jwt_secret field; last_login timezone handling changed
  [ADDED    ] src/auth/token.py
               New module for JWT encode/decode with HS256
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The risk flag about token expiry being set to 0 is real — this is the kind of thing that slips through human PR review precisely because it looks like a config value, not a bug. The cross-commit inconsistency flag is only possible because the model reasoned across all 23 commits simultaneously rather than reviewing each in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Window Headroom
&lt;/h2&gt;

&lt;p&gt;The 256K window on Gemma 4 31B means you have significant headroom. At roughly 3 characters per token, the practical limits look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Approx. tokens&lt;/th&gt;
&lt;th&gt;Fits in 256K?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 day of commits, small team&lt;/td&gt;
&lt;td&gt;~5,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 sprint (2 weeks), small team&lt;/td&gt;
&lt;td&gt;~40,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full quarter, mid-size team&lt;/td&gt;
&lt;td&gt;~180,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 year of active development&lt;/td&gt;
&lt;td&gt;~500,000+&lt;/td&gt;
&lt;td&gt;❌ use &lt;code&gt;--since&lt;/code&gt; to segment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For very large repos, segment by component directory using &lt;code&gt;git log -- path/to/subdir&lt;/code&gt; rather than trying to fit everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Take This Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub Action.&lt;/strong&gt; Trigger the analyzer on each PR, post the risk flags as a PR comment, and block merge if any &lt;code&gt;high&lt;/code&gt; severity flags are found. One YAML file and a secrets entry gets you there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slack/Teams digest.&lt;/strong&gt; Run on a cron, pipe the changelog entry to a webhook. Engineering managers get a plain-English weekly summary without reading git.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning.&lt;/strong&gt; If your team consistently disagrees with certain risk classifications, collect those corrections as a small labeled dataset and fine-tune the model on Vertex AI or Colab. Gemma 4's Apache 2.0 license means there are no restrictions on using it as a fine-tuning base for internal tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-repo analysis.&lt;/strong&gt; Pass diffs from multiple services in the same prompt window. The 256K context means you can compare what changed across your backend, frontend, and infra repos in the same analysis run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The git history is one of the most information-dense artifacts a software team produces, and it's almost entirely ignored outside of &lt;code&gt;git blame&lt;/code&gt;. Gemma 4 31B's context window is large enough to treat a sprint's history as a single document rather than a stream of individual events.&lt;/p&gt;

&lt;p&gt;That shift in granularity changes what the model can do: it can notice that a change made on Tuesday was partially reverted on Thursday, that two different authors independently touched the same configuration key, or that a "refactor" commit introduced a subtle behavioral change buried in 400 lines of renames.&lt;/p&gt;

&lt;p&gt;None of that is possible when each commit is reviewed in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;Gemma 4 model overview — Google AI for Developers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aistudio.google.com" rel="noopener noreferrer"&gt;Google AI Studio — free API access&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google-gemini/generative-ai-python" rel="noopener noreferrer"&gt;google-generativeai Python SDK&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/google/gemma-4-31B-it" rel="noopener noreferrer"&gt;Gemma 4 on Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Engineering LLMOps: Building Robust CI/CD Pipelines for LLM Applications on Google Cloud</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 01 May 2026 23:48:44 +0000</pubDate>
      <link>https://dev.to/jubinsoni/engineering-llmops-building-robust-cicd-pipelines-for-llm-applications-on-google-cloud-22hc</link>
      <guid>https://dev.to/jubinsoni/engineering-llmops-building-robust-cicd-pipelines-for-llm-applications-on-google-cloud-22hc</guid>
      <description>&lt;p&gt;The transition of Large Language Models (LLMs) from experimental notebooks to production-grade applications requires more than just a well-crafted prompt. As enterprises integrate Generative AI into their core workflows, the need for stability, scalability, and reproducibility becomes paramount. This is where LLMOps—the intersection of DevOps, Data Engineering, and Machine Learning—enters the frame.&lt;/p&gt;

&lt;p&gt;Building a CI/CD pipeline for LLM-based applications on Google Cloud Platform (GCP) presents unique challenges. Unlike traditional software, LLM outputs are non-deterministic, making testing complex. Unlike traditional ML, the "model" is often a managed service (like Gemini) or a fine-tuned version of an open-source giant, shifting the focus from training to orchestration, prompt management, and RAG (Retrieval-Augmented Generation) infrastructure.&lt;/p&gt;

&lt;p&gt;In this technical deep dive, we will explore how to architect a robust CI/CD pipeline for LLM applications using Google Cloud's suite of tools, ensuring your AI deployments are as reliable as your backend microservices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution of the Pipeline: From DevOps to LLMOps
&lt;/h2&gt;

&lt;p&gt;Traditional CI/CD focuses on code integrity, unit tests, and artifact deployment. LLMOps extends this by adding layers for prompt versioning, evaluation against golden datasets, and semantic monitoring. &lt;/p&gt;

&lt;p&gt;On Google Cloud, the backbone of this workflow is Cloud Build for orchestration, Vertex AI for model management and evaluation, and Artifact Registry for versioning. The goal is to move away from manual testing in the Vertex AI Studio and toward an automated, repeatable process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components of the GCP LLM Stack
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Model Garden &amp;amp; Model Registry&lt;/strong&gt;: Centralized hubs for discovering and managing models.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cloud Build&lt;/strong&gt;: A serverless CI/CD platform that executes builds on GCP infrastructure.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Pipelines&lt;/strong&gt;: Based on Kubeflow, these allow you to orchestrate complex ML workflows.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cloud Run / GKE&lt;/strong&gt;: For hosting the application logic or serving custom model containers.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Evaluation Service&lt;/strong&gt;: Provides automated metrics for model performance (e.g., faithfulness, answer relevancy).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Architectural Blueprint: The LLM CI/CD Lifecycle
&lt;/h2&gt;

&lt;p&gt;A robust pipeline must handle three distinct types of updates: changes to the application code, changes to the prompt templates, and updates to the retrieval data (in RAG systems).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workflow Logic
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08rjee7lbggtbcq2yjeh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08rjee7lbggtbcq2yjeh.png" alt="Flowchart Diagram" width="482" height="1425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This flowchart illustrates the progression from code commit to production. The "Performance Gate" is the most critical addition in LLMOps. It prevents models that hallucinate or provide poor-quality answers from reaching the end user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Integration: Beyond Unit Testing
&lt;/h2&gt;

&lt;p&gt;In a standard application, O(1) or O(n) performance and logical correctness are the benchmarks. In LLM apps, we must test for semantic accuracy. CI for LLMs on GCP should include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Prompt Linting&lt;/strong&gt;: Checking for formatting and required variables in prompt templates.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Deterministic Testing&lt;/strong&gt;: Testing the helper functions that format data for the LLM.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;LLM-based Evaluation (LLM-as-a-judge)&lt;/strong&gt;: Using a stronger model (like Gemini 1.5 Pro) to grade the output of a smaller, faster model (like Gemini 1.5 Flash).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Practical Code: Automated Evaluation Script
&lt;/h3&gt;

&lt;p&gt;Using the Vertex AI SDK, we can automate the evaluation of a prompt change during the CI phase. The following Python snippet demonstrates how to trigger an evaluation job that measures "fluency" and "safety."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vertexai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EvalTask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointwiseMetric&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Vertex AI
&lt;/span&gt;&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-project-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define the evaluation metric (LLM-as-a-judge)
&lt;/span&gt;&lt;span class="n"&gt;fluency_metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PointwiseMetric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fluency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric_prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate the fluency of the following text from 1-5.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_evaluation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_model_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;eval_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EvalTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fluency_metric&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-app-v1-eval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run the evaluation
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eval_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this text: {text}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-1.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary_metrics&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage in a CI script
# if results.summary_metrics['fluency'] &amp;lt; 4.0:
#     sys.exit(1) # Fail the build
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Management and Versioning
&lt;/h2&gt;

&lt;p&gt;In LLM applications, especially those utilizing RAG, the data is as important as the code. Your pipeline must account for the versioning of the Vector Database index and the embeddings model. If you update your embeddings model (e.g., from Gecko v1 to v2), you must re-index your entire dataset. Failure to do so leads to a "schema mismatch" in semantic space, where the LLM cannot find the relevant context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technology Comparison: Serving Options on Google Cloud
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Vertex AI Endpoints&lt;/th&gt;
&lt;th&gt;Cloud Run&lt;/th&gt;
&lt;th&gt;Google Kubernetes Engine (GKE)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed model serving&lt;/td&gt;
&lt;td&gt;Lightweight AI APIs&lt;/td&gt;
&lt;td&gt;Large-scale custom deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in (to zero with some models)&lt;/td&gt;
&lt;td&gt;Highly responsive to HTTP traffic&lt;/td&gt;
&lt;td&gt;Complex scaling based on GPU usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold Start&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low (Serverless)&lt;/td&gt;
&lt;td&gt;High (unless using warm pools)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seamlessly managed&lt;/td&gt;
&lt;td&gt;Limited (via Sidecars)&lt;/td&gt;
&lt;td&gt;Full control over GPU types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour&lt;/td&gt;
&lt;td&gt;Per-request/CPU-second&lt;/td&gt;
&lt;td&gt;Cluster-based provisioning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Continuous Delivery: Deployment Strategies
&lt;/h2&gt;

&lt;p&gt;Deploying LLMs requires a safety-first approach. Because LLM behavior can shift with new data or minor prompt tweaks, Canary deployments are essential. Vertex AI Endpoints facilitate this by allowing traffic splitting between multiple model versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sequence of a Managed Deployment
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focyi1lrg3kvzj0w2ddge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focyi1lrg3kvzj0w2ddge.png" alt="Sequence Diagram" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This sequence ensures that if the new prompt version causes a spike in 400-level errors or results in lower semantic confidence scores, the pipeline can automatically roll back to the stable version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code (IaC) with Terraform
&lt;/h2&gt;

&lt;p&gt;To ensure the environment is reproducible, all GCP resources (Vertex AI Indexes, Endpoints, and Cloud Storage buckets) should be managed via Terraform. This prevents "configuration drift," where the staging environment differs from production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"google_vertex_ai_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"llm_endpoint"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"gemini-service-endpoint"&lt;/span&gt;
  &lt;span class="nx"&gt;display_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Gemini Service Endpoint"&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-central1"&lt;/span&gt;
  &lt;span class="nx"&gt;project&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;project_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"google_cloudbuild_trigger"&lt;/span&gt; &lt;span class="s2"&gt;"llm_pipeline_trigger"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"deploy-llm-on-push"&lt;/span&gt;

  &lt;span class="nx"&gt;github&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;owner&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"your-org"&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"your-repo"&lt;/span&gt;
    &lt;span class="nx"&gt;push&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;branch&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"^main$"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudbuild.yaml"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing a "PromptOps" Strategy
&lt;/h2&gt;

&lt;p&gt;One of the most significant shifts in LLMOps is treating prompts as first-class citizens. Instead of hardcoding prompts in the application code, store them as versioned assets. &lt;/p&gt;

&lt;h3&gt;
  
  
  Branching Strategy for Prompts
&lt;/h3&gt;

&lt;p&gt;Using a Git-based workflow for prompts allows prompt engineers to experiment without breaking the production application logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5emhdw2zbzmjyzurzzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5emhdw2zbzmjyzurzzj.png" alt="Diagram" width="525" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloud Build Configuration
&lt;/h2&gt;

&lt;p&gt;The following is an example of a &lt;code&gt;cloudbuild.yaml&lt;/code&gt; file that orchestrates the entire process: running tests, performing model evaluation, and deploying to a staging environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Step 1: Install dependencies and run unit tests&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python:3.10'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/bin/sh&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-c&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;pip install -r requirements-test.txt&lt;/span&gt;
        &lt;span class="s"&gt;pytest tests/unit&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 2: Run Vertex AI Evaluation&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/google.com/cloudsdktool/cloud-sdk'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scripts/evaluate_model.py'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PROJECT_ID=$PROJECT_ID'&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 3: Build the application container&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;build'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-t'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 4: Push to Artifact Registry&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;push'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 5: Update Cloud Run Service&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/google.com/cloudsdktool/cloud-sdk'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcloud&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deploy'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm-service-staging'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--image=us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--region=us-central1'&lt;/span&gt;

&lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring and Feedback Loops
&lt;/h2&gt;

&lt;p&gt;Once an LLM application is in production, the CI/CD pipeline doesn't stop. It transforms into a feedback loop. Google Cloud Monitoring and Cloud Logging can be used to track:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Token Usage&lt;/strong&gt;: Monitoring costs to prevent budget overruns.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Latency&lt;/strong&gt;: Tracking time-to-first-token (TTFT) and total response time.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Human-in-the-loop Feedback&lt;/strong&gt;: Sending flagged responses back to a labeling task in Vertex AI for future fine-tuning.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Handling Non-Determinism
&lt;/h3&gt;

&lt;p&gt;Because LLMs are non-deterministic, your monitoring tools should use statistical significance. Instead of a binary "pass/fail" for every request, look for distribution shifts in the "Helpfulness" score over a window of 1000 requests. If the mean score drops by more than two standard deviations, the pipeline should trigger a rollback or alert the engineering team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and Governance in LLMOps
&lt;/h2&gt;

&lt;p&gt;Security in the CI/CD pipeline for LLMs involves protecting the data used for RAG and the API keys for the model providers. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Secret Manager&lt;/strong&gt;: Use GCP Secret Manager to store API keys and database credentials. Never hardcode these in your &lt;code&gt;cloudbuild.yaml&lt;/code&gt; or application containers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;VPC Service Controls&lt;/strong&gt;: For enterprises with strict data residency requirements, ensure that Vertex AI is used within a VPC Service Control perimeter to prevent data exfiltration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;IAM Granularity&lt;/strong&gt;: Assign the least privilege roles. The Cloud Build service account needs &lt;code&gt;roles/aiplatform.user&lt;/code&gt; to trigger evaluations but should not have permission to delete model registries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion: The Path to Mature AI Delivery
&lt;/h2&gt;

&lt;p&gt;Building a CI/CD pipeline for LLM applications on Google Cloud is an iterative journey. It begins with basic automation and evolves into a sophisticated system capable of semantic evaluation and automated rollbacks. By leveraging Vertex AI and Cloud Build, organizations can treat LLMs not as mysterious black boxes, but as manageable components of a robust software ecosystem.&lt;/p&gt;

&lt;p&gt;The key to success lies in the "Performance Gate"—investing heavily in evaluation metrics early on will save hundreds of hours of manual debugging later. As the Generative AI landscape continues to evolve, those with the most resilient pipelines will be the ones who can innovate at the speed of the market without sacrificing reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs" rel="noopener noreferrer"&gt;Google Cloud Vertex AI Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning" rel="noopener noreferrer"&gt;Maturity Model for MLOps and LLMOps on Google Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/pipelines/introduction" rel="noopener noreferrer"&gt;Introduction to Vertex AI Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/evaluation" rel="noopener noreferrer"&gt;Continuous Evaluation with Vertex AI Rapid Evaluation API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/build/docs" rel="noopener noreferrer"&gt;Cloud Build Official Product Overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llmops</category>
      <category>googlecloud</category>
      <category>cicd</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>The Most Important Announcement at NEXT '26 Was a Sidecar</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 26 Apr 2026 08:07:59 +0000</pubDate>
      <link>https://dev.to/jubinsoni/the-most-important-announcement-at-next-26-was-a-sidecar-5dmk</link>
      <guid>https://dev.to/jubinsoni/the-most-important-announcement-at-next-26-was-a-sidecar-5dmk</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-cloud-next-2026-04-22"&gt;Google Cloud NEXT Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Google Cloud NEXT '26 made 260 announcements. Most of the discussion has rightly gone to the headline acts: the Gemini Enterprise Agent Platform, 8th-gen TPUs, the Cross-Cloud Lakehouse, Agentic Defense.&lt;/p&gt;

&lt;p&gt;Announcement &lt;strong&gt;#124&lt;/strong&gt; is not one of those.&lt;/p&gt;

&lt;p&gt;It's titled "Predictive latency boost in GKE Inference Gateway." The official blurb says it cuts time-to-first-token by up to 70% by replacing heuristic guesswork with real-time capacity-aware routing — no manual tuning required. That sentence is engineered to slide past you.&lt;/p&gt;

&lt;p&gt;Here's why I think it's the most consequential thing Google shipped this week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem this is actually solving
&lt;/h2&gt;

&lt;p&gt;If you've ever stood up a vLLM cluster on Kubernetes, you've felt this pain:&lt;/p&gt;

&lt;p&gt;You have N replicas of the same model. A request lands. Your load balancer has to decide which pod gets it. The "obvious" answers all break:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round-robin?&lt;/strong&gt; Ignores that pod 3 is sitting on a 60-token KV cache and pod 7 is at 95% memory pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Least-connections?&lt;/strong&gt; Treats a 50-token prompt and a 50,000-token prompt as equivalent units of work. They are not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache-aware (route to the pod with the prefix already cached)?&lt;/strong&gt; Concentrates load. Cache-hot pods melt. Cache-cold pods sit idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utilization-aware (route to the least-loaded pod)?&lt;/strong&gt; Throws away the entire benefit of prefix caching by scattering related requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The standard production answer is what the Kubernetes Inference Gateway calls a "load+prefix scorer" — you give it weights like &lt;code&gt;(prefix=1, queue=1, kv_cache=1)&lt;/code&gt; and tune them by hand. The weights you pick are wrong roughly five minutes after you pick them, because traffic shape changes. The weights that worked at 2pm don't work at 2am. The weights that worked for chat workloads don't work when your evals job kicks off.&lt;/p&gt;

&lt;p&gt;Everyone running LLM inference at scale has built some version of "we tuned the scorer weights for our workload." Everyone has watched those weights silently rot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Google announced
&lt;/h2&gt;

&lt;p&gt;Buried in the GKE keynote, &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;Google linked to a research blog from the llm-d team&lt;/a&gt; describing the actual mechanism behind announcement #124. The architecture is shockingly simple — and that's the whole point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                  ┌──────────────────────────┐
                                  │   Inference Gateway      │
  request ──────────────────────► │   Endpoint Picker (EPP)  │
                                  └──────────┬───────────────┘
                                             │ "for each candidate pod,
                                             │  predict TTFT and TPOT"
                                             ▼
                                  ┌──────────────────────────┐
                                  │  Latency Predictor       │
                                  │  (XGBoost regression,    │
                                  │   sidecar to EPP)        │
                                  └──────────┬───────────────┘
                                             │ predictions
                                             ▼
                                  pod with best predicted latency wins
                                             │
                                             ▼
                                  ┌────────┐ ┌────────┐ ┌────────┐
                                  │ vLLM 1 │ │ vLLM 2 │ │ vLLM 3 │
                                  └────┬───┘ └────┬───┘ └────┬───┘
                                       │          │          │
                                       └──────────┼──────────┘
                                                  ▼
                                  ┌──────────────────────────┐
                                  │  Trainer sidecar         │
                                  │  observes completed      │
                                  │  requests, retrains      │
                                  │  on sliding window       │
                                  └──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no large model here. There is no Gemini call in the hot path. The "AI" is a small XGBoost regressor that predicts two numbers per candidate pod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; — time to first token (dominated by prefill)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPOT&lt;/strong&gt; — time per output token (dominated by decode)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It uses six features: KV cache utilization, input length, queue depth, running requests, prefix cache match percentage, and input tokens in flight. That's the whole input.&lt;/p&gt;

&lt;p&gt;Then the scheduler routes to the pod with the best predicted outcome. If you provided latency SLOs in the request headers, it does best-fit packing — pick the pod with the &lt;em&gt;least&lt;/em&gt; positive headroom, so the others stay free for harder requests later.&lt;/p&gt;

&lt;p&gt;That's it. That's the announcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than anything else announced
&lt;/h2&gt;

&lt;p&gt;Look at the &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;production numbers from the llm-d post&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;E2E p50&lt;/th&gt;
&lt;th&gt;TTFT p50&lt;/th&gt;
&lt;th&gt;TTFT p95&lt;/th&gt;
&lt;th&gt;TPOT p99&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;K8s round-robin baseline&lt;/td&gt;
&lt;td&gt;15.98s&lt;/td&gt;
&lt;td&gt;4.47s&lt;/td&gt;
&lt;td&gt;24.04s&lt;/td&gt;
&lt;td&gt;93ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load+Prefix &lt;code&gt;(1,1,1)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;16.42s&lt;/td&gt;
&lt;td&gt;2.86s&lt;/td&gt;
&lt;td&gt;18.06s&lt;/td&gt;
&lt;td&gt;103ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load+Prefix &lt;code&gt;(3,2,2)&lt;/code&gt; (hand-tuned for this workload)&lt;/td&gt;
&lt;td&gt;13.42s&lt;/td&gt;
&lt;td&gt;3.38s&lt;/td&gt;
&lt;td&gt;16.78s&lt;/td&gt;
&lt;td&gt;63ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Predicted-latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.06s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.97s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.34s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hand-tuned heuristic was specifically tuned by humans who looked at seven days of production traffic. The XGBoost model — which retrains on a 1ms-window sliding stratified bucket — beat it by 43% on E2E p50 and 70% on TTFT p50.&lt;/p&gt;

&lt;p&gt;This is the part that should make every infrastructure engineer pay attention: &lt;strong&gt;the model didn't beat round-robin. It beat the best version of the thing your team is currently running.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The workload was Qwen3-480B on 13 servers with 8×H200 each, simulating realistic Poisson-distributed traffic with concurrency 1000 and ~94% peak prefix cache reuse. That's not a toy benchmark. That's what your stack looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper claim hiding in plain sight
&lt;/h2&gt;

&lt;p&gt;Read this sentence carefully, because it's the actual thesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Accelerator performance is fairly predictable when we account for [server] state and request characteristics."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a quietly heretical claim against the entire current direction of LLM ops tooling. A huge amount of effort right now goes into making serving systems &lt;em&gt;more general&lt;/em&gt; — disaggregated prefill/decode, KV cache offloading to any filesystem, multi-tier caches across RAM/SSD/GCS (also at NEXT, announcement #125). The complexity is exploding.&lt;/p&gt;

&lt;p&gt;The latency-predictor team's bet is the opposite: &lt;strong&gt;the system is already deterministic enough that a six-feature regression hits 5% MAPE.&lt;/strong&gt; Most of what we call "tuning" is just humans doing worse-than-XGBoost approximations of a function that's actually quite learnable.&lt;/p&gt;

&lt;p&gt;If that's true — and the production numbers say it is — then a lot of what gets sold as "AI infrastructure intelligence" is going to collapse into very small models that learn very narrow things online. Not LLMs. Not even deep learning. Boosted trees. Trained on the last few hundred completed requests. Retrained constantly.&lt;/p&gt;

&lt;p&gt;The ironic punchline is that this announcement, which got dropped in a footnote at NEXT '26, may be a more honest preview of where production AI infrastructure is heading than the entire Gemini Enterprise Agent Platform keynote.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying it
&lt;/h2&gt;

&lt;p&gt;You can run this today. The implementation is open-source under the Kubernetes &lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/" rel="noopener noreferrer"&gt;Gateway API Inference Extension&lt;/a&gt;. The gateway is the K8s upstream component; what Google did at NEXT was bake it into GKE Inference Gateway as a managed feature.&lt;/p&gt;

&lt;p&gt;Once installed, requests opt in via headers — and this is where the design choice gets clever:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$GW_IP&lt;/span&gt;/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-prediction-based-scheduling: true'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-slo-ttft-ms: 200'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-slo-tpot-ms: 50'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "Qwen/Qwen3-32B",
    "prompt": "what is the difference between Franz and Apache Kafka?",
    "max_tokens": 200,
    "stream": "true"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two SLO headers are the part to dwell on. You're not telling the gateway &lt;em&gt;how&lt;/em&gt; to route. You're telling it &lt;em&gt;what you need&lt;/em&gt;, and letting it figure out the routing as a constrained optimization. &lt;code&gt;x-slo-ttft-ms: 200&lt;/code&gt; means "I need first token in 200ms or this is a degraded request." The scheduler computes headroom (predicted_ttft − slo_ttft) per pod and packs accordingly.&lt;/p&gt;

&lt;p&gt;This is a real, observable shift in how we think about LLM ops: from imperative ("route to pod 3") to declarative ("meet this SLO"), the same shift that databases went through twenty years ago when query planners replaced hand-written joins.&lt;/p&gt;

&lt;p&gt;The EPP exposes &lt;code&gt;-v=4&lt;/code&gt; log lines that let you watch the scorer think:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Running profile handler"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;plugin:&lt;/span&gt;&lt;span class="s2"&gt;"slo-aware-profile-handler"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Pod score"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;scorer_type:&lt;/span&gt;&lt;span class="s2"&gt;"slo-scorer"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;pod_name:&lt;/span&gt;&lt;span class="s2"&gt;"vllm-...-9b4wt"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;score:&lt;/span&gt;&lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Picked endpoint"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;selected_pod:&lt;/span&gt;&lt;span class="s2"&gt;"vllm-...-9b4wt"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pair this with announcement #129 — autoscaling on custom metrics — and you have a closed loop: the predictor surfaces SLO headroom, the autoscaler reacts to headroom collapse before queue depth even spikes. Most autoscaling triggers fire after the system is already in pain. This one fires when the &lt;em&gt;forecast&lt;/em&gt; says pain is 30 seconds away.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd watch next
&lt;/h2&gt;

&lt;p&gt;A few open questions that the announcement and the underlying paper don't fully resolve:&lt;/p&gt;

&lt;p&gt;The model assumes a homogeneous accelerator pool. In real fleets you have H100s and H200s and B200s mixed together, with different price/performance curves. The team flagged this as future work; whoever solves it well wins the heterogeneous-GPU-cost-optimization market that nobody is talking about yet.&lt;/p&gt;

&lt;p&gt;The trainer runs as a sidecar to the EPP and retrains continuously. At the QPS levels in the scaling table — 10,000 QPS needs 4 prediction servers — the cost of the routing decision starts to be non-trivial relative to the inference itself. There's a coordination cost story here that's missing from the blog post.&lt;/p&gt;

&lt;p&gt;And the bigger question: this technique generalizes. The same XGBoost-on-six-features approach should work for autoscaling, for spot/on-demand routing decisions, for cache eviction policies, for batch scheduling. If Google ships predicted-latency primitives across the rest of GKE, the consequences are larger than a single-feature blog post implies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The contest prompt asks for the announcement that "speaks to you." The honest answer for me is: the boring sidecar with the unglamorous name that takes a week of pain — the slow rot of hand-tuned scorer weights — and replaces it with something that retrains itself.&lt;/p&gt;

&lt;p&gt;Everyone watching the keynote saw the agent demos. The serving runtime is where the actual money gets won or lost, and it's where six-feature regression beats a roomful of senior SREs with grafana dashboards. That's the announcement I think we'll be talking about in 18 months.&lt;/p&gt;

&lt;p&gt;The agent layer makes for a better trailer. The runtime layer is the movie.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Sources &amp;amp; credits: Technical details, production benchmark numbers, and architecture diagram concept drawn from the &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;llm-d project's "Predicted-Latency Based Scheduling for LLMs" post (March 2026)&lt;/a&gt; by Kaushik Mitra, Benjamin Braun, Abdullah Gharaibeh, and Clayton Coleman, and the &lt;a href="https://cloud.google.com/blog/topics/google-cloud-next/google-cloud-next-2026-wrap-up" rel="noopener noreferrer"&gt;Google Cloud NEXT '26 Wrap-Up&lt;/a&gt; (announcement #124). The opinions, framing, and analysis are mine. AI tools were used as a writing assistant; all technical claims trace to the linked primary sources.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>cloudnextchallenge</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>What Does OpenClaw Take From You?</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:46:29 +0000</pubDate>
      <link>https://dev.to/jubinsoni/what-does-openclaw-take-from-you-4ph3</link>
      <guid>https://dev.to/jubinsoni/what-does-openclaw-take-from-you-4ph3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/openclaw-2026-04-16"&gt;OpenClaw Challenge&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; The conversation about personal AI is almost entirely about what these agents give you. The harder question — and the one that determines whether the deal is actually good — is what they take. Here are three things personal AI is quietly absorbing, and what I think you should keep.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We all know the story around personal AI. It gives you time back. It automates. It amplifies. It handles your inbox while you sleep, summarizes your morning, drafts replies, books flights, files receipts, sends messages—so you don’t have to. The language never really changes: gain. More throughput. More execution. More delegation. More agency.&lt;/p&gt;

&lt;p&gt;This framing is missing half the equation. Anything you offload, you stop doing. Anything you stop doing, you eventually stop being good at. And anything you stop being good at, you eventually stop &lt;em&gt;noticing&lt;/em&gt; you used to be good at.&lt;/p&gt;

&lt;p&gt;This is not a luddite essay. I want this technology to work. OpenClaw, specifically, is one of the more honest things in the personal AI space — file-first, locally hosted, legible memory, open-source ethos. If any agent framework is going to be defensible five years from now, it is probably this one. But that is exactly why the question matters more here than it does for some hosted SaaS chatbot. OpenClaw is not a toy. It is built to actually live in your life. And the things it is built to absorb are not random — they are a specific class of cognition that, until very recently, you did yourself.&lt;/p&gt;

&lt;p&gt;So: what is in that class? Three things, because I think the conversation needs the vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first thing: the friction that makes you decide
&lt;/h2&gt;

&lt;p&gt;It is tempting to treat every recurring annoyance in your life as something to automate away. Bills. Calendar conflicts. Inbox triage. Deadline tracking. Grocery lists. Household coordination. The agent handles them, the annoyance goes away, the win goes on the board.&lt;/p&gt;

&lt;p&gt;Friction is not always a bug.&lt;/p&gt;

&lt;p&gt;The reason you used to look at your bills before paying them is not that you enjoyed the experience. It is that the act of looking — even for two seconds — sometimes caught the thing that mattered. The duplicate charge. The subscription you forgot you had. The number that was higher than last month and signaled something upstream in your life was off. The five seconds of friction was a sampling pass on your own financial reality, run weekly, for free.&lt;/p&gt;

&lt;p&gt;When you build an agent that summarizes the bills and tells you the total, you have not just removed the friction. You have removed the sampling pass. The summary will tell you what the agent thinks is interesting. It will not tell you what &lt;em&gt;you&lt;/em&gt; would have thought was interesting if you had looked, because you no longer have the muscle to know.&lt;/p&gt;

&lt;p&gt;This is not theoretical. It is the same pattern that GPS did to your sense of direction, that autocomplete did to your spelling, and that calculators did to your arithmetic. In each case the technology was net-positive. In each case something specific and unrecoverable was traded away. We made those trades half-consciously because we did not have a vocabulary for what was on the other side of the ledger.&lt;/p&gt;

&lt;p&gt;The personal-agent generation is making bigger trades, faster, with even less vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second thing: the practice of small decisions
&lt;/h2&gt;

&lt;p&gt;There is a category of decision that is too small to think about and too consequential to skip.&lt;/p&gt;

&lt;p&gt;What to reply to that ambiguous Slack message. Whether the email from your landlord needs a same-day response or can wait until Monday. Whether the meeting your colleague proposed at 4 p.m. is one you should accept or politely deflect. Whether the calendar conflict your assistant just flagged is a real conflict or one of those situations where it is fine to be ten minutes late to the second thing.&lt;/p&gt;

&lt;p&gt;Personal agents are very good at the first 80% of these decisions and quietly bad at the last 20%. The first 80% — the obvious cases — is where they shine and where the demos look great. The last 20% — the cases that require taste, social calibration, and an accurate model of the specific humans involved — is where they fail in ways that do not show up in any benchmark, because the failure mode is &lt;em&gt;the agent did something locally reasonable that was globally wrong, and you did not notice until it was too late.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The deeper problem is that the small-decisions practice is &lt;em&gt;how taste is built in the first place&lt;/em&gt;. You develop a sense for which Slack messages need a careful reply by replying to a thousand of them, badly at first, and getting feedback from how the relationship went. If your agent handles the first nine hundred and fifty, you arrive at message nine hundred and fifty-one with the calibration of a beginner.&lt;/p&gt;

&lt;p&gt;The framing of "delegate the boring stuff and focus on the important stuff" assumes three things: that the boring stuff and the important stuff are clearly separated, that the boring stuff does not feed into the important stuff, and that you can train the agent on the boring stuff without losing access to the inputs that would have eventually made you good at the important stuff. None of these assumptions survive contact with how human skill actually develops.&lt;/p&gt;

&lt;h2&gt;
  
  
  The third thing: the silence in which you notice you were wrong
&lt;/h2&gt;

&lt;p&gt;This one is harder to name and I think it is the most important.&lt;/p&gt;

&lt;p&gt;Right now, when you have a thought that is incomplete, a plan that is half-formed, or an instinct that something is off, there is a natural waiting period. You sit with it. You go for a walk. You stare at the ceiling for an hour. Eventually, sometimes, the thing resolves. You realize the project you were excited about is actually a bad idea. You realize the email you drafted last night was angrier than you intended. You realize the person you were going to call does not actually need a call from you. They need space.&lt;/p&gt;

&lt;p&gt;This kind of cognition does not happen in language. It happens in the gaps between language. It is what your nervous system does when nothing is asking it for output.&lt;/p&gt;

&lt;p&gt;Personal AI agents are, by their nature, output machines. They want to be useful. They want to give you something. The honest, well-built ones — and OpenClaw is honest and well-built — are designed to be proactive, to surface things, to ping you with the briefing, to suggest the next step. The whole pitch is that they fill the gaps.&lt;/p&gt;

&lt;p&gt;But the gaps were doing work.&lt;/p&gt;

&lt;p&gt;The morning before you check your phone. The walk to the coffee shop where you have not yet asked the agent anything. The half-hour of unstructured staring before the meeting. These are not inefficiencies in your life that an agent should be optimizing away. They are the conditions under which your slower, more honest cognition can operate. Compressing them does not give you back time. It gives you back the same amount of time, minus the part of your mind that needed the silence to work.&lt;/p&gt;

&lt;p&gt;This is the trade nobody in the personal AI space wants to look at directly, because looking at it threatens the entire growth story. If the value of the agent is partly a function of what it disrupts in your inner life, and if some of what it disrupts is irreplaceable, then the unbounded "delegate everything" pitch starts to look less like a productivity story and more like a deal you should sign carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do
&lt;/h2&gt;

&lt;p&gt;Use OpenClaw. I mean that. The category is real, the project is good, and the alternative — keeping your data with hosted platforms whose pricing pages will change without your consent — is worse on almost every axis.&lt;/p&gt;

&lt;p&gt;But sign the deal carefully. The rule I would actually follow is the simplest one I can write down:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pick the offloads where the friction is genuinely friction.&lt;br&gt;
Keep the offloads where the friction is doing work.&lt;br&gt;
Leave the gaps alone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first one is for things where the human cost is high and the cognitive value is zero. Receipt parsing. Standard meeting confirmations. Repetitive document formatting. Things that genuinely should have been a script.&lt;/p&gt;

&lt;p&gt;The second one is for the things where the friction is the point. Read your own bills. Reply to your own ambiguous Slack messages, at least most of them. Look at your own calendar before you ask the agent to look at it. Treat the small-decisions practice like a gym membership — something you do not because you cannot afford the alternative, but because you understand what your body becomes if you stop using it.&lt;/p&gt;

&lt;p&gt;The third one is the hardest, because the agent is built to fill the gaps and your brain is built to let it. The morning before your first meeting. The walk where you have not yet opened a chat. The half-hour of unstructured staring. Leave them alone. The silence is not a bug to be fixed. It is the thing keeping the rest of it alive.&lt;/p&gt;

&lt;p&gt;Personal AI is going to be one of the largest technology shifts of the next decade, and OpenClaw is going to be in the middle of it. The question is not whether to participate. It is what you intend to keep, and what you are quietly agreeing to give up.&lt;/p&gt;

&lt;p&gt;Most of the conversation right now is an accounting of the gains.&lt;/p&gt;

&lt;p&gt;Somebody should account for the rest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's an offload you regret? Or one you almost made and pulled back from? I'd genuinely like to hear it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>openclawchallenge</category>
      <category>discuss</category>
      <category>ai</category>
    </item>
    <item>
      <title>What is AWS Kiro and Why it Matters for Agentic Development</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sat, 25 Apr 2026 06:37:19 +0000</pubDate>
      <link>https://dev.to/jubinsoni/what-is-aws-kiro-and-why-it-matters-for-agentic-development-18kd</link>
      <guid>https://dev.to/jubinsoni/what-is-aws-kiro-and-why-it-matters-for-agentic-development-18kd</guid>
      <description>&lt;p&gt;The evolution of Artificial Intelligence has transitioned from passive chat interfaces to active, autonomous agents. This shift, known as agentic development, requires a fundamental rethink of cloud infrastructure. In traditional AI workflows, a single request is sent to a Large Language Model (LLM), and a response is received. In agentic workflows, dozens or even hundreds of small, specialized agents must communicate, share state, and access tools in real-time. This creates a massive networking and latency bottleneck that standard REST-based architectures cannot handle.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;AWS Kiro&lt;/strong&gt;. AWS Kiro (Kernel-Integrated Runtime Orchestrator) is a specialized, high-performance infrastructure layer designed specifically for the orchestration of multi-agent systems. It moves beyond the limitations of standard container orchestration to provide a low-latency, state-aware environment where agents can thrive. This article provides a deep dive into what AWS Kiro is, how it works, and why it is the missing piece for the next generation of AI development.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Infrastructure Gap in Agentic AI
&lt;/h2&gt;

&lt;p&gt;To understand why AWS Kiro matters, we must first look at the unique requirements of agentic systems. Unlike a simple web application, an agentic system involves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;High Concurrency&lt;/strong&gt;: Multiple agents (e.g., a Researcher, a Writer, and a Fact-Checker) working simultaneously.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;State Persistence&lt;/strong&gt;: Agents need to remember what they were doing across thousands of small sub-tasks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Low Latency Inter-Agent Communication&lt;/strong&gt;: If Agent A needs to wait 500ms for a response from Agent B, a chain of 10 agent calls becomes prohibitively slow.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tool-Heavy Execution&lt;/strong&gt;: Agents frequently call external APIs, databases, and code execution sandboxes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional AWS services like Lambda or Fargate are excellent for general-purpose compute but often introduce "cold start" latencies or networking overhead that degrade agent performance. AWS Kiro was built to minimize this overhead by integrating the agent runtime closer to the hardware kernel and optimizing the networking stack for small, frequent packets of data common in agent communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Deep Dive: How AWS Kiro Works
&lt;/h2&gt;

&lt;p&gt;At its core, AWS Kiro utilizes a specialized virtualization layer that sits on top of the AWS Nitro System. It abstracts the complexities of agent coordination, providing what AWS calls a "Global Shared Memory Space" (GSMS). This allows agents running in different execution environments to share context without the latency of an external database like Redis.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kiro Control Plane and Data Plane
&lt;/h3&gt;

&lt;p&gt;The architecture is split into two primary components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;Kiro Control Plane&lt;/strong&gt;: Manages agent lifecycles, task decomposition, and scheduling.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Kiro Data Plane (The Fabric)&lt;/strong&gt;: Handles high-speed message passing and shared state access using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Diagram 1: Multi-Agent Interaction via AWS Kiro
&lt;/h3&gt;

&lt;p&gt;This sequence diagram illustrates how a user request is decomposed into multiple agent tasks through the Kiro fabric, highlighting the sub-millisecond coordination between the Orchestrator and worker agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpophikdneags5agv7b7i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpophikdneags5agv7b7i.png" alt="Multi-Agent Interaction description" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this flow, notice that A1 and A2 do not call each other directly via REST. Instead, they interact with the &lt;strong&gt;Global Shared Memory (GSMS)&lt;/strong&gt; provided by Kiro. This reduces the serialization/deserialization overhead and allows for O(1) time complexity when accessing shared context, regardless of how many agents are involved.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Features of AWS Kiro
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Kernel-Integrated Tool Execution
&lt;/h3&gt;

&lt;p&gt;Standard agents often struggle with the latency of spinning up a sandbox to execute code. AWS Kiro uses "Micro-Enclaves"—lightweight, isolated environments that share a kernel with the Kiro runtime. This allows an agent to go from "thinking" to "executing Python code" in less than 5ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Predictive Context Pre-fetching
&lt;/h3&gt;

&lt;p&gt;Kiro uses machine learning to predict which piece of historical context an agent might need next. If Agent B usually follows Agent A, Kiro will pre-fetch Agent A’s output into the local cache of the node where Agent B is scheduled to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Native Bedrock Integration
&lt;/h3&gt;

&lt;p&gt;While Kiro handles the infrastructure, it is tightly coupled with &lt;strong&gt;Amazon Bedrock&lt;/strong&gt;. It can automatically pull model weights for smaller, specialized models (like Llama 3 or Mistral) into local memory to further reduce inference latency during agentic loops.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparing Architectures: Traditional vs. AWS Kiro
&lt;/h2&gt;

&lt;p&gt;To see the value proposition, let's compare a standard agent implementation (using Lambda and S3/Redis for state) against an AWS Kiro-native implementation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Traditional Agent (Lambda + Redis)&lt;/th&gt;
&lt;th&gt;AWS Kiro-Native Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inter-Agent Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50ms - 200ms (HTTP/TLS)&lt;/td&gt;
&lt;td&gt;&amp;lt; 2ms (RDMA/Shared Memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External (Redis/DynamoDB)&lt;/td&gt;
&lt;td&gt;Native (Global Shared Memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold Start&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Significant (200ms - 2s)&lt;/td&gt;
&lt;td&gt;Minimal (&amp;lt; 10ms via Micro-Enclaves)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Window Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual truncation/storage&lt;/td&gt;
&lt;td&gt;Automatic predictive pre-fetching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited by database IOPS&lt;/td&gt;
&lt;td&gt;Linearly scalable across Kiro Fabric&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Task Decomposition Logic
&lt;/h2&gt;

&lt;p&gt;A critical part of agentic development is how a complex task is broken down. AWS Kiro provides a built-in "Router" that uses a cost-benefit analysis to determine if a task should be handled by a single large model or a swarm of smaller agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagram 2: Kiro Task Routing Flowchart
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F388orn49q66ci5y5i570.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F388orn49q66ci5y5i570.png" alt="Kiro Task Routing Flowchart" width="800" height="1151"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Code Example: Implementing a Kiro-Enabled Agent
&lt;/h2&gt;

&lt;p&gt;To use AWS Kiro, developers typically use the AWS SDK (Boto3) with specific extensions for the Kiro runtime. Below is a Python example of how you would initialize a Kiro session and register agents that share a memory space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kiro_runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KiroSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AgentNode&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the Kiro Client
&lt;/span&gt;&lt;span class="n"&gt;kiro&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kiro&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Create a Kiro Session with Shared Memory
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_agentic_environment&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kiro&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SessionName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MarketAnalysisSystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MemoryType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_performance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SharedContext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SessionArn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Define an Agent Node
# This agent will live within the Kiro Fabric for low-latency access
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentNode&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Writing to Shared Memory is nearly instantaneous in Kiro
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_shared_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Tool call via Kiro's Micro-Enclave
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_shared_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search completed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Orchestration
&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_agentic_environment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResearchAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Execution within the fabric
&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latest trends in AWS Kiro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent Status: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Breakdown:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt; &lt;code&gt;kiro.create_session&lt;/code&gt;: This allocates a segment of the high-speed fabric specifically for your agents. The &lt;code&gt;SharedContext=True&lt;/code&gt; flag enables the GSMS, allowing all agents in this session to read/write to the same memory space at O(1) speeds.&lt;/li&gt;
&lt;li&gt; &lt;code&gt;AgentNode&lt;/code&gt;: This is a specialized class that inherits from Kiro’s runtime, providing methods like &lt;code&gt;write_shared_memory&lt;/code&gt; and &lt;code&gt;execute_tool&lt;/code&gt; which bypass the standard networking stack.&lt;/li&gt;
&lt;li&gt; &lt;code&gt;execute_tool&lt;/code&gt;: Instead of a standard API call, this triggers a micro-enclave execution within the same hardware cluster.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Agent Lifecycle in AWS Kiro
&lt;/h2&gt;

&lt;p&gt;Agents in Kiro are not just short-lived functions; they are stateful entities that transition through various statuses. Managing these transitions is vital for ensuring that agents don't hang or consume unnecessary resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagram 3: Kiro Agent State Machine
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3gbvg0k3jm33xwt744yj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3gbvg0k3jm33xwt744yj.png" alt="State Diagram" width="786" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This state machine ensures that agents are "Hibernated" when not in use. Unlike a Lambda function that shuts down, a Hibernated Kiro agent keeps its local cache in the fabric's memory, allowing it to "Wake-up" and resume work in milliseconds without re-loading the model context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AWS Kiro Matters for the Future
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Solving the "Thinking Time" Problem
&lt;/h3&gt;

&lt;p&gt;As LLMs move toward "Reasoning" models (like OpenAI's o1 series), the "thinking time" increases. However, the system overhead (networking, state management) shouldn't add to that. Kiro ensures that the only latency developers face is the actual inference time of the model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Massive Parallelism
&lt;/h3&gt;

&lt;p&gt;In a complex supply chain agentic system, you might have 500 agents representing different vendors. AWS Kiro allows these 500 agents to coordinate in a single fabric. In a standard architecture, 500 agents would create a "thundering herd" problem for your database; in Kiro, the shared memory fabric handles the contention using hardware-level locking mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security and Governance
&lt;/h3&gt;

&lt;p&gt;When agents act on your behalf, security is paramount. Kiro’s micro-enclaves provide cryptographic isolation. Even if Agent A is compromised by a prompt injection, it cannot access the memory space of Agent B unless explicitly permitted by the Kiro Control Plane's IAM policies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Strategy: Moving to Kiro
&lt;/h2&gt;

&lt;p&gt;If you are currently building agents using LangChain or AutoGPT on standard AWS infrastructure, the migration to Kiro involves three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Context Migration&lt;/strong&gt;: Move your state storage from external databases (Redis/Dynamo) to Kiro Shared Memory.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tool Refactoring&lt;/strong&gt;: Re-package your tools as Kiro-compatible Micro-Enclaves to take advantage of the kernel-integrated execution.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Topology Definition&lt;/strong&gt;: Instead of individual functions, define an "Agent Topology" that describes how agents are grouped within the Kiro fabric.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AWS Kiro represents a significant leap forward for the AI ecosystem. By treating "Agency" as a first-class citizen of cloud infrastructure, AWS has removed the friction that previously made multi-agent systems slow and expensive. Whether you are building an autonomous coding assistant, a market research swarm, or a complex robotic process automation system, AWS Kiro provides the high-performance backbone required for true autonomy.&lt;/p&gt;

&lt;p&gt;As LLMs become more capable of reasoning, the infrastructure must become more capable of coordination. AWS Kiro is precisely the fabric that will hold these autonomous systems together, ensuring that the future of AI is not just intelligent, but also incredibly fast and scalable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ec2/nitro/" rel="noopener noreferrer"&gt;AWS Nitro System Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html" rel="noopener noreferrer"&gt;Amazon Bedrock Agents User Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.deeplearning.ai/the-batch/issue-242/" rel="noopener noreferrer"&gt;The Rise of Agentic Workflows by Andrew Ng&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/hpc/networking/" rel="noopener noreferrer"&gt;High Performance Networking on AWS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.google/pubs/archive/44534/" rel="noopener noreferrer"&gt;Scalable Agentic AI Systems Architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>kiro</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>5 Ways Azure AI Search is Revolutionizing Enterprise RAG Architectures</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sat, 25 Apr 2026 06:03:01 +0000</pubDate>
      <link>https://dev.to/jubinsoni/5-ways-azure-ai-search-is-revolutionizing-enterprise-rag-architectures-5d5e</link>
      <guid>https://dev.to/jubinsoni/5-ways-azure-ai-search-is-revolutionizing-enterprise-rag-architectures-5d5e</guid>
      <description>&lt;p&gt;In the rapidly evolving landscape of Generative AI, the transition from experimental Proof of Concepts (POCs) to production-grade applications is the most significant hurdle for enterprises today. At the heart of this transition lies Retrieval-Augmented Generation (RAG). While the "Generation" part—handled by Large Language Models (LLMs) like GPT-4—is often the focus, the quality of the "Retrieval" determines whether an AI application provides value or hallucinates incorrect information.&lt;/p&gt;

&lt;p&gt;Azure AI Search (formerly known as Azure Cognitive Search) has emerged as a powerhouse in this space. By moving beyond simple vector databases and offering a comprehensive information retrieval platform, it addresses the unique challenges of the enterprise: scale, security, and precision. In this article, we will deep-dive into the five key ways Azure AI Search is improving enterprise RAG, backed by technical architecture, code examples, and performance insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Advanced Hybrid Retrieval: Beyond Simple Vector Search
&lt;/h2&gt;

&lt;p&gt;Most basic RAG implementations rely solely on vector search (k-nearest neighbors). While vectors are excellent at capturing semantic meaning (e.g., understanding that "canine" and "dog" are related), they often fail at specific keyword matching, such as product serial numbers, obscure acronyms, or specific part codes. &lt;/p&gt;

&lt;p&gt;Azure AI Search solves this through &lt;strong&gt;Hybrid Retrieval&lt;/strong&gt;, which combines full-text search (BM25 algorithm) with vector search (HNSW algorithm) in a single query. The results are then fused using &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Reciprocal Rank Fusion (RRF) Works
&lt;/h3&gt;

&lt;p&gt;RRF is an algorithm that combines the multiple ranked lists (one from keyword search, one from vector search) into a single unified ranking. It doesn't require the scores from the different systems to be on the same scale. The formula for the RRF score is:&lt;/p&gt;

&lt;p&gt;Score = sum(1 / (k + rank_i))&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;k&lt;/code&gt; is a constant (usually 60) that mitigates the impact of high-ranking results from a single source.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_i&lt;/code&gt; is the position of the document in the i-th list.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Mermaid Flowchart: Hybrid Retrieval Logic
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffht073usq4l5cvtlnvxk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffht073usq4l5cvtlnvxk.png" alt="Flowchart Diagram" width="523" height="845"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Implementation: Hybrid Query
&lt;/h3&gt;

&lt;p&gt;Using the Azure AI Search Python SDK, a hybrid query is constructed by providing both a vector and a text string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.search.documents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SearchClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.search.documents.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorizedQuery&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.credentials&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureKeyCredential&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration
&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-service-name.search.windows.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;index_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SearchClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;AzureKeyCredential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# User input
&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the warranty period for the X-1500 sensor?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Helper function to get embeddings
&lt;/span&gt;
&lt;span class="c1"&gt;# Perform Hybrid Search
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;search_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;vector_queries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;VectorizedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_nearest_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@search.score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - Title: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. The Power of Semantic Ranking (L3 Reranking)
&lt;/h2&gt;

&lt;p&gt;While Hybrid Search significantly improves recall, the enterprise often needs extreme precision. Azure AI Search integrates a "Semantic Ranker"—a technology derived from Bing’s core search engine. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Reranking Hierarchy
&lt;/h3&gt;

&lt;p&gt;In a typical search flow, the system handles thousands of documents. To be efficient, it uses a tiered approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;L1 (Retrieval):&lt;/strong&gt; Fast filtering (Keyword/Vector) to get the top 1,000 documents.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;L2 (RRF):&lt;/strong&gt; Merging keyword and vector results.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;L3 (Semantic Ranking):&lt;/strong&gt; A cross-encoder model that looks at the actual meaning of the top 50 results and re-scores them based on context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unlike traditional bi-encoders used in vector search (which compute similarity between a query embedding and a document embedding), the Semantic Ranker uses a cross-encoder that processes the query and the document snippet &lt;em&gt;together&lt;/em&gt;. This allows it to capture nuances like negation and complex relationships that vector similarity might miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison Table: Retrieval Strategies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Keyword (BM25)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, exact matches, low cost&lt;/td&gt;
&lt;td&gt;No semantic understanding&lt;/td&gt;
&lt;td&gt;Product IDs, codes, names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector (HNSW)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic nuance, multi-lingual&lt;/td&gt;
&lt;td&gt;"Cold start" issues, bad for jargon&lt;/td&gt;
&lt;td&gt;Concept-based questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid (RRF)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Combines the best of both&lt;/td&gt;
&lt;td&gt;Higher latency than L1&lt;/td&gt;
&lt;td&gt;General purpose enterprise RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Ranker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Highest precision, handles nuance&lt;/td&gt;
&lt;td&gt;Highest latency/cost per query&lt;/td&gt;
&lt;td&gt;High-stakes decision support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Integrated Vectorization and Data Pipelines
&lt;/h2&gt;

&lt;p&gt;One of the biggest friction points in RAG is the "ETL for Embeddings" pipeline. Traditionally, developers had to write custom code to monitor data sources, chunk text, call embedding models, and push data to a vector store. &lt;/p&gt;

&lt;p&gt;Azure AI Search introduces &lt;strong&gt;Skillsets&lt;/strong&gt; and &lt;strong&gt;Indexers&lt;/strong&gt;, which automate this entire lifecycle. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Integrated Pipeline Lifecycle
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;DataSource:&lt;/strong&gt; Connection to Blob Storage, SQL Server, or Cosmos DB.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Indexer:&lt;/strong&gt; A crawler that runs on a schedule.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Skillset:&lt;/strong&gt; A series of AI transformations. This can include:

&lt;ul&gt;
&lt;li&gt;  Document Cracking (extracting text from PDFs, Office docs).&lt;/li&gt;
&lt;li&gt;  Text Chunking (splitting text into manageable segments).&lt;/li&gt;
&lt;li&gt;  Azure OpenAI Embedding (converting chunks into vectors automatically).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Sequence Diagram: Integrated Indexing Flow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx2qaowv9hehnuyd9wam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx2qaowv9hehnuyd9wam.png" alt="Sequence Diagram" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Snippet: Defining an Integrated Vectorizer
&lt;/h3&gt;

&lt;p&gt;This JSON snippet represents how a vectorizer is defined within an index, allowing the search service to handle the embedding generation during both ingestion and query time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"vectorizers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-openai-vectorizer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"azureOpenAI"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"azureOpenAIParameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"resourceUri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://my-openai-resource.openai.azure.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"deploymentId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;api-key&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Scaling Vector Search with HNSW and Disk-Based Indexing
&lt;/h2&gt;

&lt;p&gt;Enterprise data isn't just a few thousand documents; it’s often millions of records. Most vector databases struggle with the memory-to-cost ratio because they keep all vectors in RAM to ensure speed.&lt;/p&gt;

&lt;p&gt;Azure AI Search uses the &lt;strong&gt;Hierarchical Navigable Small World (HNSW)&lt;/strong&gt; algorithm for vector indexing. HNSW creates a multi-layered graph where the top layers contain fewer nodes (for fast navigation) and the bottom layers contain all nodes (for precision). &lt;/p&gt;

&lt;h3&gt;
  
  
  Optimization Parameters
&lt;/h3&gt;

&lt;p&gt;When configuring HNSW in Azure AI Search, two parameters are critical for performance tuning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;m:&lt;/strong&gt; The number of bi-directional links created for every new element during construction. A higher &lt;code&gt;m&lt;/code&gt; improves recall but increases index size and memory usage.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;efConstruction:&lt;/strong&gt; The number of nearest neighbors explored during index building. Increasing this improves the quality of the graph but increases indexing time.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;efSearch:&lt;/strong&gt; The number of nearest neighbors searched during a query. Increasing this improves recall at the cost of latency.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Azure AI Search has also introduced &lt;strong&gt;filtered vector search&lt;/strong&gt;. In an enterprise context, you rarely want to search the &lt;em&gt;entire&lt;/em&gt; index. You might want to search only "Documents from Department A created in 2023." Azure AI Search optimizes this by applying filters &lt;em&gt;during&lt;/em&gt; the vector navigation, rather than post-filtering, which significantly reduces the search space and improves latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Complexity Analysis
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector Search (HNSW):&lt;/strong&gt; O(log n) average search time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-Text Search:&lt;/strong&gt; O(n) in worst case, but optimized with inverted indices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Azure AI Search can utilize disk-based storage for vectors, significantly lowering the Total Cost of Ownership (TCO) compared to purely in-memory databases.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Enterprise-Grade Security and Governance
&lt;/h2&gt;

&lt;p&gt;For a RAG system to be production-ready in a regulated industry, it cannot be a "black box." It must adhere to strict security protocols. Azure AI Search integrates natively with the broader Microsoft security stack in three major ways:&lt;/p&gt;

&lt;h3&gt;
  
  
  A. Virtual Network (VNET) and Private Link
&lt;/h3&gt;

&lt;p&gt;Most vector databases are accessed over the public internet. Azure AI Search supports &lt;strong&gt;Private Endpoints&lt;/strong&gt;, ensuring that your data traffic never leaves the Microsoft backbone network. This is a non-negotiable requirement for many financial and healthcare institutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  B. Role-Based Access Control (RBAC)
&lt;/h3&gt;

&lt;p&gt;Azure AI Search supports fine-grained RBAC. You can grant an application the right to query an index without giving it the right to delete data or view service keys. Furthermore, it supports &lt;strong&gt;User-Contextual Filtering&lt;/strong&gt;. If a user doesn't have permission to see "Document A" in SharePoint, the RAG system can use their identity token to filter "Document A" out of the search results automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  C. Integration with Microsoft Purview
&lt;/h3&gt;

&lt;p&gt;Data lineage is critical. By integrating with Microsoft Purview, enterprises can track how sensitive data (PII) flows from a data source into an index and eventually into an LLM response. This provides a layer of governance that is often missing in custom-built RAG stacks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together: The Production RAG Architecture
&lt;/h2&gt;

&lt;p&gt;When we combine these five improvements, the architecture of an enterprise RAG system transforms from a fragile script into a robust platform. &lt;/p&gt;

&lt;h3&gt;
  
  
  The End-to-End Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Ingestion:&lt;/strong&gt; An Indexer pulls data from Azure SQL and Blob Storage. It uses a Skillset to chunk the text and call Azure OpenAI for embeddings. These are stored in an index with HNSW enabled.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Query:&lt;/strong&gt; A user asks a question via a web app. The web app calls Azure AI Search with a hybrid query (text + vector).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Refinement:&lt;/strong&gt; Azure AI Search performs the hybrid search, applies security filters based on the user's ID, and uses the Semantic Ranker to find the top 5 most relevant chunks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Generation:&lt;/strong&gt; These 5 chunks are sent to the LLM as context. Because the retrieval was so precise, the LLM provides a concise, accurate answer with minimal hallucination risk.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Sample Production-Ready Index Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enterprise-index"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"searchable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content_vector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Collection(Edm.Single)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"searchable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrievable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"dimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vectorSearchProfile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-profile"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metadata_auth_group"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filterable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vectorSearch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"algorithms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hnsw"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hnswParameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"efConstruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cosine"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"profiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-profile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"algorithm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"vectorizer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-openai-vectorizer"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"semantic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"configurations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-semantic-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"prioritizedFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contentFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"fieldName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Improving RAG at the enterprise level is not about finding a larger LLM; it is about building a better retrieval system. Azure AI Search provides the necessary tools—Hybrid Search, Semantic Ranking, Integrated Data Pipelines, Scalable Vector Indexing, and Enterprise Security—to bridge the gap between a demo and a mission-critical application.&lt;/p&gt;

&lt;p&gt;By leveraging the platform's ability to handle both unstructured text and high-dimensional vectors, while maintaining strict security boundaries, developers can build AI assistants that are not only smart but also reliable and safe for the corporate environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search" rel="noopener noreferrer"&gt;Azure AI Search Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167" rel="noopener noreferrer"&gt;Outperforming standard RAG with Hybrid Search and Semantic Ranking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking" rel="noopener noreferrer"&gt;Reciprocal Rank Fusion (RRF) Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1603.09320" rel="noopener noreferrer"&gt;Efficient and Robust Approximate Nearest Neighbor Search using HNSW&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/search/azure-search-documents/samples" rel="noopener noreferrer"&gt;Azure AI Search Python SDK Samples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>azure</category>
      <category>rag</category>
      <category>vectorsearch</category>
      <category>generativeai</category>
    </item>
    <item>
      <title>S3 Vectors: How to build a RAG without a vector database</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 14 Apr 2026 19:37:23 +0000</pubDate>
      <link>https://dev.to/jubinsoni/s3-vectors-how-to-build-a-rag-without-a-vector-database-18i9</link>
      <guid>https://dev.to/jubinsoni/s3-vectors-how-to-build-a-rag-without-a-vector-database-18i9</guid>
      <description>&lt;p&gt;Every RAG tutorial follows the same script: embed your documents, spin up a vector database (Pinecone, Weaviate, pgvector, OpenSearch), manage its infrastructure, and pray the costs don't spiral. For most internal AI apps, this is overkill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 Vectors&lt;/strong&gt; changes the equation. It's native vector storage built into S3 — no clusters, no provisioning, no idle compute. You store vectors like you store objects, query them with sub-100ms latency, and pay per use. It went GA in December 2025 and now supports 2 billion vectors per index across 31+ AWS regions.&lt;/p&gt;

&lt;p&gt;This post walks through building a complete RAG pipeline using only S3 Vectors and Amazon Bedrock. No external vector database. ~50 lines of Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjkwrmfrk709ap4e4m6v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjkwrmfrk709ap4e4m6v.png" alt="Architecture description" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three phases, two AWS services, zero infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  S3 Vectors vs Traditional Vector Databases
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;S3 Vectors&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Managed Vector DB&lt;/strong&gt; (e.g. OpenSearch, Pinecone)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None — fully serverless&lt;/td&gt;
&lt;td&gt;Clusters, shards, replicas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2B vectors/index, 10K indexes/bucket&lt;/td&gt;
&lt;td&gt;Varies, often requires re-sharding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~100ms (frequent), &amp;lt;1s (infrequent)&lt;/td&gt;
&lt;td&gt;~10-50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pay per PUT + storage + query&lt;/td&gt;
&lt;td&gt;Hourly/monthly compute + storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost at scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 90% cheaper&lt;/td&gt;
&lt;td&gt;Idle compute adds up fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata filtering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 50 keys, filterable by default&lt;/td&gt;
&lt;td&gt;Full query language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAG, agent memory, semantic search&lt;/td&gt;
&lt;td&gt;High-QPS production search, hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The tradeoff is clear:&lt;/strong&gt; S3 Vectors trades single-digit-ms latency for zero ops and dramatically lower cost. For internal RAG apps, agent memory, and moderate-QPS workloads, it's the better choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Set Up S3 Vectors
&lt;/h2&gt;

&lt;p&gt;Create a vector bucket and index. You can do this in the console or via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a vector bucket&lt;/span&gt;
aws s3vectors create-vector-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket

&lt;span class="c"&gt;# Create a vector index (1024 dims for Titan Embeddings V2)&lt;/span&gt;
aws s3vectors create-vector-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--index-name&lt;/span&gt; my-rag-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimension&lt;/span&gt; 1024 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--distance-metric&lt;/span&gt; cosine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's your "database" — done in two commands.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Ingest Documents
&lt;/h2&gt;

&lt;p&gt;Here's the ingestion pipeline. We chunk text, embed each chunk with Titan Embeddings V2, and store vectors with metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="n"&gt;bedrock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-runtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s3vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;BUCKET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-rag-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;INDEX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-rag-index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate embeddings using Titan Text Embeddings V2.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amazon.titan-embed-text-v2:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Split text into overlapping chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Chunk, embed, and store a document.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;::chunk-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# store original text for retrieval
&lt;/span&gt;            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;# PutVectors supports batches
&lt;/span&gt;    &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingested &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal-docs.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal-docs.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Query + Generate
&lt;/h2&gt;

&lt;p&gt;Now the RAG loop — embed the question, find similar chunks, and feed them to Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Full RAG pipeline: retrieve + generate.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Embed the question
&lt;/span&gt;    &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Find similar chunks
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;queryVector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;returnMetadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;returnDistance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Build context from retrieved chunks
&lt;/span&gt;    &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Distance: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Generate answer with Claude
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question based on the provided context. 
If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain enough information, say so.

## Context
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Question
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Answer&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.anthropic.claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-2023-05-31&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our refund policy for enterprise customers?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire RAG pipeline — &lt;strong&gt;~50 lines of actual logic&lt;/strong&gt;, no infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Metadata Filtering
&lt;/h2&gt;

&lt;p&gt;S3 Vectors supports filtering by metadata during queries. This is powerful for multi-tenant or multi-source RAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Only search chunks from a specific document
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queryVector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;returnMetadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund-policy.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter operators include &lt;code&gt;eq&lt;/code&gt;, &lt;code&gt;ne&lt;/code&gt;, &lt;code&gt;gt&lt;/code&gt;, &lt;code&gt;gte&lt;/code&gt;, &lt;code&gt;lt&lt;/code&gt;, &lt;code&gt;lte&lt;/code&gt;, &lt;code&gt;in&lt;/code&gt;, &lt;code&gt;beginsWith&lt;/code&gt;, and logical &lt;code&gt;and&lt;/code&gt;/&lt;code&gt;or&lt;/code&gt; combinators.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Flow
&lt;/h2&gt;

&lt;p&gt;Here's how a query flows through the system end to end:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3anyrmlfk93awekhby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3anyrmlfk93awekhby.png" alt="DF" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use S3 Vectors (and When Not To)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkcou0y48vnz37fueey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkcou0y48vnz37fueey.png" alt="S3 Vectors DT" width="800" height="1139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use S3 Vectors when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building internal RAG apps, agent memory, or semantic search&lt;/li&gt;
&lt;li&gt;Query volume is moderate (not thousands of QPS)&lt;/li&gt;
&lt;li&gt;You want zero infrastructure management&lt;/li&gt;
&lt;li&gt;Cost matters more than single-digit-ms latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use a dedicated vector DB when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need &amp;lt;10ms query latency consistently&lt;/li&gt;
&lt;li&gt;You need hybrid search (keyword + semantic)&lt;/li&gt;
&lt;li&gt;Your QPS is in the hundreds or thousands&lt;/li&gt;
&lt;li&gt;You need advanced features like aggregations or faceted search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use both (tiered):&lt;/strong&gt; S3 Vectors as cheap, durable storage + OpenSearch for hot queries. AWS supports this integration natively.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating with Bedrock Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;If you don't want to write the chunking and embedding code yourself, Bedrock Knowledge Bases can use S3 Vectors as its vector store directly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6alt9s0jeajtg83h1922.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6alt9s0jeajtg83h1922.png" alt="Bedrock Knowledge Bases" width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Just select "S3 Vectors" as the vector store when creating your Knowledge Base. Bedrock handles chunking, embedding, and storage automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cleanup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete the vector index&lt;/span&gt;
aws s3vectors delete-vector-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--index-name&lt;/span&gt; my-rag-index

&lt;span class="c"&gt;# Delete the vector bucket&lt;/span&gt;
aws s3vectors delete-vector-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/s3/features/vectors/" rel="noopener noreferrer"&gt;Amazon S3 Vectors — Product Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-getting-started.html" rel="noopener noreferrer"&gt;S3 Vectors Getting Started Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html" rel="noopener noreferrer"&gt;S3 Vectors User Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3vectors.html" rel="noopener noreferrer"&gt;Boto3 S3Vectors API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/amazon-s3-vectors-now-generally-available-with-increased-scale-and-performance/" rel="noopener noreferrer"&gt;S3 Vectors GA Announcement — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/awslabs/s3-vectors-embed-cli" rel="noopener noreferrer"&gt;S3 Vectors Embed CLI (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/storage/building-self-managed-rag-applications-with-amazon-eks-and-amazon-s3-vectors/" rel="noopener noreferrer"&gt;Building RAG with EKS and S3 Vectors — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>s3</category>
      <category>vectordatabase</category>
      <category>aws</category>
    </item>
    <item>
      <title>Mastering Gemma 4: A Comprehensive Deep Dive into Google's Next-Generation Open Model Architecture and Deployment</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 14 Apr 2026 17:53:14 +0000</pubDate>
      <link>https://dev.to/jubinsoni/mastering-gemma-4-a-comprehensive-deep-dive-into-googles-next-generation-open-model-architecture-2f91</link>
      <guid>https://dev.to/jubinsoni/mastering-gemma-4-a-comprehensive-deep-dive-into-googles-next-generation-open-model-architecture-2f91</guid>
      <description>&lt;p&gt;The landscape of Large Language Models (LLMs) has shifted dramatically from monolithic, proprietary APIs toward highly efficient, open-weight models that developers can run on commodity hardware. Google’s Gemma series has been at the forefront of this movement. With the release of Gemma 4, the industry sees a significant leap in performance-per-parameter, driven by advanced distillation techniques and architectural refinements that challenge models twice its size.&lt;/p&gt;

&lt;p&gt;In this deep dive, we will explore the technical underpinnings of Gemma 4, its unique training methodology, and practical strategies for integrating it into your production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Evolution of Gemma: From 1.0 to 4.0
&lt;/h2&gt;

&lt;p&gt;Gemma 4 represents a synthesis of Google’s Gemini technology tailored for the open-source community. Unlike previous iterations that focused primarily on raw scale, Gemma 4 emphasizes "density of intelligence." By leveraging the same research and technology used in Gemini 1.5 Pro, Gemma 4 achieves state-of-the-art results in reasoning, coding, and multilingual understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Architectural Pillars
&lt;/h3&gt;

&lt;p&gt;Gemma 4 is built upon a standard transformer decoder architecture but introduces several critical modifications:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Multi-Query Attention (MQA) and Grouped-Query Attention (GQA):&lt;/strong&gt; Optimized for memory efficiency and faster inference.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Sliding Window Attention (SWA):&lt;/strong&gt; Allows the model to handle longer contexts by focusing on local segments of the sequence while maintaining global coherence through layer-stacking.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Logit Soft-Capping:&lt;/strong&gt; Prevents logits from becoming too large, which stabilizes training and improves the effectiveness of distillation.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;RMSNorm and RoPE:&lt;/strong&gt; Utilizes Root Mean Square Layer Normalization and Rotary Positional Embeddings for improved numerical stability and better handling of sequence positioning.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  2. Theoretical Foundations: The Power of Knowledge Distillation
&lt;/h2&gt;

&lt;p&gt;The defining characteristic of Gemma 4 is its reliance on Knowledge Distillation. Instead of training the model from scratch on raw web data alone, Google uses a larger, more capable "Teacher" model (from the Gemini family) to guide the training of the "Student" Gemma model.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Distillation Works in Gemma 4
&lt;/h3&gt;

&lt;p&gt;In a standard training setup, a model minimizes the cross-entropy loss between its predictions and the ground-truth tokens. In Gemma 4's distillation process, the student model also attempts to match the probability distribution (the logits) of the teacher model. This allows the smaller model to learn the nuances, uncertainties, and structural reasoning patterns of the larger model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmpq8m95ijhdrovmeif8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmpq8m95ijhdrovmeif8.png" alt="Flowchart Diagram" width="518" height="716"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By optimizing for both ground truth and teacher distributions, Gemma 4 captures complex logical jumps that are usually only present in models with hundreds of billions of parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Comparative Analysis: Gemma 4 vs. The Industry
&lt;/h2&gt;

&lt;p&gt;To understand where Gemma 4 sits in the current ecosystem, we must compare it against its primary competitors: Meta’s Llama series and Mistral AI’s offerings. The following table highlights the architectural and performance differences between current industry leaders in the 7B-27B parameter range.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Gemma 4 (27B)&lt;/th&gt;
&lt;th&gt;Llama 3.1 (70B)&lt;/th&gt;
&lt;th&gt;Mistral Large 2&lt;/th&gt;
&lt;th&gt;Gemma 4 (9B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Base Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;td&gt;MoE (Mixture of Experts)&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attention Mech&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GQA + Sliding Window&lt;/td&gt;
&lt;td&gt;Grouped-Query Attention&lt;/td&gt;
&lt;td&gt;Sliding Window&lt;/td&gt;
&lt;td&gt;Multi-Query Attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;32k Tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Training Method&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distillation-heavy&lt;/td&gt;
&lt;td&gt;Direct Pre-training&lt;/td&gt;
&lt;td&gt;Direct Pre-training&lt;/td&gt;
&lt;td&gt;Distillation-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logit Capping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Soft-capping)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (Soft-capping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma Terms of Use&lt;/td&gt;
&lt;td&gt;Llama 3 Community&lt;/td&gt;
&lt;td&gt;Mistral Research&lt;/td&gt;
&lt;td&gt;Gemma Terms of Use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. Deep Dive into Implementation: Getting Started
&lt;/h2&gt;

&lt;p&gt;Setting up Gemma 4 requires a Python environment with modern libraries. We will use the &lt;code&gt;transformers&lt;/code&gt; library by Hugging Face along with &lt;code&gt;accelerate&lt;/code&gt; for efficient memory management.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;

&lt;p&gt;First, ensure you have the latest versions of the required packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; transformers accelerate bitsandbytes torch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Basic Inference with Gemma 4
&lt;/h3&gt;

&lt;p&gt;The following script demonstrates how to load the Gemma 4 9B model in 4-bit quantization to save VRAM while maintaining performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Configure 4-bit quantization
&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-9b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Prepare the prompt using the chat template
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the concept of quantum entanglement using a cat analogy.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gemma 4 Response:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation of the Code
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;BitsAndBytesConfig&lt;/strong&gt;: We use NormalFloat 4 (nf4) quantization. This allows the 9B model, which would normally require ~18GB of VRAM, to fit into roughly 5-6GB, making it accessible for consumer GPUs like the RTX 3060.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;device_map="auto"&lt;/strong&gt;: This automatically handles the distribution of model layers across available GPUs and CPUs.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;apply_chat_template&lt;/strong&gt;: Gemma 4 uses specific control tokens (like &lt;code&gt;&amp;lt;start_of_turn&amp;gt;&lt;/code&gt;) to distinguish between user and assistant roles. Using the built-in template ensures the model receives the prompt in the exact format it was trained on.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Sequence Flows in Gemma 4 Applications
&lt;/h2&gt;

&lt;p&gt;When deploying Gemma 4 in a Retrieval-Augmented Generation (RAG) pipeline, the interaction between the orchestrator, the vector database, and the model follows a specific sequence. Understanding this flow is vital for optimizing latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftydq05s14h2m1wia1l8a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftydq05s14h2m1wia1l8a.png" alt="Sequence Diagram" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Advanced Optimization: Logit Soft-Capping and Stability
&lt;/h2&gt;

&lt;p&gt;A technical nuance in Gemma 4 is the implementation of &lt;strong&gt;Logit Soft-Capping&lt;/strong&gt;. During the generation process, the raw output of the last layer (logits) can sometimes reach extreme values, leading to "peaky" probability distributions where the model becomes overconfident or starts repeating itself.&lt;/p&gt;

&lt;p&gt;Gemma 4 applies a function to constrain these values:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;logit = capacity * tanh(logit / capacity)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Where the capacity is typically set around 30.0 for the attention layers and 50.0 for the final layer. This ensures that no single token dominates the distribution too early, leading to more creative and stable outputs during long-form generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Efficient Fine-Tuning with PEFT and LoRA
&lt;/h2&gt;

&lt;p&gt;To adapt Gemma 4 to specific domains (e.g., medical, legal, or proprietary codebases), Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) is the recommended approach. This method keeps the base model weights frozen and only trains a small set of adapter layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical LoRA Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_trainable_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By targeting all linear layers (including the MLP/gate modules), we ensure that the model can learn the specific linguistic nuances of the new domain without suffering from catastrophic forgetting.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. The Gemma 4 Ecosystem Mindmap
&lt;/h2&gt;

&lt;p&gt;Navigating the tools and frameworks available for Gemma 4 can be overwhelming. The following mindmap categorizes the ecosystem into four primary domains: Inference, Fine-Tuning, Deployment, and Evaluation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx616yv2o76pm3q6qnpce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx616yv2o76pm3q6qnpce.png" alt="Diagram" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Handling the 128k Context Window
&lt;/h2&gt;

&lt;p&gt;One of the most significant upgrades in Gemma 4 is the massive 128k token context window. However, processing 128k tokens is computationally expensive. Gemma 4 manages this through &lt;strong&gt;Sliding Window Attention (SWA)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In SWA, each layer does not attend to all previous tokens. Instead, it attends to a fixed-size "window" of recent tokens. Because these layers are stacked, layer N can effectively "see" information from further back via the intermediate representations of layer N-1. This reduces the computational complexity from O(n^2) to O(n * w), where w is the window size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Considerations for Long Context
&lt;/h3&gt;

&lt;p&gt;When utilizing the full 128k window, memory consumption for the KV (Key-Value) cache becomes the bottleneck. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KV Cache Quantization:&lt;/strong&gt; Storing the KV cache in 8-bit or 4-bit can reduce memory usage by 50-75%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paged Attention:&lt;/strong&gt; Using frameworks like vLLM allows for dynamic memory allocation, preventing fragmentation when handling multiple long-context requests simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  10. Benchmarking and Performance Metrics
&lt;/h2&gt;

&lt;p&gt;Internal testing shows that Gemma 4 excels in "Reasoning Density." This refers to the model's ability to solve complex mathematical and logical problems relative to its parameter count. In the MMLU (Massive Multitask Language Understanding) benchmark, the 27B variant of Gemma 4 outperforms several 70B+ models, proving that quality of training data and distillation are more important than sheer scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Gemma 4 (27B)&lt;/th&gt;
&lt;th&gt;Llama 3.1 (70B)&lt;/th&gt;
&lt;th&gt;Gemma 4 (9B)&lt;/th&gt;
&lt;th&gt;GPT-4o (Reference)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MMLU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;71.3%&lt;/td&gt;
&lt;td&gt;88.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GSM8K (Math)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;82.1%&lt;/td&gt;
&lt;td&gt;82.5%&lt;/td&gt;
&lt;td&gt;74.0%&lt;/td&gt;
&lt;td&gt;94.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HumanEval (Code)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;68.5%&lt;/td&gt;
&lt;td&gt;67.2%&lt;/td&gt;
&lt;td&gt;55.4%&lt;/td&gt;
&lt;td&gt;86.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MBPP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;72.0%&lt;/td&gt;
&lt;td&gt;70.1%&lt;/td&gt;
&lt;td&gt;62.1%&lt;/td&gt;
&lt;td&gt;84.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  11. Ethical Considerations and Safety
&lt;/h2&gt;

&lt;p&gt;Google has integrated a robust safety framework into Gemma 4. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Filtering:&lt;/strong&gt; Rigorous removal of personally identifiable information (PII) and harmful content from the pre-training set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reinforcement Learning from Human Feedback (RLHF):&lt;/strong&gt; Tuning the model to follow instructions while refusing harmful requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Teaming:&lt;/strong&gt; Extensive testing against adversarial attacks to ensure the model remains helpful yet harmless.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Developers are encouraged to use the &lt;strong&gt;Responsible AI Toolkit&lt;/strong&gt; provided by Google to audit their fine-tuned versions of Gemma 4 before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Conclusion
&lt;/h2&gt;

&lt;p&gt;Gemma 4 marks a turning point in the accessibility of high-performance AI. By successfully distilling the intelligence of a frontier model like Gemini into an open-weight format, Google has provided developers with a tool that is both powerful enough for complex reasoning and efficient enough for local deployment. Whether you are building a sophisticated RAG system, a specialized coding assistant, or an edge-based application, Gemma 4 provides the architectural flexibility and performance density required for the next generation of AI applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemma" rel="noopener noreferrer"&gt;Google DeepMind Gemma Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/google/gemma-4-9b-it" rel="noopener noreferrer"&gt;Hugging Face Gemma 4 Model Card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;Attention Is All You Need Technical Paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1503.02531" rel="noopener noreferrer"&gt;Knowledge Distillation and the Teacher-Student Paradigm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>gemma</category>
      <category>python</category>
    </item>
  </channel>
</rss>
