<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jubin Soni</title>
    <description>The latest articles on DEV Community by Jubin Soni (@jubinsoni).</description>
    <link>https://dev.to/jubinsoni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3304475%2F69e594af-a39b-4e01-81fd-ebd67b67de37.jpeg</url>
      <title>DEV Community: Jubin Soni</title>
      <link>https://dev.to/jubinsoni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jubinsoni"/>
    <language>en</language>
    <item>
      <title>Build a Git Commit Analyzer with Gemma 4 31B and a 256K Context Window</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 08 May 2026 17:41:18 +0000</pubDate>
      <link>https://dev.to/jubinsoni/build-a-git-commit-analyzer-with-gemma-4-31b-and-a-256k-context-window-2ddh</link>
      <guid>https://dev.to/jubinsoni/build-a-git-commit-analyzer-with-gemma-4-31b-and-a-256k-context-window-2ddh</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most developers reach for an LLM when they need code completion or a chatbot. This article is about something more useful and less obvious: feeding your entire sprint's git history to &lt;strong&gt;Gemma 4 31B&lt;/strong&gt; — diffs, commit messages, authors and all — and getting back structured, actionable analysis of what actually changed and why it might matter.&lt;/p&gt;

&lt;p&gt;The 31B Dense model's &lt;strong&gt;256K context window&lt;/strong&gt; is the key enabler here. It means you can pass tens of thousands of lines of patch output in a single prompt and ask the model to reason across the whole thing — not chunk-and-summarize, but genuinely cross-reference commits, spot patterns, and flag risk. That's a qualitatively different capability from what a smaller model or an older Gemma generation could provide.&lt;/p&gt;

&lt;p&gt;By the end of this guide you'll have a working Python CLI tool that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shells out to &lt;code&gt;git log --patch&lt;/code&gt; to collect a commit range&lt;/li&gt;
&lt;li&gt;Sends the full diff to Gemma 4 31B via the Gemini API (free tier in Google AI Studio)&lt;/li&gt;
&lt;li&gt;Returns a structured JSON report with change summaries, risk flags, and a draft changelog&lt;/li&gt;
&lt;li&gt;Optionally writes a Markdown changelog file&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Gemma 4 31B Is the Right Model for This
&lt;/h2&gt;

&lt;p&gt;Three specific properties make the 31B Dense the correct pick here — not the 26B MoE, not the edge models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;256K context window.&lt;/strong&gt; A week's worth of commits on a mid-size codebase generates 20,000–80,000 tokens of patch text. The 31B handles that in a single pass. Chunking and summarizing separately loses cross-commit signal: the model can't notice that a refactor in commit 3 introduced the same variable name collision that commit 7 later fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maximum quality per query.&lt;/strong&gt; The 31B Dense is the highest-accuracy model in the Gemma 4 family. For code analysis you care about precision — a false positive risk flag wastes a senior engineer's time, and a false negative ships a bug. You're making one expensive call per analysis run, so raw quality beats throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native structured output.&lt;/strong&gt; Gemma 4 has first-class support for function calling and structured JSON output. The analyzer requests a strict JSON schema and the model reliably returns it — no fragile string parsing required.&lt;/p&gt;

&lt;p&gt;The 26B MoE is the right choice if you're building something that calls the model thousands of times per day and want cost efficiency. This tool calls it once per analysis run and prioritizes signal quality, so the Dense wins.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;A Google AI Studio API key (free — &lt;a href="https://aistudio.google.com/app/apikey" rel="noopener noreferrer"&gt;get one here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;A git repository to analyze&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;google-generativeai&lt;/code&gt; Python SDK
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;google-generativeai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set your API key as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-key-here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 1: Collect the Git Diff
&lt;/h2&gt;

&lt;p&gt;The first job is gathering the raw patch data. We use &lt;code&gt;git log --patch&lt;/code&gt; with a commit range and pipe the output to a string. We also collect structured commit metadata separately so the model has author and timestamp context alongside the diff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect_git_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HEAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Returns (full_patch_text, list_of_commit_metadata).
    `since` accepts anything git understands: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;7 days ago&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;v1.2.3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, a SHA, etc.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Collect the full unified diff
&lt;/span&gt;    &lt;span class="n"&gt;patch_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--no-merges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--pretty=format:COMMIT: %H%nAuthor: %an &amp;lt;%ae&amp;gt;%nDate: %ci%nMessage: %s%n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Collect lightweight metadata for the summary header
&lt;/span&gt;    &lt;span class="n"&gt;meta_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--no-merges&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--pretty=format:%H|%an|%ci|%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;meta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;msg_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;patch_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A week of commits on a real codebase might be 40,000–100,000 tokens. We'll let the model handle the full text — that's exactly what the 256K window is for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Build the Prompt
&lt;/h2&gt;

&lt;p&gt;The prompt does three things: gives the model its role and output contract, defines the JSON schema it must return, and passes the raw git history.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a senior staff engineer performing a structured code review
of a git commit history. Your job is to analyse the provided patch text and return a
single JSON object — nothing else, no markdown fences, no explanation outside the JSON.

The JSON object must match this schema exactly:

{
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2-3 sentence plain-English summary of the overall change set&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changed_areas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    {
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/file_or_directory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;change_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;added | modified | deleted | renamed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what changed and why it likely changed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    {
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low | medium | high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;area&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file or component&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
      &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific, concrete reason this change carries risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notable cross-commit pattern, refactor theme, or repeated change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
  ],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changelog_entry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A polished, user-facing changelog entry in Markdown. Use ## [Unreleased] as the heading. Group under Added, Changed, Fixed, Removed as appropriate.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
}

Be specific. Do not flag risk without a concrete reason tied to the actual diff.
Do not invent changes that are not present in the patch text.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;commit_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;authors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;date_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANALYSIS REQUEST&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Commits: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;commit_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authors: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;authors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date range: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date_range&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FULL PATCH TEXT FOLLOWS&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;patch_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system prompt enforces a strict schema so we can parse the response with &lt;code&gt;json.loads&lt;/code&gt; — no regex, no fallbacks. One of Gemma 4's standout improvements over Gemma 3 is how reliably it follows structured output instructions at this schema complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Call Gemma 4 31B
&lt;/h2&gt;

&lt;p&gt;We use the &lt;code&gt;google-generativeai&lt;/code&gt; SDK with &lt;code&gt;gemma-4-31b-it&lt;/code&gt; (the instruction-tuned variant — always use IT for structured task completion).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_with_gemma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma-4-31b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system_instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;generation_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerationConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Low temperature for consistent structured output
&lt;/span&gt;            &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sending &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; words to Gemma 4 31B...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Strip markdown fences if the model adds them despite instructions
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature at 0.2 keeps the output deterministic and schema-compliant. For creative changelog prose you could nudge it to 0.4 — but for risk flags you want the model to be conservative and consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Format and Output the Report
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GIT HISTORY ANALYSIS — Gemma 4 31B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Commits analysed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;SUMMARY&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RISK FLAGS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;flag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]):&lt;/span&gt;
            &lt;span class="n"&gt;icon&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔴&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟢&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;icon&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;area&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PATTERNS DETECTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  • &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CHANGED AREAS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changed_areas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;change_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;             &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_changelog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;changelog_entry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="c1"&gt;# Inject today's date if the entry has a placeholder
&lt;/span&gt;    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Unreleased]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Unreleased] — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Changelog written to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Wire It Together as a CLI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse a git commit range with Gemma 4 31B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path to git repository&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--since&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start of range (default: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1 week ago&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;). Accepts any git date or ref.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--until&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HEAD&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;End of range (default: HEAD)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--changelog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write changelog entry to this file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write full JSON report to this file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect_git_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No commits found in the specified range.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;analyze_with_gemma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;commits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changelog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;write_changelog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changelog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full JSON report written to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json_out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;

&lt;p&gt;Analyse the last week of commits in the current repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"1 week ago"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyse a specific SHA range and write a changelog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py /path/to/repo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--since&lt;/span&gt; v1.4.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--until&lt;/span&gt; v1.5.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--changelog&lt;/span&gt; CHANGELOG.md &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--json&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Analyse a single sprint (two-week window):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python git_analyzer.py &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"14 days ago"&lt;/span&gt; &lt;span class="nt"&gt;--changelog&lt;/span&gt; CHANGELOG.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Sample Output
&lt;/h2&gt;

&lt;p&gt;Here's an abbreviated example of what the tool produces on a real project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
GIT HISTORY ANALYSIS — Gemma 4 31B
============================================================

Commits analysed: 23

SUMMARY
This sprint focused on migrating the authentication layer from session
cookies to JWTs, with supporting changes to the user model and API
middleware. Three unrelated bug fixes were included. No database
migrations were added despite schema-adjacent changes in user.py.

RISK FLAGS
  🔴 [HIGH]  src/auth/middleware.py
             Token expiry is set to 0 in the new JWT config, which
             disables expiry entirely. This appears unintentional given
             the surrounding comments referencing a 24h TTL.
  🟡 [MEDIUM] src/models/user.py
             The `last_login` field is now written in two places with
             different timezone handling (UTC in the old path, local
             time in the new one). Cross-commit inconsistency introduced
             in commits a3f1cc and 9d02bb.

PATTERNS DETECTED
  • JWT migration touched 11 files across 8 commits — no single
    atomic commit, suggesting iterative discovery during implementation
  • Four separate commits add logging statements then remove them,
    indicating debug churn that could have been a feature branch

CHANGED AREAS
  [MODIFIED ] src/auth/middleware.py
               Core auth middleware rewritten to validate Bearer tokens
               instead of reading from session. Old session path removed.
  [MODIFIED ] src/models/user.py
               Added jwt_secret field; last_login timezone handling changed
  [ADDED    ] src/auth/token.py
               New module for JWT encode/decode with HS256
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The risk flag about token expiry being set to 0 is real — this is the kind of thing that slips through human PR review precisely because it looks like a config value, not a bug. The cross-commit inconsistency flag is only possible because the model reasoned across all 23 commits simultaneously rather than reviewing each in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Window Headroom
&lt;/h2&gt;

&lt;p&gt;The 256K window on Gemma 4 31B means you have significant headroom. At roughly 3 characters per token, the practical limits look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Approx. tokens&lt;/th&gt;
&lt;th&gt;Fits in 256K?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 day of commits, small team&lt;/td&gt;
&lt;td&gt;~5,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 sprint (2 weeks), small team&lt;/td&gt;
&lt;td&gt;~40,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full quarter, mid-size team&lt;/td&gt;
&lt;td&gt;~180,000&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 year of active development&lt;/td&gt;
&lt;td&gt;~500,000+&lt;/td&gt;
&lt;td&gt;❌ use &lt;code&gt;--since&lt;/code&gt; to segment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For very large repos, segment by component directory using &lt;code&gt;git log -- path/to/subdir&lt;/code&gt; rather than trying to fit everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Take This Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub Action.&lt;/strong&gt; Trigger the analyzer on each PR, post the risk flags as a PR comment, and block merge if any &lt;code&gt;high&lt;/code&gt; severity flags are found. One YAML file and a secrets entry gets you there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slack/Teams digest.&lt;/strong&gt; Run on a cron, pipe the changelog entry to a webhook. Engineering managers get a plain-English weekly summary without reading git.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning.&lt;/strong&gt; If your team consistently disagrees with certain risk classifications, collect those corrections as a small labeled dataset and fine-tune the model on Vertex AI or Colab. Gemma 4's Apache 2.0 license means there are no restrictions on using it as a fine-tuning base for internal tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-repo analysis.&lt;/strong&gt; Pass diffs from multiple services in the same prompt window. The 256K context means you can compare what changed across your backend, frontend, and infra repos in the same analysis run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The git history is one of the most information-dense artifacts a software team produces, and it's almost entirely ignored outside of &lt;code&gt;git blame&lt;/code&gt;. Gemma 4 31B's context window is large enough to treat a sprint's history as a single document rather than a stream of individual events.&lt;/p&gt;

&lt;p&gt;That shift in granularity changes what the model can do: it can notice that a change made on Tuesday was partially reverted on Thursday, that two different authors independently touched the same configuration key, or that a "refactor" commit introduced a subtle behavioral change buried in 400 lines of renames.&lt;/p&gt;

&lt;p&gt;None of that is possible when each commit is reviewed in isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;Gemma 4 model overview — Google AI for Developers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aistudio.google.com" rel="noopener noreferrer"&gt;Google AI Studio — free API access&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google-gemini/generative-ai-python" rel="noopener noreferrer"&gt;google-generativeai Python SDK&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/google/gemma-4-31B-it" rel="noopener noreferrer"&gt;Gemma 4 on Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>Engineering LLMOps: Building Robust CI/CD Pipelines for LLM Applications on Google Cloud</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 01 May 2026 23:48:44 +0000</pubDate>
      <link>https://dev.to/jubinsoni/engineering-llmops-building-robust-cicd-pipelines-for-llm-applications-on-google-cloud-22hc</link>
      <guid>https://dev.to/jubinsoni/engineering-llmops-building-robust-cicd-pipelines-for-llm-applications-on-google-cloud-22hc</guid>
      <description>&lt;p&gt;The transition of Large Language Models (LLMs) from experimental notebooks to production-grade applications requires more than just a well-crafted prompt. As enterprises integrate Generative AI into their core workflows, the need for stability, scalability, and reproducibility becomes paramount. This is where LLMOps—the intersection of DevOps, Data Engineering, and Machine Learning—enters the frame.&lt;/p&gt;

&lt;p&gt;Building a CI/CD pipeline for LLM-based applications on Google Cloud Platform (GCP) presents unique challenges. Unlike traditional software, LLM outputs are non-deterministic, making testing complex. Unlike traditional ML, the "model" is often a managed service (like Gemini) or a fine-tuned version of an open-source giant, shifting the focus from training to orchestration, prompt management, and RAG (Retrieval-Augmented Generation) infrastructure.&lt;/p&gt;

&lt;p&gt;In this technical deep dive, we will explore how to architect a robust CI/CD pipeline for LLM applications using Google Cloud's suite of tools, ensuring your AI deployments are as reliable as your backend microservices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution of the Pipeline: From DevOps to LLMOps
&lt;/h2&gt;

&lt;p&gt;Traditional CI/CD focuses on code integrity, unit tests, and artifact deployment. LLMOps extends this by adding layers for prompt versioning, evaluation against golden datasets, and semantic monitoring. &lt;/p&gt;

&lt;p&gt;On Google Cloud, the backbone of this workflow is Cloud Build for orchestration, Vertex AI for model management and evaluation, and Artifact Registry for versioning. The goal is to move away from manual testing in the Vertex AI Studio and toward an automated, repeatable process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Components of the GCP LLM Stack
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Model Garden &amp;amp; Model Registry&lt;/strong&gt;: Centralized hubs for discovering and managing models.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cloud Build&lt;/strong&gt;: A serverless CI/CD platform that executes builds on GCP infrastructure.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Pipelines&lt;/strong&gt;: Based on Kubeflow, these allow you to orchestrate complex ML workflows.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cloud Run / GKE&lt;/strong&gt;: For hosting the application logic or serving custom model containers.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Vertex AI Evaluation Service&lt;/strong&gt;: Provides automated metrics for model performance (e.g., faithfulness, answer relevancy).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Architectural Blueprint: The LLM CI/CD Lifecycle
&lt;/h2&gt;

&lt;p&gt;A robust pipeline must handle three distinct types of updates: changes to the application code, changes to the prompt templates, and updates to the retrieval data (in RAG systems).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workflow Logic
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08rjee7lbggtbcq2yjeh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08rjee7lbggtbcq2yjeh.png" alt="Flowchart Diagram" width="482" height="1425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This flowchart illustrates the progression from code commit to production. The "Performance Gate" is the most critical addition in LLMOps. It prevents models that hallucinate or provide poor-quality answers from reaching the end user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Integration: Beyond Unit Testing
&lt;/h2&gt;

&lt;p&gt;In a standard application, O(1) or O(n) performance and logical correctness are the benchmarks. In LLM apps, we must test for semantic accuracy. CI for LLMs on GCP should include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Prompt Linting&lt;/strong&gt;: Checking for formatting and required variables in prompt templates.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Deterministic Testing&lt;/strong&gt;: Testing the helper functions that format data for the LLM.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;LLM-based Evaluation (LLM-as-a-judge)&lt;/strong&gt;: Using a stronger model (like Gemini 1.5 Pro) to grade the output of a smaller, faster model (like Gemini 1.5 Flash).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Practical Code: Automated Evaluation Script
&lt;/h3&gt;

&lt;p&gt;Using the Vertex AI SDK, we can automate the evaluation of a prompt change during the CI phase. The following Python snippet demonstrates how to trigger an evaluation job that measures "fluency" and "safety."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vertexai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EvalTask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PointwiseMetric&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Vertex AI
&lt;/span&gt;&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-project-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define the evaluation metric (LLM-as-a-judge)
&lt;/span&gt;&lt;span class="n"&gt;fluency_metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PointwiseMetric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fluency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metric_prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rate the fluency of the following text from 1-5.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_evaluation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_model_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;eval_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EvalTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fluency_metric&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-app-v1-eval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run the evaluation
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eval_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this text: {text}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-1.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary_metrics&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage in a CI script
# if results.summary_metrics['fluency'] &amp;lt; 4.0:
#     sys.exit(1) # Fail the build
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Management and Versioning
&lt;/h2&gt;

&lt;p&gt;In LLM applications, especially those utilizing RAG, the data is as important as the code. Your pipeline must account for the versioning of the Vector Database index and the embeddings model. If you update your embeddings model (e.g., from Gecko v1 to v2), you must re-index your entire dataset. Failure to do so leads to a "schema mismatch" in semantic space, where the LLM cannot find the relevant context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technology Comparison: Serving Options on Google Cloud
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Vertex AI Endpoints&lt;/th&gt;
&lt;th&gt;Cloud Run&lt;/th&gt;
&lt;th&gt;Google Kubernetes Engine (GKE)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best For&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed model serving&lt;/td&gt;
&lt;td&gt;Lightweight AI APIs&lt;/td&gt;
&lt;td&gt;Large-scale custom deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in (to zero with some models)&lt;/td&gt;
&lt;td&gt;Highly responsive to HTTP traffic&lt;/td&gt;
&lt;td&gt;Complex scaling based on GPU usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold Start&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low (Serverless)&lt;/td&gt;
&lt;td&gt;High (unless using warm pools)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seamlessly managed&lt;/td&gt;
&lt;td&gt;Limited (via Sidecars)&lt;/td&gt;
&lt;td&gt;Full control over GPU types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour&lt;/td&gt;
&lt;td&gt;Per-request/CPU-second&lt;/td&gt;
&lt;td&gt;Cluster-based provisioning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Continuous Delivery: Deployment Strategies
&lt;/h2&gt;

&lt;p&gt;Deploying LLMs requires a safety-first approach. Because LLM behavior can shift with new data or minor prompt tweaks, Canary deployments are essential. Vertex AI Endpoints facilitate this by allowing traffic splitting between multiple model versions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sequence of a Managed Deployment
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focyi1lrg3kvzj0w2ddge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focyi1lrg3kvzj0w2ddge.png" alt="Sequence Diagram" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This sequence ensures that if the new prompt version causes a spike in 400-level errors or results in lower semantic confidence scores, the pipeline can automatically roll back to the stable version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code (IaC) with Terraform
&lt;/h2&gt;

&lt;p&gt;To ensure the environment is reproducible, all GCP resources (Vertex AI Indexes, Endpoints, and Cloud Storage buckets) should be managed via Terraform. This prevents "configuration drift," where the staging environment differs from production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"google_vertex_ai_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"llm_endpoint"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"gemini-service-endpoint"&lt;/span&gt;
  &lt;span class="nx"&gt;display_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Gemini Service Endpoint"&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-central1"&lt;/span&gt;
  &lt;span class="nx"&gt;project&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;project_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"google_cloudbuild_trigger"&lt;/span&gt; &lt;span class="s2"&gt;"llm_pipeline_trigger"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"deploy-llm-on-push"&lt;/span&gt;

  &lt;span class="nx"&gt;github&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;owner&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"your-org"&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"your-repo"&lt;/span&gt;
    &lt;span class="nx"&gt;push&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;branch&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"^main$"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cloudbuild.yaml"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing a "PromptOps" Strategy
&lt;/h2&gt;

&lt;p&gt;One of the most significant shifts in LLMOps is treating prompts as first-class citizens. Instead of hardcoding prompts in the application code, store them as versioned assets. &lt;/p&gt;

&lt;h3&gt;
  
  
  Branching Strategy for Prompts
&lt;/h3&gt;

&lt;p&gt;Using a Git-based workflow for prompts allows prompt engineers to experiment without breaking the production application logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5emhdw2zbzmjyzurzzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5emhdw2zbzmjyzurzzj.png" alt="Diagram" width="525" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cloud Build Configuration
&lt;/h2&gt;

&lt;p&gt;The following is an example of a &lt;code&gt;cloudbuild.yaml&lt;/code&gt; file that orchestrates the entire process: running tests, performing model evaluation, and deploying to a staging environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Step 1: Install dependencies and run unit tests&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python:3.10'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/bin/sh&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-c&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;pip install -r requirements-test.txt&lt;/span&gt;
        &lt;span class="s"&gt;pytest tests/unit&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 2: Run Vertex AI Evaluation&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/google.com/cloudsdktool/cloud-sdk'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scripts/evaluate_model.py'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PROJECT_ID=$PROJECT_ID'&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 3: Build the application container&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;build'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-t'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 4: Push to Artifact Registry&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/cloud-builders/docker'&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;push'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# Step 5: Update Cloud Run Service&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gcr.io/google.com/cloudsdktool/cloud-sdk'&lt;/span&gt;
    &lt;span class="na"&gt;entrypoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcloud&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;deploy'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;llm-service-staging'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--image=us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--region=us-central1'&lt;/span&gt;

&lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1-docker.pkg.dev/$PROJECT_ID/app-repo/llm-app:$SHORT_SHA'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring and Feedback Loops
&lt;/h2&gt;

&lt;p&gt;Once an LLM application is in production, the CI/CD pipeline doesn't stop. It transforms into a feedback loop. Google Cloud Monitoring and Cloud Logging can be used to track:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Token Usage&lt;/strong&gt;: Monitoring costs to prevent budget overruns.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Latency&lt;/strong&gt;: Tracking time-to-first-token (TTFT) and total response time.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Human-in-the-loop Feedback&lt;/strong&gt;: Sending flagged responses back to a labeling task in Vertex AI for future fine-tuning.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Handling Non-Determinism
&lt;/h3&gt;

&lt;p&gt;Because LLMs are non-deterministic, your monitoring tools should use statistical significance. Instead of a binary "pass/fail" for every request, look for distribution shifts in the "Helpfulness" score over a window of 1000 requests. If the mean score drops by more than two standard deviations, the pipeline should trigger a rollback or alert the engineering team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and Governance in LLMOps
&lt;/h2&gt;

&lt;p&gt;Security in the CI/CD pipeline for LLMs involves protecting the data used for RAG and the API keys for the model providers. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Secret Manager&lt;/strong&gt;: Use GCP Secret Manager to store API keys and database credentials. Never hardcode these in your &lt;code&gt;cloudbuild.yaml&lt;/code&gt; or application containers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;VPC Service Controls&lt;/strong&gt;: For enterprises with strict data residency requirements, ensure that Vertex AI is used within a VPC Service Control perimeter to prevent data exfiltration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;IAM Granularity&lt;/strong&gt;: Assign the least privilege roles. The Cloud Build service account needs &lt;code&gt;roles/aiplatform.user&lt;/code&gt; to trigger evaluations but should not have permission to delete model registries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion: The Path to Mature AI Delivery
&lt;/h2&gt;

&lt;p&gt;Building a CI/CD pipeline for LLM applications on Google Cloud is an iterative journey. It begins with basic automation and evolves into a sophisticated system capable of semantic evaluation and automated rollbacks. By leveraging Vertex AI and Cloud Build, organizations can treat LLMs not as mysterious black boxes, but as manageable components of a robust software ecosystem.&lt;/p&gt;

&lt;p&gt;The key to success lies in the "Performance Gate"—investing heavily in evaluation metrics early on will save hundreds of hours of manual debugging later. As the Generative AI landscape continues to evolve, those with the most resilient pipelines will be the ones who can innovate at the speed of the market without sacrificing reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs" rel="noopener noreferrer"&gt;Google Cloud Vertex AI Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning" rel="noopener noreferrer"&gt;Maturity Model for MLOps and LLMOps on Google Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/pipelines/introduction" rel="noopener noreferrer"&gt;Introduction to Vertex AI Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/evaluation" rel="noopener noreferrer"&gt;Continuous Evaluation with Vertex AI Rapid Evaluation API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/build/docs" rel="noopener noreferrer"&gt;Cloud Build Official Product Overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llmops</category>
      <category>googlecloud</category>
      <category>cicd</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>The Most Important Announcement at NEXT '26 Was a Sidecar</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 26 Apr 2026 08:07:59 +0000</pubDate>
      <link>https://dev.to/jubinsoni/the-most-important-announcement-at-next-26-was-a-sidecar-5dmk</link>
      <guid>https://dev.to/jubinsoni/the-most-important-announcement-at-next-26-was-a-sidecar-5dmk</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-cloud-next-2026-04-22"&gt;Google Cloud NEXT Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Google Cloud NEXT '26 made 260 announcements. Most of the discussion has rightly gone to the headline acts: the Gemini Enterprise Agent Platform, 8th-gen TPUs, the Cross-Cloud Lakehouse, Agentic Defense.&lt;/p&gt;

&lt;p&gt;Announcement &lt;strong&gt;#124&lt;/strong&gt; is not one of those.&lt;/p&gt;

&lt;p&gt;It's titled "Predictive latency boost in GKE Inference Gateway." The official blurb says it cuts time-to-first-token by up to 70% by replacing heuristic guesswork with real-time capacity-aware routing — no manual tuning required. That sentence is engineered to slide past you.&lt;/p&gt;

&lt;p&gt;Here's why I think it's the most consequential thing Google shipped this week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem this is actually solving
&lt;/h2&gt;

&lt;p&gt;If you've ever stood up a vLLM cluster on Kubernetes, you've felt this pain:&lt;/p&gt;

&lt;p&gt;You have N replicas of the same model. A request lands. Your load balancer has to decide which pod gets it. The "obvious" answers all break:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round-robin?&lt;/strong&gt; Ignores that pod 3 is sitting on a 60-token KV cache and pod 7 is at 95% memory pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Least-connections?&lt;/strong&gt; Treats a 50-token prompt and a 50,000-token prompt as equivalent units of work. They are not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache-aware (route to the pod with the prefix already cached)?&lt;/strong&gt; Concentrates load. Cache-hot pods melt. Cache-cold pods sit idle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utilization-aware (route to the least-loaded pod)?&lt;/strong&gt; Throws away the entire benefit of prefix caching by scattering related requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The standard production answer is what the Kubernetes Inference Gateway calls a "load+prefix scorer" — you give it weights like &lt;code&gt;(prefix=1, queue=1, kv_cache=1)&lt;/code&gt; and tune them by hand. The weights you pick are wrong roughly five minutes after you pick them, because traffic shape changes. The weights that worked at 2pm don't work at 2am. The weights that worked for chat workloads don't work when your evals job kicks off.&lt;/p&gt;

&lt;p&gt;Everyone running LLM inference at scale has built some version of "we tuned the scorer weights for our workload." Everyone has watched those weights silently rot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Google announced
&lt;/h2&gt;

&lt;p&gt;Buried in the GKE keynote, &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;Google linked to a research blog from the llm-d team&lt;/a&gt; describing the actual mechanism behind announcement #124. The architecture is shockingly simple — and that's the whole point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                  ┌──────────────────────────┐
                                  │   Inference Gateway      │
  request ──────────────────────► │   Endpoint Picker (EPP)  │
                                  └──────────┬───────────────┘
                                             │ "for each candidate pod,
                                             │  predict TTFT and TPOT"
                                             ▼
                                  ┌──────────────────────────┐
                                  │  Latency Predictor       │
                                  │  (XGBoost regression,    │
                                  │   sidecar to EPP)        │
                                  └──────────┬───────────────┘
                                             │ predictions
                                             ▼
                                  pod with best predicted latency wins
                                             │
                                             ▼
                                  ┌────────┐ ┌────────┐ ┌────────┐
                                  │ vLLM 1 │ │ vLLM 2 │ │ vLLM 3 │
                                  └────┬───┘ └────┬───┘ └────┬───┘
                                       │          │          │
                                       └──────────┼──────────┘
                                                  ▼
                                  ┌──────────────────────────┐
                                  │  Trainer sidecar         │
                                  │  observes completed      │
                                  │  requests, retrains      │
                                  │  on sliding window       │
                                  └──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no large model here. There is no Gemini call in the hot path. The "AI" is a small XGBoost regressor that predicts two numbers per candidate pod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; — time to first token (dominated by prefill)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPOT&lt;/strong&gt; — time per output token (dominated by decode)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It uses six features: KV cache utilization, input length, queue depth, running requests, prefix cache match percentage, and input tokens in flight. That's the whole input.&lt;/p&gt;

&lt;p&gt;Then the scheduler routes to the pod with the best predicted outcome. If you provided latency SLOs in the request headers, it does best-fit packing — pick the pod with the &lt;em&gt;least&lt;/em&gt; positive headroom, so the others stay free for harder requests later.&lt;/p&gt;

&lt;p&gt;That's it. That's the announcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than anything else announced
&lt;/h2&gt;

&lt;p&gt;Look at the &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;production numbers from the llm-d post&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;E2E p50&lt;/th&gt;
&lt;th&gt;TTFT p50&lt;/th&gt;
&lt;th&gt;TTFT p95&lt;/th&gt;
&lt;th&gt;TPOT p99&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;K8s round-robin baseline&lt;/td&gt;
&lt;td&gt;15.98s&lt;/td&gt;
&lt;td&gt;4.47s&lt;/td&gt;
&lt;td&gt;24.04s&lt;/td&gt;
&lt;td&gt;93ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load+Prefix &lt;code&gt;(1,1,1)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;16.42s&lt;/td&gt;
&lt;td&gt;2.86s&lt;/td&gt;
&lt;td&gt;18.06s&lt;/td&gt;
&lt;td&gt;103ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load+Prefix &lt;code&gt;(3,2,2)&lt;/code&gt; (hand-tuned for this workload)&lt;/td&gt;
&lt;td&gt;13.42s&lt;/td&gt;
&lt;td&gt;3.38s&lt;/td&gt;
&lt;td&gt;16.78s&lt;/td&gt;
&lt;td&gt;63ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Predicted-latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.06s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.97s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.34s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hand-tuned heuristic was specifically tuned by humans who looked at seven days of production traffic. The XGBoost model — which retrains on a 1ms-window sliding stratified bucket — beat it by 43% on E2E p50 and 70% on TTFT p50.&lt;/p&gt;

&lt;p&gt;This is the part that should make every infrastructure engineer pay attention: &lt;strong&gt;the model didn't beat round-robin. It beat the best version of the thing your team is currently running.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The workload was Qwen3-480B on 13 servers with 8×H200 each, simulating realistic Poisson-distributed traffic with concurrency 1000 and ~94% peak prefix cache reuse. That's not a toy benchmark. That's what your stack looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper claim hiding in plain sight
&lt;/h2&gt;

&lt;p&gt;Read this sentence carefully, because it's the actual thesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Accelerator performance is fairly predictable when we account for [server] state and request characteristics."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a quietly heretical claim against the entire current direction of LLM ops tooling. A huge amount of effort right now goes into making serving systems &lt;em&gt;more general&lt;/em&gt; — disaggregated prefill/decode, KV cache offloading to any filesystem, multi-tier caches across RAM/SSD/GCS (also at NEXT, announcement #125). The complexity is exploding.&lt;/p&gt;

&lt;p&gt;The latency-predictor team's bet is the opposite: &lt;strong&gt;the system is already deterministic enough that a six-feature regression hits 5% MAPE.&lt;/strong&gt; Most of what we call "tuning" is just humans doing worse-than-XGBoost approximations of a function that's actually quite learnable.&lt;/p&gt;

&lt;p&gt;If that's true — and the production numbers say it is — then a lot of what gets sold as "AI infrastructure intelligence" is going to collapse into very small models that learn very narrow things online. Not LLMs. Not even deep learning. Boosted trees. Trained on the last few hundred completed requests. Retrained constantly.&lt;/p&gt;

&lt;p&gt;The ironic punchline is that this announcement, which got dropped in a footnote at NEXT '26, may be a more honest preview of where production AI infrastructure is heading than the entire Gemini Enterprise Agent Platform keynote.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying it
&lt;/h2&gt;

&lt;p&gt;You can run this today. The implementation is open-source under the Kubernetes &lt;a href="https://gateway-api-inference-extension.sigs.k8s.io/guides/latency-based-predictor/" rel="noopener noreferrer"&gt;Gateway API Inference Extension&lt;/a&gt;. The gateway is the K8s upstream component; what Google did at NEXT was bake it into GKE Inference Gateway as a managed feature.&lt;/p&gt;

&lt;p&gt;Once installed, requests opt in via headers — and this is where the design choice gets clever:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nv"&gt;$GW_IP&lt;/span&gt;/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-prediction-based-scheduling: true'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-slo-ttft-ms: 200'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'x-slo-tpot-ms: 50'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "Qwen/Qwen3-32B",
    "prompt": "what is the difference between Franz and Apache Kafka?",
    "max_tokens": 200,
    "stream": "true"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two SLO headers are the part to dwell on. You're not telling the gateway &lt;em&gt;how&lt;/em&gt; to route. You're telling it &lt;em&gt;what you need&lt;/em&gt;, and letting it figure out the routing as a constrained optimization. &lt;code&gt;x-slo-ttft-ms: 200&lt;/code&gt; means "I need first token in 200ms or this is a degraded request." The scheduler computes headroom (predicted_ttft − slo_ttft) per pod and packs accordingly.&lt;/p&gt;

&lt;p&gt;This is a real, observable shift in how we think about LLM ops: from imperative ("route to pod 3") to declarative ("meet this SLO"), the same shift that databases went through twenty years ago when query planners replaced hand-written joins.&lt;/p&gt;

&lt;p&gt;The EPP exposes &lt;code&gt;-v=4&lt;/code&gt; log lines that let you watch the scorer think:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Running profile handler"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;plugin:&lt;/span&gt;&lt;span class="s2"&gt;"slo-aware-profile-handler"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Pod score"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;scorer_type:&lt;/span&gt;&lt;span class="s2"&gt;"slo-scorer"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;pod_name:&lt;/span&gt;&lt;span class="s2"&gt;"vllm-...-9b4wt"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;score:&lt;/span&gt;&lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;msg:&lt;/span&gt;&lt;span class="s2"&gt;"Picked endpoint"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;selected_pod:&lt;/span&gt;&lt;span class="s2"&gt;"vllm-...-9b4wt"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pair this with announcement #129 — autoscaling on custom metrics — and you have a closed loop: the predictor surfaces SLO headroom, the autoscaler reacts to headroom collapse before queue depth even spikes. Most autoscaling triggers fire after the system is already in pain. This one fires when the &lt;em&gt;forecast&lt;/em&gt; says pain is 30 seconds away.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd watch next
&lt;/h2&gt;

&lt;p&gt;A few open questions that the announcement and the underlying paper don't fully resolve:&lt;/p&gt;

&lt;p&gt;The model assumes a homogeneous accelerator pool. In real fleets you have H100s and H200s and B200s mixed together, with different price/performance curves. The team flagged this as future work; whoever solves it well wins the heterogeneous-GPU-cost-optimization market that nobody is talking about yet.&lt;/p&gt;

&lt;p&gt;The trainer runs as a sidecar to the EPP and retrains continuously. At the QPS levels in the scaling table — 10,000 QPS needs 4 prediction servers — the cost of the routing decision starts to be non-trivial relative to the inference itself. There's a coordination cost story here that's missing from the blog post.&lt;/p&gt;

&lt;p&gt;And the bigger question: this technique generalizes. The same XGBoost-on-six-features approach should work for autoscaling, for spot/on-demand routing decisions, for cache eviction policies, for batch scheduling. If Google ships predicted-latency primitives across the rest of GKE, the consequences are larger than a single-feature blog post implies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The contest prompt asks for the announcement that "speaks to you." The honest answer for me is: the boring sidecar with the unglamorous name that takes a week of pain — the slow rot of hand-tuned scorer weights — and replaces it with something that retrains itself.&lt;/p&gt;

&lt;p&gt;Everyone watching the keynote saw the agent demos. The serving runtime is where the actual money gets won or lost, and it's where six-feature regression beats a roomful of senior SREs with grafana dashboards. That's the announcement I think we'll be talking about in 18 months.&lt;/p&gt;

&lt;p&gt;The agent layer makes for a better trailer. The runtime layer is the movie.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Sources &amp;amp; credits: Technical details, production benchmark numbers, and architecture diagram concept drawn from the &lt;a href="https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms" rel="noopener noreferrer"&gt;llm-d project's "Predicted-Latency Based Scheduling for LLMs" post (March 2026)&lt;/a&gt; by Kaushik Mitra, Benjamin Braun, Abdullah Gharaibeh, and Clayton Coleman, and the &lt;a href="https://cloud.google.com/blog/topics/google-cloud-next/google-cloud-next-2026-wrap-up" rel="noopener noreferrer"&gt;Google Cloud NEXT '26 Wrap-Up&lt;/a&gt; (announcement #124). The opinions, framing, and analysis are mine. AI tools were used as a writing assistant; all technical claims trace to the linked primary sources.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>cloudnextchallenge</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>What Does OpenClaw Take From You?</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 26 Apr 2026 07:46:29 +0000</pubDate>
      <link>https://dev.to/jubinsoni/what-does-openclaw-take-from-you-4ph3</link>
      <guid>https://dev.to/jubinsoni/what-does-openclaw-take-from-you-4ph3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/openclaw-2026-04-16"&gt;OpenClaw Challenge&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; The conversation about personal AI is almost entirely about what these agents give you. The harder question — and the one that determines whether the deal is actually good — is what they take. Here are three things personal AI is quietly absorbing, and what I think you should keep.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We all know the story around personal AI. It gives you time back. It automates. It amplifies. It handles your inbox while you sleep, summarizes your morning, drafts replies, books flights, files receipts, sends messages—so you don’t have to. The language never really changes: gain. More throughput. More execution. More delegation. More agency.&lt;/p&gt;

&lt;p&gt;This framing is missing half the equation. Anything you offload, you stop doing. Anything you stop doing, you eventually stop being good at. And anything you stop being good at, you eventually stop &lt;em&gt;noticing&lt;/em&gt; you used to be good at.&lt;/p&gt;

&lt;p&gt;This is not a luddite essay. I want this technology to work. OpenClaw, specifically, is one of the more honest things in the personal AI space — file-first, locally hosted, legible memory, open-source ethos. If any agent framework is going to be defensible five years from now, it is probably this one. But that is exactly why the question matters more here than it does for some hosted SaaS chatbot. OpenClaw is not a toy. It is built to actually live in your life. And the things it is built to absorb are not random — they are a specific class of cognition that, until very recently, you did yourself.&lt;/p&gt;

&lt;p&gt;So: what is in that class? Three things, because I think the conversation needs the vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first thing: the friction that makes you decide
&lt;/h2&gt;

&lt;p&gt;It is tempting to treat every recurring annoyance in your life as something to automate away. Bills. Calendar conflicts. Inbox triage. Deadline tracking. Grocery lists. Household coordination. The agent handles them, the annoyance goes away, the win goes on the board.&lt;/p&gt;

&lt;p&gt;Friction is not always a bug.&lt;/p&gt;

&lt;p&gt;The reason you used to look at your bills before paying them is not that you enjoyed the experience. It is that the act of looking — even for two seconds — sometimes caught the thing that mattered. The duplicate charge. The subscription you forgot you had. The number that was higher than last month and signaled something upstream in your life was off. The five seconds of friction was a sampling pass on your own financial reality, run weekly, for free.&lt;/p&gt;

&lt;p&gt;When you build an agent that summarizes the bills and tells you the total, you have not just removed the friction. You have removed the sampling pass. The summary will tell you what the agent thinks is interesting. It will not tell you what &lt;em&gt;you&lt;/em&gt; would have thought was interesting if you had looked, because you no longer have the muscle to know.&lt;/p&gt;

&lt;p&gt;This is not theoretical. It is the same pattern that GPS did to your sense of direction, that autocomplete did to your spelling, and that calculators did to your arithmetic. In each case the technology was net-positive. In each case something specific and unrecoverable was traded away. We made those trades half-consciously because we did not have a vocabulary for what was on the other side of the ledger.&lt;/p&gt;

&lt;p&gt;The personal-agent generation is making bigger trades, faster, with even less vocabulary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second thing: the practice of small decisions
&lt;/h2&gt;

&lt;p&gt;There is a category of decision that is too small to think about and too consequential to skip.&lt;/p&gt;

&lt;p&gt;What to reply to that ambiguous Slack message. Whether the email from your landlord needs a same-day response or can wait until Monday. Whether the meeting your colleague proposed at 4 p.m. is one you should accept or politely deflect. Whether the calendar conflict your assistant just flagged is a real conflict or one of those situations where it is fine to be ten minutes late to the second thing.&lt;/p&gt;

&lt;p&gt;Personal agents are very good at the first 80% of these decisions and quietly bad at the last 20%. The first 80% — the obvious cases — is where they shine and where the demos look great. The last 20% — the cases that require taste, social calibration, and an accurate model of the specific humans involved — is where they fail in ways that do not show up in any benchmark, because the failure mode is &lt;em&gt;the agent did something locally reasonable that was globally wrong, and you did not notice until it was too late.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The deeper problem is that the small-decisions practice is &lt;em&gt;how taste is built in the first place&lt;/em&gt;. You develop a sense for which Slack messages need a careful reply by replying to a thousand of them, badly at first, and getting feedback from how the relationship went. If your agent handles the first nine hundred and fifty, you arrive at message nine hundred and fifty-one with the calibration of a beginner.&lt;/p&gt;

&lt;p&gt;The framing of "delegate the boring stuff and focus on the important stuff" assumes three things: that the boring stuff and the important stuff are clearly separated, that the boring stuff does not feed into the important stuff, and that you can train the agent on the boring stuff without losing access to the inputs that would have eventually made you good at the important stuff. None of these assumptions survive contact with how human skill actually develops.&lt;/p&gt;

&lt;h2&gt;
  
  
  The third thing: the silence in which you notice you were wrong
&lt;/h2&gt;

&lt;p&gt;This one is harder to name and I think it is the most important.&lt;/p&gt;

&lt;p&gt;Right now, when you have a thought that is incomplete, a plan that is half-formed, or an instinct that something is off, there is a natural waiting period. You sit with it. You go for a walk. You stare at the ceiling for an hour. Eventually, sometimes, the thing resolves. You realize the project you were excited about is actually a bad idea. You realize the email you drafted last night was angrier than you intended. You realize the person you were going to call does not actually need a call from you. They need space.&lt;/p&gt;

&lt;p&gt;This kind of cognition does not happen in language. It happens in the gaps between language. It is what your nervous system does when nothing is asking it for output.&lt;/p&gt;

&lt;p&gt;Personal AI agents are, by their nature, output machines. They want to be useful. They want to give you something. The honest, well-built ones — and OpenClaw is honest and well-built — are designed to be proactive, to surface things, to ping you with the briefing, to suggest the next step. The whole pitch is that they fill the gaps.&lt;/p&gt;

&lt;p&gt;But the gaps were doing work.&lt;/p&gt;

&lt;p&gt;The morning before you check your phone. The walk to the coffee shop where you have not yet asked the agent anything. The half-hour of unstructured staring before the meeting. These are not inefficiencies in your life that an agent should be optimizing away. They are the conditions under which your slower, more honest cognition can operate. Compressing them does not give you back time. It gives you back the same amount of time, minus the part of your mind that needed the silence to work.&lt;/p&gt;

&lt;p&gt;This is the trade nobody in the personal AI space wants to look at directly, because looking at it threatens the entire growth story. If the value of the agent is partly a function of what it disrupts in your inner life, and if some of what it disrupts is irreplaceable, then the unbounded "delegate everything" pitch starts to look less like a productivity story and more like a deal you should sign carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do
&lt;/h2&gt;

&lt;p&gt;Use OpenClaw. I mean that. The category is real, the project is good, and the alternative — keeping your data with hosted platforms whose pricing pages will change without your consent — is worse on almost every axis.&lt;/p&gt;

&lt;p&gt;But sign the deal carefully. The rule I would actually follow is the simplest one I can write down:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pick the offloads where the friction is genuinely friction.&lt;br&gt;
Keep the offloads where the friction is doing work.&lt;br&gt;
Leave the gaps alone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first one is for things where the human cost is high and the cognitive value is zero. Receipt parsing. Standard meeting confirmations. Repetitive document formatting. Things that genuinely should have been a script.&lt;/p&gt;

&lt;p&gt;The second one is for the things where the friction is the point. Read your own bills. Reply to your own ambiguous Slack messages, at least most of them. Look at your own calendar before you ask the agent to look at it. Treat the small-decisions practice like a gym membership — something you do not because you cannot afford the alternative, but because you understand what your body becomes if you stop using it.&lt;/p&gt;

&lt;p&gt;The third one is the hardest, because the agent is built to fill the gaps and your brain is built to let it. The morning before your first meeting. The walk where you have not yet opened a chat. The half-hour of unstructured staring. Leave them alone. The silence is not a bug to be fixed. It is the thing keeping the rest of it alive.&lt;/p&gt;

&lt;p&gt;Personal AI is going to be one of the largest technology shifts of the next decade, and OpenClaw is going to be in the middle of it. The question is not whether to participate. It is what you intend to keep, and what you are quietly agreeing to give up.&lt;/p&gt;

&lt;p&gt;Most of the conversation right now is an accounting of the gains.&lt;/p&gt;

&lt;p&gt;Somebody should account for the rest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's an offload you regret? Or one you almost made and pulled back from? I'd genuinely like to hear it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>openclawchallenge</category>
      <category>discuss</category>
      <category>ai</category>
    </item>
    <item>
      <title>What is AWS Kiro and Why it Matters for Agentic Development</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sat, 25 Apr 2026 06:37:19 +0000</pubDate>
      <link>https://dev.to/jubinsoni/what-is-aws-kiro-and-why-it-matters-for-agentic-development-18kd</link>
      <guid>https://dev.to/jubinsoni/what-is-aws-kiro-and-why-it-matters-for-agentic-development-18kd</guid>
      <description>&lt;p&gt;The evolution of Artificial Intelligence has transitioned from passive chat interfaces to active, autonomous agents. This shift, known as agentic development, requires a fundamental rethink of cloud infrastructure. In traditional AI workflows, a single request is sent to a Large Language Model (LLM), and a response is received. In agentic workflows, dozens or even hundreds of small, specialized agents must communicate, share state, and access tools in real-time. This creates a massive networking and latency bottleneck that standard REST-based architectures cannot handle.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;AWS Kiro&lt;/strong&gt;. AWS Kiro (Kernel-Integrated Runtime Orchestrator) is a specialized, high-performance infrastructure layer designed specifically for the orchestration of multi-agent systems. It moves beyond the limitations of standard container orchestration to provide a low-latency, state-aware environment where agents can thrive. This article provides a deep dive into what AWS Kiro is, how it works, and why it is the missing piece for the next generation of AI development.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Infrastructure Gap in Agentic AI
&lt;/h2&gt;

&lt;p&gt;To understand why AWS Kiro matters, we must first look at the unique requirements of agentic systems. Unlike a simple web application, an agentic system involves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;High Concurrency&lt;/strong&gt;: Multiple agents (e.g., a Researcher, a Writer, and a Fact-Checker) working simultaneously.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;State Persistence&lt;/strong&gt;: Agents need to remember what they were doing across thousands of small sub-tasks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Low Latency Inter-Agent Communication&lt;/strong&gt;: If Agent A needs to wait 500ms for a response from Agent B, a chain of 10 agent calls becomes prohibitively slow.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tool-Heavy Execution&lt;/strong&gt;: Agents frequently call external APIs, databases, and code execution sandboxes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional AWS services like Lambda or Fargate are excellent for general-purpose compute but often introduce "cold start" latencies or networking overhead that degrade agent performance. AWS Kiro was built to minimize this overhead by integrating the agent runtime closer to the hardware kernel and optimizing the networking stack for small, frequent packets of data common in agent communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Deep Dive: How AWS Kiro Works
&lt;/h2&gt;

&lt;p&gt;At its core, AWS Kiro utilizes a specialized virtualization layer that sits on top of the AWS Nitro System. It abstracts the complexities of agent coordination, providing what AWS calls a "Global Shared Memory Space" (GSMS). This allows agents running in different execution environments to share context without the latency of an external database like Redis.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kiro Control Plane and Data Plane
&lt;/h3&gt;

&lt;p&gt;The architecture is split into two primary components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;Kiro Control Plane&lt;/strong&gt;: Manages agent lifecycles, task decomposition, and scheduling.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Kiro Data Plane (The Fabric)&lt;/strong&gt;: Handles high-speed message passing and shared state access using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Diagram 1: Multi-Agent Interaction via AWS Kiro
&lt;/h3&gt;

&lt;p&gt;This sequence diagram illustrates how a user request is decomposed into multiple agent tasks through the Kiro fabric, highlighting the sub-millisecond coordination between the Orchestrator and worker agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpophikdneags5agv7b7i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpophikdneags5agv7b7i.png" alt="Multi-Agent Interaction description" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this flow, notice that A1 and A2 do not call each other directly via REST. Instead, they interact with the &lt;strong&gt;Global Shared Memory (GSMS)&lt;/strong&gt; provided by Kiro. This reduces the serialization/deserialization overhead and allows for O(1) time complexity when accessing shared context, regardless of how many agents are involved.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Features of AWS Kiro
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Kernel-Integrated Tool Execution
&lt;/h3&gt;

&lt;p&gt;Standard agents often struggle with the latency of spinning up a sandbox to execute code. AWS Kiro uses "Micro-Enclaves"—lightweight, isolated environments that share a kernel with the Kiro runtime. This allows an agent to go from "thinking" to "executing Python code" in less than 5ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Predictive Context Pre-fetching
&lt;/h3&gt;

&lt;p&gt;Kiro uses machine learning to predict which piece of historical context an agent might need next. If Agent B usually follows Agent A, Kiro will pre-fetch Agent A’s output into the local cache of the node where Agent B is scheduled to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Native Bedrock Integration
&lt;/h3&gt;

&lt;p&gt;While Kiro handles the infrastructure, it is tightly coupled with &lt;strong&gt;Amazon Bedrock&lt;/strong&gt;. It can automatically pull model weights for smaller, specialized models (like Llama 3 or Mistral) into local memory to further reduce inference latency during agentic loops.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparing Architectures: Traditional vs. AWS Kiro
&lt;/h2&gt;

&lt;p&gt;To see the value proposition, let's compare a standard agent implementation (using Lambda and S3/Redis for state) against an AWS Kiro-native implementation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Traditional Agent (Lambda + Redis)&lt;/th&gt;
&lt;th&gt;AWS Kiro-Native Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inter-Agent Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50ms - 200ms (HTTP/TLS)&lt;/td&gt;
&lt;td&gt;&amp;lt; 2ms (RDMA/Shared Memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External (Redis/DynamoDB)&lt;/td&gt;
&lt;td&gt;Native (Global Shared Memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold Start&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Significant (200ms - 2s)&lt;/td&gt;
&lt;td&gt;Minimal (&amp;lt; 10ms via Micro-Enclaves)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Window Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual truncation/storage&lt;/td&gt;
&lt;td&gt;Automatic predictive pre-fetching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited by database IOPS&lt;/td&gt;
&lt;td&gt;Linearly scalable across Kiro Fabric&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Task Decomposition Logic
&lt;/h2&gt;

&lt;p&gt;A critical part of agentic development is how a complex task is broken down. AWS Kiro provides a built-in "Router" that uses a cost-benefit analysis to determine if a task should be handled by a single large model or a swarm of smaller agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagram 2: Kiro Task Routing Flowchart
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F388orn49q66ci5y5i570.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F388orn49q66ci5y5i570.png" alt="Kiro Task Routing Flowchart" width="800" height="1151"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Code Example: Implementing a Kiro-Enabled Agent
&lt;/h2&gt;

&lt;p&gt;To use AWS Kiro, developers typically use the AWS SDK (Boto3) with specific extensions for the Kiro runtime. Below is a Python example of how you would initialize a Kiro session and register agents that share a memory space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kiro_runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KiroSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AgentNode&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the Kiro Client
&lt;/span&gt;&lt;span class="n"&gt;kiro&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kiro&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Create a Kiro Session with Shared Memory
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_agentic_environment&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kiro&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SessionName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MarketAnalysisSystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MemoryType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_performance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SharedContext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SessionArn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Define an Agent Node
# This agent will live within the Kiro Fabric for low-latency access
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentNode&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Writing to Shared Memory is nearly instantaneous in Kiro
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_shared_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;current_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Tool call via Kiro's Micro-Enclave
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_shared_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search completed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Orchestration
&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;setup_agentic_environment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResearchAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_arn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Execution within the fabric
&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latest trends in AWS Kiro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent Status: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Breakdown:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt; &lt;code&gt;kiro.create_session&lt;/code&gt;: This allocates a segment of the high-speed fabric specifically for your agents. The &lt;code&gt;SharedContext=True&lt;/code&gt; flag enables the GSMS, allowing all agents in this session to read/write to the same memory space at O(1) speeds.&lt;/li&gt;
&lt;li&gt; &lt;code&gt;AgentNode&lt;/code&gt;: This is a specialized class that inherits from Kiro’s runtime, providing methods like &lt;code&gt;write_shared_memory&lt;/code&gt; and &lt;code&gt;execute_tool&lt;/code&gt; which bypass the standard networking stack.&lt;/li&gt;
&lt;li&gt; &lt;code&gt;execute_tool&lt;/code&gt;: Instead of a standard API call, this triggers a micro-enclave execution within the same hardware cluster.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Agent Lifecycle in AWS Kiro
&lt;/h2&gt;

&lt;p&gt;Agents in Kiro are not just short-lived functions; they are stateful entities that transition through various statuses. Managing these transitions is vital for ensuring that agents don't hang or consume unnecessary resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diagram 3: Kiro Agent State Machine
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3gbvg0k3jm33xwt744yj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3gbvg0k3jm33xwt744yj.png" alt="State Diagram" width="786" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This state machine ensures that agents are "Hibernated" when not in use. Unlike a Lambda function that shuts down, a Hibernated Kiro agent keeps its local cache in the fabric's memory, allowing it to "Wake-up" and resume work in milliseconds without re-loading the model context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AWS Kiro Matters for the Future
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Solving the "Thinking Time" Problem
&lt;/h3&gt;

&lt;p&gt;As LLMs move toward "Reasoning" models (like OpenAI's o1 series), the "thinking time" increases. However, the system overhead (networking, state management) shouldn't add to that. Kiro ensures that the only latency developers face is the actual inference time of the model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Massive Parallelism
&lt;/h3&gt;

&lt;p&gt;In a complex supply chain agentic system, you might have 500 agents representing different vendors. AWS Kiro allows these 500 agents to coordinate in a single fabric. In a standard architecture, 500 agents would create a "thundering herd" problem for your database; in Kiro, the shared memory fabric handles the contention using hardware-level locking mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security and Governance
&lt;/h3&gt;

&lt;p&gt;When agents act on your behalf, security is paramount. Kiro’s micro-enclaves provide cryptographic isolation. Even if Agent A is compromised by a prompt injection, it cannot access the memory space of Agent B unless explicitly permitted by the Kiro Control Plane's IAM policies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Strategy: Moving to Kiro
&lt;/h2&gt;

&lt;p&gt;If you are currently building agents using LangChain or AutoGPT on standard AWS infrastructure, the migration to Kiro involves three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Context Migration&lt;/strong&gt;: Move your state storage from external databases (Redis/Dynamo) to Kiro Shared Memory.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tool Refactoring&lt;/strong&gt;: Re-package your tools as Kiro-compatible Micro-Enclaves to take advantage of the kernel-integrated execution.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Topology Definition&lt;/strong&gt;: Instead of individual functions, define an "Agent Topology" that describes how agents are grouped within the Kiro fabric.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AWS Kiro represents a significant leap forward for the AI ecosystem. By treating "Agency" as a first-class citizen of cloud infrastructure, AWS has removed the friction that previously made multi-agent systems slow and expensive. Whether you are building an autonomous coding assistant, a market research swarm, or a complex robotic process automation system, AWS Kiro provides the high-performance backbone required for true autonomy.&lt;/p&gt;

&lt;p&gt;As LLMs become more capable of reasoning, the infrastructure must become more capable of coordination. AWS Kiro is precisely the fabric that will hold these autonomous systems together, ensuring that the future of AI is not just intelligent, but also incredibly fast and scalable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/ec2/nitro/" rel="noopener noreferrer"&gt;AWS Nitro System Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html" rel="noopener noreferrer"&gt;Amazon Bedrock Agents User Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.deeplearning.ai/the-batch/issue-242/" rel="noopener noreferrer"&gt;The Rise of Agentic Workflows by Andrew Ng&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/hpc/networking/" rel="noopener noreferrer"&gt;High Performance Networking on AWS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://research.google/pubs/archive/44534/" rel="noopener noreferrer"&gt;Scalable Agentic AI Systems Architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>kiro</category>
      <category>agents</category>
      <category>ai</category>
    </item>
    <item>
      <title>5 Ways Azure AI Search is Revolutionizing Enterprise RAG Architectures</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sat, 25 Apr 2026 06:03:01 +0000</pubDate>
      <link>https://dev.to/jubinsoni/5-ways-azure-ai-search-is-revolutionizing-enterprise-rag-architectures-5d5e</link>
      <guid>https://dev.to/jubinsoni/5-ways-azure-ai-search-is-revolutionizing-enterprise-rag-architectures-5d5e</guid>
      <description>&lt;p&gt;In the rapidly evolving landscape of Generative AI, the transition from experimental Proof of Concepts (POCs) to production-grade applications is the most significant hurdle for enterprises today. At the heart of this transition lies Retrieval-Augmented Generation (RAG). While the "Generation" part—handled by Large Language Models (LLMs) like GPT-4—is often the focus, the quality of the "Retrieval" determines whether an AI application provides value or hallucinates incorrect information.&lt;/p&gt;

&lt;p&gt;Azure AI Search (formerly known as Azure Cognitive Search) has emerged as a powerhouse in this space. By moving beyond simple vector databases and offering a comprehensive information retrieval platform, it addresses the unique challenges of the enterprise: scale, security, and precision. In this article, we will deep-dive into the five key ways Azure AI Search is improving enterprise RAG, backed by technical architecture, code examples, and performance insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Advanced Hybrid Retrieval: Beyond Simple Vector Search
&lt;/h2&gt;

&lt;p&gt;Most basic RAG implementations rely solely on vector search (k-nearest neighbors). While vectors are excellent at capturing semantic meaning (e.g., understanding that "canine" and "dog" are related), they often fail at specific keyword matching, such as product serial numbers, obscure acronyms, or specific part codes. &lt;/p&gt;

&lt;p&gt;Azure AI Search solves this through &lt;strong&gt;Hybrid Retrieval&lt;/strong&gt;, which combines full-text search (BM25 algorithm) with vector search (HNSW algorithm) in a single query. The results are then fused using &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Reciprocal Rank Fusion (RRF) Works
&lt;/h3&gt;

&lt;p&gt;RRF is an algorithm that combines the multiple ranked lists (one from keyword search, one from vector search) into a single unified ranking. It doesn't require the scores from the different systems to be on the same scale. The formula for the RRF score is:&lt;/p&gt;

&lt;p&gt;Score = sum(1 / (k + rank_i))&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;k&lt;/code&gt; is a constant (usually 60) that mitigates the impact of high-ranking results from a single source.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_i&lt;/code&gt; is the position of the document in the i-th list.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Mermaid Flowchart: Hybrid Retrieval Logic
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffht073usq4l5cvtlnvxk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffht073usq4l5cvtlnvxk.png" alt="Flowchart Diagram" width="523" height="845"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Implementation: Hybrid Query
&lt;/h3&gt;

&lt;p&gt;Using the Azure AI Search Python SDK, a hybrid query is constructed by providing both a vector and a text string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.search.documents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SearchClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.search.documents.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorizedQuery&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.credentials&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureKeyCredential&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration
&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-service-name.search.windows.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;index_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise-docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SearchClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;AzureKeyCredential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# User input
&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the warranty period for the X-1500 sensor?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Helper function to get embeddings
&lt;/span&gt;
&lt;span class="c1"&gt;# Perform Hybrid Search
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;search_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;vector_queries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;VectorizedQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_nearest_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@search.score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - Title: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. The Power of Semantic Ranking (L3 Reranking)
&lt;/h2&gt;

&lt;p&gt;While Hybrid Search significantly improves recall, the enterprise often needs extreme precision. Azure AI Search integrates a "Semantic Ranker"—a technology derived from Bing’s core search engine. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Reranking Hierarchy
&lt;/h3&gt;

&lt;p&gt;In a typical search flow, the system handles thousands of documents. To be efficient, it uses a tiered approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;L1 (Retrieval):&lt;/strong&gt; Fast filtering (Keyword/Vector) to get the top 1,000 documents.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;L2 (RRF):&lt;/strong&gt; Merging keyword and vector results.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;L3 (Semantic Ranking):&lt;/strong&gt; A cross-encoder model that looks at the actual meaning of the top 50 results and re-scores them based on context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unlike traditional bi-encoders used in vector search (which compute similarity between a query embedding and a document embedding), the Semantic Ranker uses a cross-encoder that processes the query and the document snippet &lt;em&gt;together&lt;/em&gt;. This allows it to capture nuances like negation and complex relationships that vector similarity might miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison Table: Retrieval Strategies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Keyword (BM25)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, exact matches, low cost&lt;/td&gt;
&lt;td&gt;No semantic understanding&lt;/td&gt;
&lt;td&gt;Product IDs, codes, names&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector (HNSW)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic nuance, multi-lingual&lt;/td&gt;
&lt;td&gt;"Cold start" issues, bad for jargon&lt;/td&gt;
&lt;td&gt;Concept-based questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid (RRF)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Combines the best of both&lt;/td&gt;
&lt;td&gt;Higher latency than L1&lt;/td&gt;
&lt;td&gt;General purpose enterprise RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Ranker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Highest precision, handles nuance&lt;/td&gt;
&lt;td&gt;Highest latency/cost per query&lt;/td&gt;
&lt;td&gt;High-stakes decision support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Integrated Vectorization and Data Pipelines
&lt;/h2&gt;

&lt;p&gt;One of the biggest friction points in RAG is the "ETL for Embeddings" pipeline. Traditionally, developers had to write custom code to monitor data sources, chunk text, call embedding models, and push data to a vector store. &lt;/p&gt;

&lt;p&gt;Azure AI Search introduces &lt;strong&gt;Skillsets&lt;/strong&gt; and &lt;strong&gt;Indexers&lt;/strong&gt;, which automate this entire lifecycle. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Integrated Pipeline Lifecycle
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;DataSource:&lt;/strong&gt; Connection to Blob Storage, SQL Server, or Cosmos DB.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Indexer:&lt;/strong&gt; A crawler that runs on a schedule.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Skillset:&lt;/strong&gt; A series of AI transformations. This can include:

&lt;ul&gt;
&lt;li&gt;  Document Cracking (extracting text from PDFs, Office docs).&lt;/li&gt;
&lt;li&gt;  Text Chunking (splitting text into manageable segments).&lt;/li&gt;
&lt;li&gt;  Azure OpenAI Embedding (converting chunks into vectors automatically).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Sequence Diagram: Integrated Indexing Flow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx2qaowv9hehnuyd9wam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnx2qaowv9hehnuyd9wam.png" alt="Sequence Diagram" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Snippet: Defining an Integrated Vectorizer
&lt;/h3&gt;

&lt;p&gt;This JSON snippet represents how a vectorizer is defined within an index, allowing the search service to handle the embedding generation during both ingestion and query time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"vectorizers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-openai-vectorizer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"azureOpenAI"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"azureOpenAIParameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"resourceUri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://my-openai-resource.openai.azure.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"deploymentId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;api-key&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Scaling Vector Search with HNSW and Disk-Based Indexing
&lt;/h2&gt;

&lt;p&gt;Enterprise data isn't just a few thousand documents; it’s often millions of records. Most vector databases struggle with the memory-to-cost ratio because they keep all vectors in RAM to ensure speed.&lt;/p&gt;

&lt;p&gt;Azure AI Search uses the &lt;strong&gt;Hierarchical Navigable Small World (HNSW)&lt;/strong&gt; algorithm for vector indexing. HNSW creates a multi-layered graph where the top layers contain fewer nodes (for fast navigation) and the bottom layers contain all nodes (for precision). &lt;/p&gt;

&lt;h3&gt;
  
  
  Optimization Parameters
&lt;/h3&gt;

&lt;p&gt;When configuring HNSW in Azure AI Search, two parameters are critical for performance tuning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;m:&lt;/strong&gt; The number of bi-directional links created for every new element during construction. A higher &lt;code&gt;m&lt;/code&gt; improves recall but increases index size and memory usage.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;efConstruction:&lt;/strong&gt; The number of nearest neighbors explored during index building. Increasing this improves the quality of the graph but increases indexing time.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;efSearch:&lt;/strong&gt; The number of nearest neighbors searched during a query. Increasing this improves recall at the cost of latency.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Azure AI Search has also introduced &lt;strong&gt;filtered vector search&lt;/strong&gt;. In an enterprise context, you rarely want to search the &lt;em&gt;entire&lt;/em&gt; index. You might want to search only "Documents from Department A created in 2023." Azure AI Search optimizes this by applying filters &lt;em&gt;during&lt;/em&gt; the vector navigation, rather than post-filtering, which significantly reduces the search space and improves latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Complexity Analysis
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector Search (HNSW):&lt;/strong&gt; O(log n) average search time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full-Text Search:&lt;/strong&gt; O(n) in worst case, but optimized with inverted indices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage:&lt;/strong&gt; Azure AI Search can utilize disk-based storage for vectors, significantly lowering the Total Cost of Ownership (TCO) compared to purely in-memory databases.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Enterprise-Grade Security and Governance
&lt;/h2&gt;

&lt;p&gt;For a RAG system to be production-ready in a regulated industry, it cannot be a "black box." It must adhere to strict security protocols. Azure AI Search integrates natively with the broader Microsoft security stack in three major ways:&lt;/p&gt;

&lt;h3&gt;
  
  
  A. Virtual Network (VNET) and Private Link
&lt;/h3&gt;

&lt;p&gt;Most vector databases are accessed over the public internet. Azure AI Search supports &lt;strong&gt;Private Endpoints&lt;/strong&gt;, ensuring that your data traffic never leaves the Microsoft backbone network. This is a non-negotiable requirement for many financial and healthcare institutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  B. Role-Based Access Control (RBAC)
&lt;/h3&gt;

&lt;p&gt;Azure AI Search supports fine-grained RBAC. You can grant an application the right to query an index without giving it the right to delete data or view service keys. Furthermore, it supports &lt;strong&gt;User-Contextual Filtering&lt;/strong&gt;. If a user doesn't have permission to see "Document A" in SharePoint, the RAG system can use their identity token to filter "Document A" out of the search results automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  C. Integration with Microsoft Purview
&lt;/h3&gt;

&lt;p&gt;Data lineage is critical. By integrating with Microsoft Purview, enterprises can track how sensitive data (PII) flows from a data source into an index and eventually into an LLM response. This provides a layer of governance that is often missing in custom-built RAG stacks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together: The Production RAG Architecture
&lt;/h2&gt;

&lt;p&gt;When we combine these five improvements, the architecture of an enterprise RAG system transforms from a fragile script into a robust platform. &lt;/p&gt;

&lt;h3&gt;
  
  
  The End-to-End Workflow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Ingestion:&lt;/strong&gt; An Indexer pulls data from Azure SQL and Blob Storage. It uses a Skillset to chunk the text and call Azure OpenAI for embeddings. These are stored in an index with HNSW enabled.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Query:&lt;/strong&gt; A user asks a question via a web app. The web app calls Azure AI Search with a hybrid query (text + vector).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Refinement:&lt;/strong&gt; Azure AI Search performs the hybrid search, applies security filters based on the user's ID, and uses the Semantic Ranker to find the top 5 most relevant chunks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Generation:&lt;/strong&gt; These 5 chunks are sent to the LLM as context. Because the retrieval was so precise, the LLM provides a concise, accurate answer with minimal hallucination risk.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Sample Production-Ready Index Definition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enterprise-index"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"searchable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content_vector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Collection(Edm.Single)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"searchable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"retrievable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"dimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"vectorSearchProfile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-profile"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metadata_auth_group"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edm.String"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filterable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vectorSearch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"algorithms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hnsw"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hnswParameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"m"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"efConstruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cosine"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"profiles"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-profile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"algorithm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-hsnw-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"vectorizer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-openai-vectorizer"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"semantic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"configurations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-semantic-config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"prioritizedFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contentFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"fieldName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Improving RAG at the enterprise level is not about finding a larger LLM; it is about building a better retrieval system. Azure AI Search provides the necessary tools—Hybrid Search, Semantic Ranking, Integrated Data Pipelines, Scalable Vector Indexing, and Enterprise Security—to bridge the gap between a demo and a mission-critical application.&lt;/p&gt;

&lt;p&gt;By leveraging the platform's ability to handle both unstructured text and high-dimensional vectors, while maintaining strict security boundaries, developers can build AI assistants that are not only smart but also reliable and safe for the corporate environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search" rel="noopener noreferrer"&gt;Azure AI Search Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167" rel="noopener noreferrer"&gt;Outperforming standard RAG with Hybrid Search and Semantic Ranking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking" rel="noopener noreferrer"&gt;Reciprocal Rank Fusion (RRF) Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1603.09320" rel="noopener noreferrer"&gt;Efficient and Robust Approximate Nearest Neighbor Search using HNSW&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/search/azure-search-documents/samples" rel="noopener noreferrer"&gt;Azure AI Search Python SDK Samples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>azure</category>
      <category>rag</category>
      <category>vectorsearch</category>
      <category>generativeai</category>
    </item>
    <item>
      <title>S3 Vectors: How to build a RAG without a vector database</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 14 Apr 2026 19:37:23 +0000</pubDate>
      <link>https://dev.to/jubinsoni/s3-vectors-how-to-build-a-rag-without-a-vector-database-18i9</link>
      <guid>https://dev.to/jubinsoni/s3-vectors-how-to-build-a-rag-without-a-vector-database-18i9</guid>
      <description>&lt;p&gt;Every RAG tutorial follows the same script: embed your documents, spin up a vector database (Pinecone, Weaviate, pgvector, OpenSearch), manage its infrastructure, and pray the costs don't spiral. For most internal AI apps, this is overkill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon S3 Vectors&lt;/strong&gt; changes the equation. It's native vector storage built into S3 — no clusters, no provisioning, no idle compute. You store vectors like you store objects, query them with sub-100ms latency, and pay per use. It went GA in December 2025 and now supports 2 billion vectors per index across 31+ AWS regions.&lt;/p&gt;

&lt;p&gt;This post walks through building a complete RAG pipeline using only S3 Vectors and Amazon Bedrock. No external vector database. ~50 lines of Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjkwrmfrk709ap4e4m6v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjkwrmfrk709ap4e4m6v.png" alt="Architecture description" width="800" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three phases, two AWS services, zero infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  S3 Vectors vs Traditional Vector Databases
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;S3 Vectors&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;
&lt;strong&gt;Managed Vector DB&lt;/strong&gt; (e.g. OpenSearch, Pinecone)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None — fully serverless&lt;/td&gt;
&lt;td&gt;Clusters, shards, replicas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2B vectors/index, 10K indexes/bucket&lt;/td&gt;
&lt;td&gt;Varies, often requires re-sharding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~100ms (frequent), &amp;lt;1s (infrequent)&lt;/td&gt;
&lt;td&gt;~10-50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pay per PUT + storage + query&lt;/td&gt;
&lt;td&gt;Hourly/monthly compute + storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost at scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 90% cheaper&lt;/td&gt;
&lt;td&gt;Idle compute adds up fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata filtering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 50 keys, filterable by default&lt;/td&gt;
&lt;td&gt;Full query language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAG, agent memory, semantic search&lt;/td&gt;
&lt;td&gt;High-QPS production search, hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The tradeoff is clear:&lt;/strong&gt; S3 Vectors trades single-digit-ms latency for zero ops and dramatically lower cost. For internal RAG apps, agent memory, and moderate-QPS workloads, it's the better choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Set Up S3 Vectors
&lt;/h2&gt;

&lt;p&gt;Create a vector bucket and index. You can do this in the console or via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a vector bucket&lt;/span&gt;
aws s3vectors create-vector-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket

&lt;span class="c"&gt;# Create a vector index (1024 dims for Titan Embeddings V2)&lt;/span&gt;
aws s3vectors create-vector-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--index-name&lt;/span&gt; my-rag-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimension&lt;/span&gt; 1024 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--distance-metric&lt;/span&gt; cosine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's your "database" — done in two commands.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Ingest Documents
&lt;/h2&gt;

&lt;p&gt;Here's the ingestion pipeline. We chunk text, embed each chunk with Titan Embeddings V2, and store vectors with metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="n"&gt;bedrock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-runtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s3vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;BUCKET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-rag-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;INDEX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-rag-index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate embeddings using Titan Text Embeddings V2.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amazon.titan-embed-text-v2:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Split text into overlapping chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Chunk, embed, and store a document.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;::chunk-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# store original text for retrieval
&lt;/span&gt;            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;# PutVectors supports batches
&lt;/span&gt;    &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ingested &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal-docs.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal-docs.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 3: Query + Generate
&lt;/h2&gt;

&lt;p&gt;Now the RAG loop — embed the question, find similar chunks, and feed them to Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Full RAG pipeline: retrieve + generate.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Embed the question
&lt;/span&gt;    &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Find similar chunks
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;queryVector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;returnMetadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;returnDistance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Build context from retrieved chunks
&lt;/span&gt;    &lt;span class="n"&gt;context_parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Distance: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Generate answer with Claude
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question based on the provided context. 
If the context doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t contain enough information, say so.

## Context
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Question
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

## Answer&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.anthropic.claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-2023-05-31&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our refund policy for enterprise customers?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire RAG pipeline — &lt;strong&gt;~50 lines of actual logic&lt;/strong&gt;, no infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Metadata Filtering
&lt;/h2&gt;

&lt;p&gt;S3 Vectors supports filtering by metadata during queries. This is powerful for multi-tenant or multi-source RAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Only search chunks from a specific document
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;INDEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queryVector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;returnMetadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund-policy.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Filter operators include &lt;code&gt;eq&lt;/code&gt;, &lt;code&gt;ne&lt;/code&gt;, &lt;code&gt;gt&lt;/code&gt;, &lt;code&gt;gte&lt;/code&gt;, &lt;code&gt;lt&lt;/code&gt;, &lt;code&gt;lte&lt;/code&gt;, &lt;code&gt;in&lt;/code&gt;, &lt;code&gt;beginsWith&lt;/code&gt;, and logical &lt;code&gt;and&lt;/code&gt;/&lt;code&gt;or&lt;/code&gt; combinators.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Flow
&lt;/h2&gt;

&lt;p&gt;Here's how a query flows through the system end to end:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3anyrmlfk93awekhby.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3anyrmlfk93awekhby.png" alt="DF" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use S3 Vectors (and When Not To)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkcou0y48vnz37fueey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkcou0y48vnz37fueey.png" alt="S3 Vectors DT" width="800" height="1139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use S3 Vectors when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building internal RAG apps, agent memory, or semantic search&lt;/li&gt;
&lt;li&gt;Query volume is moderate (not thousands of QPS)&lt;/li&gt;
&lt;li&gt;You want zero infrastructure management&lt;/li&gt;
&lt;li&gt;Cost matters more than single-digit-ms latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use a dedicated vector DB when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need &amp;lt;10ms query latency consistently&lt;/li&gt;
&lt;li&gt;You need hybrid search (keyword + semantic)&lt;/li&gt;
&lt;li&gt;Your QPS is in the hundreds or thousands&lt;/li&gt;
&lt;li&gt;You need advanced features like aggregations or faceted search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use both (tiered):&lt;/strong&gt; S3 Vectors as cheap, durable storage + OpenSearch for hot queries. AWS supports this integration natively.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating with Bedrock Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;If you don't want to write the chunking and embedding code yourself, Bedrock Knowledge Bases can use S3 Vectors as its vector store directly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6alt9s0jeajtg83h1922.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6alt9s0jeajtg83h1922.png" alt="Bedrock Knowledge Bases" width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Just select "S3 Vectors" as the vector store when creating your Knowledge Base. Bedrock handles chunking, embedding, and storage automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cleanup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete the vector index&lt;/span&gt;
aws s3vectors delete-vector-index &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--index-name&lt;/span&gt; my-rag-index

&lt;span class="c"&gt;# Delete the vector bucket&lt;/span&gt;
aws s3vectors delete-vector-bucket &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vector-bucket-name&lt;/span&gt; my-rag-bucket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/s3/features/vectors/" rel="noopener noreferrer"&gt;Amazon S3 Vectors — Product Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-getting-started.html" rel="noopener noreferrer"&gt;S3 Vectors Getting Started Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors.html" rel="noopener noreferrer"&gt;S3 Vectors User Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3vectors.html" rel="noopener noreferrer"&gt;Boto3 S3Vectors API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/amazon-s3-vectors-now-generally-available-with-increased-scale-and-performance/" rel="noopener noreferrer"&gt;S3 Vectors GA Announcement — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/awslabs/s3-vectors-embed-cli" rel="noopener noreferrer"&gt;S3 Vectors Embed CLI (GitHub)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/storage/building-self-managed-rag-applications-with-amazon-eks-and-amazon-s3-vectors/" rel="noopener noreferrer"&gt;Building RAG with EKS and S3 Vectors — AWS Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>s3</category>
      <category>vectordatabase</category>
      <category>aws</category>
    </item>
    <item>
      <title>Mastering Gemma 4: A Comprehensive Deep Dive into Google's Next-Generation Open Model Architecture and Deployment</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 14 Apr 2026 17:53:14 +0000</pubDate>
      <link>https://dev.to/jubinsoni/mastering-gemma-4-a-comprehensive-deep-dive-into-googles-next-generation-open-model-architecture-2f91</link>
      <guid>https://dev.to/jubinsoni/mastering-gemma-4-a-comprehensive-deep-dive-into-googles-next-generation-open-model-architecture-2f91</guid>
      <description>&lt;p&gt;The landscape of Large Language Models (LLMs) has shifted dramatically from monolithic, proprietary APIs toward highly efficient, open-weight models that developers can run on commodity hardware. Google’s Gemma series has been at the forefront of this movement. With the release of Gemma 4, the industry sees a significant leap in performance-per-parameter, driven by advanced distillation techniques and architectural refinements that challenge models twice its size.&lt;/p&gt;

&lt;p&gt;In this deep dive, we will explore the technical underpinnings of Gemma 4, its unique training methodology, and practical strategies for integrating it into your production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Evolution of Gemma: From 1.0 to 4.0
&lt;/h2&gt;

&lt;p&gt;Gemma 4 represents a synthesis of Google’s Gemini technology tailored for the open-source community. Unlike previous iterations that focused primarily on raw scale, Gemma 4 emphasizes "density of intelligence." By leveraging the same research and technology used in Gemini 1.5 Pro, Gemma 4 achieves state-of-the-art results in reasoning, coding, and multilingual understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Architectural Pillars
&lt;/h3&gt;

&lt;p&gt;Gemma 4 is built upon a standard transformer decoder architecture but introduces several critical modifications:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Multi-Query Attention (MQA) and Grouped-Query Attention (GQA):&lt;/strong&gt; Optimized for memory efficiency and faster inference.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Sliding Window Attention (SWA):&lt;/strong&gt; Allows the model to handle longer contexts by focusing on local segments of the sequence while maintaining global coherence through layer-stacking.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Logit Soft-Capping:&lt;/strong&gt; Prevents logits from becoming too large, which stabilizes training and improves the effectiveness of distillation.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;RMSNorm and RoPE:&lt;/strong&gt; Utilizes Root Mean Square Layer Normalization and Rotary Positional Embeddings for improved numerical stability and better handling of sequence positioning.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  2. Theoretical Foundations: The Power of Knowledge Distillation
&lt;/h2&gt;

&lt;p&gt;The defining characteristic of Gemma 4 is its reliance on Knowledge Distillation. Instead of training the model from scratch on raw web data alone, Google uses a larger, more capable "Teacher" model (from the Gemini family) to guide the training of the "Student" Gemma model.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Distillation Works in Gemma 4
&lt;/h3&gt;

&lt;p&gt;In a standard training setup, a model minimizes the cross-entropy loss between its predictions and the ground-truth tokens. In Gemma 4's distillation process, the student model also attempts to match the probability distribution (the logits) of the teacher model. This allows the smaller model to learn the nuances, uncertainties, and structural reasoning patterns of the larger model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmpq8m95ijhdrovmeif8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmpq8m95ijhdrovmeif8.png" alt="Flowchart Diagram" width="518" height="716"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By optimizing for both ground truth and teacher distributions, Gemma 4 captures complex logical jumps that are usually only present in models with hundreds of billions of parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Comparative Analysis: Gemma 4 vs. The Industry
&lt;/h2&gt;

&lt;p&gt;To understand where Gemma 4 sits in the current ecosystem, we must compare it against its primary competitors: Meta’s Llama series and Mistral AI’s offerings. The following table highlights the architectural and performance differences between current industry leaders in the 7B-27B parameter range.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Gemma 4 (27B)&lt;/th&gt;
&lt;th&gt;Llama 3.1 (70B)&lt;/th&gt;
&lt;th&gt;Mistral Large 2&lt;/th&gt;
&lt;th&gt;Gemma 4 (9B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Base Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;td&gt;MoE (Mixture of Experts)&lt;/td&gt;
&lt;td&gt;Decoder-only Transformer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attention Mech&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GQA + Sliding Window&lt;/td&gt;
&lt;td&gt;Grouped-Query Attention&lt;/td&gt;
&lt;td&gt;Sliding Window&lt;/td&gt;
&lt;td&gt;Multi-Query Attention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;128k Tokens&lt;/td&gt;
&lt;td&gt;32k Tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Training Method&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distillation-heavy&lt;/td&gt;
&lt;td&gt;Direct Pre-training&lt;/td&gt;
&lt;td&gt;Direct Pre-training&lt;/td&gt;
&lt;td&gt;Distillation-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logit Capping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Soft-capping)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (Soft-capping)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma Terms of Use&lt;/td&gt;
&lt;td&gt;Llama 3 Community&lt;/td&gt;
&lt;td&gt;Mistral Research&lt;/td&gt;
&lt;td&gt;Gemma Terms of Use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. Deep Dive into Implementation: Getting Started
&lt;/h2&gt;

&lt;p&gt;Setting up Gemma 4 requires a Python environment with modern libraries. We will use the &lt;code&gt;transformers&lt;/code&gt; library by Hugging Face along with &lt;code&gt;accelerate&lt;/code&gt; for efficient memory management.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;

&lt;p&gt;First, ensure you have the latest versions of the required packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; transformers accelerate bitsandbytes torch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Basic Inference with Gemma 4
&lt;/h3&gt;

&lt;p&gt;The following script demonstrates how to load the Gemma 4 9B model in 4-bit quantization to save VRAM while maintaining performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Configure 4-bit quantization
&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemma-4-9b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Prepare the prompt using the chat template
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the concept of quantum entanglement using a cat analogy.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gemma 4 Response:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explanation of the Code
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;BitsAndBytesConfig&lt;/strong&gt;: We use NormalFloat 4 (nf4) quantization. This allows the 9B model, which would normally require ~18GB of VRAM, to fit into roughly 5-6GB, making it accessible for consumer GPUs like the RTX 3060.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;device_map="auto"&lt;/strong&gt;: This automatically handles the distribution of model layers across available GPUs and CPUs.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;apply_chat_template&lt;/strong&gt;: Gemma 4 uses specific control tokens (like &lt;code&gt;&amp;lt;start_of_turn&amp;gt;&lt;/code&gt;) to distinguish between user and assistant roles. Using the built-in template ensures the model receives the prompt in the exact format it was trained on.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  5. Sequence Flows in Gemma 4 Applications
&lt;/h2&gt;

&lt;p&gt;When deploying Gemma 4 in a Retrieval-Augmented Generation (RAG) pipeline, the interaction between the orchestrator, the vector database, and the model follows a specific sequence. Understanding this flow is vital for optimizing latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftydq05s14h2m1wia1l8a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftydq05s14h2m1wia1l8a.png" alt="Sequence Diagram" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Advanced Optimization: Logit Soft-Capping and Stability
&lt;/h2&gt;

&lt;p&gt;A technical nuance in Gemma 4 is the implementation of &lt;strong&gt;Logit Soft-Capping&lt;/strong&gt;. During the generation process, the raw output of the last layer (logits) can sometimes reach extreme values, leading to "peaky" probability distributions where the model becomes overconfident or starts repeating itself.&lt;/p&gt;

&lt;p&gt;Gemma 4 applies a function to constrain these values:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;logit = capacity * tanh(logit / capacity)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Where the capacity is typically set around 30.0 for the attention layers and 50.0 for the final layer. This ensures that no single token dominates the distribution too early, leading to more creative and stable outputs during long-form generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Efficient Fine-Tuning with PEFT and LoRA
&lt;/h2&gt;

&lt;p&gt;To adapt Gemma 4 to specific domains (e.g., medical, legal, or proprietary codebases), Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) is the recommended approach. This method keeps the base model weights frozen and only trains a small set of adapter layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical LoRA Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_trainable_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By targeting all linear layers (including the MLP/gate modules), we ensure that the model can learn the specific linguistic nuances of the new domain without suffering from catastrophic forgetting.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. The Gemma 4 Ecosystem Mindmap
&lt;/h2&gt;

&lt;p&gt;Navigating the tools and frameworks available for Gemma 4 can be overwhelming. The following mindmap categorizes the ecosystem into four primary domains: Inference, Fine-Tuning, Deployment, and Evaluation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx616yv2o76pm3q6qnpce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx616yv2o76pm3q6qnpce.png" alt="Diagram" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Handling the 128k Context Window
&lt;/h2&gt;

&lt;p&gt;One of the most significant upgrades in Gemma 4 is the massive 128k token context window. However, processing 128k tokens is computationally expensive. Gemma 4 manages this through &lt;strong&gt;Sliding Window Attention (SWA)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In SWA, each layer does not attend to all previous tokens. Instead, it attends to a fixed-size "window" of recent tokens. Because these layers are stacked, layer N can effectively "see" information from further back via the intermediate representations of layer N-1. This reduces the computational complexity from O(n^2) to O(n * w), where w is the window size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Considerations for Long Context
&lt;/h3&gt;

&lt;p&gt;When utilizing the full 128k window, memory consumption for the KV (Key-Value) cache becomes the bottleneck. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KV Cache Quantization:&lt;/strong&gt; Storing the KV cache in 8-bit or 4-bit can reduce memory usage by 50-75%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paged Attention:&lt;/strong&gt; Using frameworks like vLLM allows for dynamic memory allocation, preventing fragmentation when handling multiple long-context requests simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  10. Benchmarking and Performance Metrics
&lt;/h2&gt;

&lt;p&gt;Internal testing shows that Gemma 4 excels in "Reasoning Density." This refers to the model's ability to solve complex mathematical and logical problems relative to its parameter count. In the MMLU (Massive Multitask Language Understanding) benchmark, the 27B variant of Gemma 4 outperforms several 70B+ models, proving that quality of training data and distillation are more important than sheer scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Gemma 4 (27B)&lt;/th&gt;
&lt;th&gt;Llama 3.1 (70B)&lt;/th&gt;
&lt;th&gt;Gemma 4 (9B)&lt;/th&gt;
&lt;th&gt;GPT-4o (Reference)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MMLU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;td&gt;71.3%&lt;/td&gt;
&lt;td&gt;88.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GSM8K (Math)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;82.1%&lt;/td&gt;
&lt;td&gt;82.5%&lt;/td&gt;
&lt;td&gt;74.0%&lt;/td&gt;
&lt;td&gt;94.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HumanEval (Code)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;68.5%&lt;/td&gt;
&lt;td&gt;67.2%&lt;/td&gt;
&lt;td&gt;55.4%&lt;/td&gt;
&lt;td&gt;86.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MBPP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;72.0%&lt;/td&gt;
&lt;td&gt;70.1%&lt;/td&gt;
&lt;td&gt;62.1%&lt;/td&gt;
&lt;td&gt;84.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  11. Ethical Considerations and Safety
&lt;/h2&gt;

&lt;p&gt;Google has integrated a robust safety framework into Gemma 4. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Filtering:&lt;/strong&gt; Rigorous removal of personally identifiable information (PII) and harmful content from the pre-training set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reinforcement Learning from Human Feedback (RLHF):&lt;/strong&gt; Tuning the model to follow instructions while refusing harmful requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Teaming:&lt;/strong&gt; Extensive testing against adversarial attacks to ensure the model remains helpful yet harmless.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Developers are encouraged to use the &lt;strong&gt;Responsible AI Toolkit&lt;/strong&gt; provided by Google to audit their fine-tuned versions of Gemma 4 before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. Conclusion
&lt;/h2&gt;

&lt;p&gt;Gemma 4 marks a turning point in the accessibility of high-performance AI. By successfully distilling the intelligence of a frontier model like Gemini into an open-weight format, Google has provided developers with a tool that is both powerful enough for complex reasoning and efficient enough for local deployment. Whether you are building a sophisticated RAG system, a specialized coding assistant, or an edge-based application, Gemma 4 provides the architectural flexibility and performance density required for the next generation of AI applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemma" rel="noopener noreferrer"&gt;Google DeepMind Gemma Repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/google/gemma-4-9b-it" rel="noopener noreferrer"&gt;Hugging Face Gemma 4 Model Card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;Attention Is All You Need Technical Paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1503.02531" rel="noopener noreferrer"&gt;Knowledge Distillation and the Teacher-Student Paradigm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>gemma</category>
      <category>python</category>
    </item>
    <item>
      <title>The Agent Protocol Stack: MCP vs A2A vs AG-UI — When to Use What</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Sun, 12 Apr 2026 08:49:35 +0000</pubDate>
      <link>https://dev.to/jubinsoni/the-agent-protocol-stack-mcp-vs-a2a-vs-ag-ui-when-to-use-what-6dn</link>
      <guid>https://dev.to/jubinsoni/the-agent-protocol-stack-mcp-vs-a2a-vs-ag-ui-when-to-use-what-6dn</guid>
      <description>&lt;p&gt;If you're building AI agents in 2026, you've probably bumped into at least one of these acronyms: &lt;strong&gt;MCP&lt;/strong&gt;, &lt;strong&gt;A2A&lt;/strong&gt;, &lt;strong&gt;AG-UI&lt;/strong&gt;. Maybe all three. And if you're anything like me, your first reaction was: &lt;em&gt;"Are these competing standards? Do I need all of them? Which one do I actually use?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the short answer: &lt;strong&gt;they're not competing — they're complementary.&lt;/strong&gt; Each one solves a different problem at a different layer of the agent architecture. Think of them like TCP, HTTP, and HTML — different protocols at different layers that work together to make the web function.&lt;/p&gt;

&lt;p&gt;The long answer is the rest of this article.&lt;/p&gt;




&lt;h2&gt;
  
  
  The One-Sentence Version
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;th&gt;Created By&lt;/th&gt;
&lt;th&gt;What It Connects&lt;/th&gt;
&lt;th&gt;One-Liner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Agent ↔ Tools &amp;amp; Data&lt;/td&gt;
&lt;td&gt;"How does my agent use tools?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A2A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google (Linux Foundation)&lt;/td&gt;
&lt;td&gt;Agent ↔ Agent&lt;/td&gt;
&lt;td&gt;"How do agents talk to each other?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AG-UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CopilotKit&lt;/td&gt;
&lt;td&gt;Agent ↔ User Interface&lt;/td&gt;
&lt;td&gt;"How does my agent talk to the user?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's the mental model. Now let's go deeper.&lt;/p&gt;




&lt;h2&gt;
  
  
  MCP: The Tool Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What It Solves
&lt;/h3&gt;

&lt;p&gt;Your agent needs to &lt;em&gt;do things&lt;/em&gt; — query a database, call an API, read a file, search the web. Before MCP, every integration was bespoke. You'd write custom function-calling code for each tool, each framework, each model. MCP standardizes this into a single protocol.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;MCP uses a &lt;strong&gt;client-server architecture&lt;/strong&gt; over JSON-RPC 2.0. The MCP server exposes tools (functions with typed inputs/outputs), resources (data the agent can read), and prompts (reusable templates). The MCP client — typically embedded in your agent framework — discovers these capabilities and invokes them on behalf of the model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0x0xdw6g8ivcwmto47yp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0x0xdw6g8ivcwmto47yp.png" alt="MCP" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Concepts
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tools&lt;/strong&gt; are the core primitive — functions the model can call. Each tool has a name, description (the LLM reads this to decide when to use it), and a typed input schema. The model sees the tool list, decides which ones to call, and the MCP client executes them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt; let the server expose read-only data — files, database schemas, configuration — that provides context without requiring a tool call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transports&lt;/strong&gt; are flexible. Local tools can use stdio (spawning a subprocess). Remote tools use Streamable HTTP, which is what you'd use for production deployments. AWS Bedrock AgentCore Runtime expects this transport.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use MCP
&lt;/h3&gt;

&lt;p&gt;Use MCP when your agent needs to &lt;strong&gt;interact with external systems&lt;/strong&gt;: databases, APIs, monitoring tools, file systems, cloud services. If you're wrapping an existing API for agent consumption, MCP is the protocol.&lt;/p&gt;

&lt;p&gt;AWS provides a growing library of open-source MCP servers for services like S3, DynamoDB, CloudWatch, and Cost Explorer. You can also build custom MCP servers for your own internal APIs and deploy them to AgentCore Runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  When NOT to Use MCP
&lt;/h3&gt;

&lt;p&gt;MCP is not for agent-to-agent communication. If you have a research agent that needs to delegate a sub-task to a coding agent, MCP isn't the right fit — that's A2A territory. MCP is also not designed for frontend communication — it doesn't have event streaming primitives for UI updates.&lt;/p&gt;




&lt;h2&gt;
  
  
  A2A: The Agent Collaboration Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What It Solves
&lt;/h3&gt;

&lt;p&gt;You've built multiple specialized agents. One handles research, another handles code generation, a third manages deployments. Now you need them to work together on a complex task without sharing their internal state, tools, or prompts. A2A standardizes how agents discover each other, delegate tasks, and exchange results.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;A2A follows a &lt;strong&gt;client-server model&lt;/strong&gt; where agents communicate over HTTP using JSON-RPC 2.0 (and optionally gRPC as of v0.3). The key differentiator from MCP is &lt;strong&gt;opacity&lt;/strong&gt; — agents don't expose their internals. They advertise what they can do, not how they do it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqe2ikuagzau0h00dzjc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqe2ikuagzau0h00dzjc4.png" alt="A2A" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Concepts
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Agent Cards&lt;/strong&gt; are JSON metadata documents hosted at &lt;code&gt;/.well-known/agent.json&lt;/code&gt;. They describe the agent's name, capabilities (called "skills"), supported input/output types, and authentication requirements. Think of them as a machine-readable business card — any A2A client can discover what a remote agent does without prior knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks&lt;/strong&gt; are the unit of work. A client sends a message to a remote agent, which creates a task with a lifecycle: &lt;code&gt;submitted → working → completed&lt;/code&gt; (or &lt;code&gt;failed&lt;/code&gt;, &lt;code&gt;canceled&lt;/code&gt;). Tasks can produce &lt;strong&gt;artifacts&lt;/strong&gt; — the actual outputs like generated text, images, or structured data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interaction patterns&lt;/strong&gt; are flexible. Simple tasks complete synchronously. Long-running tasks use Server-Sent Events (SSE) for streaming updates. Truly async workflows use push notifications via webhooks.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use A2A
&lt;/h3&gt;

&lt;p&gt;Use A2A when you have &lt;strong&gt;multiple agents that need to collaborate&lt;/strong&gt; but shouldn't share internal state. Common patterns include a supervisor agent delegating to specialists, cross-organization agent collaboration (your agent talking to a vendor's agent), and multi-framework setups (a LangGraph agent coordinating with a CrewAI agent).&lt;/p&gt;

&lt;p&gt;A2A is especially valuable when agents are built by different teams or companies. The opacity principle means Agent A doesn't need to know that Agent B uses LangGraph internally — it just sends a task and gets results back.&lt;/p&gt;

&lt;p&gt;AWS Bedrock AgentCore Runtime supports deploying A2A servers alongside MCP servers, with the same IAM auth, session isolation, and auto-scaling. A2A containers expose their endpoint on port 9000 with an Agent Card at &lt;code&gt;/.well-known/agent-card.json&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  When NOT to Use A2A
&lt;/h3&gt;

&lt;p&gt;A2A adds overhead that isn't necessary for simple single-agent setups. If your agent just needs to call tools, use MCP. If you need tight coupling between agent components (shared memory, shared context), A2A's opacity model will work against you — consider an agent framework's native multi-agent patterns instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  AG-UI: The User Interface Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What It Solves
&lt;/h3&gt;

&lt;p&gt;Your agent is running, calling tools, maybe coordinating with other agents. But the user is staring at a loading spinner. They don't know what's happening, can't intervene when things go wrong, and can't see intermediate results. AG-UI standardizes how agents communicate with user-facing applications in real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;AG-UI is an &lt;strong&gt;event-based protocol&lt;/strong&gt; where the agent backend emits a stream of typed events that the frontend consumes. Unlike REST (request → response) or WebSocket (unstructured bidirectional), AG-UI defines ~16 specific event types that cover the full range of agent-user interactions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fauhvffwwsf2wg40vr8g4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fauhvffwwsf2wg40vr8g4.png" alt="AG-UI" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Concepts
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Event types&lt;/strong&gt; are the core of AG-UI. The main ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle events&lt;/strong&gt; (&lt;code&gt;RUN_STARTED&lt;/code&gt;, &lt;code&gt;RUN_FINISHED&lt;/code&gt;, &lt;code&gt;RUN_ERROR&lt;/code&gt;) — let the frontend show loading states and handle errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text message events&lt;/strong&gt; (&lt;code&gt;TEXT_MESSAGE_START&lt;/code&gt;, &lt;code&gt;_CONTENT&lt;/code&gt;, &lt;code&gt;_END&lt;/code&gt;) — stream generated text token by token for the "typing" effect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool events&lt;/strong&gt; (&lt;code&gt;TOOL_CALL_START&lt;/code&gt;, &lt;code&gt;TOOL_CALL_END&lt;/code&gt;) — show the user what tools the agent is using and their results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State deltas&lt;/strong&gt; (&lt;code&gt;STATE_DELTA&lt;/code&gt;) — send incremental UI state changes (progress bars, form updates) without resending everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interrupts&lt;/strong&gt; (&lt;code&gt;INTERRUPT&lt;/code&gt;) — pause execution to ask the user for approval before a sensitive action (like deleting a resource)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shared state&lt;/strong&gt; enables bidirectional synchronization between the agent and the application. The agent can read application state (what page the user is on, what document is open) and push state changes back (update a chart, fill a form).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontend tools&lt;/strong&gt; are an interesting inversion — the agent can call functions that execute &lt;em&gt;in the browser&lt;/em&gt;, like updating a collaborative document or rendering a visualization.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use AG-UI
&lt;/h3&gt;

&lt;p&gt;Use AG-UI when your agent needs to &lt;strong&gt;communicate with a user-facing application&lt;/strong&gt; in real time. This includes chat interfaces that show tool execution progress, collaborative editing where the agent modifies a shared document, dashboards that update as the agent discovers information, and any workflow that requires human-in-the-loop approval.&lt;/p&gt;

&lt;p&gt;AG-UI was born from CopilotKit's production experience and has integrations with LangGraph, CrewAI, Strands Agents, Pydantic AI, and more. AWS Bedrock AgentCore Runtime added AG-UI support in March 2026, handling auth and scaling just like MCP and A2A workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  When NOT to Use AG-UI
&lt;/h3&gt;

&lt;p&gt;If your agent is a background job with no user interaction (batch processing, scheduled tasks), AG-UI adds unnecessary complexity. Stick with simple API responses or logging. Also, AG-UI is about &lt;em&gt;communication&lt;/em&gt;, not &lt;em&gt;UI rendering&lt;/em&gt; — if you need the agent to generate actual UI components, look at A2UI (a separate spec from Google for declarative UI generation that can be transported over AG-UI events).&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Fit Together
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. In a real production system, you're likely using all three:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a3t9wuymfuligncrinm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2a3t9wuymfuligncrinm.png" alt="all three" width="800" height="722"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The user asks a question in the frontend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AG-UI&lt;/strong&gt; streams the request to the supervisor agent and carries back real-time updates&lt;/li&gt;
&lt;li&gt;The supervisor uses &lt;strong&gt;MCP&lt;/strong&gt; to call tools directly (databases, APIs, cloud services)&lt;/li&gt;
&lt;li&gt;For complex sub-tasks, the supervisor uses &lt;strong&gt;A2A&lt;/strong&gt; to delegate to specialist agents&lt;/li&gt;
&lt;li&gt;Those specialist agents may themselves use &lt;strong&gt;MCP&lt;/strong&gt; for their own tools&lt;/li&gt;
&lt;li&gt;Results flow back up through A2A → supervisor → AG-UI → user&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each protocol handles its layer. No overlap. No conflict.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;When you're designing an agent system, ask these three questions:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. "Does my agent need to use external tools or data?"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;→ Yes: Use MCP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wrap your APIs, databases, and services as MCP servers. Use existing open-source MCP servers for common services (AWS, GitHub, Slack, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. "Does my agent need to collaborate with other agents?"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;→ Yes: Use A2A&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Especially when agents are built by different teams, use different frameworks, or need to maintain privacy of their internal logic. Publish Agent Cards for discovery.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. "Does my agent need to communicate with a user in real time?"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;→ Yes: Use AG-UI&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stream progress, show tool execution, synchronize state, and handle human-in-the-loop approvals. Use AG-UI events to keep the user informed and in control.&lt;/p&gt;

&lt;p&gt;Most production agent systems will answer "yes" to at least two of these. And that's fine — the protocols are designed to compose.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Comparison Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;MCP&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;A2A&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;AG-UI&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool access&lt;/td&gt;
&lt;td&gt;Agent collaboration&lt;/td&gt;
&lt;td&gt;User interaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Created by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Google / Linux Foundation&lt;/td&gt;
&lt;td&gt;CopilotKit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wire protocol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JSON-RPC 2.0&lt;/td&gt;
&lt;td&gt;JSON-RPC 2.0 + gRPC&lt;/td&gt;
&lt;td&gt;Event stream (SSE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Discovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool listing via &lt;code&gt;tools/list&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Agent Card at &lt;code&gt;/.well-known/agent.json&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;N/A (direct connection)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key primitive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool (function call)&lt;/td&gt;
&lt;td&gt;Task (lifecycle-managed work unit)&lt;/td&gt;
&lt;td&gt;Event (~16 standard types)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transport&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;stdio, Streamable HTTP&lt;/td&gt;
&lt;td&gt;HTTP, SSE, gRPC, webhooks&lt;/td&gt;
&lt;td&gt;SSE, WebSockets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth 2.0, IAM&lt;/td&gt;
&lt;td&gt;OAuth 2.0, API keys, mTLS&lt;/td&gt;
&lt;td&gt;Application-defined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Opacity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transparent (tools are exposed)&lt;/td&gt;
&lt;td&gt;Opaque (internals hidden)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (SSE for resources)&lt;/td&gt;
&lt;td&gt;Yes (SSE for task updates)&lt;/td&gt;
&lt;td&gt;Yes (core design principle)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AgentCore Runtime + Gateway&lt;/td&gt;
&lt;td&gt;AgentCore Runtime (port 9000)&lt;/td&gt;
&lt;td&gt;AgentCore Runtime (March 2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spec version&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2025-03-26&lt;/td&gt;
&lt;td&gt;v0.3&lt;/td&gt;
&lt;td&gt;~16 event types, active development&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Running All Three on AWS
&lt;/h2&gt;

&lt;p&gt;AWS Bedrock AgentCore Runtime is one of the few platforms that supports all three protocols natively. Here's how they deploy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;th&gt;AgentCore Runtime Port&lt;/th&gt;
&lt;th&gt;Container Path&lt;/th&gt;
&lt;th&gt;Auth&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8000&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;IAM SigV4 or OAuth 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A2A&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9000&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/&lt;/code&gt; (root)&lt;/td&gt;
&lt;td&gt;IAM SigV4 or OAuth 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AG-UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configurable&lt;/td&gt;
&lt;td&gt;Configurable&lt;/td&gt;
&lt;td&gt;IAM SigV4 or OAuth 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each protocol gets the same enterprise infrastructure: session isolation in microVMs, automatic scaling, IAM auth, and observability through AgentCore. You write the server, AgentCore handles everything else.&lt;/p&gt;

&lt;p&gt;The AgentCore Gateway can sit in front of MCP servers to provide centralized tool discovery, routing, and policy enforcement via Cedar. For A2A, agents advertise their capabilities through Agent Cards. For AG-UI, the frontend connects directly to the AgentCore Runtime endpoint and receives streamed events.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About A2UI?
&lt;/h2&gt;

&lt;p&gt;You might have also heard about &lt;strong&gt;A2UI&lt;/strong&gt; (Agent-to-UI), a separate specification from Google. It's easy to confuse with AG-UI given the similar names, but they solve different problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A2UI&lt;/strong&gt; defines &lt;em&gt;what&lt;/em&gt; UI to render — it's a declarative spec for describing UI components (buttons, charts, forms) that agents can generate safely without executing arbitrary code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AG-UI&lt;/strong&gt; defines &lt;em&gt;how&lt;/em&gt; agents and UIs communicate at runtime — the event stream, state synchronization, and interaction lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They're complementary. An agent can use AG-UI to stream events to the frontend, and one of those events can carry an A2UI payload that describes a UI component to render. AG-UI is the transport; A2UI is the content format.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you're building your first agent system, here's the practical sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with MCP.&lt;/strong&gt; Most agents need tools first. Build an MCP server for your primary data source or API. Deploy it to AgentCore Runtime or run it locally during development.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add AG-UI when you build the frontend.&lt;/strong&gt; Once your agent works, connect it to a user-facing app using AG-UI events. CopilotKit provides React components that handle the event stream out of the box.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Introduce A2A when you need specialization.&lt;/strong&gt; When a single agent can't handle everything, split into specialists and use A2A for delegation. This typically happens when you're at the point of multi-team or multi-framework agent development.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need all three on day one. But understanding what each one does — and where it fits — saves you from building custom plumbing that a protocol already handles.&lt;/p&gt;

&lt;h2&gt;
  
  
  References:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://spec.modelcontextprotocol.io/specification/2025-03-26/" rel="noopener noreferrer"&gt;MCP Specification (2025-03-26)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/" rel="noopener noreferrer"&gt;One Year of MCP: Spec Anniversary Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://awslabs.github.io/mcp/" rel="noopener noreferrer"&gt;Open Source MCP Servers for AWS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/runtime-mcp.html" rel="noopener noreferrer"&gt;Deploy MCP Servers in AgentCore Runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/agent2agent-protocol-is-getting-an-upgrade" rel="noopener noreferrer"&gt;A2A v0.3 Upgrade Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/marketplace/latest/userguide/bedrock-agentcore-runtime.html" rel="noopener noreferrer"&gt;A2A on AWS AgentCore Runtime&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.datacamp.com/tutorial/ag-ui" rel="noopener noreferrer"&gt;AG-UI Overview — DataCamp Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.pydantic.dev/ui/ag-ui/" rel="noopener noreferrer"&gt;Pydantic AI AG-UI Integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://a2ui.org/" rel="noopener noreferrer"&gt;A2UI Official Site&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.googleblog.com/developers-guide-to-ai-agent-protocols/" rel="noopener noreferrer"&gt;Developer's Guide to AI Agent Protocols — Google Developers Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mcp</category>
      <category>a2a</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>MCP + AWS AgentCore: Give Your AI Agent Real Tools in 60 Minutes</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 07 Apr 2026 06:12:21 +0000</pubDate>
      <link>https://dev.to/jubinsoni/mcp-aws-agentcore-give-your-ai-agent-real-tools-in-60-minutes-2plg</link>
      <guid>https://dev.to/jubinsoni/mcp-aws-agentcore-give-your-ai-agent-real-tools-in-60-minutes-2plg</guid>
      <description>&lt;p&gt;If you've been building with AI agents, you've probably hit the same wall I did: your agent needs to &lt;em&gt;do things&lt;/em&gt; — query databases, call APIs, check systems — but wiring up each tool is a bespoke integration every time. The Model Context Protocol (MCP) solves this by giving agents a standard way to discover and invoke tools. Think of it as USB-C for AI tooling.&lt;/p&gt;

&lt;p&gt;The problem? Most MCP tutorials stop at "run it locally with stdio." That's fine for solo dev work, but it falls apart the moment you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple clients connecting to the same server&lt;/li&gt;
&lt;li&gt;Auth, session isolation, and scaling&lt;/li&gt;
&lt;li&gt;A deployment that doesn't die when your laptop sleeps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS Bedrock AgentCore Runtime changes the equation. You write an MCP server, hand it over, and AgentCore handles containerization, scaling, IAM auth, and session isolation — each user session runs in a dedicated microVM. No ECS clusters to configure. No load balancers to tune.&lt;/p&gt;

&lt;p&gt;In this post, we'll build a practical MCP server from scratch, deploy it to AgentCore Runtime, and connect an AI agent to it. The whole thing takes about 30-60 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We'll create an MCP server that exposes &lt;strong&gt;infrastructure health tools&lt;/strong&gt; — the kind of thing a DevOps agent would use to check system status, list recent deployments, and surface alerts. It's more interesting than a dice roller but simple enough to follow.&lt;/p&gt;

&lt;p&gt;Here's the architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafijq24eyeh44ll99dx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafijq24eyeh44ll99dx0.png" alt="architecture" width="800" height="154"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your agent connects via IAM auth → AgentCore discovers the tools → your server executes them → results stream back.&lt;/strong&gt; You never manage servers, containers, or networking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before we start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.10+&lt;/strong&gt; and &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt; (or pip — but uv is faster)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS CLI&lt;/strong&gt; configured with credentials that have Bedrock AgentCore permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node.js 18+&lt;/strong&gt; (for the AgentCore CLI)&lt;/li&gt;
&lt;li&gt;An AWS account with AgentCore access (there's a free tier)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install the AgentCore tooling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# AgentCore CLI&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @aws/agentcore

&lt;span class="c"&gt;# AgentCore Python SDK&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;bedrock-agentcore

&lt;span class="c"&gt;# AgentCore Starter Toolkit (handles scaffolding + deployment)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;bedrock-agentcore-starter-toolkit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Build the MCP Server
&lt;/h2&gt;

&lt;p&gt;Create your project structure:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;infra-health-mcp &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;infra-health-mcp
uv init &lt;span class="nt"&gt;--bare&lt;/span&gt;
uv add mcp bedrock-agentcore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now create &lt;code&gt;server.py&lt;/code&gt;. We'll use FastMCP, which gives us a decorator-based API for defining tools:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;infra-health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_service_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check the health status of a deployed service.

    Args:
        service_name: Name of the service to check 
                      (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;api-gateway&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;auth-service&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# In production, this would hit your monitoring API
&lt;/span&gt;    &lt;span class="n"&gt;statuses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;degraded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unhealthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;uptime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;95.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;99.99&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;statuses&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uptime_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uptime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_checked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active_instances&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_recent_deployments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;List deployments that occurred in the last N hours.

    Args:
        hours: Number of hours to look back (default: 24)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;services&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api-gateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth-service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notification-svc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;deployers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci-pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci-pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hotfix-manual&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;deployments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;deploy_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;deployments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;deploy_time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployed_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployers&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolled_back&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_active_alerts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve currently active infrastructure alerts.

    Args:
        severity: Filter by severity level - 
                  &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALT-1024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth-service p99 latency above threshold (&amp;gt;500ms)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggered_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth-service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALT-1025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments service error rate at 2.3% (threshold: 1%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggered_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALT-1026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scheduled maintenance window in 4 hours&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggered_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;alerts&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamable-http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Key decisions here:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each tool has a clear docstring with typed args — this is what the LLM sees when deciding which tool to call, so be descriptive&lt;/li&gt;
&lt;li&gt;We're using &lt;code&gt;streamable-http&lt;/code&gt; transport, which is what AgentCore Runtime expects&lt;/li&gt;
&lt;li&gt;In production, you'd replace the mock data with calls to Datadog, CloudWatch, your deployment system, etc.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 2: Test Locally
&lt;/h2&gt;

&lt;p&gt;Before deploying anything, make sure the server works:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the server&lt;/span&gt;
uv run server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In another terminal, test it with the MCP inspector or a quick curl:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using the MCP CLI inspector&lt;/span&gt;
npx @modelcontextprotocol/inspector http://localhost:8000/mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You should see your three tools listed. Click through them, pass some args, verify the responses look right. Fix any issues now — it's much faster than debugging after deployment.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Prepare for AgentCore Runtime
&lt;/h2&gt;

&lt;p&gt;AgentCore Runtime needs your server wrapped with the &lt;code&gt;BedrockAgentCoreApp&lt;/code&gt;. Update &lt;code&gt;server.py&lt;/code&gt; by adding this at the top and modifying the entrypoint:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bedrock_agentcore.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BedrockAgentCoreApp&lt;/span&gt;

&lt;span class="c1"&gt;# ... (keep all your existing tool definitions) ...
&lt;/span&gt;
&lt;span class="c1"&gt;# Replace the if __name__ block:
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BedrockAgentCoreApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.entrypoint&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamable-http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Alternatively, use the AgentCore Starter Toolkit to scaffold the project structure automatically:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore init &lt;span class="nt"&gt;--protocol&lt;/span&gt; mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This generates the Dockerfile, IAM role config, and &lt;code&gt;agentcore.json&lt;/code&gt; for you. Copy your &lt;code&gt;server.py&lt;/code&gt; into the generated project and point the entrypoint to it.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 4: Deploy to AWS
&lt;/h2&gt;

&lt;p&gt;This is the part that used to take hours of ECS/ECR/IAM wrangling. With the Starter Toolkit, it's two commands:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Configure (generates IAM roles, ECR repo, build config)&lt;/span&gt;
agentcore configure

&lt;span class="c"&gt;# Deploy (builds container via CodeBuild, pushes to ECR, &lt;/span&gt;
&lt;span class="c"&gt;# deploys to AgentCore Runtime)&lt;/span&gt;
agentcore deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That's it. No Docker installed locally. No Terraform. CodeBuild handles the container image, and AgentCore Runtime manages the rest.&lt;/p&gt;

&lt;p&gt;The output gives you a &lt;strong&gt;Runtime ARN&lt;/strong&gt; — save this, you'll need it to connect your agent.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 5: Invoke Your Deployed Server
&lt;/h2&gt;

&lt;p&gt;Test the deployed server using the AWS CLI:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws bedrock-agent-runtime invoke-agent-runtime &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-runtime-arn&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{"jsonrpc":"2.0","method":"tools/list","id":1}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You should see your three tools returned. Now try calling one:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws bedrock-agent-runtime invoke-agent-runtime &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent-runtime-arn&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:bedrock:us-east-1:123456789:agent-runtime/your-runtime-id"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--payload&lt;/span&gt; &lt;span class="s1"&gt;'{"jsonrpc":"2.0","method":"tools/call","params":{"name":"get_active_alerts","arguments":{"severity":"critical"}},"id":2}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Connect an AI Agent
&lt;/h2&gt;

&lt;p&gt;Now the fun part. Let's wire this up to a Strands agent that can use our infrastructure tools conversationally:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands.tools.mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MCPClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.streamable_http&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;streamablehttp_client&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to your deployed MCP server via IAM auth
&lt;/span&gt;&lt;span class="n"&gt;mcp_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;streamablehttp_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-agentcore-endpoint/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# IAM auth is handled automatically via your AWS credentials
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.anthropic.claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools_sync&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a DevOps assistant with access to 
        infrastructure health tools. When asked about system status, 
        check services, review recent deployments, and surface any 
        active alerts. Be concise and flag anything that needs 
        immediate attention.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Give me a quick health check — any services having issues? &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;And were there any recent deployments that might be related?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The agent will automatically discover the tools, decide which ones to call, and synthesize the results into a coherent answer. You'll see it call &lt;code&gt;get_active_alerts&lt;/code&gt;, then &lt;code&gt;get_service_status&lt;/code&gt; for the flagged services, then &lt;code&gt;list_recent_deployments&lt;/code&gt; to correlate — all without you writing any orchestration logic.&lt;/p&gt;


&lt;h2&gt;
  
  
  What AgentCore Gives You for Free
&lt;/h2&gt;

&lt;p&gt;It's worth pausing to appreciate what you &lt;em&gt;didn't&lt;/em&gt; have to build:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Without AgentCore&lt;/th&gt;
&lt;th&gt;With AgentCore&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Container infra&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ECR + ECS/EKS + ALB&lt;/td&gt;
&lt;td&gt;Handled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Session isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom session management&lt;/td&gt;
&lt;td&gt;microVM per session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OAuth setup, token management&lt;/td&gt;
&lt;td&gt;IAM SigV4 built in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-scaling policies, metrics&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Networking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VPC, security groups, NAT&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Health checks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom implementation&lt;/td&gt;
&lt;td&gt;Built in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You wrote a Python file with tool definitions. Everything else is infrastructure you didn't touch.&lt;/p&gt;


&lt;h2&gt;
  
  
  Production Considerations
&lt;/h2&gt;

&lt;p&gt;Before going live with real data, a few things to think about:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replace mock data with real integrations.&lt;/strong&gt; The tool signatures stay the same — swap &lt;code&gt;random.choice(statuses)&lt;/code&gt; with a call to your CloudWatch API, PagerDuty, or whatever you use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add error handling.&lt;/strong&gt; MCP tools should return meaningful errors, not stack traces. Wrap your integrations in try/except and return structured error responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think about tool granularity.&lt;/strong&gt; Three focused tools is better than one "do everything" tool. The LLM needs clear, specific tool descriptions to make good decisions about what to call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stateful vs stateless.&lt;/strong&gt; Our server is stateless (the default and recommended mode). If you need multi-turn interactions where the server asks the user for clarification mid-execution, look into AgentCore's stateful MCP support with elicitation and sampling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect to AgentCore Gateway.&lt;/strong&gt; If your agent needs tools from multiple MCP servers, the Gateway acts as a single entry point that discovers and routes to all of them. You can also use the Responses API with a Gateway ARN to get server-side tool execution — Bedrock handles the entire orchestration loop in a single API call.&lt;/p&gt;


&lt;h2&gt;
  
  
  Cleanup
&lt;/h2&gt;

&lt;p&gt;When you're done experimenting:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore destroy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This tears down the Runtime, CodeBuild project, IAM roles, and ECR artifacts. You'll be prompted to confirm.&lt;/p&gt;


&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;A few directions to take this further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a Gateway&lt;/strong&gt; to combine your MCP server with AWS's open-source MCP servers (S3, DynamoDB, CloudWatch, etc.) into a single agent toolkit&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try the AG-UI protocol&lt;/strong&gt; alongside MCP — it standardizes how agents communicate with frontends, enabling streaming progress updates and interactive UIs&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  References:
&lt;/h3&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fmodelcontextprotocol%2Fdocs%2F2eb6171ddbfeefde349dc3b8d5e2b87414c26250%2Fimages%2Fog-image.png" height="450" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro" rel="noopener noreferrer" class="c-link"&gt;
            What is the Model Context Protocol (MCP)? - Model Context Protocol
          &lt;/a&gt;
        &lt;/h2&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmodelcontextprotocol.io%2Fmintlify-assets%2F_mintlify%2Ffavicons%2Fmcp%2FebiVJzri-bsiCfVZ%2F_generated%2Ffavicon%2Ffavicon-16x16.png" width="16" height="16"&gt;
          modelcontextprotocol.io
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://aws.amazon.com/bedrock/agentcore/" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fd1.awsstatic.com%2Fonedam%2Fmarketing-channels%2Fwebsite%2Faws%2Fen_US%2Fproduct-categories%2Fai-ml%2Fmachine-learning%2Fapproved%2Fimages%2FAWS_Illustration_Prompt_Engineering_4_1200.015a59cde2b2ea143addd04a6f7ae5bb9322b94b.png" height="600" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://aws.amazon.com/bedrock/agentcore/" rel="noopener noreferrer" class="c-link"&gt;
            Amazon Bedrock AgentCore- AWS
          &lt;/a&gt;
        &lt;/h2&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fa0.awsstatic.com%2Flibra-css%2Fimages%2Fsite%2Ffav%2Ffavicon.ico" width="16" height="16"&gt;
          aws.amazon.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/strands-agents" rel="noopener noreferrer"&gt;
        strands-agents
      &lt;/a&gt; / &lt;a href="https://github.com/strands-agents/sdk-python" rel="noopener noreferrer"&gt;
        sdk-python
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A model-driven approach to building AI agents in just a few lines of code.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
  &lt;div&gt;
    &lt;a href="https://strandsagents.com" rel="nofollow noopener noreferrer"&gt;
      &lt;img src="https://camo.githubusercontent.com/1cf2d94f5ad881d696cc58b3ffad81acf923846f6c5132f56d6a355ebbb9d6a5/68747470733a2f2f737472616e64736167656e74732e636f6d2f6c61746573742f6173736574732f6c6f676f2d6769746875622e737667" alt="Strands Agents" width="55px" height="105px"&gt;
    &lt;/a&gt;
  &lt;/div&gt;
  &lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;
    Strands Agents
  &lt;/h1&gt;
&lt;/div&gt;


&lt;div class="markdown-heading"&gt;

&lt;h2 class="heading-element"&gt;
    A model-driven approach to building AI agents in just a few lines of code
  &lt;/h2&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div&gt;
&lt;br&gt;
    &lt;a href="https://github.com/strands-agents/sdk-python/graphs/commit-activity" rel="noopener noreferrer"&gt;&lt;img alt="GitHub commit activity" src="https://camo.githubusercontent.com/97a16934bcf6122bb7d31b378cfdd4e5fdb4366d37e421ca1400a808592151ab/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f636f6d6d69742d61637469766974792f6d2f737472616e64732d6167656e74732f73646b2d707974686f6e"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/strands-agents/sdk-python/issues" rel="noopener noreferrer"&gt;&lt;img alt="GitHub open issues" src="https://camo.githubusercontent.com/86a1b04e7cf6acc1dcffecd0c710d92f8c234109d7a9ac6cf49254b3a6f9a713/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6973737565732f737472616e64732d6167656e74732f73646b2d707974686f6e"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/strands-agents/sdk-python/pulls" rel="noopener noreferrer"&gt;&lt;img alt="GitHub open pull requests" src="https://camo.githubusercontent.com/3f9c1ce371b66ad3d7a84f53b0d4db3eb15ea30e324b44ed7b4ab5aec89af2a6/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6973737565732d70722f737472616e64732d6167656e74732f73646b2d707974686f6e"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://github.com/strands-agents/sdk-python/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img alt="License" src="https://camo.githubusercontent.com/f0bbad750117a1a77024abdf5b7f295cd20d602d7c5e5d00deb8840bd42b76ee/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f737472616e64732d6167656e74732f73646b2d707974686f6e"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://pypi.org/project/strands-agents/" rel="nofollow noopener noreferrer"&gt;&lt;img alt="PyPI version" src="https://camo.githubusercontent.com/81edea778993e0f3f83076ffef280a65e92d47f4572181429acdb1ce847e4293/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f737472616e64732d6167656e7473"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img alt="Python versions" src="https://camo.githubusercontent.com/7bfb2dda3a85f269b08e5df714abd5cd04d453f609ee5258e63e3ccb5e525aea/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f737472616e64732d6167656e7473"&gt;&lt;/a&gt;&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;p&gt;&lt;br&gt;
    &lt;a href="https://strandsagents.com/" rel="nofollow noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;br&gt;
    ◆ &lt;a href="https://github.com/strands-agents/samples" rel="noopener noreferrer"&gt;Samples&lt;/a&gt;&lt;br&gt;
    ◆ &lt;a href="https://github.com/strands-agents/sdk-python" rel="noopener noreferrer"&gt;Python SDK&lt;/a&gt;&lt;br&gt;
    ◆ &lt;a href="https://github.com/strands-agents/tools" rel="noopener noreferrer"&gt;Tools&lt;/a&gt;&lt;br&gt;
    ◆ &lt;a href="https://github.com/strands-agents/agent-builder" rel="noopener noreferrer"&gt;Agent Builder&lt;/a&gt;&lt;br&gt;
    ◆ &lt;a href="https://github.com/strands-agents/mcp-server" rel="noopener noreferrer"&gt;MCP Server&lt;/a&gt;&lt;br&gt;
  &lt;/p&gt;
&lt;br&gt;
&lt;/div&gt;

&lt;p&gt;Strands Agents is a simple yet powerful SDK that takes a model-driven approach to building and running AI agents. From simple conversational assistants to complex autonomous workflows, from local development to production deployment, Strands Agents scales with your needs.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Feature Overview&lt;/h2&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight &amp;amp; Flexible&lt;/strong&gt;: Simple agent loop that just works and is fully customizable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Agnostic&lt;/strong&gt;: Support for Amazon Bedrock, Anthropic, Gemini, LiteLLM, Llama, Ollama, OpenAI, Writer, and custom providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Capabilities&lt;/strong&gt;: Multi-agent systems, autonomous agents, and streaming support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in MCP&lt;/strong&gt;: Native support for Model Context Protocol (MCP) servers, enabling access to thousands of pre-built tools&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Install Strands Agents&lt;/span&gt;
pip install strands-agents strands-agents-tools&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;strands&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-v"&gt;Agent&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;strands_tools&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;calculator&lt;/span&gt;
&lt;span class="pl-s1"&gt;agent&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/strands-agents/sdk-python" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://aws.amazon.com/solutions/guidance/deploying-model-context-protocol-servers-on-aws/" rel="noopener noreferrer" class="c-link"&gt;
            Guidance for Deploying Model Context Protocol Servers on AWS
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            This Guidance demonstrates how to securely integrate Model Context Protocol (MCP) servers into AWS applications using containerized architecture. 
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fa0.awsstatic.com%2Flibra-css%2Fimages%2Fsite%2Ffav%2Ffavicon.ico" width="16" height="16"&gt;
          aws.amazon.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>mcp</category>
      <category>ai</category>
      <category>python</category>
      <category>aws</category>
    </item>
    <item>
      <title>Beyond the LLM: Why Amazon Bedrock Agents are the New EC2 for AI Orchestration</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Fri, 03 Apr 2026 07:28:33 +0000</pubDate>
      <link>https://dev.to/jubinsoni/beyond-the-llm-why-amazon-bedrock-agents-are-the-new-ec2-for-ai-orchestration-amj</link>
      <guid>https://dev.to/jubinsoni/beyond-the-llm-why-amazon-bedrock-agents-are-the-new-ec2-for-ai-orchestration-amj</guid>
      <description>&lt;p&gt;In 2006, Amazon Web Services (AWS) launched Elastic Compute Cloud (EC2). It was a watershed moment that moved computing from physical server rooms to a scalable, virtualized utility. Before EC2, if you wanted to launch a web application, you needed to rack servers, manage power, and handle physical networking. EC2 abstracted the "where" and "how" of compute, providing a standardized environment where code could run reliably at scale.&lt;/p&gt;

&lt;p&gt;Today, we are witnessing a similar paradigm shift in the field of Artificial Intelligence. While Large Language Models (LLMs) like Claude, GPT-4, and Llama are the "CPUs" of this new era, the industry has struggled with the infrastructure required to make these models perform tasks autonomously. Entering the scene is Amazon Bedrock Agents (often discussed internally and by architects through the lens of its underlying orchestration engine, which we will refer to as the AgentCore framework). &lt;/p&gt;

&lt;p&gt;This article argues that Amazon Bedrock Agents represent the "EC2 moment" for AI agents. By providing a managed, secure, and standardized environment for agentic reasoning, AWS is doing for AI autonomy what it did for raw compute two decades ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution of the Compute Unit
&lt;/h2&gt;

&lt;p&gt;To understand why Bedrock Agents are significant, we must look at the evolution of abstraction in the cloud. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Physical Servers&lt;/strong&gt;: Manual hardware management.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;EC2 (Virtual Machines)&lt;/strong&gt;: Abstracted hardware into virtual slices.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Lambda (Serverless Functions)&lt;/strong&gt;: Abstracted the runtime and scaling.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Bedrock Agents (Agentic Orchestration)&lt;/strong&gt;: Abstracting the reasoning loop, tool-calling, and state management.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the traditional paradigm, developers wrote deterministic logic: &lt;code&gt;if (x) then (y)&lt;/code&gt;. In the agentic paradigm, we provide a goal and a set of tools, and the agent determines the sequence of actions. However, building these agents manually using raw Python and frameworks like LangChain often leads to "spaghetti code" and brittle state management. Bedrock Agents provide the standardized "Instance" where these agents can live, breathe, and execute.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Pillars of AgentCore
&lt;/h2&gt;

&lt;p&gt;What makes an agent more than just a chatbot? It is the ability to use tools (Action Groups), access private data (Knowledge Bases), and maintain a reasoning chain (Orchestration). Amazon Bedrock Agents integrate these three pillars into a unified managed service.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Reasoning Engine (The Kernel)
&lt;/h3&gt;

&lt;p&gt;At the heart of the agent is the orchestration logic. Most modern agents use a ReAct (Reason + Act) prompting strategy. Bedrock automates this loop. When a user submits a prompt, the agent enters a cyclic state of thinking, deciding which tool to use, executing that tool, and observing the result until the task is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Action Groups (The I/O Ports)
&lt;/h3&gt;

&lt;p&gt;Action Groups are the interfaces through which an agent interacts with the outside world. Think of these as the peripheral ports on an EC2 instance. You define an OpenAPI schema and link it to an AWS Lambda function. The agent reads the schema, understands what the API does, and generates the necessary parameters to call it.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Knowledge Bases (The Persistent Storage)
&lt;/h3&gt;

&lt;p&gt;An agent is only as good as its context. Bedrock Knowledge Bases provide a managed RAG (Retrieval-Augmented Generation) workflow. It handles document chunking, embedding generation, and vector database storage (e.g., OpenSearch or Pinecone). When an agent receives a query, it automatically queries the Knowledge Base to augment its response with private, up-to-date data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualizing the Agentic Workflow
&lt;/h2&gt;

&lt;p&gt;To understand how these components interact, let's look at the sequence of a typical request handled by a Bedrock Agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35cyddauj2yynm35i0fc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35cyddauj2yynm35i0fc.png" alt="sequence diagram" width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "EC2 of Agents" Argument
&lt;/h2&gt;

&lt;p&gt;Why do we compare this to EC2? Because it solves the four major hurdles of agent deployment: Scalability, Security, Persistence, and Standardized Packaging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability and Concurrency
&lt;/h3&gt;

&lt;p&gt;Building an agent on a local server or a custom container requires you to manage the memory of the conversation, the latency of the LLM calls, and the concurrent execution of tools. Bedrock Agents are serverless. Whether you have 1 user or 10,000, AWS manages the underlying compute resource required to run the reasoning loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security and Identity (IAM)
&lt;/h3&gt;

&lt;p&gt;Just as EC2 uses IAM roles to access S3 buckets, Bedrock Agents use IAM roles to execute Lambda functions and query Knowledge Bases. This provides a fine-grained security model where the "Agent Identity" is strictly governed. You aren't passing raw API keys into a prompt; you are authorizing a service role.&lt;/p&gt;

&lt;h3&gt;
  
  
  Versioning and Aliasing
&lt;/h3&gt;

&lt;p&gt;One of the most powerful features of EC2 and Lambda is the ability to version deployments. Bedrock Agents allow you to create immutable versions and point aliases (like "PROD" or "DEV") to specific versions. This enables a professional CI/CD pipeline for AI agents, which was previously difficult to achieve with manual LLM chains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lifecycle of an Agent
&lt;/h2&gt;

&lt;p&gt;Managing an agent's state is non-trivial. The following state diagram illustrates how an agent moves from a draft configuration to a production-ready resource.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faopfp20c2fq4k8abccoy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faopfp20c2fq4k8abccoy.png" alt="State Diagram" width="527" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: Traditional Development vs. Bedrock Agents
&lt;/h2&gt;

&lt;p&gt;Below is a comparison of how common agentic requirements are handled in a "DIY" environment versus the Bedrock Agent environment.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;DIY (LangChain/Custom)&lt;/th&gt;
&lt;th&gt;Amazon Bedrock Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual (Redis/DynamoDB)&lt;/td&gt;
&lt;td&gt;Managed (Session State)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration Loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom Python logic&lt;/td&gt;
&lt;td&gt;Managed (ReAct based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual API wrappers&lt;/td&gt;
&lt;td&gt;OpenAPI Schema + Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom Vector DB pipelines&lt;/td&gt;
&lt;td&gt;Integrated Knowledge Bases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual (K8s/ECS)&lt;/td&gt;
&lt;td&gt;Serverless / Auto-scaling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracing/Logging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom implementation&lt;/td&gt;
&lt;td&gt;Integrated CloudWatch / X-Ray&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API Key Management&lt;/td&gt;
&lt;td&gt;IAM Role-based access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Technical Implementation: Building an Agent Programmatically
&lt;/h2&gt;

&lt;p&gt;To demonstrate the power of the AgentCore approach, let's look at how we define an agent using the AWS SDK for Python (Boto3). This example shows the creation of an agent, but the real magic is in the simplicity of the configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;bedrock_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bedrock-agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_support_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Create the Agent
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agentName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CustomerSupportAgent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;foundationModel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-3-sonnet-20240229-v1:0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You are a helpful customer support assistant. Use the provided tools to lookup orders.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agentResourceRoleArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789012:role/MyAgentRole&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;agentId&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Add an Action Group (The toolset)
&lt;/span&gt;    &lt;span class="n"&gt;bedrock_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_agent_action_group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agentId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agentVersion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DRAFT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actionGroupName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;OrderManagementTools&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tools for looking up and modifying customer orders.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;actionGroupExecutor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lambda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arn:aws:lambda:us-east-1:123456789012:function:OrderLookupFunc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;apiSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3BucketName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-schema-bucket&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3ObjectKey&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;order_api_schema.yaml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Prepare the Agent (Compiles the configuration)
&lt;/span&gt;    &lt;span class="n"&gt;bedrock_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agentId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_support_agent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is being initialized...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Understanding the Code
&lt;/h3&gt;

&lt;p&gt;In this snippet, we aren't writing any code for "how the model should think." We are defining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity&lt;/strong&gt;: &lt;code&gt;agentName&lt;/code&gt; and &lt;code&gt;agentResourceRoleArn&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brain&lt;/strong&gt;: The &lt;code&gt;foundationModel&lt;/code&gt; (Claude 3 Sonnet).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boundaries&lt;/strong&gt;: The &lt;code&gt;instruction&lt;/code&gt; (System Prompt).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capabilities&lt;/strong&gt;: The &lt;code&gt;actionGroupExecutor&lt;/code&gt; (The Lambda function that actually does the work).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When &lt;code&gt;prepare_agent&lt;/code&gt; is called, AWS packages these components into a runtime environment—identical to how EC2 packages an AMI (Amazon Machine Image) into a running instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dive: The Orchestration Logic
&lt;/h2&gt;

&lt;p&gt;The most significant technical contribution of Bedrock Agents is the managed orchestration. In a typical O(n) complexity operation, where n is the number of steps to solve a problem, the agent must maintain a consistent memory of what has already occurred.&lt;/p&gt;

&lt;p&gt;Bedrock uses a "Trace" feature that allows developers to see the exact reasoning of the agent. This is divided into:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Pre-processing&lt;/strong&gt;: Validating if the user input is malicious or out of scope.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Orchestration&lt;/strong&gt;: The step-by-step reasoning where the model decides which tool to call.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Post-processing&lt;/strong&gt;: Formatting the final response for the user.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This visibility is crucial for debugging. In the EC2 world, we have SSH and CloudWatch Logs. In the Bedrock Agent world, we have the Orchestration Trace.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ecosystem Mindmap
&lt;/h2&gt;

&lt;p&gt;The utility of an agent is defined by what it can connect to. The Bedrock Agent sits at the center of a vast AWS ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdq34t7z1n3lmpv7y9ht8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdq34t7z1n3lmpv7y9ht8.png" alt="Diagram" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Dimension
&lt;/h2&gt;

&lt;p&gt;Just as EC2 introduced the concept of paying for what you use, Bedrock Agents follow a similar philosophy. You pay for the underlying model tokens used during the reasoning process, and a small management fee. This eliminates the "idle cost" of running a custom agentic framework on a cluster of instances that might not be doing work 24/7.&lt;/p&gt;

&lt;p&gt;However, developers must be mindful of "Infinite Loops." If an agent's instructions are vague, it might call tools repeatedly without reaching a conclusion. Bedrock includes built-in timeouts and max-iteration settings to prevent the "Agentic version" of a runaway process that drains your budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges and Considerations
&lt;/h2&gt;

&lt;p&gt;While Bedrock Agents are the "EC2 of AI," the technology is still maturing. Here are a few technical hurdles developers face:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cold Starts&lt;/strong&gt;: Just like Lambda, the initial "Preparation" of an agent can take time. Once prepared, the invocation is fast, but the initial spin-up of the reasoning context has latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Strictness&lt;/strong&gt;: The OpenAPI schemas used for Action Groups must be precise. LLMs are sensitive to parameter descriptions. If your schema says a parameter is a string but doesn't explain what that string represents, the agent may hallucinate the input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Window Limits&lt;/strong&gt;: Even though the agent manages the conversation, the underlying model has a finite context window. For very long, multi-step tasks involving massive data retrieval, the agent must be designed to summarize previous steps to avoid hitting token limits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Future: From Instances to Fleets
&lt;/h2&gt;

&lt;p&gt;We are moving toward a world of "Agentic Fleets." If an individual Bedrock Agent is an EC2 instance, then the future involves "Auto-scaling Groups" of agents—multiple specialized agents working together (Multi-Agent Systems). &lt;/p&gt;

&lt;p&gt;AWS has already hinted at this with features that allow agents to call other agents. This creates a hierarchical structure where a "Manager Agent" decomposes a complex project into sub-tasks and delegates them to "Worker Agents" specialized in specific domains (e.g., one for SQL generation, one for document writing, one for code execution).&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Amazon Bedrock Agents (AgentCore) represent more than just a convenience feature for developers; they represent the standardization of AI autonomy. By providing a managed environment for reasoning, tool use, and data retrieval, AWS is removing the heavy lifting of "Agentic Ops."&lt;/p&gt;

&lt;p&gt;Just as EC2 allowed a single developer to launch an application that could serve millions, Bedrock Agents allow a single developer to build an autonomous system that can navigate complex business logic that previously required manual human intervention. We are no longer just building models; we are deploying virtual employees on scalable cloud infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/bedrock/agents/" rel="noopener noreferrer"&gt;Amazon Bedrock Agents Service Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents-how-it-works.html" rel="noopener noreferrer"&gt;AWS Documentation: How Amazon Bedrock Agents Work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2210.03629" rel="noopener noreferrer"&gt;The ReAct Framework: Synergizing Reasoning and Acting in Language Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent.html" rel="noopener noreferrer"&gt;Boto3 Documentation for Amazon Bedrock Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/architecture/building-generative-ai-agents-with-amazon-bedrock/" rel="noopener noreferrer"&gt;AWS Architecture Blog: Building Generative AI Agents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>generativeai</category>
      <category>aiagents</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>I Gave Gemini 3 My Worst Legacy Code — Here’s What Happened</title>
      <dc:creator>Jubin Soni</dc:creator>
      <pubDate>Tue, 31 Mar 2026 00:44:13 +0000</pubDate>
      <link>https://dev.to/jubinsoni/i-gave-gemini-3-my-worst-legacy-code-heres-what-happened-5h68</link>
      <guid>https://dev.to/jubinsoni/i-gave-gemini-3-my-worst-legacy-code-heres-what-happened-5h68</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: The Digital Archaeology Experiment
&lt;/h2&gt;

&lt;p&gt;We all have that one folder. The one labeled "v1_final_do_not_touch_2016." It is a sprawling ecosystem of spaghetti code, global variables, and comments that simply read &lt;code&gt;// I am sorry.&lt;/code&gt; In an era of Large Language Models (LLMs), we often hear about AI writing boilerplate, but can it actually perform digital archeology? &lt;/p&gt;

&lt;p&gt;I decided to feed my most "haunted" legacy script—a 2,000-line monolith responsible for processing data—into a hypothetical next-generation model, Gemini 3. The goal wasn't just to see if it could fix the bugs, but to see if it could transform a maintenance nightmare into a modern, scalable architecture. &lt;/p&gt;

&lt;p&gt;What followed was a masterclass in software engineering best practices. The AI didn't just move code around; it applied structural patterns that we often neglect in the heat of deadlines. This guide breaks down the core best practices Gemini 3 utilized to transform legacy junk into production-grade software, and why you should apply these practices even if you aren't using an AI assistant.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Single Responsibility Principle (SRP): Deconstructing the Monolith
&lt;/h2&gt;

&lt;p&gt;The first thing the AI flagged was the "God Object" syndrome. In my legacy code, a single function called &lt;code&gt;process_claim()&lt;/code&gt; was responsible for: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validating user input.&lt;/li&gt;
&lt;li&gt;Connecting to a MySQL database.&lt;/li&gt;
&lt;li&gt;Calculating claim totals with hardcoded tax rules.&lt;/li&gt;
&lt;li&gt;Sending an email notification.&lt;/li&gt;
&lt;li&gt;Logging errors to a local file.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Bad Practice (The Monolith)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_claim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Validation
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Database logic
&lt;/span&gt;    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;connect_to_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO claims VALUES (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Business logic
&lt;/span&gt;    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.15&lt;/span&gt; &lt;span class="c1"&gt;# Hardcoded tax
&lt;/span&gt;
    &lt;span class="c1"&gt;# Notification
&lt;/span&gt;    &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;admin@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Claim &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why This Fails
&lt;/h3&gt;

&lt;p&gt;This code is impossible to test in isolation. If you want to test the tax calculation, you must have a live database connection and an email server ready. Furthermore, a change in the email provider's API forces a change in the business logic file, violating the principle that software should be easy to change without unintended side effects.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Good Practice (Applying SRP)
&lt;/h3&gt;

&lt;p&gt;Gemini 3 refactored this into distinct services. Validation, Persistence, Calculation, and Messaging were separated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClaimValidator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValidationError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaxCalculator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_code&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_get_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClaimService&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;repository&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notifier&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validator&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calculator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculator&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repository&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;repository&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;notifier&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Claim &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why It Matters
&lt;/h3&gt;

&lt;p&gt;By separating concerns, the code becomes modular. You can now swap the &lt;code&gt;TaxCalculator&lt;/code&gt; for a different regional version without touching the &lt;code&gt;ClaimService&lt;/code&gt;. Testing becomes a matter of passing "mock" objects into the constructor, ensuring your unit tests are fast and reliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Checklist for Applying SRP
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Identify "Ands"&lt;/td&gt;
&lt;td&gt;If a function does A &lt;em&gt;and&lt;/em&gt; B, it needs to be split.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extract Logic&lt;/td&gt;
&lt;td&gt;Move business rules into separate, pure functions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Isolate I/O&lt;/td&gt;
&lt;td&gt;Keep database and API calls outside of core logic classes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Limit Lines&lt;/td&gt;
&lt;td&gt;Aim for functions under 20 lines of code.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Decoupling Through Dependency Injection
&lt;/h2&gt;

&lt;p&gt;One of the most profound changes Gemini 3 suggested involved how objects interact. In the legacy code, objects instantiated their own dependencies. If Class A needed Class B, it would simply call &lt;code&gt;b = new ClassB()&lt;/code&gt; inside its constructor. This creates "tight coupling."&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualizing the Transformation
&lt;/h3&gt;

&lt;p&gt;Below is a &lt;strong&gt;Flowchart&lt;/strong&gt; illustrating the decision-making process for decoupling legacy dependencies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b8g632l0go2093razk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7b8g632l0go2093razk.png" alt="Flowchart Diagram" width="586" height="910"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Pitfall: The "New" Keyword
&lt;/h3&gt;

&lt;p&gt;When you use &lt;code&gt;new&lt;/code&gt; inside a class, you are locking that class to a specific implementation. This makes it impossible to substitute a mock version for testing or a different implementation for a new environment (like a staging server).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Dependency Injection (DI)
&lt;/h3&gt;

&lt;p&gt;Instead of creating the dependency inside the class, you "inject" it—usually via the constructor. This practice shifts the responsibility of object creation to the caller or a dedicated DI container.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison: Before vs. After
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Bad (Tight Coupling):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PostgresDatabase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// Hardcoded dependency&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Good (Loose Coupling):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// Injected dependency&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;database&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Benefit:&lt;/strong&gt; In your production environment, you pass a real &lt;code&gt;PostgresDatabase&lt;/code&gt;. In your test environment, you pass an &lt;code&gt;InMemoryDatabase&lt;/code&gt;. The &lt;code&gt;OrderService&lt;/code&gt; doesn't know the difference, making it highly reusable.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Defensive Programming and Error Handling
&lt;/h2&gt;

&lt;p&gt;Legacy code often treats error handling as an afterthought, using generic &lt;code&gt;try-catch&lt;/code&gt; blocks that swallow exceptions or returning &lt;code&gt;null&lt;/code&gt; values that eventually lead to the dreaded "Null Reference Exception."&lt;/p&gt;

&lt;p&gt;Gemini 3's refactoring emphasized &lt;strong&gt;Defensive Programming&lt;/strong&gt;: the practice of designing software to continue functioning under unforeseen circumstances.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sequence Diagram: Proper Error Handling Flow
&lt;/h3&gt;

&lt;p&gt;This &lt;strong&gt;Sequence Diagram&lt;/strong&gt; shows the interaction between a client, a service, and an external API using resilient patterns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv44vxuo7r2dcfq2lsoxg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv44vxuo7r2dcfq2lsoxg.png" alt="Sequence Diagram" width="729" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Defensive Practices
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Fail Fast:&lt;/strong&gt; Validate inputs at the very beginning of a function. If they are invalid, throw an exception immediately.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Use Meaningful Exceptions:&lt;/strong&gt; Instead of throwing &lt;code&gt;Error&lt;/code&gt;, throw &lt;code&gt;InsufficientFundsError&lt;/code&gt; or &lt;code&gt;UserNotFoundError&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Circuit Breakers:&lt;/strong&gt; If an external service is down, don't keep hammering it. Stop the calls and return a cached result or a graceful failure.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Good vs. Bad Error Handling
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Bad Practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt; &lt;span class="c1"&gt;# Silently failing is the worst thing you can do
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Good Practice:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ConnectionError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to connect to UserAPI for ID &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ServiceUnavailableError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Our user service is temporarily down.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;UserNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="c1"&gt;# Explicitly handled
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Modernizing State Management
&lt;/h2&gt;

&lt;p&gt;In my legacy script, the code relied heavily on global state. A variable like &lt;code&gt;current_user_id&lt;/code&gt; was updated by multiple functions across the file. This led to unpredictable bugs where the state would change in the middle of a process due to an asynchronous callback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation: Using Immutability
&lt;/h3&gt;

&lt;p&gt;Instead of modifying an existing object, create a new one. This ensures that other parts of the system holding a reference to the old object aren't surprised by a sudden change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad (Mutable):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;updatePrice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;newPrice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;newPrice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Changes the object everywhere&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Good (Immutable):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;updatePrice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;newPrice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;newPrice&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="c1"&gt;// Returns a new object&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using immutability, you make your code thread-safe and much easier to debug. If a bug occurs, you can inspect the state at any point in time without worrying that it was modified downstream.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Refactoring Summary: The Do's and Don'ts
&lt;/h2&gt;

&lt;p&gt;To help you apply these findings to your own legacy codebases, here is a summary table of the transformations Gemini 3 performed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Don't Do This (Legacy)&lt;/th&gt;
&lt;th&gt;Do This (Modern)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Giant functions with nested if/else.&lt;/td&gt;
&lt;td&gt;Small, pure functions with early returns.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct manipulation of global state.&lt;/td&gt;
&lt;td&gt;Immutable data structures and local state.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hardcoded &lt;code&gt;new&lt;/code&gt; instances.&lt;/td&gt;
&lt;td&gt;Injected dependencies via interfaces.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generic &lt;code&gt;try-catch&lt;/code&gt; with empty bodies.&lt;/td&gt;
&lt;td&gt;Domain-specific exceptions and logging.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nested loops with O(n^2) complexity.&lt;/td&gt;
&lt;td&gt;Optimized algorithms with O(n) or O(log n).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Comments explaining &lt;em&gt;what&lt;/em&gt; code does.&lt;/td&gt;
&lt;td&gt;Self-documenting code explaining &lt;em&gt;why&lt;/em&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Common Pitfalls to Avoid During Refactoring
&lt;/h2&gt;

&lt;p&gt;Even with an AI as powerful as Gemini 3, refactoring is not without risks. Here are three common pitfalls I encountered during this experiment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Refactoring Without Tests:&lt;/strong&gt; Never start refactoring until you have "Characterization Tests"—tests that describe how the code &lt;em&gt;currently&lt;/em&gt; behaves. If you change the code and the tests pass, you know you haven't broken existing functionality.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Over-Engineering:&lt;/strong&gt; It is tempting to apply every design pattern (Factory, Strategy, Observer) at once. Only introduce complexity when it solves a specific problem. If a simple function works, you don't need a class.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The "Big Bang" Rewrite:&lt;/strong&gt; Resist the urge to rewrite the entire system from scratch. This almost always leads to project failure. Instead, refactor one small module at a time, ensuring the system remains operational throughout the process.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Practical Guidance: An Implementation Roadmap
&lt;/h2&gt;

&lt;p&gt;If you are staring at a mountain of legacy code today, here is the recommended roadmap for modernization:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Identify the Pain Points:&lt;/strong&gt; Which part of the code breaks most often? Start there.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Write Integration Tests:&lt;/strong&gt; Capture the current behavior of that module.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Decouple the Core:&lt;/strong&gt; Identify the business logic and extract it from the infrastructure (database/UI).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Introduce Dependency Injection:&lt;/strong&gt; Allow your business logic to be tested in isolation.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Clean Up the Syntax:&lt;/strong&gt; Use modern language features (like Async/Await or Type Hints) to improve readability.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion: AI as the Ultimate Pair Programmer
&lt;/h2&gt;

&lt;p&gt;Feeding my worst legacy code to Gemini 3 was an eye-opening experience. The AI didn't just "fix" the code; it enforced a level of discipline that is often lost in the day-to-day grind of feature delivery. It reminded me that the most important audience for our code isn't the compiler—it is the human developer who has to maintain it six months from now.&lt;/p&gt;

&lt;p&gt;By prioritizing the Single Responsibility Principle, decoupling dependencies through injection, and embracing defensive programming, we can turn even the most frightening legacy scripts into robust, modern systems. Whether you use an AI assistant or your own expertise, these best practices remain the bedrock of professional software engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://martinfowler.com/books/refactoring.html" rel="noopener noreferrer"&gt;Refactoring: Improving the Design of Existing Code by Martin Fowler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/library/view/clean-code-a/9780136083238/" rel="noopener noreferrer"&gt;Clean Code: A Handbook of Agile Software Craftsmanship&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://12factor.net/" rel="noopener noreferrer"&gt;The Twelve-Factor App Methodology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google/styleguide" rel="noopener noreferrer"&gt;Google Software Engineering Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/community/conceptual-articles/s-o-l-i-d-the-first-five-principles-of-object-oriented-design" rel="noopener noreferrer"&gt;SOLID Principles of Object-Oriented Design&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Connect with me: &lt;a href="https://linkedin.com/in/jubinsoni" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; | &lt;a href="https://twitter.com/sonijubin" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt; | &lt;a href="https://github.com/jubins" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://jubinsoni.com" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cleancode</category>
      <category>legacysystems</category>
      <category>ai</category>
      <category>refactoring</category>
    </item>
  </channel>
</rss>
