<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Asako Hayase</title>
    <description>The latest articles on DEV Community by Asako Hayase (@asakohayase).</description>
    <link>https://dev.to/asakohayase</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2019147%2F9adcceb0-1f6b-4ba5-ab9a-6f884529de64.jpeg</url>
      <title>DEV Community: Asako Hayase</title>
      <link>https://dev.to/asakohayase</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/asakohayase"/>
    <language>en</language>
    <item>
      <title>I Built a Rhythm Game That Lives Above My IDE</title>
      <dc:creator>Asako Hayase</dc:creator>
      <pubDate>Sun, 07 Jun 2026 20:31:05 +0000</pubDate>
      <link>https://dev.to/asakohayase/i-built-a-rhythm-game-that-lives-above-my-ide-icp</link>
      <guid>https://dev.to/asakohayase/i-built-a-rhythm-game-that-lives-above-my-ide-icp</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Every year, I pick up new hobbies. This year: drums.&lt;/p&gt;

&lt;p&gt;When I was watching Claude Code flibbertigibbeting, I thought, "why don't I build a game to practice my rhythm skills?"&lt;/p&gt;

&lt;p&gt;So I built a rhythm game in Electron that floats transparently above my IDE. When Claude Code is thinking, I hit F and J to play low and high hits against whatever song is loaded. When it responds, I go back to coding. No context switch, no separate window. The game is just there, above everything else, always.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What I Built
&lt;/h2&gt;

&lt;p&gt;The game loads any audio file, analyzes it offline to find where the low and high hits land, then spawns hit targets that scroll toward two drum pads synced to the song's playback position. You hit F for low hits, J for high hits. Timing accuracy scores each hit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loads any audio file&lt;/strong&gt;: MP3, WAV, OGG, M4A, AAC, FLAC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyzes beats offline&lt;/strong&gt;: runs onset detection across the whole file before playback starts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Five visual themes&lt;/strong&gt;: Lime, Classic, Forest, Neon, Dusk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BPM auto-detection&lt;/strong&gt;: estimates tempo from detected low-hit intervals&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. How to Run
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/asakohayase/drum-overlay.git
&lt;span class="nb"&gt;cd &lt;/span&gt;drum-overlay
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Controls:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;F&lt;/code&gt;: low hits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;J&lt;/code&gt;: high hits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Space&lt;/code&gt;: play / pause song&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Cmd+Shift+Q&lt;/code&gt;: quit&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. What is an Electron Overlay?
&lt;/h2&gt;

&lt;p&gt;A browser tab can't float above other apps. It lives inside the browser window, so you'd have to alt-tab to use it, which defeats the whole point. You need OS-level window control.&lt;/p&gt;

&lt;p&gt;Electron gives you that. It's normally used to build standalone desktop apps. VS Code, Claude Desktop, Slack are all Electron. Three window flags make the overlay possible:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;transparent: true&lt;/code&gt;&lt;/strong&gt; removes the default white window background Electron adds. Without it, it looks like this: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hh7slgkepd7vo3pfmzs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7hh7slgkepd7vo3pfmzs.png" alt=" " width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;alwaysOnTop: true&lt;/code&gt;&lt;/strong&gt; keeps the window above all other windows system-wide. It doesn't lose its position when you click on something else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;setIgnoreMouseEvents(true, { forward: true })&lt;/code&gt;&lt;/strong&gt; without this, you cannot click your IDE. The window covers the full screen, so it would intercept every click. This flag passes clicks through to whatever's underneath, while still telling the overlay where your cursor is. When it enters the panel, the overlay temporarily becomes clickable. When it leaves, clicks pass through again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;win&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BrowserWindow&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;transparent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;alwaysOnTop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;webPreferences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;preload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preload.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;win&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setIgnoreMouseEvents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The renderer toggles interactivity dynamically based on what the cursor is over:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mousemove&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;over&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;closest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.pad, .icon-btn, .play-btn, .progress-bar&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;ipcRenderer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;set-ignore-mouse&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;over&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Web Audio API (built-in):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;OfflineAudioContext&lt;/code&gt;: runs the full analysis pass before playback starts. Some libraries only offer real-time analysis, which is too late to pre-populate the note lane.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;createBiquadFilter&lt;/code&gt;: applies lowpass/bandpass frequency filters&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;decodeAudioData&lt;/code&gt;: decodes MP3/WAV/etc into raw samples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom code built on top:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;detectOnsets&lt;/code&gt;: finds low-hit and high-hit timestamps from the filtered audio&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;estimateBPM&lt;/code&gt;: estimates tempo from low-hit intervals&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;playKick&lt;/code&gt; / &lt;code&gt;playSnare&lt;/code&gt;: synthesized drum sounds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;drawFrame&lt;/code&gt;: game loop, note scrolling, hit detection, scoring
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio file
    │
    ▼
decodeAudioData()
    │
    ├─ detectOnsets(lowpass,  100Hz)  → lowTimes[]
    └─ detectOnsets(bandpass, 2500Hz) → highTimes[]
    │
    ▼
estimateBPM(lowTimes) → bpm
    │
    ▼
Game loop (requestAnimationFrame)
    ├─ Spawn notes from lowTimes/highTimes ahead of currentTime
    ├─ Scroll notes toward hit zone
    └─ Score hit on keydown (F=kick, J=snare)
    │
    ▼
playKick() / playSnare() on hit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Onset detection
&lt;/h3&gt;

&lt;p&gt;DSP (Digital Signal Processing) is math applied to audio signals: filtering frequencies, measuring energy, finding patterns in waveforms.&lt;/p&gt;

&lt;p&gt;The naive approach is to threshold amplitude: find frames above a loudness cutoff. This fails on any real track because overall loudness varies constantly. A quiet verse and a loud chorus have completely different amplitude ranges.&lt;/p&gt;

&lt;p&gt;The insight: &lt;strong&gt;drum hits are transients, sharp sudden attacks, not just loud frames.&lt;/strong&gt; A low hit is a sudden spike in bass energy that decays in under half a second. What distinguishes it isn't loudness. It's a sharp &lt;em&gt;increase&lt;/em&gt; in energy. So instead of thresholding energy, threshold the &lt;em&gt;first difference&lt;/em&gt; of energy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// RMS energy in 10ms windows, 5ms hop&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;energy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;nFrames&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;nFrames&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;hop&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;win&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;win&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Half-wave rectified first difference: energy increases only&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;strength&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;nFrames&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;nFrames&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;strength&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Percentile threshold over mean+std: "the top 3% of energy spikes count as onsets." It adapts to each song automatically, regardless of the noise floor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;positives&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;strength&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;positives&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;positives&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Local maxima above the threshold with a 220ms minimum gap prevent catching echoes. Without this, a hit's ring-out produces a secondary spike that gets detected as a second note. Low and high hits separate by frequency: lowpass at 100Hz captures bass-range hits (kick, tom, bass); bandpass at 2500Hz captures treble-range hits (hi-hat, cymbals, snare crack). &lt;code&gt;OfflineAudioContext&lt;/code&gt; applies these filters and renders faster than real-time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;lowTimes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;highTimes&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nf"&gt;detectOnsets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;lowpass&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.98&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;detectOnsets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bandpass&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sound synthesis
&lt;/h3&gt;

&lt;p&gt;A kick is a sine wave sweeping from 160Hz down to near-zero over 450ms (the body) plus a 20ms square-wave burst at 900Hz (the click attack).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;playKick&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;osc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOscillator&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;osc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sine&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;osc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;frequency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setValueAtTime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentTime&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;osc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;frequency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exponentialRampToValueAtTime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentTime&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// gain envelope, connect to destination...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A snare is white noise (&lt;code&gt;Math.random()&lt;/code&gt; into a buffer) filtered through a bandpass at 2200Hz plus a short triangle-wave tone sweep for the crack.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Key Learnings
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;OfflineAudioContext&lt;/code&gt; for pre-analysis.&lt;/strong&gt; To render the note lane, all hit timestamps need to be known before playback starts. &lt;code&gt;OfflineAudioContext&lt;/code&gt; runs the full analysis pass upfront. Real-time analysis would only surface hits as the song plays, too late to populate the lane.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Percentile threshold over mean+std.&lt;/strong&gt; Tracks with heavy cymbal wash raise the noise floor and collapse mean+std thresholds. Percentile threshold only cares about relative spike height within the track.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Onset detection parameters needed tuning.&lt;/strong&gt; The minimum gap between onsets and the percentile threshold both took a few iterations to feel right. Too permissive and you catch echoes; too strict and real hits get dropped.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  7. Conclusion
&lt;/h2&gt;

&lt;p&gt;AI thinking pauses are dead time by default. They don't have to be. Build something creative, build it for yourself, and you might end up with more than you expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Resources
&lt;/h2&gt;

&lt;p&gt;🚀 &lt;strong&gt;Try it yourself:&lt;/strong&gt; &lt;a href="https://github.com/asakohayase/drum-overlay" rel="noopener noreferrer"&gt;github.com/asakohayase/drum-overlay&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📚 &lt;strong&gt;Learn more:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API" rel="noopener noreferrer"&gt;Web Audio API (MDN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.electronjs.org/docs/latest/api/browser-window" rel="noopener noreferrer"&gt;BrowserWindow options (Electron docs)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>Optimizing a Customer Support Agent on AgentCore</title>
      <dc:creator>Asako Hayase</dc:creator>
      <pubDate>Thu, 04 Jun 2026 22:12:59 +0000</pubDate>
      <link>https://dev.to/asakohayase/optimizing-a-customer-support-agent-on-agentcore-4mn3</link>
      <guid>https://dev.to/asakohayase/optimizing-a-customer-support-agent-on-agentcore-4mn3</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;The AI agent stack has evolved quickly through a few distinct phases.&lt;/p&gt;

&lt;p&gt;First came the &lt;strong&gt;model&lt;/strong&gt;: call an API, get a response. The intelligence is in the model; your job is to write a good prompt.&lt;/p&gt;

&lt;p&gt;Then came the &lt;strong&gt;harness&lt;/strong&gt;: frameworks like LangGraph, CrewAI, and Strands gave agents tools, memory, and multi-step loops. Orchestration became the product.&lt;/p&gt;

&lt;p&gt;Now the question is: how do you make a deployed agent &lt;em&gt;better over time&lt;/em&gt; without rebuilding it from scratch on every iteration? That's the phase we're in, and it's where most of the interesting engineering work is happening.&lt;/p&gt;

&lt;p&gt;AgentCore Optimization is designed for this. Now in public preview, it gives you the infrastructure to run controlled A/B experiments on a live agent: split traffic across configurations, score every session automatically with LLM-as-a-judge evaluators, and read results in CloudWatch.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk through how I built a LangGraph-based customer support agent on Amazon Bedrock AgentCore, then ran three sequential A/B experiments to optimize it: a better prompt, better tool descriptions, and a bigger model.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What's AgentCore Optimization?
&lt;/h2&gt;

&lt;p&gt;AgentCore Optimization is a set of integrated AWS services that lets you continuously improve an agent without rebuilding it. It's built on three primitives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration Bundles&lt;/strong&gt; are versioned JSON payloads containing whatever per-request config you want to test: system prompt, tool descriptions, model ID, or any arbitrary key. The bundle gets injected into every invocation by the gateway, so you can run two different agent configurations off the same container, with no redeployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Agent reads its config bundle on every request
&lt;/span&gt;&lt;span class="n"&gt;bundle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BedrockAgentCoreContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_config_bundle&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;tool_descriptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_descriptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;The AgentCore Gateway&lt;/strong&gt; sits in front of your runtime and handles traffic routing. You create an A/B test that maps two config bundles to traffic percentages (50/50, 80/20, etc.) and attach it to a gateway target. From that point, every invocation is probabilistically routed to one variant — so a 50/50 split is a target, not a guarantee — and the variant assignment is recorded in OTel spans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online Evaluators&lt;/strong&gt; are LLM-as-a-judge scorers that run asynchronously after every session. You define evaluation criteria in natural language, choose a judge model and scoring scale, and register the evaluator with AgentCore. Once attached to an online evaluation config, it scores every session in the A/B test and writes the results to a CloudWatch log group. You can define custom evaluators tuned to your domain, or use AgentCore's built-in evaluators.&lt;/p&gt;

&lt;p&gt;These three primitives compose into a four-step continuous improvement loop (&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/optimization.html" rel="noopener noreferrer"&gt;as described in the official docs&lt;/a&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generate a recommendation.&lt;/strong&gt; Point the Recommendations API at agent traces in CloudWatch and specify the evaluator you want to optimize for. It analyzes failure patterns and returns an improved system prompt or tool descriptions, along with an explanation of what changed and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Package as a configuration bundle.&lt;/strong&gt; Version the recommended config as an immutable snapshot. This decouples agent behavior from code: you can change prompts, models, and tool descriptions without touching the container.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate with an A/B test.&lt;/strong&gt; Split production traffic between current (control) and improved (treatment) through the gateway. Online evaluation scores every session and reports statistical significance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy the winner and repeat.&lt;/strong&gt; Route 100% of traffic to the winning variant. The new baseline's traces seed the next iteration.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  3. What I Built
&lt;/h2&gt;

&lt;p&gt;I built a customer support agent that handles three common ticket types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;account_locked&lt;/code&gt;&lt;/strong&gt;: "I can't log in, keep getting an error" → call &lt;code&gt;validate_account_identity&lt;/code&gt;, confirm identity, explain unlock steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;billing_duplicate&lt;/code&gt;&lt;/strong&gt;: "I was charged twice" → call &lt;code&gt;fetch_billing_history&lt;/code&gt;, identify duplicate, initiate refund via &lt;code&gt;check_refund_status&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gdpr_deletion&lt;/code&gt;&lt;/strong&gt;: "Delete my data under Article 17" → verify identity, explain deletion process, escalate to privacy team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent has three tools:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_billing_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve complete billing transaction history for a customer by user_id.
    Returns itemized charges, payment dates, amounts, and subscription details
    for the past 90 days.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_refund_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check the current processing status of a refund request by ticket_id.
    Returns status (pending/approved/rejected), refund amount, and estimated
    completion timeline.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_account_identity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Verify a customer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s account identity and retrieve their account status,
    access level, subscription tier, and any active restrictions or flags.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each ticket requires the agent to correctly identify which tool(s) to call, extract the right parameters, and ground its response in what the tools actually returned, not what it thinks the answer should be.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. Architecture
&lt;/h2&gt;

&lt;p&gt;The full stack:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47ln1ht1y0kjyxwm9ntr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47ln1ht1y0kjyxwm9ntr.png" alt=" " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent is built with LangGraph's &lt;code&gt;StateGraph&lt;/code&gt; and &lt;code&gt;ToolNode&lt;/code&gt;. AgentCore is framework-agnostic, so plain Python, LangChain, CrewAI, or any other framework works equally well:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_active_model_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# from config bundle
&lt;/span&gt;        &lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_llm_with_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_active_system_prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;llm_with_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ALL_TOOLS&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The per-request config injection happens in the &lt;code&gt;@app.entrypoint&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.entrypoint&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customer_support_agent_runtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;bundle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BedrockAgentCoreContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_config_bundle&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Config bundle missing model_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_active_model_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_active_system_prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BASELINE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;_apply_tool_description_overrides&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ALL_TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_descriptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}))&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;contextvars.ContextVar&lt;/code&gt; scopes the config to the current request without thread-safety issues, even under concurrent invocations.&lt;/p&gt;
&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;OTel spans flow to &lt;code&gt;aws/spans&lt;/code&gt; via the AWS Distro for OpenTelemetry (ADOT). The &lt;code&gt;LangchainInstrumentor&lt;/code&gt; captures LangGraph node executions. The online evaluators read both &lt;code&gt;aws/spans&lt;/code&gt; and the runtime log group. If either is missing, scoring fails silently.&lt;/p&gt;

&lt;p&gt;One gotcha: the default Dockerfile from AgentCore starter sets &lt;code&gt;OTEL_TRACES_EXPORTER=none&lt;/code&gt;, which disables all span export. You have to remove that line and add the ADOT configurator:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Remove this — it kills all observability:&lt;/span&gt;
&lt;span class="c"&gt;# ENV OTEL_TRACES_EXPORTER=none&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; AGENT_OBSERVABILITY_ENABLED=true&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; OTEL_PYTHON_DISTRO=aws_distro&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; OTEL_PYTHON_CONFIGURATOR=aws_configurator&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Infrastructure Setup
&lt;/h3&gt;

&lt;p&gt;The official way to set up an AgentCore project is:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentcore create   &lt;span class="c"&gt;# scaffold project&lt;/span&gt;
agentcore deploy   &lt;span class="c"&gt;# build container, push to ECR, create runtime&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In practice, I hit two bugs that made this not work out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1 — CodeBuild project name mismatch.&lt;/strong&gt; The CLI creates a CodeBuild project named &lt;code&gt;AgentCore-&amp;lt;project&amp;gt;-default-container-builder&lt;/code&gt;, but &lt;code&gt;deploy.py&lt;/code&gt; looks for &lt;code&gt;bedrock-agentcore-&amp;lt;agent_name&amp;gt;-builder&lt;/code&gt;. The build trigger silently does nothing because the project it expects doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2 — Wrong architecture.&lt;/strong&gt; AgentCore Runtime requires arm64 containers. The CLI-generated CodeBuild project uses x86, which fails at runtime with &lt;code&gt;ValidationException: Architecture incompatible&lt;/code&gt;. You need &lt;code&gt;ARM_CONTAINER&lt;/code&gt; compute type and the &lt;code&gt;amazonlinux2-aarch64-standard:3.0&lt;/code&gt; image, neither of which the CLI sets.&lt;/p&gt;

&lt;p&gt;I worked around this with &lt;code&gt;bootstrap_infra.py&lt;/code&gt;, a one-time setup script that creates the ECR repo, S3 bucket, IAM role, and CodeBuild project with the correct name and architecture. It's idempotent, so safe to re-run if anything already exists.&lt;/p&gt;
&lt;h3&gt;
  
  
  Pre-built evaluators — and why they weren't enough
&lt;/h3&gt;

&lt;p&gt;AgentCore ships with built-in evaluators out of the box. No setup, works immediately. Here's what each one actually measures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Builtin.GoalSuccessRate&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(session-level)&lt;/em&gt;: Did the agent successfully complete all user goals across the entire conversation? The judge outputs &lt;strong&gt;Yes / No&lt;/strong&gt;, which AgentCore maps to &lt;strong&gt;1.0 / 0.0&lt;/strong&gt; before writing to CloudWatch. The aggregated scores you see (e.g. 0.154, 0.647) are the proportion of sessions that scored "Yes".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Builtin.Helpfulness&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(trace-level)&lt;/em&gt;: Did the response move the user closer to their goal, from the user's perspective? Scores on a 0–6 categorical scale. Explicitly ignores factual accuracy — it only evaluates whether the response felt helpful to the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Builtin.Correctness&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(trace-level)&lt;/em&gt;: Is the response factually accurate? Framed like a quiz: only content matters, not style or presentation. Scores &lt;strong&gt;Perfectly Correct / Partially Correct / Incorrect&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The full prompt templates for all built-in evaluators are &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/prompt-templates-builtin.html" rel="noopener noreferrer"&gt;published in the AWS docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;They're domain-agnostic. For a customer support agent, that's not specific enough — but I'll show exactly what I mean once the results are in.&lt;/p&gt;
&lt;h3&gt;
  
  
  Custom Evaluators
&lt;/h3&gt;

&lt;p&gt;I defined four domain-specific LLM-as-a-judge evaluators, each scoring on a 0.0–1.0 scale:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cs_intent_resolution&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(Exp 1 — Prompt Strategy)&lt;/em&gt;: Did the agent correctly identify the customer's underlying intent and fully address it, even when the request was ambiguous?&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are evaluating a SaaS customer support agent.

Assess whether the agent correctly identified what the customer actually needed
and addressed it completely.

HIGH QUALITY:
- Correctly classifies the intent (billing, access, refund, privacy, etc.)
- Asks a targeted clarifying question when the request is genuinely ambiguous
- Does not ask for information it already has
- Resolves the stated problem or provides a clear path to resolution

LOW QUALITY:
- Misidentifies or ignores the customer's actual need
- Responds to a surface request while missing the underlying issue
- Asks unnecessary clarifying questions when intent is already clear
- Leaves the customer without a resolution or next step

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.00 — Perfect Resolution&lt;/strong&gt;: Intent correctly identified; response fully addresses the customer's need with a concrete resolution or escalation path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.75 — Mostly Resolved&lt;/strong&gt;: Intent correctly identified and mostly addressed, but one minor gap (e.g. missing a follow-up step or detail)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.50 — Partially Resolved&lt;/strong&gt;: Intent recognised but only partially addressed, or a correct clarifying question was asked but no resolution yet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.25 — Misaligned&lt;/strong&gt;: Agent responded to the wrong intent or provided a solution that does not match the customer's actual problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.00 — Failed&lt;/strong&gt;: Intent completely missed, customer redirected incorrectly, or no actionable response provided&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cs_tool_groundedness&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(Exp 2 — Tool Descriptions)&lt;/em&gt;: Did the agent select the right tool, cite specific data from the tool result, and avoid making up facts that should have come from a tool call?&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are evaluating a SaaS customer support agent that has access to three tools:
fetch_billing_history, check_refund_status, and validate_account_identity.

Assess whether the agent:
(a) selected the right tool for the customer's issue
(b) cited specific data from the tool result (amounts, dates, statuses, account details)
(c) avoided making up facts that should have come from a tool call

HIGH QUALITY:
- Calls the most appropriate tool for the stated issue
- Cites specific values: '$49.00 duplicate charge on May 1st',
  'account locked after 5 failed attempts', 'refund approved, ETA May 9th'
- Never invents billing amounts, account statuses, or ticket details

LOW QUALITY:
- Calls the wrong tool or skips tool calls entirely
- Responds with generic statements: 'your billing looks fine' without checking
- Fabricates specific data that should have been retrieved

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.00 — Fully Grounded&lt;/strong&gt;: Correct tool selected; response cites specific retrieved data; no hallucinated facts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.75 — Mostly Grounded&lt;/strong&gt;: Correct tool used; most claims are data-backed but one minor detail is missing or slightly imprecise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.50 — Partially Grounded&lt;/strong&gt;: Tool was called but the response mixes real data with generic or inferred statements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.25 — Wrong Tool / Mostly Generic&lt;/strong&gt;: Wrong tool called, or the right tool was skipped and the response is largely generic with little specific data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.00 — Hallucinated / No Tool&lt;/strong&gt;: No tool called when one was clearly needed, or data cited in the response was fabricated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cs_support_quality&lt;/code&gt;&lt;/strong&gt; &lt;em&gt;(Exp 3 — Model Comparison)&lt;/em&gt;: Holistic quality scored equally across four dimensions: empathy, clarity, completeness, and tone.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are evaluating a SaaS customer support agent response.

Assess the response on four dimensions equally:
1. EMPATHY — does it acknowledge the customer's frustration or situation?
2. CLARITY — is the response easy to understand and act on?
3. COMPLETENESS — does it cover all aspects of the customer's issue?
4. TONE — is it professional, warm, and appropriate for support?

HIGH QUALITY:
- Opens with genuine acknowledgment of the customer's experience
- Explains what happened and why in plain language
- Provides concrete next steps with timelines where applicable
- Closes with an offer to help further
- Would not cause the customer to escalate or churn

LOW QUALITY:
- Robotic or dismissive tone
- Incomplete — addresses only part of the issue
- Unclear or filled with jargon
- Leaves the customer without a clear next step

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.00 — Excellent&lt;/strong&gt;: Empathetic, clear, complete, and professional. Would fully satisfy the customer and prevent escalation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.75 — Good&lt;/strong&gt;: Strong on most dimensions with a minor gap, perhaps slightly terse or missing one follow-up detail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.50 — Adequate&lt;/strong&gt;: Technically correct but lacking empathy, clarity, or completeness in a noticeable way&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.25 — Poor&lt;/strong&gt;: Multiple gaps: robotic tone, incomplete answer, or confusing language that would frustrate the customer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.00 — Unacceptable&lt;/strong&gt;: Response would cause the customer to escalate or churn: dismissive, wrong, incoherent, or entirely unhelpful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cs_overall_customer_outcome&lt;/code&gt; ⭐&lt;/strong&gt; &lt;em&gt;(North star — all experiments)&lt;/em&gt;: Holistic score across all dimensions simultaneously: resolution, data accuracy, tone, and compliance process. A response that excels on one dimension but fails another (e.g. empathetic but factually wrong) should not score above 0.50.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are evaluating a SaaS customer support agent on its ultimate business outcome:
did the customer get a good result?

Score based on ALL of the following:
- Was the customer's issue RESOLVED or correctly ESCALATED?
- Did the agent use REAL DATA (no hallucinated amounts, statuses, dates)?
- Was the TONE empathetic enough that the customer would not churn?
- For compliance issues (GDPR, legal): was the correct process followed?

This is a holistic score — a response that excels on one dimension but fails another
(e.g. empathetic but factually wrong) should not score above 0.50.

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.00 — Outstanding Outcome&lt;/strong&gt;: Issue fully resolved or correctly escalated; no hallucinated data; empathetic tone; customer would be satisfied&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.75 — Good Outcome&lt;/strong&gt;: Issue substantially addressed with minor gaps; data accurate; tone acceptable; customer unlikely to escalate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.50 — Neutral Outcome&lt;/strong&gt;: Issue partially addressed, or data accurate but tone poor, or tone good but resolution incomplete&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.25 — Poor Outcome&lt;/strong&gt;: Issue largely unresolved, or significant hallucinated data, or tone likely to frustrate the customer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.00 — Failed Outcome&lt;/strong&gt;: Issue not addressed, wrong advice given, compliance process ignored, or response would directly cause churn or harm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An important lesson on evaluator throughput: I originally used Claude Sonnet 4.5 as the judge model. With 4 evaluators firing asynchronously per session, concurrent Converse calls regularly exceeded Sonnet's throughput limit. About half of &lt;code&gt;cs_overall_customer_outcome&lt;/code&gt; scores silently failed with &lt;code&gt;ThrottlingException&lt;/code&gt;. No error in the logs; scores just didn't appear. The fix was switching to Claude Haiku 4.5, which has roughly 10x higher throughput limits:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;EVALUATOR_MODEL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us.anthropic.claude-haiku-4-5-20251001-v1:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Haiku is fast enough for async LLM-as-judge scoring at demo scale. Save Sonnet for the inference model, not the judge.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Test Overview
&lt;/h2&gt;

&lt;p&gt;Three sequential experiments, each isolating one variable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exp 1 — System prompt&lt;/strong&gt;: C = Baseline ("Respond immediately with a solution"), T1 = Optimized (Classify → clarify if ambiguous → cite tool data)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exp 2 — Tool descriptions&lt;/strong&gt;: C = Vague ("Get data for a user."), T1 = Precise (full typed signatures with return value descriptions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exp 3 — Model&lt;/strong&gt;: C = Claude Haiku 4.5, T1 = Claude Sonnet 4.6&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each experiment ran 30 sessions (10 repeats × 3 ticket types), routed 50/50 via the AgentCore Gateway. Each session was scored by all four evaluators asynchronously.&lt;/p&gt;
&lt;h3&gt;
  
  
  Experiment Pipeline
&lt;/h3&gt;

&lt;p&gt;The traffic in this demo is synthetic, not real user activity. Two different mechanisms were used depending on the phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 (baseline batch evaluation)&lt;/strong&gt; uses AgentCore's &lt;code&gt;BatchEvaluationRunner&lt;/code&gt; with a simulated customer actor — a Claude Haiku model playing the customer role. Given a character profile and a goal, the actor dynamically responds to whatever the support agent says, producing realistic multi-turn conversations up to 4 turns deep. For example, the GDPR ticket actor is briefed as an EU customer who understands their Article 17 rights and will push back if the agent seems evasive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phases 5, 8, and 9 (A/B experiments)&lt;/strong&gt; use single-turn prompts sent directly to the gateway — one fixed message per ticket type, repeated 10 times each. There is no back-and-forth; each invocation is a complete self-contained session.&lt;/p&gt;

&lt;p&gt;In production, you would replace synthetic traffic with real user interactions. The infrastructure — gateway routing, online evaluation, CloudWatch logging — works identically regardless of whether the traffic is real or simulated. The practical advantage of real traffic is that it captures the authentic distribution of how users phrase requests, including edge cases and ambiguous formulations that synthetic prompts don't cover.&lt;/p&gt;

&lt;p&gt;The three experiments ran sequentially. AgentCore only allows one active A/B test per gateway at a time. The full phase sequence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: Generate baseline traffic. Invoke the runtime directly across all 3 ticket types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: Baseline batch evaluation. Score the baseline sessions to establish a starting benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3&lt;/strong&gt;: AI prompt recommendation. Point the Recommendations API at baseline traces; get an optimized system prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 4&lt;/strong&gt;: AI tool description recommendation. Same API, optimized tool descriptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 5&lt;/strong&gt;: Create Exp 1 config bundles, run A/B test (prompt strategy), promote winner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 6&lt;/strong&gt;: AI tool description recommendation. Generate improved tool descriptions based on Exp 1 traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 7&lt;/strong&gt;: Create Exp 2 config bundles. Best prompt + vague tools (C) vs best prompt + precise tools (T1).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 8&lt;/strong&gt;: Run Exp 2 A/B test. Stop Exp 1, create new A/B test, send 30 sessions; promote Exp 2 winner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 9&lt;/strong&gt;: Run Exp 3 A/B test. Best prompt + best tools, vary only &lt;code&gt;model_id&lt;/code&gt;: Haiku (C) vs Sonnet (T1).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How the Recommendations API works
&lt;/h3&gt;

&lt;p&gt;Phases 3, 4, and 6 each call &lt;code&gt;StartRecommendation&lt;/code&gt; with a &lt;code&gt;type&lt;/code&gt; parameter that specifies what to optimize. Prompt and tool descriptions are separate calls — there's no combined mode:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_recommendation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SYSTEM_PROMPT_RECOMMENDATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# or "TOOL_DESCRIPTION_RECOMMENDATION"
&lt;/span&gt;    &lt;span class="n"&gt;recommendationConfig&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemPromptRecommendationConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;systemPrompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CURRENT_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agentTraces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudwatchLogs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{...}},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluationConfig&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluators&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluatorArn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Builtin.GoalSuccessRate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Tool description recommendations use &lt;code&gt;toolDescriptionRecommendationConfig&lt;/code&gt; instead and don't accept an &lt;code&gt;evaluationConfig&lt;/code&gt; — which is why prompt recommendations always optimize for a session-level evaluator like &lt;code&gt;Builtin.GoalSuccessRate&lt;/code&gt; rather than your custom trace-level north star.&lt;/p&gt;


&lt;h2&gt;
  
  
  6. Test Results and Analysis
&lt;/h2&gt;

&lt;p&gt;I included &lt;code&gt;Builtin.GoalSuccessRate&lt;/code&gt; as a reference signal alongside my custom evaluators. Across all three experiments, it frequently disagreed with &lt;code&gt;cs_overall_customer_outcome&lt;/code&gt;, my north star. The pattern was consistent: any ticket that required escalation or a follow-up step — GDPR deletion, account unlock pending identity verification — scored "No" from GoalSuccessRate because the agent didn't complete the action in a single turn. That's the correct process, but GoalSuccessRate doesn't know that.&lt;/p&gt;

&lt;p&gt;The gap matters because the Recommendations API only accepts session-level evaluators, which means it optimizes for GoalSuccessRate, not your custom north star. Worth knowing before you treat the API's output as ground truth.&lt;/p&gt;

&lt;p&gt;All verdicts below are based on &lt;code&gt;cs_overall_customer_outcome&lt;/code&gt; only. Builtin scores are shown for reference.&lt;/p&gt;
&lt;h3&gt;
  
  
  Experiment 1 — Prompt Strategy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Baseline prompt&lt;/strong&gt; (C): Answer immediately, use tools when needed, keep responses concise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimized prompt&lt;/strong&gt; (T1): Classify intent first → ask one clarifying question if ambiguous → call the right tool → cite actual tool data → provide clear next steps.&lt;/p&gt;

&lt;p&gt;The optimized prompt was generated by AgentCore's recommendation API after analyzing baseline session traces.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;C (n=13)&lt;/th&gt;
&lt;th&gt;T1 (n=17)&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;p&lt;/th&gt;
&lt;th&gt;Significant?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cs_overall_outcome ⭐&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.708&lt;/td&gt;
&lt;td&gt;0.835&lt;/td&gt;
&lt;td&gt;+18.0%&lt;/td&gt;
&lt;td&gt;0.059&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_tool_groundedness&lt;/td&gt;
&lt;td&gt;0.865&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;+13.9%&lt;/td&gt;
&lt;td&gt;0.002&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_intent_resolution&lt;/td&gt;
&lt;td&gt;0.923&lt;/td&gt;
&lt;td&gt;0.971&lt;/td&gt;
&lt;td&gt;+5.1%&lt;/td&gt;
&lt;td&gt;0.324&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_support_quality&lt;/td&gt;
&lt;td&gt;0.827&lt;/td&gt;
&lt;td&gt;0.838&lt;/td&gt;
&lt;td&gt;+1.4%&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Builtin.GoalSuccessRate&lt;/td&gt;
&lt;td&gt;0.154&lt;/td&gt;
&lt;td&gt;0.647&lt;/td&gt;
&lt;td&gt;+320.6%&lt;/td&gt;
&lt;td&gt;0.002&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: DIRECTIONAL. T1 leads on &lt;code&gt;cs_overall_customer_outcome&lt;/code&gt; (+18.0%, p=0.059), just misses significance at n=30.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;: The most revealing number is &lt;code&gt;cs_tool_groundedness&lt;/code&gt;, the only statistically significant cs_* result across all three experiments (p=0.002). The baseline prompt was partially answering from the model's own knowledge rather than grounding responses in what tools returned.&lt;/p&gt;

&lt;p&gt;The billing sessions show where GoalSuccessRate is actually useful. The baseline called &lt;code&gt;fetch_billing_history&lt;/code&gt;, confirmed the duplicate $49 charge, then stopped. cs_overall gave it 0.75 — correct tool, accurate data, reasonable response. GoalSuccessRate gave 0 — the overcharge wasn't resolved, so the user's goal wasn't met. GoalSuccessRate was right. Diagnosing a problem is not the same as fixing it. The optimized prompt's "provide clear next steps" step is what moved the agent from diagnosis to action, and GoalSuccessRate captured that clearly (0.0 → 0.833 on billing).&lt;/p&gt;


&lt;h3&gt;
  
  
  Experiment 2 — Tool Descriptions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Baseline tool descriptions&lt;/strong&gt; (C): Vague one-liners that give the LLM almost no signal.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetch_billing_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get data for a user.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_refund_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Process a request.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_account_identity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run a query.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Optimized tool descriptions&lt;/strong&gt; (T1): Precise typed signatures with return value descriptions.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetch_billing_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieve complete billing transaction history for a customer by user_id. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Returns itemized charges, payment dates, amounts, and subscription details &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;for the past 90 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Both variants use the winning prompt from Exp 1, so only tool selection behavior changes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;C (n=14)&lt;/th&gt;
&lt;th&gt;T1 (n=16)&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;p&lt;/th&gt;
&lt;th&gt;Significant?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cs_overall_outcome ⭐&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.804&lt;/td&gt;
&lt;td&gt;0.859&lt;/td&gt;
&lt;td&gt;+6.9%&lt;/td&gt;
&lt;td&gt;0.193&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_tool_groundedness&lt;/td&gt;
&lt;td&gt;0.964&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;+3.7%&lt;/td&gt;
&lt;td&gt;0.141&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_intent_resolution&lt;/td&gt;
&lt;td&gt;0.964&lt;/td&gt;
&lt;td&gt;0.922&lt;/td&gt;
&lt;td&gt;-4.4%&lt;/td&gt;
&lt;td&gt;0.271&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_support_quality&lt;/td&gt;
&lt;td&gt;0.814&lt;/td&gt;
&lt;td&gt;0.797&lt;/td&gt;
&lt;td&gt;-2.1%&lt;/td&gt;
&lt;td&gt;0.650&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Builtin.GoalSuccessRate&lt;/td&gt;
&lt;td&gt;0.643&lt;/td&gt;
&lt;td&gt;0.188&lt;/td&gt;
&lt;td&gt;-70.8%&lt;/td&gt;
&lt;td&gt;0.006&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: DIRECTIONAL. T1 leads (+6.9%, p=0.193), not significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;: Better descriptions pushed &lt;code&gt;cs_tool_groundedness&lt;/code&gt; to a perfect 1.000. The GoalSuccessRate collapse (-70.8%) looks bad but reflects a judge inconsistency, not a real regression. On account_locked sessions, C scored 0.75 and T1 scored 0 — yet cs_overall was 0.875 for both. Both variants did the same thing: confirmed the account lock, requested identity verification. GoalSuccessRate sometimes counted that as success, sometimes as failure. cs_overall (+6.9%) is the more consistent signal here.&lt;/p&gt;


&lt;h3&gt;
  
  
  Experiment 3 — Model Comparison
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Control&lt;/strong&gt; (C): Claude Haiku 4.5 (fast, cost-efficient, ~$0.80/M input tokens).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treatment&lt;/strong&gt; (T1): Claude Sonnet 4.6 (more capable, ~$3/M input tokens).&lt;/p&gt;

&lt;p&gt;Both variants use the best prompt and best tool descriptions from Exp 1 and 2. The only difference is &lt;code&gt;model_id&lt;/code&gt; in the config bundle. No redeployment needed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;C (n=14)&lt;/th&gt;
&lt;th&gt;T1 (n=16)&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;th&gt;p&lt;/th&gt;
&lt;th&gt;Significant?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cs_overall_outcome ⭐&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.875&lt;/td&gt;
&lt;td&gt;0.812&lt;/td&gt;
&lt;td&gt;-7.1%&lt;/td&gt;
&lt;td&gt;0.212&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_tool_groundedness&lt;/td&gt;
&lt;td&gt;0.982&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;+1.8%&lt;/td&gt;
&lt;td&gt;0.317&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_intent_resolution&lt;/td&gt;
&lt;td&gt;0.911&lt;/td&gt;
&lt;td&gt;0.938&lt;/td&gt;
&lt;td&gt;+2.9%&lt;/td&gt;
&lt;td&gt;0.537&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cs_support_quality&lt;/td&gt;
&lt;td&gt;0.786&lt;/td&gt;
&lt;td&gt;0.844&lt;/td&gt;
&lt;td&gt;+7.4%&lt;/td&gt;
&lt;td&gt;0.142&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Builtin.GoalSuccessRate&lt;/td&gt;
&lt;td&gt;0.214&lt;/td&gt;
&lt;td&gt;0.438&lt;/td&gt;
&lt;td&gt;+104.2%&lt;/td&gt;
&lt;td&gt;0.193&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: INCONCLUSIVE / C holds. Haiku 4.5 leads on &lt;code&gt;cs_overall_customer_outcome&lt;/code&gt; across every ticket type. Not statistically significant (p=0.212), but the direction is consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;: Sonnet scored higher on &lt;code&gt;cs_support_quality&lt;/code&gt; (+7.4%) — richer responses, better formatted. But on one billing session it scored 0.5 because it required identity verification before committing to the refund, even though &lt;code&gt;fetch_billing_history&lt;/code&gt; had already confirmed the duplicate and the user's eligibility. That extra step wasn't warranted by the data. Haiku saw the same tool output and offered the refund directly. On a structured task where the tool result already tells you what to do, Sonnet's tendency to add caution worked against it.&lt;/p&gt;


&lt;h2&gt;
  
  
  7. Key Learnings
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What worked well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Online evaluation is fully automatic once configured.&lt;/strong&gt; After the eval config is set up, scores land in CloudWatch for every gateway session without any extra instrumentation on your end. The only thing you need to do is read the log group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config bundles make iteration fast.&lt;/strong&gt; Swapping model ID, system prompt, and tool descriptions across variants with no container rebuild changes the cost of an experiment from hours to minutes.&lt;/p&gt;
&lt;h3&gt;
  
  
  Gotchas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Recommendations API optimizes for a metric you may not be using.&lt;/strong&gt; The API only accepts session-level evaluators. If your north star is a custom trace-level evaluator (as mine was), the API silently falls back to &lt;code&gt;Builtin.GoalSuccessRate&lt;/code&gt; instead. As the results show, those two metrics frequently disagree. Treat AI-generated recommendations as a strong starting point, not a guaranteed improvement. The A/B test is the actual verdict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each A/B test is limited to two variants.&lt;/strong&gt; The gateway supports one control and one treatment per experiment. Testing three or more configurations requires running sequential experiments, which means more time and the risk of confounds between runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-judge has variance. Consider deterministic evals where possible.&lt;/strong&gt; For outputs with a clear correct answer, a deterministic check (exact field match, regex, schema validation) is more reliable than asking a judge model. LLM-as-judge is necessary for open-ended quality, but if part of your rubric can be verified programmatically, that part should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a smaller, high-throughput model for LLM-as-judge.&lt;/strong&gt; When multiple evaluators fire concurrently per session, a capable-but-limited-throughput model will silently drop scores under throttling — no errors, scores just don't appear. A faster, cheaper model handles the concurrency, and for judging structured rubrics the quality difference is negligible.&lt;/p&gt;


&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;Building the agent was the easy part.&lt;/p&gt;

&lt;p&gt;The surprising difficulty was eval design. Writing a north star metric that genuinely reflects your business goal, not just something easy to score, takes real iteration. And once you have a north star, designing the supporting diagnostics that explain &lt;em&gt;why&lt;/em&gt; it moves is just as hard. The built-in evaluators are a useful reference, but they're domain-agnostic by design. They will disagree with your north star at exactly the moments that matter most.&lt;/p&gt;

&lt;p&gt;A few things I'd carry into the next project:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config bundles are the operational win.&lt;/strong&gt; Swapping prompts, tool descriptions, and model IDs in production with a 50/50 split, with no container rebuild, changes how you think about iteration. Changes that used to require a deploy cycle become experiments you can start in minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The loop is the product.&lt;/strong&gt; Baseline → recommend → A/B test → promote → repeat. Every step is already an API call returning structured data. There's nothing stopping an agent from evaluating itself, triggering a new recommendation when scores drop, and starting a test automatically. I'm not quite at fully self-driving agents yet, but the primitives are already here.&lt;/p&gt;


&lt;h2&gt;
  
  
  9. Resources
&lt;/h2&gt;

&lt;p&gt;✍️ My Blog&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://www.asakohayase.com/blog" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;asakohayase.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;🚀 Try It Yourself&lt;br&gt;
&lt;a href="https://github.com/asakohayase/agentcore-customer-support-agent" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📚 Learn More&lt;br&gt;
&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/optimization.html" rel="noopener noreferrer"&gt;AgentCore Optimization official docs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentcore</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
