DEV Community: Asako Hayase

I Built a Rhythm Game That Lives Above My IDE

Asako Hayase — Sun, 07 Jun 2026 20:31:05 +0000

1. Introduction

Every year, I pick up new hobbies. This year: drums.

When I was watching Claude Code flibbertigibbeting, I thought, "why don't I build a game to practice my rhythm skills?"

So I built a rhythm game in Electron that floats transparently above my IDE. When Claude Code is thinking, I hit F and J to play low and high hits against whatever song is loaded. When it responds, I go back to coding. No context switch, no separate window. The game is just there, above everything else, always.

2. What I Built

The game loads any audio file, analyzes it offline to find where the low and high hits land, then spawns hit targets that scroll toward two drum pads synced to the song's playback position. You hit F for low hits, J for high hits. Timing accuracy scores each hit.

Loads any audio file: MP3, WAV, OGG, M4A, AAC, FLAC
Analyzes beats offline: runs onset detection across the whole file before playback starts
Five visual themes: Lime, Classic, Forest, Neon, Dusk
BPM auto-detection: estimates tempo from detected low-hit intervals

3. How to Run

git clone https://github.com/asakohayase/drum-overlay.git
cd drum-overlay
npm install
npm start

Controls:

F: low hits
J: high hits
Space: play / pause song
Cmd+Shift+Q: quit

4. What is an Electron Overlay?

A browser tab can't float above other apps. It lives inside the browser window, so you'd have to alt-tab to use it, which defeats the whole point. You need OS-level window control.

Electron gives you that. It's normally used to build standalone desktop apps. VS Code, Claude Desktop, Slack are all Electron. Three window flags make the overlay possible:

transparent: true removes the default white window background Electron adds. Without it, it looks like this:

alwaysOnTop: true keeps the window above all other windows system-wide. It doesn't lose its position when you click on something else.

setIgnoreMouseEvents(true, { forward: true }) without this, you cannot click your IDE. The window covers the full screen, so it would intercept every click. This flag passes clicks through to whatever's underneath, while still telling the overlay where your cursor is. When it enters the panel, the overlay temporarily becomes clickable. When it leaves, clicks pass through again.

win = new BrowserWindow({
  transparent: true,
  frame: false,
  alwaysOnTop: true,
  webPreferences: { preload: path.join(__dirname, 'preload.js') }
});
win.setIgnoreMouseEvents(true, { forward: true });

The renderer toggles interactivity dynamically based on what the cursor is over:

document.addEventListener('mousemove', (e) => {
  const over = e.target.closest('.pad, .icon-btn, .play-btn, .progress-bar');
  ipcRenderer.send('set-ignore-mouse', !over);
});

5. Architecture

Web Audio API (built-in):

OfflineAudioContext: runs the full analysis pass before playback starts. Some libraries only offer real-time analysis, which is too late to pre-populate the note lane.
createBiquadFilter: applies lowpass/bandpass frequency filters
decodeAudioData: decodes MP3/WAV/etc into raw samples

Custom code built on top:

detectOnsets: finds low-hit and high-hit timestamps from the filtered audio
estimateBPM: estimates tempo from low-hit intervals
playKick / playSnare: synthesized drum sounds
drawFrame: game loop, note scrolling, hit detection, scoring

Audio file
    │
    ▼
decodeAudioData()
    │
    ├─ detectOnsets(lowpass,  100Hz)  → lowTimes[]
    └─ detectOnsets(bandpass, 2500Hz) → highTimes[]
    │
    ▼
estimateBPM(lowTimes) → bpm
    │
    ▼
Game loop (requestAnimationFrame)
    ├─ Spawn notes from lowTimes/highTimes ahead of currentTime
    ├─ Scroll notes toward hit zone
    └─ Score hit on keydown (F=kick, J=snare)
    │
    ▼
playKick() / playSnare() on hit

Onset detection

DSP (Digital Signal Processing) is math applied to audio signals: filtering frequencies, measuring energy, finding patterns in waveforms.

The naive approach is to threshold amplitude: find frames above a loudness cutoff. This fails on any real track because overall loudness varies constantly. A quiet verse and a loud chorus have completely different amplitude ranges.

The insight: drum hits are transients, sharp sudden attacks, not just loud frames. A low hit is a sudden spike in bass energy that decays in under half a second. What distinguishes it isn't loudness. It's a sharp increase in energy. So instead of thresholding energy, threshold the first difference of energy.

// RMS energy in 10ms windows, 5ms hop
const energy = new Float32Array(nFrames);
for (let i = 0; i < nFrames; i++) {
  const s = i * hop;
  let e = 0;
  for (let j = 0; j < win; j++) e += raw[s + j] ** 2;
  energy[i] = Math.sqrt(e / win);
}

// Half-wave rectified first difference: energy increases only
const strength = new Float32Array(nFrames);
for (let i = 1; i < nFrames; i++) {
  strength[i] = Math.max(0, energy[i] - energy[i - 1]);
}

Percentile threshold over mean+std: "the top 3% of energy spikes count as onsets." It adapts to each song automatically, regardless of the noise floor.

const positives = [...strength].filter(v => v > 0).sort((a, b) => a - b);
const threshold = positives[Math.floor(positives.length * 0.97)];

Local maxima above the threshold with a 220ms minimum gap prevent catching echoes. Without this, a hit's ring-out produces a secondary spike that gets detected as a second note. Low and high hits separate by frequency: lowpass at 100Hz captures bass-range hits (kick, tom, bass); bandpass at 2500Hz captures treble-range hits (hi-hat, cymbals, snare crack). OfflineAudioContext applies these filters and renders faster than real-time.

const [lowTimes, highTimes] = await Promise.all([
  detectOnsets(buf, 'lowpass',  100,  1.0, 0.22, 0.98),
  detectOnsets(buf, 'bandpass', 2500, 1.5, 0.22, 0.97),
]);

Sound synthesis

A kick is a sine wave sweeping from 160Hz down to near-zero over 450ms (the body) plus a 20ms square-wave burst at 900Hz (the click attack).

function playKick() {
  const c = ctx();
  const osc = c.createOscillator();
  osc.type = 'sine';
  osc.frequency.setValueAtTime(160, c.currentTime);
  osc.frequency.exponentialRampToValueAtTime(0.001, c.currentTime + 0.45);
  // gain envelope, connect to destination...
}

A snare is white noise (Math.random() into a buffer) filtered through a bandpass at 2200Hz plus a short triangle-wave tone sweep for the crack.

6. Key Learnings

OfflineAudioContext for pre-analysis. To render the note lane, all hit timestamps need to be known before playback starts. OfflineAudioContext runs the full analysis pass upfront. Real-time analysis would only surface hits as the song plays, too late to populate the lane.
Percentile threshold over mean+std. Tracks with heavy cymbal wash raise the noise floor and collapse mean+std thresholds. Percentile threshold only cares about relative spike height within the track.
Onset detection parameters needed tuning. The minimum gap between onsets and the percentile threshold both took a few iterations to feel right. Too permissive and you catch echoes; too strict and real hits get dropped.

7. Conclusion

AI thinking pauses are dead time by default. They don't have to be. Build something creative, build it for yourself, and you might end up with more than you expected.

8. Resources

🚀 Try it yourself: github.com/asakohayase/drum-overlay

📚 Learn more:

Optimizing a Customer Support Agent on AgentCore

Asako Hayase — Thu, 04 Jun 2026 22:12:59 +0000

1. Introduction

The AI agent stack has evolved quickly through a few distinct phases.

First came the model: call an API, get a response. The intelligence is in the model; your job is to write a good prompt.

Then came the harness: frameworks like LangGraph, CrewAI, and Strands gave agents tools, memory, and multi-step loops. Orchestration became the product.

Now the question is: how do you make a deployed agent better over time without rebuilding it from scratch on every iteration? That's the phase we're in, and it's where most of the interesting engineering work is happening.

AgentCore Optimization is designed for this. Now in public preview, it gives you the infrastructure to run controlled A/B experiments on a live agent: split traffic across configurations, score every session automatically with LLM-as-a-judge evaluators, and read results in CloudWatch.

In this post, I'll walk through how I built a LangGraph-based customer support agent on Amazon Bedrock AgentCore, then ran three sequential A/B experiments to optimize it: a better prompt, better tool descriptions, and a bigger model.

2. What's AgentCore Optimization?

AgentCore Optimization is a set of integrated AWS services that lets you continuously improve an agent without rebuilding it. It's built on three primitives:

Configuration Bundles are versioned JSON payloads containing whatever per-request config you want to test: system prompt, tool descriptions, model ID, or any arbitrary key. The bundle gets injected into every invocation by the gateway, so you can run two different agent configurations off the same container, with no redeployment.

# Agent reads its config bundle on every request
bundle = BedrockAgentCoreContext.get_config_bundle()
model_id = bundle["model_id"]
system_prompt = bundle["system_prompt"]
tool_descriptions = bundle.get("tool_descriptions", {})

The AgentCore Gateway sits in front of your runtime and handles traffic routing. You create an A/B test that maps two config bundles to traffic percentages (50/50, 80/20, etc.) and attach it to a gateway target. From that point, every invocation is probabilistically routed to one variant — so a 50/50 split is a target, not a guarantee — and the variant assignment is recorded in OTel spans.

Online Evaluators are LLM-as-a-judge scorers that run asynchronously after every session. You define evaluation criteria in natural language, choose a judge model and scoring scale, and register the evaluator with AgentCore. Once attached to an online evaluation config, it scores every session in the A/B test and writes the results to a CloudWatch log group. You can define custom evaluators tuned to your domain, or use AgentCore's built-in evaluators.

These three primitives compose into a four-step continuous improvement loop (as described in the official docs):

Generate a recommendation. Point the Recommendations API at agent traces in CloudWatch and specify the evaluator you want to optimize for. It analyzes failure patterns and returns an improved system prompt or tool descriptions, along with an explanation of what changed and why.
Package as a configuration bundle. Version the recommended config as an immutable snapshot. This decouples agent behavior from code: you can change prompts, models, and tool descriptions without touching the container.
Validate with an A/B test. Split production traffic between current (control) and improved (treatment) through the gateway. Online evaluation scores every session and reports statistical significance.
Deploy the winner and repeat. Route 100% of traffic to the winning variant. The new baseline's traces seed the next iteration.

3. What I Built

I built a customer support agent that handles three common ticket types:

account_locked: "I can't log in, keep getting an error" → call validate_account_identity, confirm identity, explain unlock steps
billing_duplicate: "I was charged twice" → call fetch_billing_history, identify duplicate, initiate refund via check_refund_status
gdpr_deletion: "Delete my data under Article 17" → verify identity, explain deletion process, escalate to privacy team

The agent has three tools:

@tool
def fetch_billing_history(user_id: str) -> dict:
    """Retrieve complete billing transaction history for a customer by user_id.
    Returns itemized charges, payment dates, amounts, and subscription details
    for the past 90 days."""
    ...

@tool
def check_refund_status(ticket_id: str) -> dict:
    """Check the current processing status of a refund request by ticket_id.
    Returns status (pending/approved/rejected), refund amount, and estimated
    completion timeline."""
    ...

@tool
def validate_account_identity(user_id: str) -> dict:
    """Verify a customer's account identity and retrieve their account status,
    access level, subscription tier, and any active restrictions or flags."""
    ...

Each ticket requires the agent to correctly identify which tool(s) to call, extract the right parameters, and ground its response in what the tools actually returned, not what it thinks the answer should be.

4. Architecture

The full stack:

The agent is built with LangGraph's StateGraph and ToolNode. AgentCore is framework-agnostic, so plain Python, LangChain, CrewAI, or any other framework works equally well:

def create_agent():
    def chatbot(state: MessagesState):
        model_id = _active_model_id.get()  # from config bundle
        llm_with_tools = _get_llm_with_tools(model_id)
        system_prompt = _active_system_prompt.get()
        messages = [SystemMessage(content=system_prompt)] + state["messages"]
        return {"messages": state["messages"] + [llm_with_tools.invoke(messages)]}

    graph = StateGraph(MessagesState)
    graph.add_node("chatbot", chatbot)
    graph.add_node("tools", ToolNode(ALL_TOOLS))
    graph.add_conditional_edges("chatbot", tools_condition)
    graph.add_edge("tools", "chatbot")
    graph.set_entry_point("chatbot")
    return graph.compile()

The per-request config injection happens in the @app.entrypoint:

@app.entrypoint
def customer_support_agent_runtime(payload: dict) -> str:
    bundle = BedrockAgentCoreContext.get_config_bundle()
    if bundle:
        model_id = bundle.get("model_id")
        if not model_id:
            raise ValueError("Config bundle missing model_id")
        _active_model_id.set(model_id)
        _active_system_prompt.set(bundle.get("system_prompt", BASELINE_SYSTEM_PROMPT))
        _apply_tool_description_overrides(ALL_TOOLS, bundle.get("tool_descriptions", {}))

    response = agent.invoke({"messages": [HumanMessage(content=payload["prompt"])]})
    return response["messages"][-1].content

contextvars.ContextVar scopes the config to the current request without thread-safety issues, even under concurrent invocations.

Observability

OTel spans flow to aws/spans via the AWS Distro for OpenTelemetry (ADOT). The LangchainInstrumentor captures LangGraph node executions. The online evaluators read both aws/spans and the runtime log group. If either is missing, scoring fails silently.

One gotcha: the default Dockerfile from AgentCore starter sets OTEL_TRACES_EXPORTER=none, which disables all span export. You have to remove that line and add the ADOT configurator:

# Remove this — it kills all observability:
# ENV OTEL_TRACES_EXPORTER=none

ENV AGENT_OBSERVABILITY_ENABLED=true
ENV OTEL_PYTHON_DISTRO=aws_distro
ENV OTEL_PYTHON_CONFIGURATOR=aws_configurator
ENV OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

Infrastructure Setup

The official way to set up an AgentCore project is:

agentcore create   # scaffold project
agentcore deploy   # build container, push to ECR, create runtime

In practice, I hit two bugs that made this not work out of the box.

Bug 1 — CodeBuild project name mismatch. The CLI creates a CodeBuild project named AgentCore-<project>-default-container-builder, but deploy.py looks for bedrock-agentcore-<agent_name>-builder. The build trigger silently does nothing because the project it expects doesn't exist.

Bug 2 — Wrong architecture. AgentCore Runtime requires arm64 containers. The CLI-generated CodeBuild project uses x86, which fails at runtime with ValidationException: Architecture incompatible. You need ARM_CONTAINER compute type and the amazonlinux2-aarch64-standard:3.0 image, neither of which the CLI sets.

I worked around this with bootstrap_infra.py, a one-time setup script that creates the ECR repo, S3 bucket, IAM role, and CodeBuild project with the correct name and architecture. It's idempotent, so safe to re-run if anything already exists.

Pre-built evaluators — and why they weren't enough

AgentCore ships with built-in evaluators out of the box. No setup, works immediately. Here's what each one actually measures:

Builtin.GoalSuccessRate (session-level): Did the agent successfully complete all user goals across the entire conversation? The judge outputs Yes / No, which AgentCore maps to 1.0 / 0.0 before writing to CloudWatch. The aggregated scores you see (e.g. 0.154, 0.647) are the proportion of sessions that scored "Yes".

Builtin.Helpfulness (trace-level): Did the response move the user closer to their goal, from the user's perspective? Scores on a 0–6 categorical scale. Explicitly ignores factual accuracy — it only evaluates whether the response felt helpful to the user.

Builtin.Correctness (trace-level): Is the response factually accurate? Framed like a quiz: only content matters, not style or presentation. Scores Perfectly Correct / Partially Correct / Incorrect.

The full prompt templates for all built-in evaluators are published in the AWS docs.

They're domain-agnostic. For a customer support agent, that's not specific enough — but I'll show exactly what I mean once the results are in.

Custom Evaluators

I defined four domain-specific LLM-as-a-judge evaluators, each scoring on a 0.0–1.0 scale:

cs_intent_resolution (Exp 1 — Prompt Strategy): Did the agent correctly identify the customer's underlying intent and fully address it, even when the request was ambiguous?

You are evaluating a SaaS customer support agent.

Assess whether the agent correctly identified what the customer actually needed
and addressed it completely.

HIGH QUALITY:
- Correctly classifies the intent (billing, access, refund, privacy, etc.)
- Asks a targeted clarifying question when the request is genuinely ambiguous
- Does not ask for information it already has
- Resolves the stated problem or provides a clear path to resolution

LOW QUALITY:
- Misidentifies or ignores the customer's actual need
- Responds to a surface request while missing the underlying issue
- Asks unnecessary clarifying questions when intent is already clear
- Leaves the customer without a resolution or next step

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}

1.00 — Perfect Resolution: Intent correctly identified; response fully addresses the customer's need with a concrete resolution or escalation path
0.75 — Mostly Resolved: Intent correctly identified and mostly addressed, but one minor gap (e.g. missing a follow-up step or detail)
0.50 — Partially Resolved: Intent recognised but only partially addressed, or a correct clarifying question was asked but no resolution yet
0.25 — Misaligned: Agent responded to the wrong intent or provided a solution that does not match the customer's actual problem
0.00 — Failed: Intent completely missed, customer redirected incorrectly, or no actionable response provided

cs_tool_groundedness (Exp 2 — Tool Descriptions): Did the agent select the right tool, cite specific data from the tool result, and avoid making up facts that should have come from a tool call?

You are evaluating a SaaS customer support agent that has access to three tools:
fetch_billing_history, check_refund_status, and validate_account_identity.

Assess whether the agent:
(a) selected the right tool for the customer's issue
(b) cited specific data from the tool result (amounts, dates, statuses, account details)
(c) avoided making up facts that should have come from a tool call

HIGH QUALITY:
- Calls the most appropriate tool for the stated issue
- Cites specific values: '$49.00 duplicate charge on May 1st',
  'account locked after 5 failed attempts', 'refund approved, ETA May 9th'
- Never invents billing amounts, account statuses, or ticket details

LOW QUALITY:
- Calls the wrong tool or skips tool calls entirely
- Responds with generic statements: 'your billing looks fine' without checking
- Fabricates specific data that should have been retrieved

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}

1.00 — Fully Grounded: Correct tool selected; response cites specific retrieved data; no hallucinated facts
0.75 — Mostly Grounded: Correct tool used; most claims are data-backed but one minor detail is missing or slightly imprecise
0.50 — Partially Grounded: Tool was called but the response mixes real data with generic or inferred statements
0.25 — Wrong Tool / Mostly Generic: Wrong tool called, or the right tool was skipped and the response is largely generic with little specific data
0.00 — Hallucinated / No Tool: No tool called when one was clearly needed, or data cited in the response was fabricated

cs_support_quality (Exp 3 — Model Comparison): Holistic quality scored equally across four dimensions: empathy, clarity, completeness, and tone.

You are evaluating a SaaS customer support agent response.

Assess the response on four dimensions equally:
1. EMPATHY — does it acknowledge the customer's frustration or situation?
2. CLARITY — is the response easy to understand and act on?
3. COMPLETENESS — does it cover all aspects of the customer's issue?
4. TONE — is it professional, warm, and appropriate for support?

HIGH QUALITY:
- Opens with genuine acknowledgment of the customer's experience
- Explains what happened and why in plain language
- Provides concrete next steps with timelines where applicable
- Closes with an offer to help further
- Would not cause the customer to escalate or churn

LOW QUALITY:
- Robotic or dismissive tone
- Incomplete — addresses only part of the issue
- Unclear or filled with jargon
- Leaves the customer without a clear next step

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}

1.00 — Excellent: Empathetic, clear, complete, and professional. Would fully satisfy the customer and prevent escalation
0.75 — Good: Strong on most dimensions with a minor gap, perhaps slightly terse or missing one follow-up detail
0.50 — Adequate: Technically correct but lacking empathy, clarity, or completeness in a noticeable way
0.25 — Poor: Multiple gaps: robotic tone, incomplete answer, or confusing language that would frustrate the customer
0.00 — Unacceptable: Response would cause the customer to escalate or churn: dismissive, wrong, incoherent, or entirely unhelpful

cs_overall_customer_outcome ⭐ (North star — all experiments): Holistic score across all dimensions simultaneously: resolution, data accuracy, tone, and compliance process. A response that excels on one dimension but fails another (e.g. empathetic but factually wrong) should not score above 0.50.

You are evaluating a SaaS customer support agent on its ultimate business outcome:
did the customer get a good result?

Score based on ALL of the following:
- Was the customer's issue RESOLVED or correctly ESCALATED?
- Did the agent use REAL DATA (no hallucinated amounts, statuses, dates)?
- Was the TONE empathetic enough that the customer would not churn?
- For compliance issues (GDPR, legal): was the correct process followed?

This is a holistic score — a response that excels on one dimension but fails another
(e.g. empathetic but factually wrong) should not score above 0.50.

Context (customer message and conversation): {context}
Agent response to evaluate: {assistant_turn}

1.00 — Outstanding Outcome: Issue fully resolved or correctly escalated; no hallucinated data; empathetic tone; customer would be satisfied
0.75 — Good Outcome: Issue substantially addressed with minor gaps; data accurate; tone acceptable; customer unlikely to escalate
0.50 — Neutral Outcome: Issue partially addressed, or data accurate but tone poor, or tone good but resolution incomplete
0.25 — Poor Outcome: Issue largely unresolved, or significant hallucinated data, or tone likely to frustrate the customer
0.00 — Failed Outcome: Issue not addressed, wrong advice given, compliance process ignored, or response would directly cause churn or harm

An important lesson on evaluator throughput: I originally used Claude Sonnet 4.5 as the judge model. With 4 evaluators firing asynchronously per session, concurrent Converse calls regularly exceeded Sonnet's throughput limit. About half of cs_overall_customer_outcome scores silently failed with ThrottlingException. No error in the logs; scores just didn't appear. The fix was switching to Claude Haiku 4.5, which has roughly 10x higher throughput limits:

EVALUATOR_MODEL_ID = "us.anthropic.claude-haiku-4-5-20251001-v1:0"

Haiku is fast enough for async LLM-as-judge scoring at demo scale. Save Sonnet for the inference model, not the judge.

5. Test Overview

Three sequential experiments, each isolating one variable:

Exp 1 — System prompt: C = Baseline ("Respond immediately with a solution"), T1 = Optimized (Classify → clarify if ambiguous → cite tool data)
Exp 2 — Tool descriptions: C = Vague ("Get data for a user."), T1 = Precise (full typed signatures with return value descriptions)
Exp 3 — Model: C = Claude Haiku 4.5, T1 = Claude Sonnet 4.6

Each experiment ran 30 sessions (10 repeats × 3 ticket types), routed 50/50 via the AgentCore Gateway. Each session was scored by all four evaluators asynchronously.

Experiment Pipeline

The traffic in this demo is synthetic, not real user activity. Two different mechanisms were used depending on the phase.

Phase 2 (baseline batch evaluation) uses AgentCore's BatchEvaluationRunner with a simulated customer actor — a Claude Haiku model playing the customer role. Given a character profile and a goal, the actor dynamically responds to whatever the support agent says, producing realistic multi-turn conversations up to 4 turns deep. For example, the GDPR ticket actor is briefed as an EU customer who understands their Article 17 rights and will push back if the agent seems evasive.

Phases 5, 8, and 9 (A/B experiments) use single-turn prompts sent directly to the gateway — one fixed message per ticket type, repeated 10 times each. There is no back-and-forth; each invocation is a complete self-contained session.

In production, you would replace synthetic traffic with real user interactions. The infrastructure — gateway routing, online evaluation, CloudWatch logging — works identically regardless of whether the traffic is real or simulated. The practical advantage of real traffic is that it captures the authentic distribution of how users phrase requests, including edge cases and ambiguous formulations that synthetic prompts don't cover.

The three experiments ran sequentially. AgentCore only allows one active A/B test per gateway at a time. The full phase sequence:

Phase 1: Generate baseline traffic. Invoke the runtime directly across all 3 ticket types.
Phase 2: Baseline batch evaluation. Score the baseline sessions to establish a starting benchmark.
Phase 3: AI prompt recommendation. Point the Recommendations API at baseline traces; get an optimized system prompt.
Phase 4: AI tool description recommendation. Same API, optimized tool descriptions.
Phase 5: Create Exp 1 config bundles, run A/B test (prompt strategy), promote winner.
Phase 6: AI tool description recommendation. Generate improved tool descriptions based on Exp 1 traces.
Phase 7: Create Exp 2 config bundles. Best prompt + vague tools (C) vs best prompt + precise tools (T1).
Phase 8: Run Exp 2 A/B test. Stop Exp 1, create new A/B test, send 30 sessions; promote Exp 2 winner.
Phase 9: Run Exp 3 A/B test. Best prompt + best tools, vary only model_id: Haiku (C) vs Sonnet (T1).

How the Recommendations API works

Phases 3, 4, and 6 each call StartRecommendation with a type parameter that specifies what to optimize. Prompt and tool descriptions are separate calls — there's no combined mode:

dp.start_recommendation(
    type="SYSTEM_PROMPT_RECOMMENDATION",  # or "TOOL_DESCRIPTION_RECOMMENDATION"
    recommendationConfig={
        "systemPromptRecommendationConfig": {
            "systemPrompt": {"text": CURRENT_SYSTEM_PROMPT},
            "agentTraces": {"cloudwatchLogs": {...}},
            "evaluationConfig": {"evaluators": [{"evaluatorArn": "Builtin.GoalSuccessRate"}]},
        }
    },
)

Tool description recommendations use toolDescriptionRecommendationConfig instead and don't accept an evaluationConfig — which is why prompt recommendations always optimize for a session-level evaluator like Builtin.GoalSuccessRate rather than your custom trace-level north star.

6. Test Results and Analysis

I included Builtin.GoalSuccessRate as a reference signal alongside my custom evaluators. Across all three experiments, it frequently disagreed with cs_overall_customer_outcome, my north star. The pattern was consistent: any ticket that required escalation or a follow-up step — GDPR deletion, account unlock pending identity verification — scored "No" from GoalSuccessRate because the agent didn't complete the action in a single turn. That's the correct process, but GoalSuccessRate doesn't know that.

The gap matters because the Recommendations API only accepts session-level evaluators, which means it optimizes for GoalSuccessRate, not your custom north star. Worth knowing before you treat the API's output as ground truth.

All verdicts below are based on cs_overall_customer_outcome only. Builtin scores are shown for reference.

Experiment 1 — Prompt Strategy

Baseline prompt (C): Answer immediately, use tools when needed, keep responses concise.

Optimized prompt (T1): Classify intent first → ask one clarifying question if ambiguous → call the right tool → cite actual tool data → provide clear next steps.

The optimized prompt was generated by AgentCore's recommendation API after analyzing baseline session traces.

Metric	C (n=13)	T1 (n=17)	Δ	p	Significant?
cs_overall_outcome ⭐	0.708	0.835	+18.0%	0.059	No
cs_tool_groundedness	0.865	0.985	+13.9%	0.002	Yes
cs_intent_resolution	0.923	0.971	+5.1%	0.324	No
cs_support_quality	0.827	0.838	+1.4%	0.800	No
Builtin.GoalSuccessRate	0.154	0.647	+320.6%	0.002	Yes

Verdict: DIRECTIONAL. T1 leads on cs_overall_customer_outcome (+18.0%, p=0.059), just misses significance at n=30.

Analysis: The most revealing number is cs_tool_groundedness, the only statistically significant cs_* result across all three experiments (p=0.002). The baseline prompt was partially answering from the model's own knowledge rather than grounding responses in what tools returned.

The billing sessions show where GoalSuccessRate is actually useful. The baseline called fetch_billing_history, confirmed the duplicate $49 charge, then stopped. cs_overall gave it 0.75 — correct tool, accurate data, reasonable response. GoalSuccessRate gave 0 — the overcharge wasn't resolved, so the user's goal wasn't met. GoalSuccessRate was right. Diagnosing a problem is not the same as fixing it. The optimized prompt's "provide clear next steps" step is what moved the agent from diagnosis to action, and GoalSuccessRate captured that clearly (0.0 → 0.833 on billing).

Experiment 2 — Tool Descriptions

Baseline tool descriptions (C): Vague one-liners that give the LLM almost no signal.

"fetch_billing_history": "Get data for a user."
"check_refund_status":   "Process a request."
"validate_account_identity": "Run a query."

Optimized tool descriptions (T1): Precise typed signatures with return value descriptions.

"fetch_billing_history": (
    "Retrieve complete billing transaction history for a customer by user_id. "
    "Returns itemized charges, payment dates, amounts, and subscription details "
    "for the past 90 days."
)

Both variants use the winning prompt from Exp 1, so only tool selection behavior changes.

Metric	C (n=14)	T1 (n=16)	Δ	p	Significant?
cs_overall_outcome ⭐	0.804	0.859	+6.9%	0.193	No
cs_tool_groundedness	0.964	1.000	+3.7%	0.141	No
cs_intent_resolution	0.964	0.922	-4.4%	0.271	No
cs_support_quality	0.814	0.797	-2.1%	0.650	No
Builtin.GoalSuccessRate	0.643	0.188	-70.8%	0.006	Yes

Verdict: DIRECTIONAL. T1 leads (+6.9%, p=0.193), not significant.

Analysis: Better descriptions pushed cs_tool_groundedness to a perfect 1.000. The GoalSuccessRate collapse (-70.8%) looks bad but reflects a judge inconsistency, not a real regression. On account_locked sessions, C scored 0.75 and T1 scored 0 — yet cs_overall was 0.875 for both. Both variants did the same thing: confirmed the account lock, requested identity verification. GoalSuccessRate sometimes counted that as success, sometimes as failure. cs_overall (+6.9%) is the more consistent signal here.

Experiment 3 — Model Comparison

Control (C): Claude Haiku 4.5 (fast, cost-efficient, ~$0.80/M input tokens).

Treatment (T1): Claude Sonnet 4.6 (more capable, ~$3/M input tokens).

Both variants use the best prompt and best tool descriptions from Exp 1 and 2. The only difference is model_id in the config bundle. No redeployment needed.

Metric	C (n=14)	T1 (n=16)	Δ	p	Significant?
cs_overall_outcome ⭐	0.875	0.812	-7.1%	0.212	No
cs_tool_groundedness	0.982	1.000	+1.8%	0.317	No
cs_intent_resolution	0.911	0.938	+2.9%	0.537	No
cs_support_quality	0.786	0.844	+7.4%	0.142	No
Builtin.GoalSuccessRate	0.214	0.438	+104.2%	0.193	No

Verdict: INCONCLUSIVE / C holds. Haiku 4.5 leads on cs_overall_customer_outcome across every ticket type. Not statistically significant (p=0.212), but the direction is consistent.

Analysis: Sonnet scored higher on cs_support_quality (+7.4%) — richer responses, better formatted. But on one billing session it scored 0.5 because it required identity verification before committing to the refund, even though fetch_billing_history had already confirmed the duplicate and the user's eligibility. That extra step wasn't warranted by the data. Haiku saw the same tool output and offered the refund directly. On a structured task where the tool result already tells you what to do, Sonnet's tendency to add caution worked against it.

7. Key Learnings

What worked well

Online evaluation is fully automatic once configured. After the eval config is set up, scores land in CloudWatch for every gateway session without any extra instrumentation on your end. The only thing you need to do is read the log group.

Config bundles make iteration fast. Swapping model ID, system prompt, and tool descriptions across variants with no container rebuild changes the cost of an experiment from hours to minutes.

Gotchas

The Recommendations API optimizes for a metric you may not be using. The API only accepts session-level evaluators. If your north star is a custom trace-level evaluator (as mine was), the API silently falls back to Builtin.GoalSuccessRate instead. As the results show, those two metrics frequently disagree. Treat AI-generated recommendations as a strong starting point, not a guaranteed improvement. The A/B test is the actual verdict.

Each A/B test is limited to two variants. The gateway supports one control and one treatment per experiment. Testing three or more configurations requires running sequential experiments, which means more time and the risk of confounds between runs.

LLM-as-judge has variance. Consider deterministic evals where possible. For outputs with a clear correct answer, a deterministic check (exact field match, regex, schema validation) is more reliable than asking a judge model. LLM-as-judge is necessary for open-ended quality, but if part of your rubric can be verified programmatically, that part should be.

Use a smaller, high-throughput model for LLM-as-judge. When multiple evaluators fire concurrently per session, a capable-but-limited-throughput model will silently drop scores under throttling — no errors, scores just don't appear. A faster, cheaper model handles the concurrency, and for judging structured rubrics the quality difference is negligible.

8. Conclusion

Building the agent was the easy part.

The surprising difficulty was eval design. Writing a north star metric that genuinely reflects your business goal, not just something easy to score, takes real iteration. And once you have a north star, designing the supporting diagnostics that explain why it moves is just as hard. The built-in evaluators are a useful reference, but they're domain-agnostic by design. They will disagree with your north star at exactly the moments that matter most.

A few things I'd carry into the next project:

Config bundles are the operational win. Swapping prompts, tool descriptions, and model IDs in production with a 50/50 split, with no container rebuild, changes how you think about iteration. Changes that used to require a deploy cycle become experiments you can start in minutes.

The loop is the product. Baseline → recommend → A/B test → promote → repeat. Every step is already an API call returning structured data. There's nothing stopping an agent from evaluating itself, triggering a new recommendation when scores drop, and starting a test automatically. I'm not quite at fully self-driving agents yet, but the primitives are already here.

9. Resources

✍️ My Blog

asakohayase.com

🚀 Try It Yourself
GitHub Repository

📚 Learn More
AgentCore Optimization official docs