DEV Community: yoko / Naoki Yokomachi

I Built an Issue-Based Claude Code Plugin "cadenza" for Technical Output Creation

yoko / Naoki Yokomachi — Sat, 09 May 2026 04:03:01 +0000

Introduction

When I write blog posts or give lightning talks, I often feel that my output isn't quite as engaging as I'd like. Reflecting on it briefly, I think the causes are: (1) trying to cram in too much and ending up wide but shallow, (2) staying at the level of mere introduction without doing my own verification or analysis, and (3) the structure lacking dynamic pacing and feeling flat.

To address these issues and make my output more effective, efficient, and sustainable, I created a Claude Code plugin. (The contents are simply a collection of Skills, so it can be used with other agents as well.)

The repository is below. Anyone can install and use it by following the README.md.
https://github.com/n-yokomachi/cadenza

Plugin Overview

The plugin is named cadenza. It comes from the musical term "cadenza" (a free movement in a concerto where the soloist showcases their virtuosity). The idea is that while the structured workflow is strictly enforced, the user's free will drives the issue framing and verification.

cadenza is a Claude Code plugin that divides technical output creation into 5 phases. Each phase functions as a "gate," preventing rushed writing while encouraging users to clarify their questions.

The fundamental design philosophy is based on issue-first knowledge production theories such as Kazuto Ataka's Issue Driven (Eiji Press, 2010).

The 5-phase structure

Phase	Skill	Role
1. Planning	`/cadenza:issue-finding`	Identify what question is worth answering
2. Design	`/cadenza:issue-decomposition`	Decompose into sub-issues and design the storyline
3. Storyboard	`/cadenza:storyboarding`	Decide the "presentation" (code snippets, diagrams, tables, etc.) for each sub-issue
4. Verification	`/cadenza:analysis-execution`	Implement, measure, and create diagrams according to the storyboard
5. Finishing	`/cadenza:output-crafting`	Generate the Markdown deliverable

In addition, there are 2 skills for review.

/cadenza:output-proofread — AI-driven exhaustive proofreading (fact-checking + language proofreading)
/cadenza:output-review — Author-driven review cycle support

Each skill confirms "Shall we proceed to the next phase?" upon completion, and if approved, calls the next skill in a chain. The final deliverable of cadenza is a single general-purpose Markdown file at ./.cadenza/output.md. The intended use is to base blog posts or slide decks on this Markdown.

Each phase functions as a "gate"

A key feature of this plugin is that each phase checks whether the start conditions for downstream phases are met. For example, when storyboarding starts up, it checks whether the Phase 2 (Issue Decomposition) section has been written to ./.cadenza/state.md, and if not, it directs the user to run issue-decomposition first.

This structurally prevents phase skipping.

Each phase also defines "upstream regression signals." For example, if the storyline starts to drift during the design phase, the plugin guides the user back to the planning phase to reconsider the issue.

Enforcing user accountability

As an aside: even though I'm introducing this AI tool for output creation, I still take a critical stance toward output that is simply written by AI alone.
https://zenn.dev/yokomachi/articles/202512_ai_article_comment

For this reason, cadenza is designed so that output cannot be produced by simply leaving everything to the agent. To ensure that everything written remains the user's responsibility, mandatory steps such as user confirmation are placed at each phase, so the entire process cannot be fully delegated to the agent.

Roughly, the user-led and agent-led steps are as follows.

Phase / Skill	User-led steps	Agent-led steps
Phase 1: issue-finding	Confirm primary information / articulate target audience, problem hypothesis, and post-read change / explicitly consent to the one-line issue	Survey existing content (WebSearch) / 3-condition check
Phase 2: issue-decomposition	Agree on the decomposition pattern / articulate a hypothesis for each sub-issue / agree on the storyline pattern / pin the claim down to one sentence	Storyline validity check
Phase 3: storyboarding	Agree on format and specifications for each sub-issue / specify output style	Storyboard assembly / storyboard review
Phase 4: analysis-execution	Decide whether to regress upstream / run verification / request additional checks beyond the skill's defaults	Classify verification type / pin premises down / run verification / structure results
Phase 5: output-crafting	Select the title (1 from 3 candidates)	Assemble structure / write TL;DR / write each section / final check
output-proofread	Decide which findings to accept and edit accordingly	Technical accuracy check / language proofreading / generate proofreading report
output-review	User-led overall (author re-reads, asks questions, issues editing instructions)	Provide grounded answers to questions / edit only when instructed

The leadership balance in Phase 4 (verification) shifts depending on the type of verification. For verifications like Implementation, Measurement, and Reproduction, the agent handles the baseline planning, while the actual hands-on work — or directing the agent on separate implementation tasks — is user-led. On the other hand, Comparison (research of public information) and Diagramming are AI-led from start to finish.

Ideally I'd want a guard that prevents output unless the user demonstrably understands the verification results (for example, the agent quizzing the user on the results and refusing to create the output unless they answer correctly), but I'll leave that to the user's (my own) conscience.

How to Use

Installation

The cadenza repository itself functions as a Claude Code marketplace. You can use it just by registering the marketplace and installing the plugin.

claude plugin marketplace add github.com/n-yokomachi/cadenza
claude plugin install cadenza@cadenza

After installation, reload the plugin with /reload-plugins, then check that cadenza is installed with /plugins.

Launch

Launch Claude Code in the project directory where you want to write a technical output, and start the flow by running /cadenza:issue-finding.

/cadenza:issue-finding

State management

cadenza creates a ./.cadenza/ directory directly under the working directory and consolidates state and deliverables there.

./.cadenza/
├── state.md          # Consolidates confirmed information from each phase
└── output.md         # Final deliverable

Confirmed information from Phase 1 through Phase 5 is appended to state.md sequentially. Each skill checks that the upstream phase's section exists in state.md before proceeding downstream, so phase skipping is structurally prevented.

If you work in a different project directory, ./.cadenza/ will naturally be a separate one, so you can produce multiple articles in parallel.

Suspend and resume

Since the result of each phase is written out to state.md, you can resume even if the Claude Code session is disconnected. Running /cadenza:<next phase> in a new session loads the contents of state.md and lets you pick up where you left off.

Bonus: Actual Flow

The following is a walkthrough showing how cadenza actually behaves during output creation. The sample topic is "Quantitative Comparison of AI Coding Agent Free Tiers as of May 2026," and we'll follow how cadenza behaves at each step. The output content itself is also just a sample — I haven't reviewed it properly, so please take it with a grain of salt.

Phase 1: Issue Finding

When /cadenza:issue-finding is launched, it starts with theme selection, hypothesis, and issue framing.

The following 5 steps run in order.

Step	Lead	Interaction details
1	User	Confirm primary information → I've used all 7 tools / no experience getting stuck. Selected "Continue as research/survey by an experienced user"
2	User	Articulate target audience / reader's problem hypothesis / post-read change in 1-2 sentences
3	AI	Survey existing articles via WebSearch → Found that the combination "free-tier-focused × quantitative × Japanese" was not yet covered
4	AI	All 3 condition checks (essential choice / deep hypothesis / answerable) passed
5	User	Finalize the one-line issue and give explicit consent

The final issue was decided as follows.

As of May 2026, how should individual developers wanting to try AI coding agents choose the "tool to try first" that fits their use case from the free tiers of 7 major tools (Claude Code / Codex CLI / Cursor / Gemini CLI / GitHub Copilot / Windsurf / Kiro), and what usage patterns should they consider for paid upgrades?

Phase 2: Issue Decomposition

Launching /cadenza:issue-decomposition enters the process of building the storyline.

The following 5 steps run.

Step	Lead	Interaction details
1	AI proposal → user agreement	Decomposition pattern selection. This time, Compare-Select (Options → Criteria → Decision) was proposed → adopted
2	User (this time: AI draft → user edit)	One-line hypothesis for each sub-issue. Since I had already answered "no experience getting stuck / no unique elements" in Phase 1, the AI provided a draft that I adopted as-is
3	AI proposal → user agreement	Storyline pattern. Sky-Rain-Umbrella, suitable for long-form articles, was proposed → adopted
4	User	Pin the claim down to one sentence. Selected from 3 candidates
5	AI-led	Storyline validity check (5 items). All items passed

The storyline was settled as follows.

Role	Question
☁️ Sky (fact)	What does it look like when each tool's free-tier limits are aligned in common units?
🌧️ Rain (interpretation)	Do the differences in limits stem from each provider's business model? (Does the 3-strategy classification of growth-first / paid conversion / platform infiltration hold up?)
☂️ Umbrella (action)	How should we sort the tools by use case (completion / agent / refactoring) into "try first" vs. "paid required"?

The claim is as follows.

Each provider's free tier reflects 3 strategic patterns (growth-first / paid conversion / platform infiltration), and the right approach is to choose by matching the reader's use case to each provider's strategic pattern.

Phase 3: Storyboarding

Launching /cadenza:storyboarding lets you design "how to show it" (code / diagrams / tables / benchmarks, etc.) and "what needs to be verified vs. what won't be" for each sub-issue.

Step	Lead	Interaction details
1, 2	AI proposal → user agreement	Propose the format (presentation) and specifications (content) for each sub-issue
3	AI proposal → user agreement	Adjust to match the output style. This time I specified "drop flat onto Zenn, no diagrams, tables only"
4	AI-led	Organize all verification items and out-of-scope items (what won't be done)
5	AI-led	Storyboard review (5 items). All items passed

The policy is to express each sub-issue in a single table (Mermaid diagrams are not used).

Sub-issue	Format	Specification overview
☁️ Sky (sub-issue 1)	Comparison table	Rows = 7 tools / Columns = official limits, common-unit conversion (tasks/day), billing trigger, credit card requirement
🌧️ Rain (sub-issue 2)	3-strategy classification table	Rows = 7 tools / Columns = strategy pattern, limit generosity, offering type, basis for judgment
☂️ Umbrella (sub-issue 3)	Decision matrix + descriptive paragraph	A 21-cell grid of rows = 3 use cases × columns = 7 tools, marked with ◎ / ○ / △ / ×, followed by a decision-guidance paragraph

Phase 4: Analysis Execution

In /cadenza:analysis-execution, the actual verification work is executed. This time, since it's just a sample, I scoped everything down to web research only.

Verification execution and main findings

I aggregated the official pricing pages of the 7 tools and the business model background of each company via WebSearch, and defined the baseline unit "1 coding task = 1 file edit + ~5 completions, or 1 agent invocation" based on the author's usage experience. Using this, I converted each tool's free tier into "tasks/day."

Main findings:

For Tab completion generosity, Windsurf (unlimited) and GitHub Copilot Free (equivalent to 2,000/month) stand out
For agent-driven use, Gemini CLI (1,000 req/day) is more generous than expected
Claude Code / Codex CLI's free tiers are essentially zero (Pro at $20/month is required for serious use)
Refactoring and large-scale tasks exceed the free tier across all providers; a paid plan is required

Issue reconsideration (partial)

The 3-strategy names (growth-first / paid conversion / platform infiltration) set in Phase 2 turned out to be inaccurate when checked against the data; "platform on-ramp / pure-tool paid funnel / completion-focused giveaway" proved to be a more persuasive classification.

cadenza defines 3 types of upstream regression signals (storyline collapse / new issue discovery / format mismatch). Since this finding didn't match any of them — the underlying structure of three strategy patterns held up, only the labels needed updating — I decided that returning to Phase 2 was unnecessary, and updated only the hypothesis names within Phase 4.

Partial shifts in the storyline driven by data are within the expected range, and cadenza is designed to ask the user whether to handle such shifts by "going all the way back upstream" or "updating in place."

Materials to hand off to Phase 5

Comparison table (7 tools × 4 columns: official limits / task conversion / billing trigger / credit-card requirement)
3-strategy classification table (revised, with the basis for each tool's classification)
Decision matrix (3 use cases × 7 tools = 21 cells of ◎/○/△/× judgments)
Updated claim candidate (version with the 3-strategy labels replaced)

That wraps up Phase 4. Next is Phase 5 (Output Crafting), which generates ./.cadenza/output.md.

Phase 5: Output Crafting

/cadenza:output-crafting writes out the final Markdown to ./.cadenza/output.md based on the confirmed information from Phases 1 through 4.

Step	Lead	Interaction details
1	AI-led	Build the structural skeleton (Title / TL;DR / Background / one section per sub-issue / Conclusion / References)
2	AI proposal → user selection	Propose 3 title candidates; user selects one
3	AI-led	TL;DR / opening (claim + target audience + post-read change in 3-5 lines)
4	AI-led	Write each sub-issue as its own section. Use the visuals (tables) decided in the Phase 3 storyboard as-is, maintaining storyboard fidelity
5	AI-led	Final check on code / diagrams / personal info. All 7 final confirmation items passed

Output body

The generated output.md is reproduced verbatim below as a raw Markdown source.

# AI Coding Agent Free Tiers Reflect 3 Strategies: Sorting 7 Tools by Use Case

## TL;DR

When the free tiers of 7 major AI coding agent tools (Claude Code / Codex CLI / Cursor / Gemini CLI / GitHub Copilot / Windsurf / Kiro) are aligned on a common "tasks/day" unit, each provider's limit design reflects one of 3 business strategies (platform on-ramp / pure-tool paid funnel / completion-focused giveaway). For individual developers, the realistic answer is not to commit to a single tool, but to combine the best free tier for each use case (completion-focused / agent-driven / refactoring or large-scale).

The target audience is individual developers and small-team developers who want to try AI coding agents and quantitatively compare multiple tools before going paid. After reading, you'll be able to gauge the practical value of each tool's free tier in common units, and sort tools that fit your use case into "try first" vs. "free isn't enough — paid required."

## Stance of this article

This comparison is a research / survey-style write-up by an experienced user. The author has actually used all 7 tools, but has no particular experience of "getting stuck" with the free tiers. The purpose is to organize information so that readers about to try them can quantitatively grasp the limits. Read this not as "stories of when I got stuck," but as "a map for those about to try them out."

## Defining the common unit: "1 coding task"

Each provider publishes limits in different units (requests / completions / tokens / premium requests / messages, etc.), so they can't be compared apples-to-apples as-is. This article normalizes them against the following baseline unit.

> **1 coding task = 1 file edit + about 5 completions, or 1 agent invocation**

Here, "completion" means inline completion (short suggestions accepted via Tab), and "agent invocation" treats chat-based instructions or Edit / Cascade-style operations spanning multiple files as 1 unit. Conversion accuracy is roughly ±50% — the goal is approximate comparison of practical usage, not precision.

## Each tool's free tier limits (as of May 2026)

The following organizes each provider's official pricing page side-by-side.

| Tool | Official limit | Common unit conversion (tasks/day) | Billing trigger | Credit card required |
|--------|-----------|------------------------|------------|-----------|
| Claude Code | Pro at $20/month (annual contract: $17/month) required; not usable on the free tier. Some sources indicate that new API accounts receive about $5 in API credit | About 5 tasks/day (rough estimate, dividing $5 of API credit at Claude Sonnet 4.x mid-tier rates, assuming 5,000-10,000 tokens per task, spread over 30 days) | Pro subscription / API credit depletion | Required for both API and Pro subscription |
| Codex CLI | Codex is included with ChatGPT Free / Go, but specific usage limits for Free / Go must be checked individually in the ChatGPT usage dashboard (not listed in the official pricing table) | Not evaluable (limits aren't public, so can't be quantified) | ChatGPT Plus at $20/month | Optional |
| Cursor (Hobby) | Per publicly available info, 2,000 completions + 50 slow premium model requests/month (not directly extractable from the official pricing page; sourced from review articles) | About 13 tasks/day (completion) + about 1.7 premium/day | Monthly quota depletion → Pro at $20 | Not required |
| Gemini CLI | 1,000 requests/day, 60 requests/minute, about 250,000 tokens/minute (Flash model-centric; Pro model is limited. Specific values are checked individually in the Google AI Studio dashboard) | About 200 tasks/day (1 task = 5 requests) | Daily limit / per-minute token limit | Not required (Google account only) |
| GitHub Copilot Free | Officially listed as 2,000 completions/month + 50 agent mode or chat requests/month | About 13 tasks/day (completion) + about 1.7 agent / chat requests/day | Monthly quota depletion → Pro at $10 | Not required |
| Windsurf (Free) | Public info indicates Tab completion is exempt from the usage quota. Advanced features like Cascade are quota-based (not directly extractable from the official pricing page; sourced from review articles) | Completion: effectively unlimited / advanced: a few times/day | Advanced feature use → Pro at $20 | Not required |
| Kiro (Free) | Officially listed as 50 credits/month + an initial 500-credit bonus (must be used within 30 days), with overage at $0.04/credit | Steady-state: about 1.7 credits/day / Initial bonus: about 17 credits/day (vibe mode 1 = 1 credit, spec mode consumes several credits) | Credit depletion → Pro at $20 / additional $0.04/credit | Not required |

> **Note**: The above values are aggregated from each provider's official pricing pages and review articles as of May 2026. Only GitHub Copilot Free and Kiro publish specific free-tier values directly in their official pricing tables. The others (Claude Code / Codex / Cursor / Gemini CLI / Windsurf) require checking separate pages or usage dashboards, or rely on external review articles for the published limits. Please verify each provider's latest official information before actually trying them.

The values span a **5-10× range**. The differences are too large to lump together as the same "free tier."

Laid out side-by-side, you can see significant variation in the generosity of free tiers. Anthropic and OpenAI are essentially zero, Google and Microsoft (GitHub) are more generous, Windsurf is unlimited only for completion, and Amazon's Kiro takes an unusual credit-based approach. This isn't a technical constraint — it's **a reflection of each provider's business model**.

## 3-strategy classification of free tiers

> **Note**: The 3-strategy classification below is this article's original framing, not an established industry taxonomy. It groups providers into 3 categories from this article's perspective, based on strategies that each provider has officially expressed (see "judgment basis" below for citations). Readers are welcome to reclassify along other axes (IDE-embedded / CLI / feature maturity, etc.).

When the free-tier design of the 7 tools is viewed from a strategic perspective using this article's framing, the following 3 patterns emerge.

| Pattern | Name | Characteristics | Tools |
|------|------|------|----------|
| A | Platform on-ramp | Generous free tier draws users into the provider's other services (GitHub / Google Cloud / AWS). Their main business is cloud/platform, and the coding agent serves as an entry point | GitHub Copilot, Gemini CLI, Kiro |
| B | Pure-tool paid funnel | Thin free tier allows trial use, but full use requires a paid plan. The tool itself is the main business, with subscriptions as the primary revenue source | Cursor, Claude Code, Codex CLI |
| C | Completion-focused giveaway | Tab completion is fully released as unlimited; agent and advanced features are gated behind paid tiers. A strategic loss-leader aimed at maximizing IDE adoption | Windsurf |

### Each company's judgment basis (official source citation)

**A. Platform on-ramp**

- **GitHub Copilot**: Microsoft CEO Satya Nadella stated, "Any per user business of ours, whether it's productivity or coding or security, will become a per user and usage business," positioning Copilot as part of Microsoft's company-wide per-user + usage strategy ([GitHub Blog: usage-based billing](https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/)). Copilot Free can be read as the entry point, aiming for stickiness to GitHub accounts and repositories plus monetization through Enterprise integration.
- **Gemini CLI**: Google's official blog announcement explicitly states, "industry's largest allowance with 60 model requests per minute and 1,000 requests per day at no charge," and lays out a tiered funnel where additional quota moves to "usage-based billing with Google AI Studio or Vertex AI key" ([Google Blog: Introducing Gemini CLI](https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/)). It's designed as the entry point for Cloud migration: Free → Standard → Vertex AI Enterprise.
- **Kiro**: AWS positions Kiro as the successor to Amazon Q Developer and has announced that no new Q Developer Free Tier accounts will be created going forward ([AWS Blog: Q Developer end-of-support](https://aws.amazon.com/blogs/devops/amazon-q-developer-end-of-support-announcement/)). Signing in with an AWS Builder ID enables direct integration with Amazon Q and the broader AWS ecosystem ([Kiro Authentication docs](https://kiro.dev/docs/getting-started/authentication/)), which reads as a play for AWS developer acquisition.

**B. Pure-tool paid funnel**

- **Cursor**: Anysphere has reached **$1B in annualized revenue with over 1 million paying users** on Cursor alone, with the $20 Pro subscription as its revenue mainstay. Hobby Free is positioned as an evaluation tier: "a real comparison... most developers who give it a serious two-week test either upgrade to Pro or decide the tool is not for them" (interpretation based on public benchmarks + review articles).
- **Claude Code**: Anthropic Head of Growth Amol Avasare publicly explained the bundling strategy with Pro/Max: "Max launched a year prior, it didn't include Claude Code, and the company later bundled Claude Code into Max after it took off" ([reported by The Register](https://www.theregister.com/2026/04/22/anthropic_removes_claude_code_pro/)). Claude Code is positioned as a lever for boosting subscription engagement.
- **Codex CLI**: OpenAI's official blog states, "Codex is included with ChatGPT Plus, Pro, Business, and Enterprise plans—no separate subscription needed" ([OpenAI: Introducing Codex](https://openai.com/index/introducing-codex/)), and signing in to Plus / Pro also grants free API credit ($5/$50). The tight Free / Go limits can be read as a deliberate push toward the ChatGPT subscription.

**C. Completion-focused giveaway**

- **Windsurf**: Cognition (Devin's parent company) acquired Windsurf for about $250M in December 2025. Cognition CEO Scott Wu laid out the strategy in an official blog post: "start by integrating Cognition's autonomous AI-powered engineer Devin into Windsurf's IDE," and "developers can plan tasks in Windsurf and launch a team of Devins" ([Cognition Blog: Windsurf acquisition](https://cognition.ai/blog/windsurf)). The Free plan's unlimited Tab completion accelerates IDE adoption, while monetization comes from advanced features (Cascade / Devin integration).

Once you see the strategy patterns, you realize that even similar numbers like 2,000 completions/month **play opposite roles depending on the strategy**. For GitHub Copilot it acts as "an entry point for user retention (a steady allowance to keep users in the GitHub ecosystem long-term)," while for Cursor it acts as "a cutoff line before paid (a mechanism to push serious users toward Pro at $20/month)." That's the contrast.

## Sorting by use case × tool

Based on the strategy patterns, the table below judges how far the 7 tools can be pushed across 3 typical use cases.

| Use case | Claude Code | Codex CLI | Cursor | Gemini CLI | GitHub Copilot | Windsurf | Kiro |
|------|-------------|-----------|--------|------------|----------------|----------|------|
| Completion-focused (Tab completion as main) | × | × | △ | ○ | ◎ | ◎ | △ |
| Agent-driven (delegating tasks) | △ | × | △ | ◎ | △ | △ | ○ |
| Refactoring / large-scale (multi-file editing) | × | × | × | △ | × | △ | × |

Legend: ◎ try first / ○ worth trying / △ paid required / × skip

### Reading by use case

**Completion-focused users** (typing in the IDE while heavily using Tab completion) have **two clear winners: GitHub Copilot Free and Windsurf**. Copilot Free offers 2,000 completions/month, while Windsurf has fully unlimited Tab. Copilot brings strong integration with the GitHub ecosystem, while Windsurf stands out for the polish of the IDE itself. Gemini CLI's 1,000 req/day is hard to dismiss for completion use, but being CLI-based, it's a fundamentally different experience from in-IDE completion. Cursor and Kiro deplete quota quickly under completion-focused use, and Claude Code and Codex CLI don't target completion as their primary use case (both lean agent).

**Agent-driven users** (delegating tasks via chat, auto-editing multiple files) will find that, perhaps surprisingly, **Gemini CLI Free is the strongest option**. Even though it's Flash-model-centric, 1,000 requests/day is plenty to try agent tasks, and being usable with just a Google account is a big plus. Kiro also suits agent delegation in spec mode, but burns through credits quickly under its credit-based system. Claude Code has high-quality agent design, but with a free tier of essentially zero, Pro is required for serious use. Cursor's 50 premium requests/month is insufficient for trying agent-driven workflows, and Codex CLI's free-tier limits aren't public, so it falls outside the scope of evaluation here.

**Refactoring / large-scale users** (structural changes spanning multiple files, heavy editing) unfortunately face **structurally insufficient free tiers across the board**. Cursor's 50 premium runs out in a few days, as do Kiro's 50 credits and Copilot Free's 50 agent / chat requests. Windsurf's Cascade is also limited to a few uses per day on the free tier. Gemini CLI's 1,000 requests/day is theoretically generous, but keeping multi-file editing within 5 requests per task is practically infeasible, and the quota gets consumed regardless. **For serious use of AI agents in this category, a paid upgrade to one of the tools should be assumed from the outset.**

## Conclusion

The fact that free-tier strategies divide into 3 patterns gives the chooser a clear guideline: **"match your use case to each provider's strategy pattern."**

- **Completion-focused trial** → Pattern A (GitHub Copilot Free / Windsurf) is enough. You get the benefits of the platform on-ramp while staying within the free tier
- **Agent-driven trial** → Short-term verification with Pattern A (Gemini CLI Free, or Kiro's initial bonus)
- **Refactoring / large-scale serious use** → A paid upgrade to one of the Pattern B tools (Cursor Pro / Claude Code Pro) is required

In other words, **individual developers shouldn't commit to a single tool but should combine free tiers by use case** — that's the realistic answer as of May 2026. Once you read each provider's strategy against your own use case, the right timing for going paid also becomes apparent on its own.

## Aside: OSS BYO API tools

While outside the scope of this article, the following OSS tools are worth knowing about as a separate category — **"the tool itself is $0 since it's OSS, but you pay separately for LLM API usage"**.

- **Cline**: A VS Code extension; an OSS tool whose popularity is rapidly rising. You bring your own API keys for Anthropic / OpenAI / Google, etc.
- **Aider**: A terminal-CLI-based OSS tool, aimed at power users
- **Continue**: OSS that runs as an extension for VS Code / JetBrains

These follow a "tool free, API at cost" model, so they don't fit on this article's "quantitative comparison of free tiers" axis. They are, however, an option for developers who don't want to pay a fixed monthly API fee or who prefer to manage billing themselves.

## References

- [Claude Code Pricing (Anthropic)](https://claude.com/pricing)
- [Codex Pricing (OpenAI Developers)](https://developers.openai.com/codex/pricing)
- [Cursor Models & Pricing](https://cursor.com/docs/models-and-pricing)
- [Gemini CLI Quotas (Google AI)](https://ai.google.dev/gemini-api/docs/rate-limits)
- [GitHub Copilot Plans](https://github.com/features/copilot/plans)
- [Windsurf Pricing](https://windsurf.com/pricing)
- [Kiro Pricing](https://kiro.dev/pricing/)

Improving and Validating Multi-Agent Prompts with Bedrock AgentCore Optimization

yoko / Naoki Yokomachi — Mon, 04 May 2026 07:22:57 +0000

This article is an AI-assisted translation of a Japanese technical article.

Introduction

In April 2026, Amazon Bedrock AgentCore added a new capability called Optimization, which takes real agent traces and proposes prompt improvements based on them.
https://aws.amazon.com/about-aws/whats-new/2026/05/bedrock-agentcore-optimization-preview/

In this article, I apply AgentCore Optimization to a Strands Agents-as-Tools setup (a main agent that wraps sub-agents as @tools) and walk through what actually happens. What kind of improvements does Recommendations propose? Does the change hold up under real traffic in an A/B test? And how does it feel to put this into operation? Those are the questions I tried to answer.

Inside AgentCore Optimization

Let me start by laying out what Optimization actually consists of.

The three capabilities

Capability	Role
Recommendations	Takes real trace logs plus a target Evaluator as input, and has an AI generate improved versions of system prompts and tool descriptions. Instead of you iterating manually, Recommendations does the iteration for you.
Configuration bundles	Externalizes prompts and tool descriptions out of source code and version-manages them on the AgentCore side. You can change agent behavior just by swapping the bundled values — no code change, no redeploy. Also used to run two settings side by side in the A/B test described below.
A/B testing	Routes real traffic via AgentCore Gateway between two variants (control / treatment), scoring each side with an Evaluator. You can compare which prompt actually performs better in production, with statistical backing.

The official docs describe these three as a "continuous improvement loop": Recommendations generates an improved version → Configuration bundles version-controls it → A/B testing validates the effect under real traffic. The three capabilities are designed to cycle.

Prerequisites

Following the official docs, the setup requires:

An agent built with Strands Agents
Deployed to AgentCore Runtime with Observability enabled
CloudWatch Transaction Search enabled

Building the test setup

For the experiment I built a multi-agent setup with Strands Agents — a main agent that delegates to specialized sub-agents for weather and news, wired together with the Agents-as-Tools pattern.

The repo:
https://github.com/n-yokomachi/agentcore-optimization-lab

Configuration bundle structure

To make a setup A/B-testable, prompts and tool descriptions need to be externalized in configBundles inside agentcore.json. The bundle structure I ended up with:

{
  "components": {
    "{{runtime:agentsAsToolsLab}}": {
      "configuration": {
        "systemPrompt": "You are an assistant that answers questions about weather and news.",
        "weather_agent": "Get weather",
        "news_agent": "Get news"
      }
    }
  }
}

A note on the prompts: I deliberately wrote them quite carelessly so the impact of Recommendations would be easy to see.

{{runtime:agentsAsToolsLab}} is an agentcore CLI placeholder; it gets resolved to the actual Runtime ARN at deploy time.

One quirk: the tool descriptions (weather_agent / news_agent) sit directly under configuration as flat siblings. This shape matches how the Recommendations API resolves the tool description path. The default structure that the AgentCore CLI generates with --with-config-bundle (which nests them under toolDescriptions) didn't resolve correctly for tool description Recommendations, so I flattened it and that worked.

Adding the bundle definition and deploying are both done through the AgentCore CLI:

agentcore add config-bundle
agentcore deploy

Wiring the bundle into the agent

To inject bundle values into the Runtime dynamically, we use Strands' hook mechanism. The ConfigBundleHook class overrides the main agent's system prompt at BeforeInvocationEvent and each tool's description at BeforeToolCallEvent.

class ConfigBundleHook(HookProvider):
    def register_hooks(self, registry: HookRegistry, **kwargs: Any) -> None:
        registry.add_callback(BeforeInvocationEvent, self._inject_system_prompt)
        registry.add_callback(BeforeToolCallEvent, self._override_tool_description)

    def _inject_system_prompt(self, event: BeforeInvocationEvent) -> None:
        config = BedrockAgentCoreContext.get_config_bundle()
        event.agent.system_prompt = config.get("systemPrompt", DEFAULT_SYSTEM_PROMPT)

    def _override_tool_description(self, event: BeforeToolCallEvent) -> None:
        config = BedrockAgentCoreContext.get_config_bundle()
        override = config.get(event.tool_use["name"])
        if override and event.selected_tool:
            spec = event.selected_tool.tool_spec
            if spec and "description" in spec:
                spec["description"] = override

This Hook class is based on the template the AgentCore CLI generates with --with-config-bundle. Because I flattened the bundle structure, the tool description lookup (config.get(event.tool_use["name"])) is simpler than the generated default.

Recommendations and A/B test run

For the experiment I generated trace logs from 8 English queries × 5 rounds = 40 sessions, then ran both system-prompt and tool-description Recommendations against the agent.

agentcore run recommendation --type system-prompt
agentcore run recommendation --type tool-description

Recommendations on the system prompt

The original system prompt and the Recommendations output are both visible in the AWS Console. The improved prompt now factors in tool calling — phrases like "call both tools in parallel" and "use news_agent to find related news" appear in the suggestion.

Recommendations on the tool descriptions

The before/after for tool descriptions is visible in the same way. The descriptions are filled out more thoroughly, and they explicitly call out the possibility of parallel use with the other sub-agent — phrases like "Often used alongside news_agent" and "Often used alongside weather_agent".

A/B test for effect validation

To verify that the Recommendations output actually moves the needle, I ran an A/B test as well.

Control variant (C): bundle version with the human-authored prompt and tool descriptions
Treatment variant (T1): bundle version with the Recommendations output applied
Traffic split: 50/50 (sticky session-to-variant assignment by session ID)
Online Evaluator: Builtin.GoalSuccessRate
Traffic volume: 8 queries × 5 rounds = 40 sessions

To run the A/B test you need an HTTP Gateway and an Online evaluation config. The HTTP Gateway has to be added by hand to httpGateways in agentcore.json (no add subcommand seems to exist for it at the moment). The Online evaluation config is added with agentcore add online-eval.

"httpGateways": [
  {
    "name": "agentsAsToolsLabGateway",
    "runtimeRef": "agentsAsToolsLab"
  }
]

agentcore add online-eval

Then add the A/B test itself and register everything in one go with deploy.

agentcore add ab-test
agentcore deploy

Traffic generation is done by POSTing to the AgentCore Gateway URL with SigV4 auth. agentcore invoke hits the Runtime directly, so for the A/B test we have to go through the Gateway URL. Here's the script I used:

GATEWAY_URL = "https://agentsastoolslabgateway-XXXXX.gateway.bedrock-agentcore.us-west-2.amazonaws.com/agentsAsToolsLab/invocations"
credentials = Session().get_credentials()

def invoke_one(query: str):
    sid = str(uuid.uuid4())
    payload = json.dumps({"prompt": query}).encode()
    req = AWSRequest(method="POST", url=GATEWAY_URL, data=payload, headers={
        "Content-Type": "application/json",
        "X-Amzn-Bedrock-AgentCore-Runtime-Session-Id": sid,
    })
    SigV4Auth(credentials, "bedrock-agentcore", "us-west-2").add_auth(req)
    http_req = urllib.request.Request(GATEWAY_URL, data=payload, headers=dict(req.headers), method="POST")
    with urllib.request.urlopen(http_req, timeout=180) as resp:
        return sid, resp.status

The A/B test results are visible in the AWS Console under "Bedrock AgentCore > Optimizations > A/B Tests".

Here are the numbers:

Metric	Value	Meaning
Sessions routed to control	21	Number of sessions routed to the control variant
Sessions routed to variant	19	Number of sessions routed to the treatment variant
Control average (Goal Success Rate)	0.48	Mean Goal Success Rate of the control variant
Variant average	0.53	Mean Goal Success Rate of the treatment variant
Variant improvement	Not significant: +10.5% (p=0.95)	Treatment shows a +10.5% improvement over control, but not statistically significant (p>0.05)

Directionally, the treatment is ahead by +5pt absolute (= +10.5% relative). So the Recommendations output is moving things in the right direction, but with only 40 sessions there isn't enough data to claim statistical significance. Since the original goal — confirming Recommendations actually works end to end — is met, and going further would start to hurt my wallet, I'm cutting the experiment off here.

Where to draw the line with Recommendations

This is just from this experiment, but if I sort the improvement patterns Recommendations produced, I think the natural division of labor between Recommendations and the developer looks something like this:

Owner	Domain
Recommendations	Mention of parallel calls, naming of related elements, multilingual support callouts, response format directives, safety mechanisms, proactive behavior
Developer	Domain context, business logic, data interpretation policy

So when you put Recommendations into your operational loop, the parts you (the human) still need to write are:

Domain-specific context (specific customer business processes, external API specs, etc.)
Business logic (output constraints, compliance, billing rules, etc.)
Data interpretation policy (e.g. "when this field is empty, treat it as X")

For everything else — the "general patterns of good prompt writing" — it might be reasonable to let Recommendations handle it. That's the takeaway for me from this experiment.

Wrap-up

So that was a hands-on look at AgentCore Optimization on an Agents-as-Tools setup. The takeaways:

Recommendations extracts general patterns like parallel invocation, tangential topic handling, response format, and safety mechanisms
A boundary becomes visible between what humans should write (domain context, business logic) and what we can hand off to Recommendations
The A/B testing capability and its outputs are confirmed working, but at this experiment's scale the sample size isn't enough for significance

That's it. I hope this is useful for anyone planning to try Optimization themselves.

Bonus: Japanese system prompts getting misflagged as prompt injection?

When I ran the system prompt Recommendation with a Japanese prompt like --inline "あなたは天気とニュースに答えるアシスタント。", I got this error:

[ValidationException] The provided content was detected as unsafe by 
prompt attack protection. Please review your system prompt and try again.

After narrowing it down:

Fails regardless of Evaluator (Builtin.GoalSuccessRate / Builtin.Helpfulness)
Fails whether via bundle or inline mode
Fails even when I rewrite the Japanese prompt in different ways
Works as soon as I switch to English

So the only difference that flips the outcome is the language of the prompt. Tool description Recommendations work fine in Japanese, by the way.

For that reason, all the experiments in this article ended up being run with English prompts.

References

https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/optimization.html
https://github.com/aws/agentcore-cli
https://aws.amazon.com/about-aws/whats-new/2026/05/bedrock-agentcore-optimization-preview/

Building an AWS Cost Visualization Workflow with Strands Agents Skills and AgentCore Code Interpreter

yoko / Naoki Yokomachi — Fri, 03 Apr 2026 01:26:25 +0000

Introduction

I'm currently developing a personal AI agent called TONaRi. It also has an X (Twitter) account where it posts tech news and more.
https://x.com/tonari_with

The agent's core architecture is built on Strands Agents + Amazon Bedrock AgentCore.

In this article, I combined AgentCore Code Interpreter with Strands Agents' Agent Skills to implement a workflow that retrieves AWS cost data and generates chart images using code. Check out the video demo below:

https://x.com/_cityside/status/2035339843014987845

Although this was an addition to an existing web application codebase, I hope it also serves as a useful reference for building something similar from scratch.

Here are the main technologies used:

AgentCore Code Interpreter: One of Amazon Bedrock AgentCore's building blocks that executes code in a sandboxed environment
Agent Skills (SKILL.md): Externalized prompts that are loaded on demand
Cost Explorer API: An API for retrieving AWS cost data, called from an agent tool
S3: Stores chart images generated by Code Interpreter, served to the frontend via Presigned URLs

Amazon Bedrock AgentCore Code Interpreter

Amazon Bedrock AgentCore Code Interpreter (hereafter "Code Interpreter") is one of the building blocks that allows agents hosted on AgentCore Runtime to safely execute code in a sandboxed environment.
https://aws.amazon.com/blogs/machine-learning/introducing-the-amazon-bedrock-agentcore-code-interpreter/

Key features include:

Code execution in a sandboxed environment
Pre-installed libraries such as pandas, numpy, and matplotlib
In addition to the default access-restricted environment, you can create user-defined environments with public internet access or VPC connectivity

In this project, I use Code Interpreter to have the agent dynamically generate chart images from data using matplotlib.

Strands Agents Skills

Agent Skills is a mechanism originally proposed by Anthropic. In a nutshell, it works like this: you define procedures you want the agent to execute in Markdown files (similar to system prompts), then inject only the metadata into the system prompt. The agent dynamically loads the Skill files based on the metadata and executes the procedures. This approach helps reduce token consumption and prevents context pollution.

As of March 2026, Agent Skills are now available in Strands Agents as well:
https://strandsagents.com/docs/user-guide/concepts/plugins/skills/

For this project, I defined the following workflow as a Skill:

Call the Cost Explorer API tool to retrieve cost data for the user-specified period
Call the cost visualization tool
- 2-1. Convert cost data into a chart image using Code Interpreter
- 2-2. Upload the image to S3
- 2-3. Return the S3 presigned URL

Processing Flow

Here's a simplified overview of the processing flow:

User: "Show me this month's AWS costs"
  ↓
Main Agent
  ├─ ① skills tool: Load skill
  ├─ ② get_aws_cost tool: Call Cost Explorer API
  └─ ③ execute_python tool
     └─ ③-1 Generate matplotlib chart via Code Interpreter
        ③-2 Upload to S3
        ③-3 Return presigned URL
  ↓
Frontend: Detect S3 image URL in text → Display inline in chat

Implementation

get_aws_cost: Cost Data Retrieval Tool

The AWS cost retrieval tool is defined as an agent tool using the @tool decorator. The logic is separated from the Code Interpreter chart image generation.

import boto3
from strands import tool

_ce_client = boto3.client("ce", region_name="ap-northeast-1")

@tool
def get_aws_cost(
    period: str = "monthly",
    months: int = 1,
    group_by_service: bool = True,
) -> str:
    """Retrieve AWS cost data from Cost Explorer.

    Use this tool to fetch cost data. Then pass the result to execute_python
    to create matplotlib charts for visualization.

    Args:
        period: Granularity - "monthly" or "daily".
        months: Number of months to look back (default: 1, max: 6).
        group_by_service: If True, break down costs by AWS service.

    Returns:
        JSON string with cost data.
    """
    ce = _ce_client
    # ...
    response = ce.get_cost_and_usage(
        TimePeriod={"Start": start, "End": end},
        Granularity="MONTHLY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    return json.dumps({"data": data})

execute_python: Code Execution Tool

Similarly, Code Interpreter code execution is defined as an agent tool using the @tool decorator. To reliably capture matplotlib figures, the tool automatically injects capture code before and after the agent-generated code.

from bedrock_agentcore.tools.code_interpreter_client import code_session

CODE_INTERPRETER_REGION = os.getenv("CODE_INTERPRETER_REGION", "ap-northeast-1")
OUTPUT_BUCKET = os.getenv("CODE_INTERPRETER_OUTPUT_BUCKET", "ap-northeast-1")
_s3_client = boto3.client("s3", region_name=os.getenv("AWS_REGION", "ap-northeast-1"))

@tool
def execute_python(code: str, description: str = "") -> str:
    """Execute Python code in a sandboxed environment. Use this to run data analysis,
    generate charts with matplotlib, or perform calculations.

    Available libraries: pandas, numpy, matplotlib, json, datetime.
    Use ONLY matplotlib for plotting (not seaborn).
    Use English for all chart labels and titles (Japanese fonts are not available).

    IMPORTANT for chart generation:
    - Do NOT call plt.savefig() — images are auto-captured from open figures.
    - Do NOT call plt.close() — closing figures prevents image capture.
    - Just create figures with plt.subplots() and leave them open.
    - Do NOT use boto3 — the sandbox has no AWS credentials.

    Args:
        code: Python code to execute.
        description: Optional description of what the code does.

    Returns:
        JSON string with execution results including stdout, stderr, and image URLs.
    """
    # Automatically inject matplotlib image capture code
    img_code = f"""
import matplotlib
matplotlib.use('Agg')
{code}
import matplotlib.pyplot as plt, base64, io, json as _json
_imgs = []
for _i in plt.get_fignums():
    _b = io.BytesIO()
    plt.figure(_i).savefig(_b, format='png', bbox_inches='tight', dpi=100)
    _b.seek(0)
    _imgs.append({{'i': _i, 'd': base64.b64encode(_b.read()).decode()}})
if _imgs:
    print('_IMG_' + _json.dumps(_imgs) + '_END_')
plt.close('all')
"""
    with code_session(CODE_INTERPRETER_REGION) as code_client:
        response = code_client.invoke("executeCode", {
            "code": img_code,
            "language": "python",
            "clearContext": False,
        })
        # Extract images from stdout using _IMG_..._END_ markers
        # Upload to S3 and return presigned URLs

Creating the SKILL.md

Now that the tools are defined, we create the Agent Skill that defines how to call them. The directory structure looks like this:

agentcore/
├── skills/
│   └── aws-cost/
│       └── SKILL.md
├── app.py
└── ...

The SKILL.md file contains YAML frontmatter and a Markdown-formatted prompt:

---
name: aws-cost
description: "Analyze and visualize AWS cost data using get_aws_cost"
  for data retrieval and execute_python for matplotlib chart generation
allowed-tools: get_aws_cost execute_python
---

# AWS Cost Analysis Skill

Two-step process: fetch data with `get_aws_cost`,
then visualize with `execute_python`.

## Critical Rules

- **NEVER call plt.savefig()** — images are auto-captured from open figures.
- **NEVER call plt.close()** — closing figures prevents image capture.
- **Use English for ALL text** in charts — Japanese fonts are unavailable.

## Step 1: Fetch Data
(How to call get_aws_cost)

## Step 2: Visualize
(matplotlib code template)

Integrating with the Agent

The tools are passed via the tools parameter, and the Skill is initialized with the AgentSkills plugin and passed to the agent.

from strands import Agent, AgentSkills
from src.agent.code_interpreter import execute_python
from src.agent.aws_cost import get_aws_cost

# Initialize the Skills plugin
skills_plugin = AgentSkills(skills="./skills/")

# Create the agent
agent = Agent(
    tools=[*other_tools, execute_python, get_aws_cost],
    plugins=[skills_plugin],
    system_prompt=system_prompt,
)

I'll skip the frontend implementation details, but essentially it detects image URLs in the agent's response and automatically fetches and displays them inline.

Demo

Here's what it looks like when the skill is actually running. Since the chart-generating code is dynamically created by the agent, the output varies depending on how you phrase your instructions.

Here's the video demo again from the beginning of the article:

https://x.com/_cityside/status/2035339843014987845

Wrapping Up

That's how I implemented an AWS cost charting feature using Agent Skills + Code Interpreter. (Admittedly, you could just look at the Cost Explorer console for the same information, but this was more of a proof of concept...)

In this implementation, I used the default Code Interpreter tool, which restricts public internet access. However, by using a user-defined Code Interpreter tool, you could enable more flexible code execution. I'd love to explore the possibilities further.

Using OpenRouter's OpenAI-Compatible Models (Grok 4.1 Fast) with Strands Agents

yoko / Naoki Yokomachi — Sun, 15 Mar 2026 02:05:39 +0000

This article is an AI-assisted translation of a Japanese technical article.

Introduction

I'm building a personal AI agent called TONaRi ("tonari" means "next to" in Japanese — named with the idea of an AI that stands next to you and supports your daily life). It's built with Strands Agents + Amazon Bedrock AgentCore, with a VRM-powered 3D avatar frontend using AITuberKit.

In a previous article, I wrote about cost reduction through sub-agent splitting.
https://dev.to/yokomachi/28-tool-definitions-cutting-ai-agent-costs-with-sub-agent-splitting-4dbp

This time, I took cost reduction a step further by making it possible to switch the LLM itself to Grok 4.1 Fast via OpenRouter.

Cost Comparison

Let's compare the costs between Claude Haiku 4.5 (Amazon Bedrock), which I had been using as the main model, and Grok 4.1 Fast (OpenRouter), the new alternative.

	Claude Haiku 4.5 (Bedrock)	Grok 4.1 Fast (OpenRouter)
Input	$1.10 / 1M tokens	$0.20 / 1M tokens
Output	$5.50 / 1M tokens	$0.50 / 1M tokens

That's a significant difference. As I mentioned in the previous article, LLM per-token pricing is by far the biggest cost driver, so reducing the unit price — while maintaining an acceptable quality balance — has the greatest impact.

Switching Models in Strands Agents

Strands Agents is an open-source agent SDK provided by AWS, and it supports models beyond Bedrock. Using the OpenAIModel class, you can directly use models from any service that provides an OpenAI-compatible API, such as OpenRouter. If you need broader provider support, LiteLLMModel is also an option. Since Grok 4.1 Fast is OpenAI-compatible, we use the OpenAIModel class directly.

Creating an OpenAIModel

First, add the openai dependency.

dependencies = [
    "strands-agents>=1.23.0",
    "openai>=1.0.0",
    # ...
]

Then create the model instance via OpenRouter.

from strands.models.openai import OpenAIModel

model = OpenAIModel(
    client_args={
        "api_key": "your-openrouter-api-key",
        "base_url": "https://openrouter.ai/api/v1",
    },
    model_id="x-ai/grok-4.1-fast",
)

The created model can be passed to an Agent with the exact same interface as a Bedrock model.

from strands import Agent

agent = Agent(
    model=model,  # Works the same whether BedrockModel or OpenAIModel
    system_prompt="You are a personal AI assistant.",
    tools=my_tools,
)

Wrap Up

So I switched the model used for everyday conversations to Grok 4.1 Fast, and my impression is that quality isn't a major issue for casual conversation. However, application-specific conversation tags (this AI agent uses tags like [happy] or [bow] to trigger facial expressions and motions) sometimes get ignored or misinterpreted by the model, so that still needs tuning.

I also had concerns about tool calling via AgentCore Gateway, but it's been working surprisingly well without any major adjustments.

I'll continue monitoring and consider trying other models or implementing model-specific routing if needed.

28 TOOL DEFINITIONS! — Cutting AI Agent Costs with Sub-Agent Splitting

yoko / Naoki Yokomachi — Sat, 07 Mar 2026 12:53:02 +0000

This article is an AI-assisted translation of a Japanese technical article.

Introduction

As I kept adding tools to make my personal AI agent more useful for daily tasks, the input tokens per API call ballooned — and so did the cost.

It's lower now, but the projection was heading toward $120/month

In this article, I'll walk through the input token bloat problem caused by too many tools and how I tackled it by splitting into sub-agents.

Architecture Overview

Here's a high-level look at TONaRi's architecture:

Frontend (Next.js + VRM 3D Avatar)
  → Next.js API Route
    → AgentCore Runtime (Strands Agent)
      → AgentCore Gateway → Lambda functions (tools)
      → AgentCore Memory (STM/LTM)

The agent runs as a container deployed on Bedrock AgentCore Runtime. External tools are implemented as Lambda functions accessed through AgentCore Gateway. Adding a new tool is as simple as writing a Lambda function and registering it as a Gateway target.

All the Tools

AgentCore Gateway lets you expose Lambda functions as agent tools.

from strands.tools.mcp import MCPClient
from mcp_proxy_for_aws.client import aws_iam_streamablehttp_client

def create_mcp_client(gateway_url: str, region: str) -> MCPClient:
    def create_transport():
        return aws_iam_streamablehttp_client(
            endpoint=gateway_url,
            aws_region=region,
            aws_service="bedrock-agentcore",
        )
    return MCPClient(create_transport)

Here are all the tools I've connected:

Domain	Tools	Count
Task Management	List, Add, Complete, Update	4
Calendar	List events, Check availability, Create, Update, Delete, Suggest schedule	6
Gmail	Search, Get, Create draft, Archive	4
Notion	Search pages, Get page, Create, Update, Query DB, Get DB	6
Twitter	Get today's tweets, Post	2
Diary	Save, Get	2
Date Utils	Get current datetime, Calculate date, List date range	3
Web Search	Web search	1
Total		28

Each tool can be called individually, but the real power is chaining. For example, saying "Search for a recipe, save the bookmark to Notion, create a shopping list, and add grocery shopping to my tasks" triggers:

Web search tool finds a recipe
Saves the URL to a Notion bookmark page
Creates a shopping list from the recipe and saves it to a Notion memo page
Adds a grocery shopping task to TONaRi's task list

The AI agent sits between tools and interprets vague user requests to orchestrate across them — this is the most useful aspect of using an AI agent day-to-day.

The Input Token Explosion

Behind the convenience, costs were quietly piling up. When calling the Bedrock API, input tokens consist of four main components:

System prompt: Agent character settings, behavior rules
Tool definitions: Name, description, and JSON schema for every tool
Long-term memory (LTM): Episodes and facts extracted from past conversations
Conversation history (STM): Current session content

The biggest culprit was tool definitions. I had Claude Code calculate it — the 28 tools directly connected to the agent consumed about 5,000 tokens.

Breaking Down the Numbers

Here's a rough breakdown of input tokens per call for the monolithic agent:

Component	Estimated Tokens
System prompt (character + all domain rules)	~3,500
Tool definitions (28 tools × schema)	~5,000
LTM search results	~1,500
Conversation history (10 turns)	Variable (~5,000–30,000)

The system prompt, tools, and LTM are essentially fixed costs sent with every message — that's 10,000 tokens per call. With about 100 calls per day, the monthly fixed cost alone is:

10,000 tokens × 100 calls/day × 30 days = 30,000,000 tokens/month

At Claude Haiku 4.5's Bedrock input token rate ($1.10/1M tokens for Japan cross-region inference), that's $33/month in fixed costs alone. As a solo developer, having ~$33/month go toward tool definitions that might not even be used on a given call was painful.

Splitting into Sub-Agents

To reduce the number of tool definitions the main agent loads, I created domain-specific sub-agents and had the main agent call them via the @tool decorator.

[Before: Monolithic]
Main Agent
├── System prompt (all domain rules)
└── 28 tools ← sent every single call

[After: Sub-agent split]
Main Agent
├── System prompt (generic rules only)
├── DateTool (3 tools)      ← frequently used, kept in main
├── TavilySearch (1 tool)   ← same
├── task_agent      ← defined as @tool (4 tools)
├── calendar_agent  ← defined as @tool (6 tools)
├── gmail_agent     ← defined as @tool (4 tools)
├── notion_agent    ← defined as @tool (6 tools)
├── diary_agent     ← defined as @tool (2 tools)
├── briefing_agent  ← defined as @tool (multi-domain tools)
└── twitter_agent   ← defined as @tool (2 tools)

Sub-Agent Implementation

With Strands Agents' @tool decorator, you can define a sub-agent as a tool for the main agent:

@tool
def calendar_agent(request: str) -> str:
    """Google Calendar sub-agent. Handles listing, availability checks, creating, updating, and deleting events.

    Args:
        request: A request related to the owner's calendar
    """
    try:
        agent = Agent(
            model=BedrockModel(
                model_id="jp.anthropic.claude-haiku-4-5-20251001-v1:0",
                region_name="ap-northeast-1",
                streaming=True,
            ),
            system_prompt="You are a Google Calendar specialist assistant...",
            tools=_calendar_tools,  # calendar tools only
            callback_handler=None,
        )
        result = agent(request)
        return str(result)
    except Exception as e:
        return f"Calendar operation error: {e}"

System Prompt Reduction

By splitting sub-agents by domain, domain-specific rules moved from the main system prompt to each sub-agent's prompt.

Before: Main prompt contained all domain rules

- Calendar rules (duplicate checks, deletion confirmation, etc.)
- Gmail rules (draft only, date search caveats, etc.)
- Notion rules (property formats, database mappings, etc.)
- Briefing procedure (5 detailed sections)
- Diary creation flow (interview → generate → save)
- ...

After: Main prompt only has sub-agent list and delegation rules

## Sub-agent Coordination
- task_agent: Task management (list, add, complete, update)
- calendar_agent: Google Calendar (get, create, update, delete events)
- gmail_agent: Gmail (search, get, create drafts)
- ...

### Delegation Rules
- Describe requests to sub-agents in detail
- Rephrase sub-agent results in your own words

This reduced the system prompt from ~7,400 characters to ~3,800 characters.

Cost Reduction

Comparing the main agent's fixed cost per call:

Component	Before (Monolithic)	After (Sub-agent split)
System prompt	~3,500 tokens	~2,000 tokens
Tool definitions	28 tools (~5,000 tokens)	12 tools (~2,500 tokens)
LTM search results	~1,500 tokens	~1,500 tokens
Fixed cost total	~10,000 tokens	~6,000 tokens

Those 4,000 tokens weren't deleted — they moved to the sub-agents. Here's the per-call input token cost for each sub-agent:

Sub-agent	Prompt	Tool Defs	Request Message	Total
task_agent	~400	~400	~100	~900
calendar_agent	~400	~850	~100	~1,350
gmail_agent	~400	~400	~100	~900
notion_agent	~400	~700	~100	~1,200
briefing_agent	~500	~2,500	~100	~3,100
diary_agent	~400	~200	~100	~700
twitter_agent	~400	~150	~100	~650

If you just add everything up, the "After" total is actually higher. But the key insight is reducing tokens sent on every call. For example, the briefing_agent loads Gmail, Calendar, and task tools all at once and has complex rules — it's expensive, but it only runs once a day. Before, all those definitions were loaded on every single call. Now they only load when needed.

Monthly Cost Impact

Estimating with ~100 calls per day:

[Main agent fixed cost reduction (every call)]
  4,000 tokens/call × 100 calls/day × 30 days = 12,000,000 tokens/month

[Sub-agent additional cost (only when invoked)]
  Assuming ~60% of calls (60/day) trigger one sub-agent
  Average 900 tokens/call × 60 calls/day × 30 days = 1,620,000 tokens/month
  *briefing_agent (~3,100 tokens) runs once/day, calculated separately
  briefing: 3,100 tokens × 30 days = 93,000 tokens/month

[Net savings]
  12,000,000 - 1,620,000 - 93,000 = 10,287,000 tokens/month

At Claude Haiku 4.5's Bedrock input token rate ($1.10/1M tokens, Japan cross-region inference), that's roughly $11/month in input token savings.

Other Optimizations

I also made several complementary changes:

Conversation Window Reduction

Changed SlidingWindowConversationManager's window_size from 15 to 10.
Savings: $3–5/month

LTM Search Result Reduction

Reduced top_k across LTM strategies (total 18 → 10 results).
Savings: $2–3/month

Lightweight Pipeline Agents

For automated tasks like scheduled tweets and news collection, I was using the full main agent. I replaced these with lightweight dedicated agents that share memory but carry only minimal tools.
Savings: $2–3/month

Total Savings

Optimization	Est. Monthly Savings
Sub-agent splitting	$11
Conversation window reduction	$3–5
LTM result reduction	$2–3
Lightweight pipeline agents	$2–3
Total	$18–22

Wrapping Up

So I managed to cut costs to some degree, but it's still expensive...!
If you have any clever cost reduction ideas, I'd love to hear them.

(Fortunately I was recently selected as an AWS Community Builder, so I'm hoping for some AWS credits!)

Controlling VRM Character Motions for an AI Agent on the Web

yoko / Naoki Yokomachi — Sat, 21 Feb 2026 13:00:22 +0000

This article is an AI-assisted translation of a Japanese technical article.
https://zenn.dev/yokomachi/articles/202602_vrm-motion-control-on-web

Introduction

I'm currently working on a personal AI agent project and decided to use a 3D model as the user interface.
Since I didn't have the knowledge to build everything from scratch, I leveraged AITuberKit, an OSS project I'd been aware of for a while, to quickly set up the frontend.

Tech Stack

VRM model creation: VRoid Studio
Web frontend: Next.js, TypeScript
VRM rendering & control: three-vrm (v3.0.0), Three.js
Base kit: AITuberKit
Agent implementation: Strands Agents, Amazon Bedrock AgentCore Not covered in detail in this article

VRM and VRoid Studio

VRM is a file format designed for 3D avatars.
With VRoid Studio, you can create characters and export them in VRM format without any 3D modeling knowledge.
In my case, my only prior experience was creating characters in video games, but I was able to create two models (male and female) in about an hour — that's how easy it is.

https://x.com/_cityside/status/2019742015617994773

What AITuberKit Can Do

AITuberKit is an OSS that displays VRM models in a web browser and bundles features like LLM-powered chat, facial expression control, and speech synthesis.

Here are some of the key features AITuberKit provides:

VRM model display, facial expression control, and lip-sync
LLM-powered chatbot functionality
Speech synthesis API integration
YouTube streaming integration
Multimodal input
etc.

For my project, since I'm building it as a personal AI agent, I'm using AITuberKit's base features like VRM display control and chatbot functionality while adding heavy customizations on top.

Implementing Motion Control

Here's where we get to the main topic.
AITuberKit supports switching facial expressions (smile, angry face, etc.) out of the box, so I decided to implement additional body motions (bowing, extending a hand, etc.).

https://x.com/_cityside/status/2016874430056845502

Architecture Overview

Here's the overall picture of the motion control system:

LLM Response
  ↓ Streaming parser
  ├─ [emotion] Emotion tag → ExpressionController → Facial expression control
  └─ [bow/present] Motion tag → GestureController → Bone control
                                         ↑
                                    EmoteController (conflict resolution)

The EmoteController sits between facial expressions and motions to handle conflicts between them.

Motion Definitions

Motions are implemented by defining bone rotations as keyframes.

Here's an example definition for a bow:

// src/features/emoteController/gestureController.ts
interface BoneRotation {
  bone: VRMHumanBoneName
  rotation: THREE.Quaternion
}

interface GestureKeyframe {
  duration: number
  bones: BoneRotation[]
}

interface GestureDefinition {
  keyframes: GestureKeyframe[]
  holdDuration: number
  closeEyes?: boolean
}

For the bow motion, three bones — spine, chest, and neck — are each rotated forward to create a more natural-looking bow rather than simply bending at the waist.
The arm bones are also adjusted to achieve a natural posture.

// src/features/emoteController/gestureController.ts
this._gestures.set('bow', {
  keyframes: [
    {
      duration: 1.0,
      bones: [
        {
          bone: 'spine',
          rotation: new THREE.Quaternion().setFromEuler(
            new THREE.Euler(0.25, 0, 0)
          ),
        },
        {
          bone: 'chest',
          rotation: new THREE.Quaternion().setFromEuler(
            new THREE.Euler(0.15, 0, 0)
          ),
        },
        {
          bone: 'neck',
          rotation: new THREE.Quaternion().setFromEuler(
            new THREE.Euler(0.12, 0, 0)
          ),
        },
        // Arm bones are also adjusted (omitted)
      ],
    },
  ],
  holdDuration: 1.0,
  closeEyes: true, // Close eyes during the bow
})

Triggering Motions from LLM Responses

The character's expressions are controlled by having the LLM output emotion and motion tags in its responses.

Emotion tags are implemented by default in AITuberKit. The LLM response looks like this:

[happy]Thank you so much!

Motion tags are a custom addition. They appear in the response just like emotion tags:

Welcome! [bow]What kind of fragrance are you looking for today?

When both emotion and motion tags appear simultaneously, both are triggered.
For example, [happy][bow] results in the character bowing with a smile.

The system prompt includes the following instructions:

`
## Emotional Expression
The format for conversation text is as follows. Choose the single most appropriate emotion for the entire response and prepend the emotion tag at the beginning.
[{neutral|happy|angry|sad|relaxed|surprised}]{conversation text}
`

Handling Conflicts Between Expressions and Motions

Simply applying both facial expressions and motions at the same time can cause unexpected behavior, so I've added the following controls.
For example, having the eyes open during a bow looked unnatural, so I set closeEyes: true to close the eyes on the motion control side.

The EmoteController manages this by passing flags between controllers:

// src/features/emoteController/emoteController.ts
public updateExpression(delta: number) {
  const isEmotionActive = this._expressionController.isEmotionActive
  // Skip auto-blink if the motion is closing eyes and expression is neutral
  const skipAutoBlink =
    this._gestureController.isClosingEyes && !isEmotionActive
  this._expressionController.update(delta, skipAutoBlink)
}

public updateGesture(delta: number) {
  const isEmotionActive = this._expressionController.isEmotionActive
  // Skip motion eye-close if an emotion expression is active
  this._gestureController.update(delta, isEmotionActive)
}

The emotion expressions and the motion's eye-close feature are mutually exclusive.
When the emotion is neutral, the motion side closes the eyes. When an emotion is active, the motion's eye-close is disabled and control is handed to the expression side.

Wrapping Up

Using a chat UI as the frontend for an AI agent is a very common approach, but even a simple model like this feels lively just by having it move around, which makes it really fun.
That said, controlling motions can be quite tricky — figuring out which bones to rotate and by how much is surprisingly difficult.
For more complex motions, you could look into purchasing motion packs, which might be a good option.