DEV Community

AgentCore Registry: 16 Skills, 1 Hour, Zero Downtime

The story of migrating our governance agent from hardcoded skills to dynamic Registry loading — the wins, the gotchas, and what we learned along the way.

Why Bother?

Our AWS governance agent has 16 domain skills — security analysis, cost optimization, network intelligence, the works. Every single one of them was baked into the system prompt on every request. Ask about a single S3 bucket? Here's all 16 skill descriptions anyway.

That's a lot of tokens doing nothing useful.

The goal was simple: move skill definitions to AgentCore Registry so the agent loads only what it needs per request. Less prompt bloat, smarter skill selection, and the foundation for per-thread skill loading down the road.

The key constraint: Registry acts as a catalog only. Tool implementations stay in the agent code where they belong (lowest latency). The Registry just stores metadata — name, description, instructions, which tools a skill can use.

And because we're not reckless, we added a USE_REGISTRY env var toggle. Flip it off, and everything goes back to the old local files. Zero drama.

before_after

The Plan

We used Claude Code with agentic planning to break it into 7 phases: SDK upgrades, plugin creation, upload scripts, IAM permissions, tests, this journal, and documentation. The whole thing — plan through production deployment — took about an hour. A few decisions we locked in early:

  • Registry as catalog, not runtime — tools stay in-process for speed
  • One record per skill — 16 skills, 16 records, clean mapping
  • Session-scoped caching — fetch the catalog once per session, not every turn
  • Auto-approval enabled — all uploads come from our repo, no human review gate needed

Setup: Console and CLI Only

As of April 2026, AgentCore Registry has no Terraform provider and no CloudFormation support. So we did everything through the AWS Console and CLI — which honestly worked fine for a one-time setup, but worth knowing if you're planning to IaC everything from day one.

Created the registry in the console. We already had a working JWT auth setup with Azure Entra ID on our main AgentCore agent, so we matched that same configuration. Straightforward.

Then we needed the Registry ID.

You'd think there'd be a labeled field with a copy button, like every other AWS service. There isn't. The ID is buried inside the ARN — you have to extract it yourself, or notice it tucked into sample CLI commands at the bottom of the page.

Not a showstopper, but the kind of thing that makes you squint at the screen for five minutes wondering what you're missing.

Dear AWS: A "Registry ID" field with a copy button would save everyone some confusion.

Phase 1–2: Foundation and Plugin

Bumped our SDK versions (strands-agents and boto3), added config fields, and built the new plugin — a drop-in replacement for the old AgentSkills loader.

One thing worth knowing: match your Registry auth to your agent's auth. If your agent uses IAM, create an IAM-auth registry. If your agent uses OIDC, create a JWT registry. We use OIDC on our agent, so we went with JWT — and since the plugin runs with the agent's IAM execution role, we use the control plane APIs (list_registry_records + get_registry_record) which accept IAM regardless of registry type.

Phase 3: Upload Scripts (Where Things Got Interesting)

We wrote scripts to push all 16 skill files to the Registry. Two things bit us:

The frontmatter problem. Our SDK's Skill.instructions field helpfully strips the YAML frontmatter from skill files. The Registry's upload API helpfully requires it. We switched to reading raw file content.

The async creation dance. create_registry_record returns immediately with an ARN but no record ID field. The record is in CREATING state. You have to: extract the ID from the ARN, wait for it to reach DRAFT, then submit it for approval as a separate step. None of this is one API call.

Phase 4–5: IAM and Tests

IAM was uneventful — added registry permissions to the execution role, scoped to our specific registry ARN.

Tests were thorough: 13 unit tests for the plugin (including XML injection protection and caching behavior), 6 for the upload scripts. All green.

Phase 6: The Moment of Truth

Deployed with USE_REGISTRY=true. Opened Slack. Asked Schwarzi a question.

It worked.

The logs told the story:

  • Skills loaded from AgentCore Registry ✓
  • 16 skills fetched ✓
  • Same tool count as before (30) ✓
  • Extra latency: ~1.2 seconds at agent creation, once per session

That 1.2 seconds is the catalog fetch. It happens once, then individual skill fetches (~200ms each) are cached for the rest of the session. Acceptable trade-off for dynamic loading.

"But Does It Cost More Tokens?"

This was our first question too. We'd just added API calls and latency — surely we were also burning more tokens?

No. The LLM sees exactly the same text either way.

Both plugins inject an <available_skills> XML block into the system prompt before every turn — 16 skill names and descriptions, same format, same size. When the agent activates a skill, both return the full SKILL.md content as a tool result. Same content, same tokens. The Registry is a different storage backend, not a different prompt.

The actual cost delta is purely operational: ~1.2 seconds of latency at session start, ~200ms per uncached skill fetch, and 3–4 AWS API calls per session that didn't exist before. Those are real, but they're measured in milliseconds and pennies, not tokens.

If anything, the Registry sets up future token savings. Right now we load the full catalog into the prompt every turn. With per-thread skill loading (the next phase), we'd inject only the skills relevant to the current conversation. That's a prompt reduction, not an increase — but it requires the dynamic loading infrastructure we just built.

The Real Payoff: Skill Updates Without Redeployment

The best part came after the migration. We needed to fix a skill's instructions — previously, that meant editing the markdown, rebuilding the container, and running agentcore launch. A full redeployment cycle for a text change.

Now? Edit the SKILL.md, run the update script, and the agent picks it up on the next session. No container build, no deploy, no downtime. Skill content is decoupled from the agent runtime.

That's the kind of workflow improvement that compounds over time. Every skill tweak, every prompt refinement, every new tool reference — just a markdown edit and an upload.

The Gotcha Tracker

For the detail-oriented, here's everything that went sideways:

# What Happened What We Did
1 Registry ID not labeled in console Extracted from ARN
2 Registry APIs missing in boto3 < 1.42.88 Upgraded boto3
3 Upload API requires YAML frontmatter Sent raw file content
4 Record creation is async, no ID in response Parsed ID from ARN, polled for DRAFT status
5 Data-plane search requires matching auth type Used control-plane APIs (accept IAM regardless)
6 Registry takes ~45s to become READY Added polling before uploads

What We'd Tell Past Us

  1. Match your Registry auth to your agent. IAM agent → IAM registry. OIDC agent → JWT registry. We matched our existing Azure Entra ID config and it worked on the first try.

  2. Control plane is your friend. The control-plane APIs (list_registry_records + get_registry_record) accept IAM regardless of registry auth type. Use them for catalog fetches from your plugin.

  3. Send the raw file, frontmatter and all. The SDK strips it; the Registry needs it. Read the file from disk, not from the parsed object.

  4. Record creation is a multi-step process. Create → wait for DRAFT → submit for approval → wait for APPROVED. Budget for polling and retries.

  5. Check your boto3 version. Registry APIs landed in 1.42.88. Older versions have the clients but not the operations — a confusing failure mode.

  6. Cache at the session level. The catalog fetch is the slow part (~1.2s). Individual skill lookups are fast (~200ms) and cache well. Don't re-fetch what hasn't changed.

  7. Use an env var toggle for migrations. USE_REGISTRY=true/false with both code paths intact means you can roll back in seconds. No database flags, no deployment. Just flip the switch.

Top comments (0)