What Is an Agent Registry? (And What We Broke Before We Had One)

#ai #agents #mlops #devops

TL;DR

An AI agent registry is a centralized catalog of every agent in your organization — what each agent does, what tools it can access, what version is running, who owns it, and how to call it
It's to agents what a container registry is to Docker images or what a service mesh is to microservices — the layer that makes distributed components governable
We hit the "which agents do we have?" wall at 14 agents across 3 teams. That's when the registry stopped being a nice-to-have

About four months into our agentic AI buildout, our head of security asked a question I couldn't answer: "Can you give me a list of every AI agent running in production, what systems they have access to, and what version of each is currently deployed?"

I had a rough mental model. I knew about the agents my team had built. I had a vague idea of what the data engineering team had shipped. The product team had recently added two agents I'd heard about secondhand.

I spent the better part of a day pulling together a spreadsheet. By the time I finished, one of the agents I'd listed had already been replaced by a newer version. Two of them had been granted access to an internal API I hadn't known about.

The spreadsheet was outdated before I sent it.

That was our forcing function for building a proper agent registry. This post is what I wish I'd read before that conversation happened.

What an agent registry is

An agent registry is a centralized catalog of AI agents — a single source of truth that tracks every agent deployed in your organization, its capabilities, its integrations, its ownership, and its current state.

The analogy that landed for me: it's to agents what a container registry (Docker Hub, ECR, GCR) is to container images. When you have three containers running, you don't need a registry — you know what you have. When you have 40 containers across six teams, you need a registry to know what's running, who owns it, what version is deployed, and what depends on what.

Agents are the same. At two or three agents, a shared Notion doc is sufficient. At 14 agents across three teams, you need infrastructure that tracks state, not a doc that someone last edited last month.

A registry stores metadata for each agent:

Identity and ownership — which team built it, who's the current owner, what's the canonical name
Capabilities — what the agent can do, expressed as a standard interface (increasingly via the Model Context Protocol, so other agents can discover and call it without custom integration)
Tool and model access — which MCP servers it's authorized to use, which models it can call, what permissions it holds
Version and deployment state — which version is currently in production, what changed, when it was last updated
Observability metadata — success rate, latency, last error, evaluation scores if you're running evals
Access policy — which other agents or services are authorized to call this agent

The last one is what distinguishes a registry from a spreadsheet: it's not just a catalog, it's the enforcement point for agent-to-agent communication.

What goes wrong without one

We ran without a registry for longer than we should have. Here's what actually broke.

Shadow agents. Three separate teams had independently built agents that called our internal data API. None of them knew about the others. When we introduced rate limits on that API, two of the agents started failing intermittently — and we spent a week debugging what we thought was a data API problem before realizing the actual problem was three agents competing for quota we'd only budgeted for one.

Version confusion at 2am. An agent went into production with a bug. We rolled back. The rollback was applied to one environment but not the other. For six hours, our staging environment had the fixed version and production had the broken one, because there was no single source of truth for which version was where. The incident took longer to resolve than it should have because different team members were looking at different version references.

The offboarding gap. When an engineer left the team, we revoked their credentials for the systems we knew about. Three weeks later, a contractor reported that an internal Jira webhook was still firing from an agent they'd built. The agent had been registered nowhere. It was running on a piece of infrastructure they'd stood up themselves, using credentials that hadn't been included in the offboarding checklist because nobody knew the agent existed.

M×N integration hell. Each new agent that needed to call tools had to build its own integration with each tool. Eight agents, six tools: 48 potential integration points, each with its own credential management, error handling, and retry logic. When a tool API changed, we had to find and update every agent that used it manually.

The registry fixes all four of these. Shadow agents can't exist if registration is a prerequisite for deployment. Version state is tracked centrally. Offboarding is "revoke this agent's access in the registry." M×N integrations collapse to each tool being registered once, each agent pointing to the registry.

What a registry is not

Worth being explicit, because I conflated some things early on.

It's not a deployment platform. The registry tracks what's running, but it doesn't run the agents. Deployment is a separate concern — Kubernetes, a container orchestrator, whatever your team uses. The registry is the catalog; deployment is the execution layer.

It's not an orchestration framework. LangGraph, CrewAI, AutoGen — those handle how agents coordinate with each other. The registry handles what agents exist and whether they're authorized to talk to each other at all. These are complementary, not competing.

It's not an MCP server list. An MCP server registry catalogs available tools. An agent registry catalogs available agents. Both are useful. Both are needed. TrueFoundry calls the combination of the two a unified MCP and Agents Registry — one place where you can see both the tools agents can use and the agents themselves. That unification matters because the governance question is really "which agents can call which tools" — you need both catalogs to answer it.

It's not just a spreadsheet. The spreadsheet version of an agent catalog is a snapshot. A proper registry is stateful — it connects to your observability layer and shows live performance, not last-week's-update performance. When TrueFoundry's registry shows you an agent's success rate, it's pulling from real-time telemetry, not a manually updated field.

The architecture pattern that makes it work

The pattern that made everything cleaner: every agent registers with the gateway using the Model Context Protocol. Once registered, the agent looks like a standard MCP endpoint to every other agent in the system. A LangGraph agent and a CrewAI agent and a custom HTTP service all appear as the same kind of thing to the orchestrator — they're all just callable endpoints with a defined schema.

This is what solves the M×N problem architecturally. Each tool is registered once. Each agent is registered once. The registry maps which agents can call which tools. Agents don't need to know how to integrate with Jira or Slack or your internal data API directly — they call the registry endpoint, and the registry handles routing, credentials, and access control.

The other pattern that mattered: the registry as the access control enforcement point. Before this, access control for agent-to-agent calls lived in application code — each agent decided for itself whether to accept a call. That's as reliable as it sounds. Moving access control to the registry layer means it's enforced centrally, consistently, and not dependent on each individual agent implementation being correct.

What we ended up using

After the security audit incident, we evaluated a few options and landed on TrueFoundry's Agent Registry. I can explain specifically what mattered.

Unified agent and MCP catalog. Every agent and every tool visible in one place. When the security team asks "which agents have access to the internal data API," the answer is a query, not a two-day investigation.

Framework-agnostic registration. We have agents on LangGraph, one on CrewAI, and two custom HTTP services. The registry handles all of them through a standard registration interface. Once registered, governance policies apply regardless of what framework built the agent — the same RBAC rules, the same audit trail, the same access policies.

Live performance tracking. The registry shows each agent's success rate, average latency, and last error pulled from the observability layer. We set a routing rule: for production code changes, only route to agents with >90% success rate on the latest eval run. The registry enforces this automatically rather than requiring a human to check before deploying.

A2A communication via MCP. When an agent needs to call another agent, it goes through the registry. The registry checks whether the calling agent is authorized to invoke the target agent, handles the call, and logs the interaction with both agent identities. The over-privileged sub-agent problem — where a spawned agent inherits more permissions than it should — is closed at the registry layer.

The tradeoff: TrueFoundry is Kubernetes-native, so there's real infrastructure investment if you're not already on K8s. For a team of 5 with 3 agents, a YAML file is probably enough. The inflection point for us was around 10 agents across multiple teams with compliance requirements.

When you actually need one

The honest answer: you need a registry before you think you do, and you'll know you needed it earlier after you don't have one.

Some concrete signals:

You can't answer "which agents do we have in production" without asking multiple people
A team deploys an agent and you find out about it from a runaway cost alert rather than a check-in
An engineer leaves and you realize you don't know what credentials their agents were using
Two teams built agents that do similar things because neither knew the other existed
You want to introduce rate limits or access controls on an internal system and don't know how many agents are calling it

If any of those describe your situation, the registry conversation is overdue. If none of them do yet, you're probably still small enough that the overhead isn't justified.

What pushed you toward building or adopting a registry — and what does your current agent catalog look like? Curious whether most teams are still on the spreadsheet version or if the registry infrastructure has actually caught up to the agent deployment pace. Drop it in the comments.