Yesterday I asked an AI agent to recommend a library for parsing European VAT numbers. It suggested one I've been seeing in Stack Overflow answers since 2022. I asked three different ways. Same library every time.
The library works. That's not the point. The point is that there are at least four newer options in this space that are objectively better, and the agent had no idea any of them existed. Not because they were hidden. Because they shipped after the underlying model was trained.
This is a small, specific moment that points at a much bigger problem. When an AI agent recommends a tool, capability, library, or service to its user, the recommendation is essentially frozen at whatever the foundation model saw during training. Anything that came into existence after the training cut is invisible.
The gap compounds
Every month that passes is another month of releases the recommendation layer doesn't know about. A capability launched yesterday won't be recommended by any agent on the planet until the next round of foundation model training, which might be six months away, might be a year. And even then, there's no guarantee the new capability gets enough representation in the training data to actually surface in recommendations. Popular things stay popular. New things stay invisible.
For builders shipping tools into the agent ecosystem this is closer to an existential issue than people realise. The whole pitch of AI agents is that they take work off your plate by making good decisions on your behalf, including the decision of which tool to use. If "good decision" means "the same handful of options that were popular eighteen months ago," then the agent ecosystem rewards incumbency in a way the open web never did. You can build the best thing in the category and remain invisible to the buyers who matter.
What search engines did, and why it's hard to replicate
Search engines solved a version of this problem with crawlers and ranking algorithms. The mechanism was ugly but it worked: build a thing, ship it, get crawled, eventually get found. The ranking signal, PageRank and the hundred refinements that came after, was visible enough that builders could optimise for it. SEO is a much-maligned discipline, but what it actually represents is a path to discoverability for things that didn't exist yesterday.
Agents have no equivalent yet. There's no AEO.
You might think "well, just put your tool in the model's training data." But foundation model training is an opaque process controlled by a small number of labs, with a feedback cycle measured in months or years. It's not a primitive that builders can rely on. And even if you could somehow guarantee inclusion, training data is by definition retrospective. The recommendation layer has to be live, not historical.
The partial answers taking shape
A few things are starting to emerge that don't solve the problem on their own but together start to look like something.
MCP server registries. Tools like Glama, Smithery, and Docker's MCP catalogue let agents discover MCP servers at runtime. The agent queries the registry, finds available capabilities, and calls them through the Model Context Protocol. This bypasses the training cut entirely. The catch is that MCP registries today are mostly unvetted lists. An agent that finds a server in a registry has no way to know whether it works, whether it returns useful data, or whether someone is going to start charging for it tomorrow.
Framework PRs. Opening pull requests against framework repos like Pydantic AI, LangChain, or LlamaIndex puts your tool's name and description into the codebase, which then gets ingested into the next round of foundation model training. This is a slow path but a real one. It's also gameable in obvious ways and probably won't survive contact with adversarial actors at scale.
Documentation indexers. Services like Context7 index documentation specifically for retrieval by AI agents and LLMs. If your tool has good docs and is indexed in Context7, an agent that uses Context7 as a retrieval source can find you. This is closer to a real answer but it relies on the agent being configured to query the indexer in the first place, which isn't a default in most frameworks today.
None of these solves the recommendation layer problem. But sketched together, they suggest the shape of a solution: a registry an agent queries at runtime, with content that updates in real time, and a quality signal the agent can verify itself rather than trust on reputation.
The hard part is the ranking signal
The mechanical part of building a registry is straightforward. Anyone can stand up a database, give it an API, and call it a registry. The hard part is the same hard part search engines had: how do you rank?
PageRank worked because it was a citation graph. The thing being measured was visible to anyone who wanted to look, and it correlated reasonably well with quality. It was also gameable, which is why search engines have spent the last twenty-five years building defences. But the gameability was a feature in a sense. It gave builders a clear target to optimise for.
Agents need a ranking signal that has similar properties: visible, correlated with quality, optimisable by builders, and resistant to obvious manipulation. Stars and reviews don't work because agents make thousands of calls a day and can't read prose. Download counts don't work because they reward incumbency exactly the way the current system does. Reputation doesn't work because reputation is laggy and capturable.
Probably the right primitive is something the agent can verify itself at call time. If the agent can independently check whether a capability does what it claims to do, by running a test, checking a signature, validating a known-good output, then the ranking signal becomes empirical rather than social. That means the registry's job isn't to tell agents which capabilities are good. Its job is to expose enough verifiable data that the agent can decide for itself.
What we're working on
This is what we're building toward at Strale. The shape is a registry of capabilities for AI agents, things like data lookups, validations, and integrations with external services, where each capability is continuously tested against a standardised test suite, and the test results are exposed as a quality score the agent can read before deciding whether to call it. Every score is the result of recent tests, not a static rating. When reality moves, the score moves.
Whether this turns out to be the right primitive is still an open question. Maybe the answer is something else entirely. A federated trust graph, or per-agent learned preferences, or something none of us have thought of yet. But it feels like the recommendation layer is the bottleneck for the whole agent ecosystem, and most of the conversation right now is still about payment rails and protocols.
Discovery is upstream of payment. If agents can't find the capabilities that exist, it doesn't matter how cheap or fast the transaction layer is.
If you're building tools for agents, or building agents that need to call tools, the discovery problem is going to come up for you sooner or later. We'd be glad to compare notes with anyone wrestling with it from a different angle.
Top comments (0)