A story about attention, and the three architectures it took to figure that out.

The setup
Early 2025. I was building a chat assistant for a hospitality company I'll call Harbor. Airport lounges, a loyalty card program, gift cards, a B2B portal for travel agencies. Four products, four kinds of customer, one chat window.
The brief: "One AI. Handles everything. Like talking to a person who actually works here."
"Everything" meant 53 distinct backend operations across five completely different user journeys. If you've ever tried to wire that many tools into a 2025-era LLM, you already know where this is going.
Attempt one · a plain ReAct agent
We started the obvious way. One agent, all 53 tools, good prompt, let the model reason.
On a real workload, it collapsed. You have to remember what early-2025 models were like. Impressive in a short loop, two or three tool calls and done, but push them into longer chains with 50+ tools and they fell apart. They'd hallucinate tool names. They'd forget the user was a distributor and reach for customer-facing tools. They'd call the same thing twice with slightly different arguments.
None of this was the model being dumb. It was the model being asked to do something the reasoning budget of that generation couldn't sustain. So we did what everyone else was doing: we retreated into a graph.
Attempt two · the distributed state graph
The industry consensus at the time was that if ReAct couldn't handle your complexity, you broke the problem into a graph of smaller agents. Each node a specialist with three or four tools. A router up top. State flowing between them.
For a while it worked better. Then real conversations hit it.
A user in booking mode would ask about their loyalty points. The router would hand them to the membership specialist, which had no idea a half-finished booking was sitting in state. Control would come back and the booking flow would be in a subtly broken state. We patched. We added edges. We added a supervisor layer.
Six weeks in, the graph looked like a subway map drawn by someone having a bad week. The uncomfortable truth took us too long to admit: the graph wasn't a better architecture. It was scaffolding around a weak foundation. We hadn't solved the reasoning problem; we'd buried it under so many layers that when it failed, we couldn't find it. We called it a night and decided to tear it down.
Two architectures, two different kinds of complexity, both collapsing under the same underlying problem.

Attempt three · stop being clever, start managing attention
Between starting the project and tearing the graph down, two things had changed.
The first was that the models had gotten quietly better. Not dramatically. Just enough that a ReAct loop which had collapsed six months earlier now held together much longer.
The second was that I'd started reading the emerging research on why large toolsets hurt agents, and the answer wasn't "the model is bad at choosing." It was subtler and more useful.
Tool descriptions compete with the user's message for the model's attention. Every tool definition you put in the prompt is a block of tokens the model has to process, weigh, and mostly ignore. A 2025 paper on RAG-MCP showed that in typical MCP deployments, 72% of the agent's context window is consumed by tool definitions before any actual work begins. Almost three-quarters of the model's working memory spent on descriptions it will never touch on this turn. (Gan & Sun, 2025)

When the RAG-MCP authors replaced the all-tools-at-once approach with retrieval, showing the model only the tools relevant to the current query, tool-selection accuracy went from 13.62% to 43.13% on their benchmark. More than triple. Same model. Same tools. The only thing that changed was how much of the model's attention was spent on noise.
The broader literature tells the same story. Recent benchmarks show per-tool accuracy as high as 96% in isolation collapsing to under 15% in large-toolset multi-turn settings. As one recent survey put it: "A well-scoped agent with 8 tools that all apply to its task outperforms a general agent with 80 tools in almost every benchmark that matters." (Ziółkowski, 2026)
So we stopped thinking of our job as "help the model choose between 53 tools" and started thinking of it as "protect the model's attention so it only ever has to choose between the 6 to 18 that actually matter right now."
The frame shifted from "pick the right tool" to "defend the attention budget." Everything that followed came from that reframing.
The architecture, in one sentence
One agent. All 53 tools in the registry. But on any given turn, the model only sees the tools relevant to what the user is currently doing, and only the prompt text that applies to the current scope.
Two moving parts: a tool-scoping layer and a three-layer prompt.
Part one · tool scoping
We split Harbor into five contexts: booking (~25 tools), gift cards (6), account (12), distributor (18), membership (16). Three tools are always available on top of any context: load_context, get_knowledge, and logout. Those three are the agent's navigation kit.
Enforcement is a middleware that runs before every model call. Seven lines do the real work:
def _scope_tools(self, request, state):
active_context = state.get("active_context")
if active_context and active_context in self.TOOL_SCOPES:
allowed = self.TOOL_SCOPES[active_context]
tools = [t for t in (request.tools or []) if t.name in allowed]
return request.override(tools=tools)
return request
The hidden tools are not in the function-calling schema at all. There's nothing to get confused by, and more importantly, nothing taking up attention.
This same pattern is showing up independently across the industry. GitHub's official MCP server ships a --dynamic-toolsets mode that starts with exactly three meta-tools (list_available_toolsets, get_toolset_tools, enable_toolset) and lets the LLM activate whole tool groups on demand. (Ziółkowski, 2026) Different domain, different terminology, exactly the same idea: start narrow, let the model widen itself when it needs to.
Part two · the three-layer prompt
Tool scoping alone isn't enough. The model also needs to know what it could do if it switched, or it'll never think to switch. So we split the system prompt into three layers:

A concise capability index, always loaded. A short list of everything the agent can do across every context: "You can book lounges, manage existing bookings, buy or redeem gift cards, manage partner accounts, top up loyalty cards." Five lines. Tells the model the universe of possibilities without drowning it in detail.
The always-on core, always loaded. Core behavior: how to think, how to talk, what it can't do, security boundaries. The same for every context. This is who the agent is, regardless of what it's currently doing.
The switchable module, hot-swapped per context. The detailed playbook for the current scope. Booking gets flight codes and passenger types. Distributor gets commission periods and billing extracts. Neither has to know about the other.
Every model call gets one overview + one set of behavior rules + one detailed module. Not five modules fighting for attention. One.
The index in layer 1 matters more than it looks. Without it, an agent stuck in the gift-card scope has no idea it could also help with loyalty cards, so it never calls load_context when it should. The index is a tiny attention cost that unlocks the entire navigation system.
The non-obvious bit · the LLM drives its own navigation
load_context isn't a routing decision our code makes. It's a tool the model calls. When a user in booking mode says "actually, can I buy a gift card?", the model decides to call load_context("gift_cards"). State updates. On the next turn, the middleware swaps the prompt module and filters the tools, and the model finds itself in a new room with new capabilities.
From the user's side: nothing happened. From our side: we wrote zero routing logic. The model navigates itself, because we gave it the ability and told it when to use it.
This is the opposite of the state graph. In the graph, we decided what transitions were allowed. Here, we decide what tools and instructions live in each room, and the model walks between rooms on its own. Adding a new context is a dict entry, a prompt module, and a list of tools. About an hour of work, versus two weeks of node surgery in the old system.
The router isn't gone. It moved inside the model.
What I actually want you to take away
Every architecture in this post was a workaround for a limitation in the process of disappearing.
The graph existed because ReAct couldn't handle 53 tools in early 2025. The scoping architecture exists because even today, attention degrades when the menu gets too long. Both are compromises. Both are scaffolding around something the model, at this moment, can't quite do.
Here's the thing nobody writing about AI architecture wants to say out loud: the models are eating our compromises. The graph layer we spent four months on in mid-2025 is something a 2026-class model no longer needs. The scoping trick I'm writing about right now is something a 2027-class model probably won't need either. Every workaround you ship today has a half-life. You're building trellises for a plant growing faster than you can build.
This isn't a reason to stop building them. You need the trellis now. But it should change how you hold the work. Don't fall in love with your architecture. Don't let a clever middleware layer become part of your identity. The whole point of scaffolding is to be removed when the thing it's holding up can stand on its own.
Build honestly. Solve the problem you actually have with the models you actually have. Keep the scaffolding as light and removable as possible. Every line of architecture is a bet that the limitation it's working around will still be there tomorrow. Most of those bets will lose. That's fine. That's what progress looks like from the inside.
Scaling agents isn't about picking tools. It's about managing attention. Give the model a small menu, a concise sense of what else is out there, and the ability to walk itself between rooms, and it will stay where it belongs. For now.
Hold it lightly. The next model is coming.
Sources
- RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation (Gan & Sun, arXiv 2505.03275). The 13.62% → 43.13% accuracy jump, and the finding that tool definitions consume ~72% of agent context windows in typical MCP deployments.
-
Research: Architecting Tools for AI Agents at Scale, by Grzegorz Ziółkowski (April 2026). The 8-vs-80 tools observation, GitHub's
--dynamic-toolsetsmeta-tool pattern, and the STRAP tool-consolidation pattern. - MCP Tool Overload: Why More Tools Make Your Agent Worse (Nebula, DEV.to). An accessible walkthrough of the same phenomenon for readers who want a less academic entry point.
Top comments (0)