Gen.Y.Sakai

Posted on Mar 17

Not Everything Needs MCP, Part 2: The 2026 Phase Transition — When Three Independent Roads Led to the Same Conclusion

#agents #ai #architecture #llm

The Ancient Past of Eighteen Months Ago — And What It Taught Us About the Future of AI Agents

Let me tell you a story from the ancient past.

By which I mean eighteen months ago.

In the world of AI, eighteen months is geological time. Think back to mid-2024. Context windows were small. "Prompt engineering" was the skill everyone was hiring for. MCP didn't exist yet. The idea of AI agents autonomously operating external services was mostly theoretical.

I was building a medical AI product in Osaka, Japan. And I had a problem that, looking back, contained the seed of everything that happened in 2026.

This is Part 2 of my "Not Everything Needs MCP" series. Part 1 told the story of Google Workspace CLI implementing a full MCP server, then deliberately deleting all 1,151 lines of it two days after launch. That investigation revealed an architectural mismatch between MCP's protocol design and large-scale APIs.

But that was only one data point. Since publishing that article, I discovered two more — and together, they tell a much bigger story about where AI agent architecture is heading in 2026.

The Timestamp Hack: Before MCP Had a Name

In early 2024, I was working on an AI assistant for my company's medical IT platform. We serve clinics across the Kansai region of Japan (Osaka, etc.) — and I'd been using ChatGPT's Custom GPTs to prototype workflows.

I had a simple need: I wanted every AI response to include the exact timestamp of when the conversation happened. Not for fun — for traceability. In medical IT, knowing when a decision was discussed matters. It matters for audits. It matters for compliance. It turned out to matter for patent applications too.

Here's what I did. I deployed a tiny Web API on a server we host publicly. It did exactly one thing: return the current time. Then I configured the Custom GPT to call this API before every response, and output the timestamp first.

The result looked like this:

User: Hey, long time no see!
(Communicated with myowndomain.com)

🕐 Response time: 2025-04-02 09:39:00 (JST) / 2025-04-02 00:39:00 (UTC)

Oh wow, it's been a while! So great to hear from you! 😊

A web API that returns a timestamp. Called before every response. Output deterministically. Nothing more, nothing less. That's all it did.

At the time, this was called "Function Calling" or "Tool Use" — the predecessors to what Anthropic would later formalize as MCP in November 2024. I didn't know I was implementing a pattern that would become the center of a protocol war. I just needed a clock.

But here's what matters: the design decision I made instinctively was to keep the external call as small and deterministic as possible. One API. One purpose. Minimal payload. The LLM didn't need to understand time zones or server infrastructure — it just needed to paste the result.

It wasn't a "hack" because I was lazy. It was an architectural instinct: keep the LLM away from what the system already knows. Deterministic output for a deterministic need. Don't make the AI think about the time — just give it the time.

Looking back now, eighteen months later, it turns out this minimal pattern — one deterministic call, zero reasoning overhead — was already the architecture that the rest of the industry would independently converge on. I didn't see it that way at the time. I was just solving a problem.

The MCP Honeymoon — And the Hangover

November 2024. Anthropic open-sourced MCP. By February 2025, Google and others rushed to announce MCP support. The community was electric. Finally, a standard protocol for connecting LLMs to external tools!

I dove in immediately. I connected MCP servers for GitHub, for databases, for various services. Context windows were getting larger. The future felt bright.

And at first, it was genuinely impressive. GitHub operations that used to require manual terminal commands — commits with thoughtful messages, PR creation, branch management — the AI handled them smoothly through MCP. I felt the productivity gains. They were real.

But then something else started happening.

The AI started getting... dumber.

Not in the "wrong answer" sense. In fact, the AI got better at executing tasks exactly as intended — MCP meant it could commit code, create PRs, and query databases with precision. But something subtler was degrading. The quality of reasoning. The ability to take a vague idea and turn it into a structured thought. What I call "zero-to-one thinking" — the creative, synthetic part of working with an LLM.

I spent the second half of 2025 with this nagging feeling. More tools, more capabilities, but less... intelligence. More precise in execution, less insightful in thought. I kept thinking: "I wish context windows would just get bigger so this wouldn't matter." But I also suspected that bigger windows alone wouldn't fix it — the AI would probably just get confused in different ways.

I couldn't quantify this feeling at the time. But I now know that researchers were documenting exactly what I was experiencing.

The Science Behind "Getting Dumber"

It turns out my gut feeling had a name: context rot.

Here's what researchers found — and why it matters for anyone loading MCP servers into their workflow:

Research	Key Finding
Context Rot (Chroma Research)	Irrelevant context degrades reasoning first. Retrieval survives; thinking dies.
Reasoning Degradation with Long Context Windows (14-model benchmark)	Reasoning ability decays as a function of input size — even when the model can still find the right information.
Maximum Effective Context Window (Paulsen, 2025)	The actual usable window is up to 99% smaller than advertised. Severe degradation at just 1,000 tokens in some top models.
Fundamental Limits of LLMs at Scale (arXiv, 2026)	Context compression, reasoning degradation, and retrieval fragility are proven architectural ceilings — not bugs to be patched.

Let me unpack why this hits MCP users so hard.

Chroma Research showed that as irrelevant context increases in an LLM's input, performance degrades — and the degradation is worse when the task requires genuine reasoning rather than simple retrieval. The less obvious the connection between question and answer, the more devastating the irrelevant context becomes.

The "Challenging LLMs Beyond Information Retrieval" study tested 14 different LLMs and demonstrated that reasoning ability degrades as a function of input size — even when the model can still find the right information. Information retrieval and reasoning are different capabilities, and reasoning breaks first.

And here's the connection to MCP that makes this personal:

A single popular MCP server like Playwright contains 21 tools. Just the definitions of those tools — names, descriptions, parameter schemas — consume over 11,700 tokens. And these definitions are included in every single message, whether you use the tools or not.

Now multiply that by 10 MCP servers. You've burned 100,000+ tokens on tool definitions alone. Your 200k context window is suddenly 70k. And it's not just smaller — it's polluted with information that actively degrades the model's ability to reason about the thing you actually asked it to do.

This is what I felt. The AI wasn't broken. It was drowning. More tools meant more noise in the signal. More capability meant less room to think.

The 15,000-Character Prompt and the Limits of "Prompt Engineering"

While I was wrestling with MCP overhead, I was also building an AI-powered tool — essentially a converter that takes ambiguous, unstructured text input and generates structured, formatted output. Think of it as a bridge between how humans naturally communicate and how systems need to receive data.

The core of this tool is a system prompt. That prompt went through dozens of iterations. At its peak, it was 20,000 characters. I tested, compared outputs, and eventually settled on 15,000 characters.

15,000 characters of instructions. For a single task.

The whole time, a thought kept nagging me: "Would a human expert need 15,000 characters of instructions to do this job?" A domain specialist would need maybe a paragraph of guidance. The rest is knowledge they already have — accumulated through years of working in their field.

And that's when "prompt engineering" started feeling like what it really was: a brute-force workaround for the absence of domain expertise in the model's operating context.

But here's the twist. Despite the bloated prompt, the tool worked. Output quality stayed consistent and reliable. Why?

Because I had constrained the domain. The tool operated within a specific industry workflow — a narrow slice of reality with its own vocabulary, its own established patterns, its own expected output formats. By telling the LLM upfront "you are operating within this domain," the massive prompt became effective.

If you've ever worked with LLMs, you already know this intuitively: a purely descriptive, narrative-style prompt — no matter how long — doesn't guarantee output quality. We've all been there. But a prompt that constrains the domain changes the game.

Here's why, and you don't need a PhD to see it. Think about what's happening inside a Transformer model. The attention mechanism operates on an enormous matrix — in large models, tens of thousands of dimensions. Every token is trying to figure out which other tokens matter. When the domain is wide open, the model is searching for relevance across a vast, noisy space. The outputs fluctuate. The reasoning wanders. Anyone who's done even basic linear algebra — even 3×3 matrices in high school — can imagine what happens when you scale that uncertainty to tens of thousands of dimensions. Of course the output changes every time.

But constrain the domain, and you dramatically narrow where the model needs to look. The relevant vectors cluster. The gap between what the model retrieves and what the human intended shrinks toward zero. Domain limitation doesn't just help. It's the mechanism by which prompts actually work.

This taught me something that would later click into place: domain limitation is the real optimization. Not longer prompts. Not bigger context windows. Narrower scope.

And if that's true for prompts, shouldn't the same principle apply to how we design AI agents?

From Prompt Engineering to Architecture Engineering

As the tool matured, the architecture evolved in a direction I didn't fully appreciate at the time.

The initial version was pure prompt — a single, monolithic instruction set that did everything through LLM reasoning. Unstructured text in, structured text out.

But the real world isn't one output format. My domain required multiple types of structured documents — each with its own format, its own required fields, its own regulatory and compliance requirements. The number of output variations kept growing.

Trying to handle all of these through prompt engineering alone was... well, it was exactly the "spread the entire menu on the table" problem from Part 1.

So the architecture shifted. The LLM's output became fully structured JSON — deterministic, parseable, machine-readable. Document generation moved to Google Workspace via GCP. The LLM's job narrowed to what it's actually good at: understanding the input, extracting the meaning, structuring the reasoning. Everything else — formatting, template selection, compliance checks, document assembly — moved to deterministic systems.

The LLM handles the ambiguous. Deterministic systems handle the deterministic.

I was doing this throughout 2025, iterating toward an architecture where AI reasoning and programmatic execution were cleanly separated. And I kept thinking about Google Workspace — if only there were a way to programmatically drive every Workspace API from the command line, it would be the perfect backend for the document generation pipeline...

And Then GWS Appeared

March 2026. Google released gws — Google Workspace CLI. A Rust-based CLI that covers nearly every Google Workspace API, with commands dynamically generated from Google's Discovery Service.

When I saw the announcement, my reaction was immediate: "This is it. This is what I've been waiting for."

A CLI that could drive Gmail, Drive, Docs, Sheets, Calendar — all from the command line, all returning structured JSON. Perfect for my document generation pipeline. Perfect for AI agent integration.

And then I noticed the articles mentioning MCP support. Perfect! I could connect it directly to—

$ gws mcp
{
  "error": {
    "code": 400,
    "message": "Unknown service 'mcp'."
  }
}

You know the rest. That investigation became Part 1. Google had implemented a full MCP server — 1,151 lines of Rust — then deliberately deleted it as a breaking change. Two days after launch.

At the time, I focused on the forensic story: what happened, why, and what it meant for tool design. But the deeper significance only hit me later.

Google didn't just remove MCP. Google arrived at the same architectural conclusion I had been groping toward with my own product — that for large-scale operations, the right pattern is CLI-first with structured output, not protocol-mediated tool discovery. "Order from the kitchen when you're hungry" beats "spread the entire menu on the table."

That was two independent arrivals at the same destination.

Then I found the third.

The Hackathon Winner's Blueprint

A few days after publishing Part 1, I came across the everything-claude-code repository by Affaan Mustafa (@affaanmustafa). Affaan won the Anthropic × Forum Ventures hackathon in NYC, building zenith.chat entirely with Claude Code in 8 hours. His repository — 77,000+ stars, 640+ commits, 76 contributors — packages 10+ months of daily Claude Code usage into a complete agent configuration system.

I started reading it out of curiosity. Within minutes, I was sitting bolt upright.

The philosophy was identical to what I'd been building independently.

Let me show you the parallels.

MCP: Deliberately Minimized

From Affaan's guide:

"Your 200k context window before compacting might only be 70k with too many tools enabled."

His rule of thumb: have 20–30 MCPs configured, but keep under 10 enabled and under 80 tools active. The repository includes mcp-configs/mcp-servers.json with explicit disabledMcpServers entries — actively turning off MCP servers to protect context space.

This is exactly what Google concluded with gws. And exactly what I experienced — more tools, less thinking room.

CLI Skills as MCP Replacements

From Affaan's longform guide:

"Instead of having the GitHub MCP loaded at all times, create a /gh-pr command that wraps gh pr create with your preferred options. Instead of the Supabase MCP eating context, create skills that use the Supabase CLI directly. The functionality is the same, the convenience is similar, but your context window is freed up for actual work."

Skills in Claude Code are Markdown files — tiny prompt templates that load only when invoked. A /gh-pr skill might be 200 tokens. The GitHub MCP server's tool definitions are thousands. Same functionality. Orders of magnitude less context consumption.

This is the "kitchen model" from Part 1, independently rediscovered by a power user.

Domain Expert Agents

The repository is organized into specialized subagents: planner.md, code-reviewer.md, tdd-guide.md, security-reviewer.md, build-error-resolver.md. Each agent has a narrow scope, specific tools, and defined behaviors.

This mirrors what I learned from my own product development — that established industries organize into specialties for a reason, and AI should follow the same principle. You don't ask a generalist to do a specialist's job. You don't ask a general-purpose agent to handle security review when a specialized security-reviewer agent would be more precise and use less context.

Context Hygiene as First Principle

Affaan's system includes automatic compaction hooks, session memory persistence, and strategic context management. The entire architecture is built around one principle: protect the context window for reasoning.

Not storage. Not tool definitions. Reasoning.

The Convergence

So here's what happened in 2026:

Google — a trillion-dollar company with the largest productivity API surface in the world — implemented MCP, stress-tested it against 200–400 tool definitions, and deleted it. Their conclusion: CLI-first with on-demand schema discovery. Context stays clean.

Affaan Mustafa — an individual developer who won an AI hackathon and spent 10+ months refining his workflow — independently concluded that MCP should be minimized, replaced with CLI skills where possible, and the context window should be protected for reasoning above all else.

I — a medical IT veteran building AI-powered tools in Japan — arrived at the same architecture through a completely different path. A timestamp API in 2024. The "getting dumber" experience in 2025. A product's evolution from monolithic prompt to JSON + deterministic pipeline. And then the forensic discovery of Google's MCP deletion.

Three different starting points. Three different domains. Three different scales. The same conclusion.

That's not coincidence. That's a phase transition.

What the 2026 Phase Transition Actually Means

When people talk about AI milestones, they usually mean model capabilities. GPT-4. Claude 3. Gemini Ultra. Bigger context windows. Better benchmarks.

But the real phase transition of 2026 isn't about model capabilities. It's about how we architect around the capabilities we already have.

The shift can be summarized in one sentence:

"Do it for me" is expensive. "Do this specific thing" is cheap.

Every token spent on tool definitions, prompt engineering, and ambiguous instructions is a token not spent on reasoning. And the research confirms what practitioners have been feeling: irrelevant context doesn't just waste space — it actively degrades the model's ability to think.

Here's what that means in practice:

The end of "prompt engineering" as we knew it. A 15,000-character prompt is a confession that we're compensating for missing architecture. The future is narrower prompts, domain-specific skills, and deterministic systems handling everything that doesn't require reasoning.

MCP is not dead — it's bounded. MCP remains excellent for small-to-medium tool sets (under 50 tools). But for large API surfaces, CLI-first is the proven pattern. The "everything via MCP" fantasy is over.

"Skills" are the new unit of AI agent design. Whether you call them Skills (Affaan), Agent Skills (Google), or domain-specific prompts (what I've been doing with my own tools), the pattern is the same: small, scoped, loaded on demand, discarded after use.

Context windows are not memory — they're working memory. Treating the context window as storage is like covering your entire desk with every book you own before you even pick up a pen. You haven't left any room to actually write. The desk needs to be clear for thinking — and every MCP tool definition, every bloated prompt, every retained conversation turn is another book on the pile.

The Human Parallel (Or: Why "Do It For Me" Was Always Expensive)

There's an observation I keep coming back to, and it's one that makes me laugh every time.

Consider how humans delegate work:

Boss: "Handle this, will you?"
Employee: (Internal monologue: What exactly? By when? In what format? Who approved this? What's the budget?) → 10 rounds of clarification follow.

Now consider the alternative:

Boss: "Run git commit -m 'fix: resolve auth timeout' && git push origin main."
Employee: Done. One round. Zero ambiguity.

The first conversation — the "human" one — requires the employee to infer intent, plan actions, select tools, estimate parameters, and verify assumptions. Every step of that inference costs time and mental bandwidth.

In LLM terms, every step of that inference costs tokens.

MCP tool definitions are the LLM equivalent of "let me explain everything you might possibly need to know before we start." CLI commands are the equivalent of "just do this one thing."

What the token economy has done — accidentally, beautifully — is make the cost of human communication ambiguity visible as a number. Every vague instruction, every "you know what I mean," every "figure it out" translates directly to token consumption that crowds out actual reasoning.

Someone with forty-plus years of programming experience — from assembly language to LLMs — finds this deeply ironic. We spent decades making computers understand human language. Now we're learning that the most efficient way to use language-understanding computers is... to give them precise, unambiguous commands. Like assembly language. Like CLI.

The wheel doesn't just turn. It circles back to the truth.

What Comes Next

If the pattern holds, the next phase is already emerging.

Domain-specific agent languages. Not natural language prompts. Not traditional programming languages. Something in between — structured enough for deterministic execution, flexible enough for AI reasoning. We're already seeing DSLs for agent workflows (LangGraph's graph definitions), constrained syntax languages designed for LLM generation, and YAML/JSON-based knowledge objects.

Agent architecture as a discipline. "Prompt engineer" was the job title of 2024. The 2026 equivalent is closer to "Agent Architect" or "Domain Skill Designer" — someone who understands how to decompose workflows into deterministic and non-deterministic components, and how to allocate context window real estate accordingly.

Domain specialization as a design principle. This is my domain bias speaking — I come from medical IT, where specialization has been refined over centuries. There's a reason medicine has cardiologists and dermatologists. It isn't bureaucratic — it's cognitive. A specialist holds deep domain knowledge that makes their work faster, more accurate, and more reliable. I believe AI agents should be organized the same way. Not one giant model that knows everything. A team of specialists, each with their own skills, routing tasks to the right expert. Every industry has its own version of "specialties." The principle is universal.

Closing

In Part 1, I wrote: "If you write about an OSS tool, run it first."

In Part 2, the lesson is different:

If three independent paths converge on the same conclusion, pay attention.

Google didn't read Affaan's guide before deleting MCP from gws. Affaan didn't study my architecture before recommending CLI skills over MCP. I didn't know about either of them when I built a timestamp API in 2024 and started separating deterministic from non-deterministic processing.

We all arrived at the same place: protect the context window for reasoning. Push everything deterministic to CLI, scripts, and structured pipelines. Load skills on demand. Discard them when done. Let the AI think.

That convergence — from a trillion-dollar company, a hackathon winner, and someone who's been writing code since assembly language was the only option — is what makes 2026 a phase transition.

Not because the models got better. Because we finally learned how to stop wasting them.

Try It Yourself

If you want to feel what "the 2026 phase transition" means in practice rather than just reading about it, the fastest way is to inject Affaan's system into your own Claude Code environment.

I did it myself. The difference was immediate — sessions stayed coherent longer, context stopped rotting mid-task, and the AI's reasoning felt sharper in ways that are hard to quantify but impossible to miss once you've experienced them.

The quickest path — install as a Plugin directly inside Claude Code:

# Inside Claude Code
/plugin marketplace add affaan-m/everything-claude-code
/plugin install everything-claude-code@everything-claude-code

That alone gives you the commands, skills, and hooks. You'll notice the difference.

For the full setup including rules and language-specific configurations:

git clone https://github.com/affaan-m/everything-claude-code.git
cd everything-claude-code
./install.sh typescript   # or: python / golang / rust

You don't need to install everything. Start with the plugin. Use it for a day. Pay attention to how long your sessions stay productive before context degrades. Compare it to yesterday.

I suspect you'll have your own moment of convergence — your own version of the realization that Google, Affaan, and I all had independently. That the bottleneck was never the model. It was how much of the context window we were wasting on everything except thinking.

Your setup is different from mine. Your domain is different. But the principle is the same.

Let the AI think.

And if this feels familiar —

it is.

References

Part 1: Not Everything Needs MCP — What Google Workspace CLI Taught Us About AI Agent Architecture
everything-claude-code by Affaan Mustafa — The agent harness performance optimization system
The Shorthand Guide to Everything Claude Code — 2.7M+ views on X
The Longform Guide to Everything Claude Code — Token optimization, memory persistence, and CLI skill patterns
Chroma Research, "Context Rot" — Empirical study on how irrelevant context degrades LLM performance
"Challenging LLMs Beyond Information Retrieval: Reasoning Degradation with Long Context Windows" — 14-model benchmark showing reasoning decay with context length
Paulsen (2025), "Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs" — Maximum effective context windows far smaller than advertised
"On the Fundamental Limits of LLMs at Scale" (2026) — Formal framework for reasoning degradation under context expansion

DEV Community