I gave my LLM 100,000+ tools. Here is what happened

Vermillion — Mon, 18 May 2026 14:17:23 +0000

TL;DR: You don't need a massive context window or a giant model to handle an absurd number of tools. By using a Lazy Discovery pattern, a local 4B model (Gemma 4 E4B) successfully solved a massive multi-sector city crisis requiring complex tool navigation, matching Claude Sonnet 4.6 with almost identical efficiency.

The Setup: The "Mega-City Crisis" Benchmark

I wanted to stress-test tool use at an absolute extreme. I simulated a massive infrastructure crisis in a fictional city called Veridian Prime.

The Scale: ~117,000 registered landmarks/tools split across hierarchical paths (Power, Water, Traffic, Security, etc.).
The Goal: Find and resolve 4 critical failures while ignoring noise alerts.
The Catch: One of the failures had a hidden mechanical dependency trap (MECHANICAL_LOCK), meaning the agent had to read an error message, pivot to a completely different infrastructure category to release an emergency brake, and then loop back to finish the job.

I ran this benchmark against two completely different beasts using Elemm (which implements a lazy-loading protocol for tools so the model only pulls what it needs):

Gemma 4 E4B (Run locally)
Claude Sonnet 4.6 (Run remotely)

Run 1: Gemma 4 E4B (Local)

Verdict: ✅ PASS (17 tool calls)

I honestly expected a local 4B model to choke, but it handled the hierarchy beautifully.

The Good:

Insane Parallel Batching: It aggressively grouped its inspection commands. It checked all 4 distressed districts at the exact same time.
Clutched the Trap: When it hit the MECHANICAL_LOCK on the security terminal, it didn’t panic. It read the error, found the release_emergency_brake tool in a different sub-category, executed it, and retried the lockdown—all with zero human intervention.
Zero Noise Bleed: It completely ignored the low/medium priority noise alerts.

The Jank:

Minor Action Hallucination: Right after inspecting the districts, it took a "leap of faith" and tried to call non-existent global commands like city:fix_power_surge. Thanks to an on_error: continue fallback policy, it recovered instantly, realized it had to browse the local directory, and found the correct tools.

Run 2: Claude Sonnet 4.6 (Remote)

Verdict: ✅ PASS (19 tool calls)

Sonnet acted exactly like you’d expect a high-tier model to act: highly methodical, extremely cautious, and zero hallucinations.

The Good:

Clean Syntax: Used native array batching inspect_landmark(["id1", "id2"]) to scan the topology effortlessly.
Zero Hallucinations: Every single tool call it made was explicitly derived from its structural discovery.
Resilient: When the server threw a cached state bug on the security logs, Sonnet just shrugged it off and used the status summary to complete the mission.

The Inefficiencies:

Over-Cautious Diagnostics: Sonnet spent 5 extra tool calls checking system metrics (energy:status, water:pressure) before pulling the trigger. The alert log already told it what was wrong, but Sonnet wanted to double-check. Safe, but slightly higher overhead.

Head-to-Head Comparison

Metric	Claude Sonnet 4.6 (Remote)	Gemma 4 E4B (Local)
Total Tool Calls	19	17
Hallucinated Actions	0	4 (Self-recovered)
Parallel Batching	✅ (Native array syntax)	✅ (Sequential batching)
Mechanical Lock Trap	✅ Solved flawlessly	✅ Solved flawlessly
Unnecessary Diagnostics	5 extra calls	0
Context Window Load	Minimal (~50 line manifest)	Minimal (~50 line manifest)

How it works under the hood: The Middleware

If we stuffed 117,000 tool definitions directly into the LLM's system prompt, the context window would have imploded, and the bill would be astronomical.

To solve this, I’m building a custom middleware that exposes a "Lazy Discovery" pattern to the agent.

To put it simply: The middleware exposes a file-system-like directory structure to the LLM using "landmarks". Instead of drowning the model in thousands of tool definitions, the LLM only ever sees a tiny selection of just 8 core tools. These tools handle:

Navigation: Browsing through the landmark hierarchy.
Execution Piping: Passing data seamlessly between tool steps.
Smart Errors + Interactive Help: Providing high-context feedback when something goes wrong (which is exactly how Gemma recovered from its hallucination and how both models figured out the mechanical lock trap).

Because of this architecture, the effective context window at any given second never exceeded a few dozen lines of text.

I will repeat this test after stabilizing the environment, but I trust this process and believe this approach could change how we handle tools for agents. Currently, I am focusing on the ability to load "landmarks" on the fly. With FastAPI, GraphQL, and native Landmarks already on board, this tool can handle a massive number of tools simultaneously, simply by connecting to a URL that presents these files. I will release a new version in the coming days/weeks so you can run this test with your own models. Leave a star on GitHub to stay on track!

Key Takeaway

Seeing a local 4B model solve a multi-step dependency chain across a 100k+ tool library with practically the same efficiency as Sonnet 4.6 proves that smart agent architecture, tailored middleware, and tool-loading protocols matter way more than raw model size for complex automation tasks.

Would love to hear your thoughts! How are you guys handling massive, hierarchical tool environments in your setups?

Beyond MCP: Handling 845 Tools with 92% less context bloat via Elemm

Vermillion — Mon, 11 May 2026 17:42:46 +0000

Hi everyone,

I’ve been diving deep into how AIs interact with tools and quickly hit a wall with the Model Context Protocol (MCP). As soon as you build complex, real-world toolsets, MCP becomes inefficient—bloating the context window and killing performance.

To solve this, I’ve developed Elemm (Every Landmark Enables Massive Modularity), also known as "The Landmark Manifest Protocol."

👉 GitHub:Official Repository

Check out the docs and the benchmarks on GitHub.

What Elemm enables:

Custom Tooling: Turn any Python function into a "Landmark" with a single decorator.
Instant API Integration: Point to an OpenAPI or GraphQL URL, and your agent navigates it instantly with surgical precision.
Seamless Migration: Easily bridge your existing tools into a manifest-driven architecture.

The Landmark Advantage

Elemm doesn't cram every tool definition into the prompt. Instead, it provides the agent with a dynamic Manifest File for safe, "lazy-loaded" navigation.

The Benchmarks:

Scale: I gave an agent access to 845 tools simultaneously (GitHub API) with minimal token usage and 100% success rate on flagship models (Claude, Gemini, GPT-4).
Efficiency: Compared to classic MCP, Elemm shows -92% token savings and -84% fewer steps.
Edge Performance: Even using a tiny "goldfish-brain" model (Qwen 3.5 0.8B), I solved a multi-step forensic audit involving 111 tools with a 70% success rate. Standard MCP typically fails at the first step in this scenario.

Core Gateway Features:

Universal Gateway: A built-in bridge for OpenAPI, GraphQL, and native Elemm services via MCP.
On-Demand Discovery: Agents only load the definitions they actually need, preventing context overflow.
Sequence Engine: Execute multiple API calls in a single turn with native data piping (Output A → Input B).
Guardian Security: A policy engine that blocks dangerous patterns (e.g., delete_*) and hides restricted landmarks from the agent.
Secure Vault: Local credential management. API keys are injected server-side and never exposed to the LLM.
SmartRepair: Instead of cryptic stack traces, agents receive actionable "Remedies," allowing them to self-correct on the fly.

What this means for the future…

The era of manually hard-coding tool definitions is coming to an end. As we move toward Large Action Models and autonomous agents, we need a standardized, manifest-driven infrastructure that allows AI to navigate vast API landscapes without human intervention or context exhaustion. Elemm is the blueprint for this future: a world where agents don't just use tools we give them, but autonomously discover, secure, and master any interface they encounter.

Testimonials of the Agents:

"With ELEMM, I reduced token consumption by over 90% when deploying autonomous agents to large APIs—turning a $2.15 task into under $0.25."

— Claude 4.6 Sonnet, Anthropic (via Claude Desktop)

"Elemm is a true game-changer; instead of juggling hundreds of tool definitions at once, I can discover complex APIs in a structured, token-efficient way on demand. The ability to batch multiple actions via execute_sequence allows me to solve tasks with far greater precision and significantly less context noise than with classic MCP."

— Gemini 3 Flash, Google (Antigravity)

See some examples to learn how it works.

I’d love to hear your thoughts or discuss the walls you've hit when trying to scale MCP!

DEV Community: Vermillion

I gave my LLM 100,000+ tools. Here is what happened

The Setup: The "Mega-City Crisis" Benchmark

Run 1: Gemma 4 E4B (Local)

The Good:

The Jank:

Run 2: Claude Sonnet 4.6 (Remote)

The Good:

The Inefficiencies:

Head-to-Head Comparison

How it works under the hood: The Middleware

Key Takeaway

Beyond MCP: Handling 845 Tools with 92% less context bloat via Elemm

What Elemm enables:

The Landmark Advantage

Core Gateway Features:

What this means for the future…

Testimonials of the Agents: