gentic news

Posted on Jul 1 • Originally published at gentic.news

Stop Hardcoding Model Lists: Use Discovery-Driven MCP to Cut Token Bloat 40%

#ai #tech #opinion #analysis

Switch from hardcoded MCP tool schemas to discovery-driven tools like nvidia_list_foundation_models. Your agent queries available models dynamically, cutting token bloat and adapting to infrastructure changes in real-time.

Key Takeaways

Switch from hardcoded MCP tool schemas to discovery-driven tools like nvidia_list_foundation_models.
Your agent queries available models dynamically, cutting token bloat and adapting to infrastructure changes in real-time.

What Changed — Discovery-Driven MCP Cuts Token Bloat by 40%

A developer recently noticed their MCP servers were burning 50k+ tokens before a user typed a single word. The culprit? Hardcoded tool definitions — listing every possible model, endpoint, and configuration in the system prompt. That's the equivalent of pre-loading a phone book when you only need one number.

The fix is a paradigm shift called discovery-driven architecture. Instead of telling your Claude Code agent "You have access to Llama3, Nemotron, and 47 other models," you give it a tool that lets it ask: "What's actually available right now?"

This isn't theoretical. The NVIDIA API Catalog MCP implements this pattern with nvidia_list_foundation_models. Your agent calls that tool once, gets a real-time dump of accessible paths, and adjusts its subsequent calls based on actual availability. No more stale schemas, no more token waste.

What It Means For You — Concrete Impact on Daily Claude Code Usage

1. You stop paying for dead weight

When you hardcode 50 model definitions into your CLAUDE.md or system prompt, every single one occupies context — even models that aren't active in your region or quota tier. With discovery-driven MCP, you only pay for the models currently available. That's a 40% token reduction on startup alone.

2. Your agent adapts to infrastructure changes

Providers update model versions, rename endpoints, and deprecate features constantly. Hardcoded prompts break silently. Discovery-driven agents call nvidia_list_foundation_models and adapt instantly. No more Tuesday night debugging sessions.

3. Quota management becomes proactive

The NVIDIA Catalog MCP includes nvidia_check_token_quota. You can instruct your agent to check its own constraints before initiating heavy inference tasks. If quota is low, the agent can switch to a smaller model or pause and alert you. Governance moves from the orchestrator into the agent itself.

Try It Now — Commands, Config, and Prompts to Take Advantage of This

Step 1: Install the NVIDIA API Catalog MCP

claude mcp add nvidia-catalog --url https://vinkius.com/mcp/nvidia-api-catalog

Or add it to your claude.json:

{
  "mcpServers": {
    "nvidia-catalog": {
      "url": "https://vinkius.com/mcp/nvidia-api-catalog",
      "env": {
        "NVIDIA_API_KEY": "your-key-here"
      }
    }
  }
}

Step 2: Update your CLAUDE.md with a discovery-first prompt

Replace this:

## Available Models

![Anthropic Just Solved AI Agent Bloat — 150K Tokens Down to 2K (Code ...](https://miro.medium.com/v2/resize:fit:1200/1*WEy3KPXZ82zcxEBIiSYyDw.png)

- Llama3-70B
- Nemotron-4-340B
- Mistral-7B
- (47 more lines...)

With this:

## Discovery-Driven Workflow
Before running any inference, call `nvidia_list_foundation_models` to discover currently available models. Then select the best one for the task. Check `nvidia_check_token_quota` before heavy runs.

Step 3: Try it in a Claude Code session

> /compact
> List the available foundation models and pick the best one for summarizing this PDF.

The agent will call nvidia_list_foundation_models, discover what's active, and proceed with minimal context overhead.

Why This Works — The Token Economics

Every tool definition in your system prompt consumes tokens. If you define 50 tools with full JSON schemas, that's roughly 10-15k tokens before any real work. Multiply by 4 for the KV cache overhead, and you're burning 50k+ tokens per request cycle.

Discovery-driven tools invert this. The tool definition is small (one function call), and the real data comes from a runtime query. You pay for exactly what you use.

The Bigger Picture — Beyond NVIDIA

This pattern applies to any dynamic infrastructure. If you're building agents that interact with cloud APIs, database schemas, or internal microservices, hardcoding is a liability. Give your agent a discovery tool and let it figure out the landscape at runtime.

As the MCP ecosystem crosses 13,000 servers (per our recent coverage), the distinction between static and dynamic MCPs will define production-grade agents. Static MCPs are for demos. Discovery-driven MCPs are for shipping.

Source: dev.to

[Updated 30 Jun via gn_mcp_protocol]

Meanwhile, X (formerly Twitter) launched its own official MCP server on April 10, 2025, enabling AI agents to search posts, look up user profiles, and retrieve trending topics directly via the Model Context Protocol [per TechCrunch]. The move signals that social platforms are now building MCP-native interfaces, reinforcing the discovery-driven thesis: as more services expose dynamic endpoints, hardcoded model lists become even more wasteful. Developers can now treat X's public data as another discoverable resource rather than a static schema.

Originally published on gentic.news

DEV Community