Gumaro Gonzalez

Posted on Mar 24

24 Custom MCP Tools Later: Why Your Agent's Biggest Cost Is Not the Model — It's the Prompt

#ai #llm #agents #mcp

Every time your agent sends a prompt like "read the file src/routes/ventas.js, find line 45, and tell me what's there", you're paying for 25 tokens of natural language that the model has to interpret, might misunderstand, and will probably hallucinate part of the answer.
When my agent does the same thing, it calls:
{
"tool": "read_file",
"parameters": {
"path": "src/routes/ventas.js",
"offset": 40,
"limit": 10
}
}
The model didn't generate that path from memory. It didn't guess what's on line 45. The MCP tool returned the actual file content with line numbers from the actual file system. Zero interpretation. Zero hallucination. Fewer tokens.
I built 24 custom MCP tools organized in 6 categories. They power an autonomous agent that manages 6 production services for my business. This post is about what I learned building those tools and why MCP is the single biggest lever you have for reducing cost, hallucination, and prompt bloat in any agent system.
What "Native Prompts" Means (And Why It Matters More Than Model Choice)
I use the term "native prompt" to describe something most agent builders overlook: every MCP tool definition is an instruction the model consumes without you writing it in the system prompt.
When you register a tool like this:
@server.tool()
async def search_code(pattern: str, glob: str = "*/", case_insensitive: bool = True) -> str:
"""Regex search across project files using ripgrep.
Returns matching lines with file paths and line numbers.
Use for finding function definitions, variable usage,
import patterns, or error-related code."""
That docstring, those parameter names, those type hints — they are documentation the model actually reads. You don't need to write in your system prompt: "When you need to find code patterns, use regex search. Pass the pattern as the first argument..." The tool schema already communicates this.
This is a fundamental shift in how you think about prompt engineering for agents:
TRADITIONAL AGENT PROMPT:
┌─────────────────────────────────────────────┐
│ System prompt (800 tokens) │
│ ├── Role description │
│ ├── How to read files (150 tokens) │
│ ├── How to edit files (200 tokens) │
│ ├── How to search code (100 tokens) │
│ ├── How to manage processes (120 tokens) │
│ ├── How to use git (130 tokens) │
│ └── Safety rules │
│ │
│ Every instruction = tokens you pay for │
│ Every ambiguity = hallucination risk │
└─────────────────────────────────────────────┘

MCP-BASED AGENT:
┌─────────────────────────────────────────────┐
│ System prompt (200 tokens) │
│ ├── Role description │
│ └── Safety rules │
│ │
│ Tool schemas (consumed natively by model) │
│ ├── read_file: schema + docstring │
│ ├── edit_file: schema + docstring │
│ ├── search_code: schema + docstring │
│ ├── restart_process: schema + docstring │
│ ├── check_health: schema + docstring │
│ └── ... 19 more tools │
│ │
│ Instructions live IN the tools, not │
│ in prose the model might misread │
└─────────────────────────────────────────────┘
The system prompt shrinks from 800 tokens to 200 because the tools carry their own documentation. And that documentation is structured parameter names, types, descriptions not free-text that the model has to parse and might misinterpret.
Native prompts are cheaper, more precise, and harder to hallucinate against.
The 24 Tools: Anatomy of a Custom MCP Server
My MCP server is a single Python file using the MCP SDK. Each production service gets its own instance, parameterized by project root and process name. Here's the full toolbox:
┌──────────────────────────────────────────────────┐
│ MCP PROJECT SERVER │
│ 24 tools · 6 categories │
├──────────────┬───────────────────────────────────┤
│ CODE READ │ read_file (with line #s) │
│ │ list_files (glob patterns) │
│ │ search_code (ripgrep regex) │
│ │ get_project_structure (dir tree) │
├──────────────┼───────────────────────────────────┤
│ CODE WRITE │ edit_file (search & replace) │
│ │ write_file (create/overwrite) │
│ │ delete_file │
│ │ create_directory │
├──────────────┼───────────────────────────────────┤
│ PM2 PROCESS │ get_status (CPU, mem, uptime) │
│ │ view_logs (last N lines) │
│ │ restart_process │
│ │ stop_process │
│ │ start_process │
├──────────────┼───────────────────────────────────┤
│ GIT │ git_status │
│ │ git_diff │
│ │ git_log │
│ │ git_pull │
│ │ git_commit │
│ │ git_add │
├──────────────┼───────────────────────────────────┤
│ TESTING │ run_tests (autodetect runtime) │
│ │ check_health (HTTP status check) │
├──────────────┼───────────────────────────────────┤
│ CONTEXT │ read_claude_md (project docs) │
│ │ get_dependencies (pkg/req files) │
│ │ run_command (shell, with timeout) │
└──────────────┴───────────────────────────────────┘
Every tool has hard constraints baked into the server code, not into the prompt:
edit_file requires old_text to match exactly once in the file. If it's ambiguous, the tool returns an error — the model cannot apply a vague edit.
read_file caps at 500 lines and 500KB — the model can't accidentally dump a 10MB log into context.
run_command has a blocklist of 13 substrings + 5 regex patterns (path traversal, fork bombs, curl|bash piping, DROP TABLE, etc.).
All tools are sandboxed to the project directory — path traversal with ../../ is blocked at the server level.
These constraints would normally be paragraphs in your system prompt that the model might ignore under pressure. As MCP server logic, they are enforced by code, not by hope.
How MCP Tools Reduce Token Consumption
Let's trace a real operation: "Find where producto.precio is used and check if there's a null safety issue."
Without MCP tools (raw prompt approach):
TURN 1 — User prompt:
"Search the codebase for producto.precio usage" (9 tokens)

TURN 1 — Model output:
"I'll search for that pattern. Let me run:
grep -rn 'producto.precio' src/" (22 tokens generated)

TURN 2 — Execution result injected:
raw grep output, 40 lines

TURN 2 — Model output:
"I found it in 3 files. Let me read ventas.js:
cat -n src/routes/ventas.js | head -60" (28 tokens generated)

TURN 3 — Execution result injected:
60 lines of code

TURN 3 — Model output:
"Line 45 has no null check. Here's the fix..." (50 tokens)

TOTAL: ~809 tokens across 3 turns
With MCP tools:
TURN 1 — Model calls tool:
search_code(pattern="producto\.precio", glob="*/.js")
→ Tool returns structured matches (~120 tokens)

TURN 1 — Model calls tool:
read_file(path="src/routes/ventas.js", offset=40, limit=10)

→ Tool returns 10 lines with numbers (~80 tokens)

TURN 1 — Model output:
"Line 45 has no null check. Here's the fix..." (50 tokens)

TOTAL: ~250 tokens in 1 turn
3.2x fewer tokens. 1 turn instead of 3. No generated bash commands. No raw output parsing.
The savings compound across every execution. With structured tool responses, the model receives exactly the data it needs in a predictable format no wasted tokens on grep headers, bash syntax, or conversational padding.
The multiplier effect at scale
Monthly executions: 30,000
Tokens saved per exec: ~550 (809 - 250)
Total tokens saved/month: 16,500,000

At Claude Sonnet API rates ($3 input / $15 output per MTok):
Savings ≈ $150-$300/month just from token compression

At GPT-4o rates ($2.50 / $10):
Savings ≈ $100-$200/month
This is before you factor in the flat subscription model. MCP tools reduce costs on API AND subscription plans on subscriptions because you consume less of your rate-limited quota per operation.
The Cost Equation: Why Flat Beats Per-Token for Agents
My agent runs on claude -p (Claude Code CLI) using a Max subscription at $100/month. No API key. No per-token billing. The CLI invokes Claude with native MCP support it reads mcp-projects.json, connects to the specified servers via stdio, and exposes all tools to the model automatically.
Here's what this looks like compared to API pricing for a moderately active agent (1,000 daily executions, ~2,600 tokens each):
MONTHLY COST COMPARISON — 30,000 executions/month
═══════════════════════════════════════════════════

$100 ██ Claude Max (flat)

$390 ████████ GPT-4o API ($2.50/$10 per MTok)

$408 ████████ Gemini 3.1 Pro ($2.00/$12)

$441 █████████ GPT-5.2 API ($1.75/$14)

$540 ███████████ Claude Sonnet 4.6 API ($3/$15)

$900 ██████████████████ Claude Opus 4.6 API ($5/$25)
The flat model wins by 4-9x. But the real insight is: MCP tools make the flat model even flatter. Because each execution consumes fewer tokens (thanks to tool compression), you fit more executions within the same rate-limited window.
One developer tracked 10 billion tokens of Claude Code usage over 8 months and estimated it would have cost over $15,000 on API pricing. He paid ~$800 total on the Max plan. That's a 93% saving before any MCP optimization.
Dynamic MCP config: only load what you need
Here's an optimization most people miss. Instead of loading all 6 MCP servers (one per project) into every execution, my agent generates a mini-config with only the target project's server:
// claude-runner.js
function generarMiniConfig(proyecto) {
const serverKey = PROYECTOS[proyecto].mcp;
return {
mcpServers: {
[serverKey]: fullConfig.mcpServers[serverKey]
}
};
}
Why does this matter? Because every MCP server loaded = tool schemas injected into context = tokens consumed. If you have 6 servers × 24 tools = 144 tool definitions in context, that's a significant chunk of your prompt budget wasted on tools the model won't use in this execution.
Loading only the relevant server keeps the tool context tight: 24 tools instead of 144. That's ~80% reduction in tool-schema tokens per execution.
MCP as a Code Optimization Pattern
Beyond cost and hallucination, custom MCP tools change how you structure agent code. Here are three patterns I've found most impactful:
Pattern 1: Constraint enforcement via tool design
Instead of writing in your prompt "never edit more than 3 files in a single operation", design the tool to enforce it:
@server.tool()
async def edit_file(path: str, old_text: str, new_text: str) -> str:
"""Edit a file using exact search and replace.
old_text must match exactly ONE location in the file.
If old_text appears 0 or 2+ times, the edit is rejected."""

content = read(path)
count = content.count(old_text)
if count == 0:
    return "ERROR: old_text not found in file"
if count > 1:
    return f"ERROR: old_text found {count} times. Be more specific."

new_content = content.replace(old_text, new_text, 1)
write(path, new_content)
return f"OK: replaced 1 occurrence in {path}"

The constraint is impossible to bypass through prompt manipulation. No amount of creative prompting will make edit_file accept an ambiguous edit.
Pattern 2: Context injection via tool responses
Your read_claude_md tool is not just "reading a file." It's injecting project-specific context into the model's reasoning window at exactly the right moment:
@server.tool()
async def read_claude_md() -> str:
"""Read the project's CLAUDE.md documentation.
Contains architecture decisions, conventions,
known issues, and deployment notes.
Call this BEFORE making changes to understand project context."""

claude_md = project_root / "CLAUDE.md"
if claude_md.exists():
    return claude_md.read_text(encoding="utf-8")[:5000]
return "No CLAUDE.md found for this project."

That docstring "Call this BEFORE making changes" is a native prompt. The model reads it as part of the tool schema and learns when to use the tool, not just how. You didn't write this timing instruction in your system prompt. The tool teaches it.
Pattern 3: One server, N projects
The most powerful code optimization: my entire MCP server is one Python file parameterized by CLI arguments.

Same server.py, different instances

python server.py --root C:\projects\api --pm2 tacos-api --name api
python server.py --root C:\projects\bot --pm2 TacosAragon --name bot
python server.py --root C:\projects\cfo --pm2 cfo-agent --name cfo
Adding a new project to the agent requires zero new code. Three config lines:
"project-new": {
"command": "python",
"args": ["server.py", "--root", "C:\new-project", "--pm2", "new-svc", "--name", "new"]
}
Restart. The agent now has full read/write/git/process/test capabilities over the new project. 24 tools, zero development time.
This is the "N×M problem" that MCP was designed to solve. Without it, adding a new project would mean writing new integration code — bash scripts, API wrappers, custom parsers. With MCP, the protocol is the integration layer.
Sessions: The Forgotten Token Optimization
MCP tools compress tokens per execution. But sessions compress tokens across executions.
My agent uses --session-id UUID to maintain context for 1 hour across up to 8 messages. Here's what this saves:
WITHOUT SESSIONS:
Message 1: system prompt (200 tok) + tool schemas + task → response
Message 2: system prompt (200 tok) + tool schemas + task → response
Message 3: system prompt (200 tok) + tool schemas + task → response
Message 4: system prompt (200 tok) + tool schemas + task → response

Total system prompt tokens: 800+ (repeated 4 times)
Context from previous messages: 0 (each starts fresh)

WITH SESSIONS:
Message 1: system prompt (200 tok) + tool schemas + task → response
Message 2: task only → response (has full prior context)
Message 3: task only → response (has full prior context)
Message 4: task only → response (has full prior context)

Total system prompt tokens: 200 (loaded once)
Context from previous messages: everything
The model remembers files it read, changes it made, and errors it encountered without re-sending any of it. For a sequence of related operations (diagnose → fix → verify → report), sessions eliminate ~600 tokens of redundant context per follow-up message.
Over 30,000 monthly executions with an average session length of 3 messages, that's roughly 12 million tokens saved tokens that never enter the context window, never count against your rate limit, and never cost you a cent.
The Blocklist: What Your MCP Server Should Never Allow
If you're building MCP tools that execute code or commands, here's the blocklist I arrived at after running in production:
BLOCKED_SUBSTRINGS = [
"rm -rf /", "rm -rf ~", # filesystem destruction
"format", "del /s /q", # Windows destruction
"rmdir /s /q", # Windows recursive delete
"shutdown", "reboot", # system control
"halt", "poweroff", # system control
":(){:|:&};:", # fork bomb
"DROP TABLE", # database destruction
"chmod 777", # permission escalation
"chown -R", # ownership takeover
]

BLOCKED_PATTERNS = [
r"curl.|\s(bash|sh|python|node)", # remote code exec
r"wget.|\s(bash|sh)", # remote code exec
r"(bash|sh)\s+<(", # process substitution
r"eval\s+\$(", # eval injection
r"../../../..", # path traversal
]
These are not prompt instructions. They are server-side enforcement that the model cannot circumvent regardless of prompt injection, jailbreaking, or hallucination. The run_command tool checks every command against this list before execution and returns a hard error if any pattern matches.
Could a determined attacker find ways around these? Maybe. But the point is that the defense doesn't depend on the model behaving correctly. It's code, not a suggestion.
Key Takeaways

Every MCP tool is a native prompt. Tool schemas carry documentation that the model reads automatically. Move instructions from your system prompt into tool definitions they're cheaper, more precise, and structurally enforced.
Tool constraints beat prompt constraints. "Never edit ambiguously" as a prompt instruction is a suggestion. edit_file rejecting non-unique matches is a guarantee. Put your guardrails in server code, not in prose.
MCP tools compress tokens 3x. Structured tool calls replace multi-turn bash generation, raw output parsing, and conversational overhead. The savings compound at scale.
Dynamic MCP config saves context budget. Load only the servers relevant to the current task. 24 tools in context instead of 144 is an 80% reduction in schema tokens.
Sessions are the multiplier. Token compression per-execution (MCP tools) × token elimination across-executions (sessions) = dramatic reduction in total consumption. This is the compounding effect that makes the flat subscription model viable at high volume.
One server, N projects. Parameterize your MCP server by project root and process name. Adding a new project should be a config change, not a code change.
The flat subscription changes everything. At 30,000 executions/month, a $100 flat plan is 4-9x cheaper than any per-token API. MCP tools amplify this advantage by fitting more operations into the same rate-limited window. Build your tools once. Reuse them everywhere. Let the protocol do the integration work. The Stack MCP Server: Python + MCP SDK (24 tools, single file, parameterized per project) Agent runtime: claude -p (Claude Code CLI, Max plan, Sonnet model) Protocol: MCP over stdio (JSON-RPC 2.0) Sessions: --session-id / --resume (1-hour context retention) Config: Dynamic per-execution mini-config (only target project's server)

I'm Gumaro González. I run a restaurant in Culiacán, México and I build the software behind it from the WhatsApp order bot to the autonomous agent infrastructure. Everything built with Claude Code as my copilot.

GitHub: github.com/Gumagonza1

DEV Community

24 Custom MCP Tools Later: Why Your Agent's Biggest Cost Is Not the Model — It's the Prompt

Same server.py, different instances

Top comments (0)