DEV Community

Cover image for Evicting MCP tool calls from your Kubernetes cluster
Misha Mekhanov
Misha Mekhanov

Posted on

Evicting MCP tool calls from your Kubernetes cluster

Model Context Protocol (MCP) has been wobbling on its own legs for a while now, which articulated a frustration many of us in the agentic space have felt but hadn't quantified:

LLMs are exceptional code generators, but terrible state machines.

The industry standard for agents is the ReAct pattern (Reason + Act), which essentially turns your LLM into a pseudo-runtime. The model decides on a tool, calls for the API, waits for the network response, parses the JSON result, and repeats. But here's the catch: this approach isn't exactly efficient with today's LLMs. You're basically asking a probabilistic engine to handle deterministic control flow, which is like using a neural network to do basic arithmetic. The result? Hallucinations, context window gets polluted fast, and state management becomes surprisingly fragile.

I have tried to implement the Code Mode pattern for Kubernetes diagnostics. Instead of exposing atomic tools (list, get, log, etc.) for the LLM to orchestrate sequentially, a sandbox with runtime was exposed. The LLM writes a program, and the agent executes it.

The results aren't just "better"; they represent a fundamental shift in how agents should interact with infrastructure.

To put numbers on it, in real debugging sessions, I’ve seen traditional task workflows (pinpointing the issue for a failed pod from the logs) climb to around 1 million tokens, roughly (8-10 tool calls). The same style of investigations in code mode land roughly in the 100–200k token range end‑to‑end, a whopping ~2–10× reduction, and for context‑heavy tasks (log parsing, namespace-scope troubleshooting, global cluster health checks), the savings can be even larger (I’ve even had runs where it pinpointed the problematic pod and its root-cause message with the single tool call, saving more than 90% of tokens).

The Economics of Context

Typicall MCP context size

In production Kubernetes workflows (where logs are large), a standard "tell me why the pod is broken" workflow typically consumes 1M+ tokens over 8–10 turns. From each tool call, the context window fills with intermediate JSONs that the user never needs to see, increasing the surface area for hallucination with every turn.

And the traditional MCP tool calls chains not efficient for prompt caching because each tool interaction fundamentally changes the conversation context.

By switching to Code Mode, the model receives a prompt, writes a single code block, and receives the final distilled answer. By eliminating unnecessary tool calls, code mode drastically cuts token usage, latency, and hallucination rate. Recent benchmarks by Anthropic showed that switching from sequential tool calls to code execution cut a workflow from 150,000 tokens to ~2,000 tokens (up to 98% context saving). In our own Kubernetes scenarios, we’ve seen similarly dramatic reductions in prompt context size (up to 90%).

  • Traditional Tool-Use: ~1M tokens / 8-10+ round-trips typically.
  • Code Mode: ~100k–200k tokens / 4-6+ round-trips typically.

Why it works

LLMs are exceptional code generators, but terrible state machines.

Code Mode works because it aligns the architecture with the training data. Frontier LLM models are heavily pre-trained on valid, executable code. They understand control flow, error handling, and data filtering implicitly.

Conversely, they struggle with the interactive Tool-Calling paradigm, which requires them to:

  • Manage loop state across stateless HTTP requests.
  • Parse verbose JSON without getting distracted by noise from previous tool calls.

By moving the execution logic into a deterministic sandbox, we offload the boring strict logic, such as looping, filtering, and conditional waiting, to a runtime designed for it. The LLM essentially writes a script that says

"Loop through all pods; if a pod restarts > 5 times, fetch its logs, grep for 'Error', and return the result."

The sandbox runtime executes this loop in milliseconds (after fetching data). The LLM never sees the raw list of 500 healthy pods; it only sees the final root cause. Intermediate results are processed in the sandbox rather than polluting the context window, which Anthropic notes is one of the most common causes of degraded agent performance.

Code move vs Standart

According to Anthropic's internal testing, Code Mode also significantly improves accuracy. This aligns with broader research showing that LLMs struggle when forced to return code wrapped in JSON: Aider's benchmarks found that models produce lower-quality code when asked to structure it as JSON due to the cognitive overhead of ensuring format validity while simultaneously solving the coding problem.

High-Order Kubernetes Diagnostics

This shift enables workflows that were previously cost-prohibitive or technically impossible due to context limits.

1. The "Cluster health scan" scenario

Old Way: The agent lists all pods, then attempts to describe over them via tool calls turns. This easily hits a rate limit or context limit.
New Way: The agent writes a loop.

// Execution inside the sandbox
const pods = tools.kubernetes.list({ namespace: 'default' });
const problems = pods
  .filter(p => p.status === 'CrashLoopBackOff')
  .map(p => {
     const logs = tools.kubernetes.logs({ name: p.metadata.name });
     return analyze(logs);
  });
return problems;
Enter fullscreen mode Exit fullscreen mode

The entire "scan" happens off the main LLM chat chain.

2. Configuration Drift & Audits

An agent can pull a Helm release manifest (tools.helm.get) and the live Kubernetes object (tools.kubernetes.get), diff them in-memory within the sandbox, and return only the drift. Doing this via standard tool calling would require pasting two massive YAML files into the context window and asking the LLM to squint at them.

3. Event-Driven Debugging

Because the sandbox provides a true runtime, the agent can write polling logic. It can check a deployment status, wait 5 seconds, and check again without burning a single token on the "wait" state. This allows for atomic "Rollout and Verify" operations that feel like magic compared to the stuttering steps of a chat-based agent.

4. PII data fetching

Code Mode with MCP provides a critical security advantage for Kubernetes environments: sensitive data never touches the LLM's context window. In traditional tool-calling, when an agent retrieves API credentials, database passwords, or service account tokens from Kubernetes Secrets, these values flow into the cloud model provider network before being used by LLM. With code execution in a sandboxed runtime, intermediate results (including sensitive data) stay isolated in the execution environment. MCP's tokenization layer extends this protection further.

Engineering the code-mode

Implementing this required solving several security and architectural challenges. We couldn't simply eval() code in the main process.

The Isolation Layer

We need to utilize a locked-down sandbox environment. This sandbox should have zero filesystem access and no network IO capability. It is a hermetically sealed compute unit that can only communicate with the outside world through a specific channel: the MCP Bridge.

The MCP Bridge

Function calls from the generated code, such as tools.kubernetes.list(), should not make direct HTTP requests. Instead, they should invoke a function on an internal bridge. This bridge should then marshal the request to the actual Kubernetes client and return the result to the sandbox's memory.​

To reduce the risk of the model inventing tools or parameters, the system should build a dynamic context from the tool schema at startup. During initialization, the full tool schema is translated into type definitions and injected into the sandbox context, so every available tool and its signature are explicitly defined. This enforces strict contract adherence: the language model can only call functions and parameters that exist in the exposed API surface. The Model Context Protocol SDK supports this pattern by providing a standardized way for applications to supply additional context: in a TypeScript-based MCP server, a global.d.ts file defines global types and interfaces that are accessible throughout the project without requiring explicit imports.

Progressive Discovery

The system should implement a file-system-like discovery mechanism that enables incremental tool schema loading. Rather than injecting the complete schema of all available Kubernetes tools into the system prompt at initialization, this approach should allow the agent to query available capabilities dynamically using a search function (e.g., tools.search("ingress")). This mechanism significantly reduces initial prompt overhead by deferring schema injection until specific tool categories are needed, while also supporting alternative discovery patterns such as capability-based filtering, hierarchical tool categorization, or lazy-loading schemas based on the agent's current task context. The MCP Bridge should facilitate these discovery queries, marshaling them to the tool registry and returning only the relevant schema segments to the sandbox, thereby optimizing both memory usage and inference latency.

Alternatively, Cloudflare's virtual file system implementation provides a complementary model. Cloudflare Workers use an in-memory virtual file system where files are organized hierarchically, allowing efficient isolation and lazy-loading of resources. This pattern of hierarchical, request-scoped file isolation can inform similar discovery mechanisms where tool schemas are organized by category or namespace, loaded on-demand as the agent navigates the tool hierarchy.

Based on my experience with both methods, I recommend starting with the search tool approach for tool discovery: it is straightforward to implement (but it does not scale really well). This strategy is also promoted by Anthropic, which recently introduced a Tool Search Tool to manage large toolsets more efficiently. Also, this approach opens up state persistence in multi-step workflows, where we can write intermediate state to the sandbox filesystem and resume it atomically, when it fails. ...And yes, this topic deserves a separate article.

The Trade-Offs: a Silver Bullet?

While Code Mode is superior for diagnostics, it introduces complexity that engineering teams must weigh:

  1. Infrastructure Complexity: You are no longer just passing JSON; you are now a Remote Code Execution (RCE) provider. Even with sandboxing, the security surface area increases. You must manage CPU quotas, memory limits, and execution timeouts to prevent infinite loops from freezing the agent.
  2. State Mutation Concerns: Code Mode may not be appropriate for all operations, particularly state-changing actions. When the agent orchestrates multiple destructive operations in code (deleting resources, modifying configurations), you lose the granular approval checkpoints that individual tool calls provide.
  3. Latency on simple tasks: For a trivial task like "check if pod is running", Code Mode is overkill. You will consume more tokens for simple operations like that.
  4. Model Overfitting: This approach relies on models that are good at coding. Smaller, quantized models (7B parameters) often struggle to write a valid executable code on the first try, whereas they might manage a simple JSON tool call fine. This pushes you toward more expensive frontier models.
  5. Debugging Challenges: When code execution fails, troubleshooting becomes significantly harder. You need to diagnose runtime errors across multiple tool invocations and all happening inside a sandbox with limited visibility.

That said, emerging protocols like ACP (Agent Communication Protocol) are shaping up to be a more robust and grown-up way to orchestrate tool calls.

Conclusion

Code mode shifts the burden to the infrastructure layer, thus requiring robust sandboxing and governance.

By moving logic out of the probabilistic reasoning layer and into a deterministic runtime, we reduce hallucinations and token costs by orders of magnitude. However, this shifts the burden to the infrastructure layer, requiring robust sandboxing and governance. For complex Kubernetes environments, the trade might be worth it.

P.S.: You can compare the behaviors directly by toggling the MCP_MODE in KubeView MCP. Give it a spin and let your own judgment be the compass:

# Run in Code Mode (Sandboxed tool execution)
MCP_MODE=code npx -y kubeview-mcp

# Run in Standard Mode ("Legacy" tool chaining)
MCP_MODE=tools npx -y kubeview-mcp
Enter fullscreen mode Exit fullscreen mode

P.P.S.: If you found this article helpful, I'd appreciate a ⭐ on the https://github.com/mikhae1/kubeview-mcp 🦋

Top comments (0)