Juan Torchia

Posted on Apr 21 • Originally published at juanchi.dev

What Building with MCP Taught Me About Its Weirdest Gap

#english #typescript #llm #agentesia

It was 2am and I had an agent that had been churning through a financial reconciliation flow for 40 minutes. Everything green in tests. Everything green in staging. In production, after the seventeenth call to an MCP tool, it started making decisions based on data that no longer existed in the source system.

It didn't crash. It didn't throw an exception. It just kept working with a model of the world that had gone stale three tool calls ago.

Took me two days to understand what was happening. Took another three to accept that the problem wasn't my code.

MCP protocol gaps in agents: the silent assumption nobody documents

There's an implicit conceptual paper baked into how MCP is designed: the context you pass to a tool in call 1 is still valid when you get to call 17. The protocol has no native mechanism to express that the world changed while the agent was working.

This isn't an implementation bug. It's a design decision. And it makes sense for the 80% of use cases MCP was built for: read tools, searches, static data transformations.

But my agents don't live in that 80%.

They live in systems where:

A record can be modified by another process while the agent is analyzing it
The state of an entity changes as a side effect of the very tool the agent just called
Multiple agents are running in parallel over the same dataset

In those contexts, the stationary context assumption becomes a silent trap.

The three bugs I documented (with real code)

Bug 1: The ghost of the deleted entity

This was the first one. I had an agent processing purchase orders. The flow was:

get_pending_orders() — fetches list of pending orders
For each order: get_order_details(order_id) — fetches full details
validate_order(order_id, validation_rules) — validates against business rules
approve_or_reject_order(order_id, decision) — executes the decision

// What the agent was doing internally — simplified pseudocode
// of how the LLM was building its plan
const orders = await mcp.call('get_pending_orders');
// orders = [{ id: 'ORD-001' }, { id: 'ORD-002' }, { id: 'ORD-003' }]

for (const order of orders) {
  // Between get_pending_orders and this point, ORD-002 may have been
  // cancelled by another process — MCP doesn't know that
  const details = await mcp.call('get_order_details', { id: order.id });
  const validation = await mcp.call('validate_order', { 
    id: order.id, 
    rules: details.applicable_rules 
  });

  // If ORD-002 was cancelled after get_order_details,
  // approve_or_reject will operate on an entity that no longer exists
  // in the state the agent believes it exists
  await mcp.call('approve_or_reject_order', { 
    id: order.id, 
    decision: validation.recommendation 
  });
}

The problem: in staging the dataset was static. In production, other users were cancelling orders while the agent was processing. The agent was calling approve_or_reject_order with validation data calculated against an entity the system already considered to be in a different state.

No error thrown, because the system accepted the operation (defensive backend design). But the result was logically wrong.

The tests passed because nobody tests real concurrency in an MCP context.

Bug 2: The side effect the agent never saw

This one was more subtle. I had a process_payment(invoice_id) tool that, as a side effect, marked the invoice as "processing" and applied a 5-minute temporary lock.

// The MCP tool — server definition
{
  name: 'process_payment',
  description: 'Processes payment for an invoice by ID',
  inputSchema: {
    type: 'object',
    properties: {
      invoice_id: { type: 'string' }
    }
  }
  // PROBLEM: the description doesn't mention the side effect
  // MCP has no native way to express that this tool
  // mutates entity state for subsequent calls
}

// What the agent tried to do afterward
// (in the same flow, 3 tool calls later)
const invoiceStatus = await mcp.call('get_invoice_status', { 
  id: invoice_id 
});
// Returns: { status: 'processing', locked: true, locked_until: ... }

// The agent interpreted 'processing' as a pre-existing state
// unrelated to its own action 3 calls ago
// and made bad decisions based on that interpretation

The agent had no way of knowing that the "processing" state was a direct consequence of its own earlier call. MCP has no mechanism to express "this tool mutates state and here are the affected entities."

The result: the agent interpreted its own side effect as evidence of an external problem and triggered retry logic that generated loops.

Bug 3: The context that traveled between sessions

This was the strangest one, and the one that took me longest to find.

I had an agent with persistent memory between sessions (using an external store). The agent saved references to entity IDs it had processed. The problem: IDs in the source system were reusable after a certain period of inactivity.

// Session 1 — the agent saves context
const memory = {
  last_processed_batch: 'BATCH-2024-001',
  processed_item_ids: ['ITEM-4521', 'ITEM-4522', 'ITEM-4523'],
  processing_rules_version: 'v2.1'
};
await persistMemory(agentId, memory);

// Session 2 — 6 weeks later
// The agent retrieves its context
const memory = await getMemory(agentId);
// memory.processed_item_ids is still ['ITEM-4521', 'ITEM-4522'...]
// BUT the source system reused those IDs for new entities
// MCP has no context TTL. No reference invalidation.

// The agent calls the tool with IDs that now point
// to completely different entities
const itemDetails = await mcp.call('get_item_details', { 
  id: 'ITEM-4521' 
});
// Returns data for a new entity that happens to have the same ID
// The agent thinks it's looking at something it already processed

This bug was especially nasty because it depended on the combination of three factors: agent persistent memory, ID reuse in the source system, and MCP's implicit assumption that references are stable.

The common mistakes when you discover this gap

Mistake 1: Trying to solve this in the LLM.

My first instinct was to add instructions in the system prompt: "always verify the current state of an entity before operating on it." It worked for some cases. It added overhead to all of them. And eventually the LLM found reasoning paths where it skipped the verification anyway because it "logically seemed unnecessary."

The LLM is not the right place to solve data infrastructure problems.

Mistake 2: Manually adding versioning to context.

I tried serializing a "snapshot timestamp" into every MCP call and comparing it on the server. It worked. It also added state complexity that basically reinvented distributed transactions — very, very poorly.

Mistake 3: Ignoring it and adding retries.

Worst decision. The retries masked the symptom for weeks until the bug surfaced in a context where retrying made the problem bigger, not smaller.

What works (partially):

Explicit mutability modeling in tool descriptions. Not elegant, but honest:

{
  name: 'process_payment',
  description: `
    Processes payment for an invoice.

    STATE EFFECTS: This tool marks the invoice as 'processing'
    and applies a 5-minute lock. Subsequent calls to 
    get_invoice_status for this invoice will reflect these changes.

    CONTEXT VALIDITY: The result of this tool assumes the invoice
    state has not changed since the last call to get_invoice_details.
    If the flow has taken more than 2 minutes since that call,
    re-verify state before calling this tool.
  `,
  // ...
}

That's not the solution. It's a crutch that documents the problem until the protocol has a better answer.

Why this matters beyond MCP

This problem isn't unique to MCP. It's a problem for any system that exposes stateful tools to agents operating across time.

When I wrote about the trust problem that Emacs solved and agents ignore, I was brushing up against the same issue: the implicit trust that the environment behaves consistently. MCP has the same problem in the temporal dimension.

And when I dug into the changes between Claude Opus 4.6 and 4.7, one thing I observed is that model changes also mutate the "stationary context" your tools assume. A model that reasons differently about your tool descriptions is another vector of context mutation.

The pattern keeps showing up: we build systems assuming stability in layers that aren't stable.

FAQ: MCP protocol gaps in real agents

Does MCP have plans to add support for mutable context or state versioning?

As of when I'm writing this, the MCP spec has no native mechanisms to express state mutability, reference TTLs, or context invalidation. There are discussions in the Anthropic repo about protocol extensions, but nothing concrete on the public roadmap. It's a known problem in the community but it's not prioritized because most current use cases deal with relatively static data.

Can these bugs be caught with unit tests for the tools?

No, and that's exactly the problem. Unit tests for MCP tools test each tool in isolation with static context. The bugs I described emerge from temporal interaction between tools in multi-step flows. You need integration tests that simulate real concurrency and state mutation between calls. Most agent testing frameworks don't have good support for this yet.

Do these problems apply equally to all LLMs or are they specific to how Claude reasons about tools?

The problem is in the protocol, not the model. But different models have different tendencies to re-verify state vs. assume continuity. In my experience, larger models tend to be more conservative and re-verify, while smaller models (more token-efficient) tend to assume prior context is still valid. This means the bugs are more frequent when you're optimizing for speed/cost and running smaller models.

Are there server-side MCP workarounds that solve this without modifying the protocol?

Yes, but all of them have tradeoffs. The most robust is implementing a context middleware on the MCP server that tracks the state of relevant entities and injects warnings into responses when it detects divergence. It's extra work and it's not portable across implementations. Another approach is designing tools as "snapshot-first": every tool that reads state returns a version token, and every tool that writes accepts that token and fails if state changed (optimistic concurrency style). Works well, but requires the underlying system to support that pattern.

How do you know if your use case is in the safe 80% or the problematic 20%?

Simple question: can any entity your agent processes be modified by an external process during flow execution? Does any tool have side effects on entities that other tools in the same flow also read? Does your agent have persistent memory with references to IDs from systems that reuse identifiers? If you answered yes to any of those three, you're in the 20% and you need to design explicitly for the problem.

Isn't this basically the distributed transactions problem? Why not use existing solutions?

Yes and no. On the surface it looks similar, but the context is different: in distributed transactions, the participants are deterministic systems you can coordinate. Here, one of the participants is an LLM with probabilistic reasoning. Classic solutions (two-phase commit, sagas, etc.) assume you can roll back cleanly. With an agent that's already made decisions based on incorrect context, "rollback" isn't technical — it's semantic. It's a lot messier.

What I changed in my stack and what I still haven't solved

After documenting these three cases, I made three concrete changes:

Every tool that reads state now returns an opaque context_version. Tools that write accept it as an optional parameter and log divergence if state changed.
I added explicit context_ttl in tool descriptions. I tell the LLM how long it can assume context is still valid before re-verifying.
For agents with persistent memory, I added hashing of key properties of referenced entities. If the hash changes between sessions, the agent gets an explicit warning before operating.

What I still haven't solved: concurrency between multiple instances of the same agent. If you have two instances processing the same dataset in parallel, the mutable context problem multiplies. I haven't found an elegant solution that doesn't require centralized coordination, which destroys a good chunk of the value of having distributed agents in the first place.

MCP is a young protocol. These gaps are expected. What's not acceptable is not documenting them, because in production someone pays for them — usually at 2am, with an agent that keeps working with a model of the world that no longer exists.

If you're building with MCP in systems with mutable state, also check how configuration context affects agents — there's another silently stationary context vector you're probably not testing.

And if you've found other gaps I didn't cover: I want to know. This is an area where collective documentation is worth more than any single post.

This article was originally published on juanchi.dev

DEV Community