Memorylake AI

Posted on May 28

MCP Isn’t Dead: What the Latest MCP Updates Mean for Memory Servers

#ai #mcp #claude #agents

TL;DR

Claude Code's April releases raised per-tool MCP output to 500,000 characters, added concurrent server connections, and shipped Tool Search + lazy loading.
For most MCP servers (Slack, GitHub, filesystem), this is a quality-of-life bump.
For memory servers, the four changes compound. They flip the design problem from "what's the smallest useful response we can fit?" to "what's the richest payload the model will actually use?"

If you've run an MCP server that exposes memory to Claude, you've felt the squeeze.

Tools fight for token budget. The model picks one or two memory items per turn, then runs out of room. You can return more from your server, but the prompt only holds so much before the assistant starts ignoring things. Either you ship less context, or you ship less useful context. There wasn't a third option.

April changed the math on that. Four updates landed close together. On their own, none of them is the kind of release note you'd retweet. Stacked, they reshape what a memory server can actually do in a session — and most of the writeups I've seen treat them as generic developer wins instead of the specific kind of win they are for memory.

Here's what changed, what each one means for a memory server in particular, and the config tweaks worth making this week.

What shipped

Per-tool MCP output limit raised to 500,000 characters. This is the headline. The old limit forced memory servers to truncate aggressively.
Concurrent MCP server connections. Multiple servers can be queried in parallel within one turn. Previously you queued.
MCP Tool Search. Claude searches across registered tools rather than carrying every tool description in the system prompt.
Lazy loading. Tool schemas load when they're needed, not at session start.

Two of these change what your server can deliver. Two free up the prompt budget you were paying to have your server registered at all. They compound.

What this means for a memory server, concretely

Memory servers have an awkward shape inside MCP. Most servers have natural ceilings on what they should return — a Slack connector hands back recent messages, a GitHub MCP fetches a file, a filesystem MCP lists a directory. Memory doesn't have an obvious ceiling. The most useful response is often "everything relevant to the question," and in 2025 that meant "everything we can fit in the truncation budget."

500,000 characters per tool call changes that ceiling.

You can now return:

Full conversation summaries with timestamps and references, not single-line digests
Original document excerpts with provenance alongside the extracted fact
Multi-source synthesis in one call instead of forcing the agent to make four
Skill or rule memories with examples included, not just the rule name

The trade-off has flipped. The question isn't "what's the minimum we can return that still answers the question?" anymore. It's "what's the maximum useful payload before the model starts ignoring the structure?" That's a much better optimization problem to have.

The concurrent connection change matters more for cross-stack setups. If you run a memory server alongside a GitHub MCP, a filesystem MCP, and a web-fetch MCP, recall against memory now overlaps with everything else instead of blocking on it. End-to-end wall time drops, but the more important effect is that the model isn't waiting on memory before it can start reasoning.

A Claude Desktop config that opts into the new behavior

If your config looks like the 2025 default, you're leaving the new headroom on the floor. Here's the shape that takes advantage of it:

{
  "mcpServers": {
    "memory": {
      "url": "https://<your-memory-endpoint>",
      "headers": {
        "Authorization": "Bearer <YOUR_API_KEY_SECRET>"
      }
    },
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/project"]
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"]
    }
  }
}

A few specifics worth knowing:

Order matters less than it used to. With concurrent connections, Claude isn't walking your servers top to bottom. The config is a registry now, not a priority list.
If your memory server returns big payloads, structure them. A 400K-character blob of unstructured text wastes the new ceiling. Sections with headings, source attribution, and timestamps survive the model's compression pass much better than walls of prose.
Don't expose every tool. Lazy loading helps, but you still pay listing cost when Tool Search inventories your server. Five to ten well-named tools per server is the right number. Twenty is too many.

Structure beats volume

The thing I'd flag, because it's not obvious, is that "use the new headroom" doesn't mean "stuff it full." Attention budget is still attention budget. The model can technically read a 500K payload; whether it uses any of it depends on whether the structure makes the useful parts findable.

Returns that work:

# Recall: "what did I work on this week"

## Summary
- Spent three of five days on Project X (auth migration)
- Held a decision pending on Project Y schema choice

## Sources
- 2026-05-20  Slack #project-x  → "shipped the JWT refresh, deferred refresh-token rotation"
- 2026-05-22  Document/PRD v3 (excerpt)  → "...decision on schema deferred to next sprint"
- 2026-05-24  Claude conversation summary  → "discussed three migration paths, narrowed to two"

## Related preferences (stable memory)
- Decisions are recorded as ADRs in /docs/decisions
- Code review requires the "schema-breaking" tag for migrations

Returns that fail:

This week the user worked on a lot of different things including Project X
which involved authentication work and Project Y which had some schema
discussions and there were also several Slack threads where various
implementation choices were debated and...

Both fit. Only one of them gets used.

Rate limits are about to bite you

If you've been running a memory server with per-IP rate limits set for the old "one call per turn" assumption, concurrent connections are going to surprise you.

A single turn can now produce three to five calls to the same server through different tool paths. If your limit was 5r/s to be safe, you'll hit it on a single agent's normal usage.

# For self-hosted nginx-fronted servers
# Bump the limit, but more importantly understand the new burst shape
limit_req_zone $binary_remote_addr zone=mcp:10m rate=20r/s;

For managed memory layers, check whether their rate-limit policy assumes concurrent or serial calls. If their docs were written before April and haven't been updated, ask. (This is also worth knowing if you're evaluating vendors — it's a quick signal of whether they're paying attention to the protocol.)

A small sanity-check workflow

Before you ship config changes, it's worth measuring on your own setup:

# 1. Old config, recall-heavy query
claude --config old-config.json "summarize what I worked on this week"

# 2. New config (concurrent enabled, larger payloads from memory)
claude --config new-config.json "summarize what I worked on this week"

# 3. Compare wall time, tokens used, and — the important one — recall completeness

The interesting metric is completeness, not speed. Speed wins are nice but inconsistent across providers and load. Completeness — did the response actually include the things you wanted it to include — is what justifies redesigning the server.

What's still hard

A few things April didn't fix:

Auth is still a per-server snowflake. Bearer tokens, OAuth, config-time API keys, signed URLs. If you run a memory service with per-project API keys (where the key has three parts and the Secret only displays once), you still have to explain that in prose in every directory listing. The MCP spec didn't standardize this in April. It probably won't this year.
Concurrent calls amplify dependency failures. If your memory server depends on a vector store and a graph DB, and now five concurrent calls all need both, you've widened the failure surface. Circuit breakers and per-dependency budgets matter more than they did.
Tool Search helps discovery within a session, not outside it. Getting your server found in the first place is still about being on canonical directories (punkpeye/awesome-mcp-servers, tolkonepiu/best-of-mcp-servers, mcp.so, Glama, Smithery). The April release didn't change anything about that.

The thing worth saying out loud

The bigger story behind April isn't any individual feature. It's that the prompt-economy assumptions that shaped how memory servers were designed in 2025 are gone.

Servers built to fit the old constraints will keep working. They just won't be using the new headroom. If you designed your recall logic around "smallest possible useful response," there's a real architectural opportunity to rebuild it around "richest possible useful response that still respects attention" — which is a different problem with different solutions.

That's the part of the April release that's worth re-architecting around first.

If you're running a memory server, this is the moment to revisit assumptions. If you don't want to run one yourself, MemoryLake handles the cross-model, protocol-evolution, and auth pieces — one Memory you carry across ChatGPT, Claude, Gemini and coding agents via MCP, end-to-end encrypted and user-owned.

Part of a series on AI agent memory in 2026. Next: securing your MCP server after April's RCE disclosures.

Discussion welcome in the comments — what changed in your setup once concurrent connections went live?

DEV Community